摘要:對(duì)于維問題就最優(yōu)解,梯度下降法是最常用的方法之一。下面通過梯度下降法的前生今世來(lái)進(jìn)行詳細(xì)推導(dǎo)說(shuō)明。這是因?yàn)樘荻认陆捣ㄖ皇菍?duì)當(dāng)前所處的凹谷進(jìn)行梯度下降求解,對(duì)于函數(shù)并不代表只有一個(gè)的凹谷。所以梯度下降法只能求得局部解,但不一定能求得全部的解。
前言
最近機(jī)器學(xué)習(xí)越來(lái)越火了,前段時(shí)間斯丹福大學(xué)副教授吳恩達(dá)都親自錄制了關(guān)于Deep Learning Specialization的教程,在國(guó)內(nèi)掀起了巨大的學(xué)習(xí)熱潮。本著不被時(shí)代拋棄的念頭,自己也開始研究有關(guān)機(jī)器學(xué)習(xí)的知識(shí)。都說(shuō)機(jī)器學(xué)習(xí)的學(xué)習(xí)難度非常大,但不親自嘗試一下又怎么會(huì)知道其中的奧妙與樂趣呢?只有不斷的嘗試才能找到最適合自己的道路。
請(qǐng)容忍我上述的自我煽情,下面進(jìn)入主題。這篇文章主要對(duì)機(jī)器學(xué)習(xí)中所遇到的GradientDescent(梯度下降)進(jìn)行全面分析,相信你看了這篇文章之后,對(duì)GradientDescent將徹底弄明白其中的原理。
梯度下降的概念梯度下降法是一個(gè)一階最優(yōu)化算法,通常也稱為最速下降法。要使用梯度下降法找到一個(gè)函數(shù)的局部極小值,必須向函數(shù)上當(dāng)前點(diǎn)對(duì)于梯度(或者是近似梯度)的反方向的規(guī)定步長(zhǎng)距離點(diǎn)進(jìn)行迭代搜索。所以梯度下降法可以幫助我們求解某個(gè)函數(shù)的極小值或者最小值。對(duì)于n維問題就最優(yōu)解,梯度下降法是最常用的方法之一。下面通過梯度下降法的前生今世來(lái)進(jìn)行詳細(xì)推導(dǎo)說(shuō)明。
梯度下降法的前世首先從簡(jiǎn)單的開始,看下面的一維函數(shù):
f(x) = x^3 + 2 * x - 3
在數(shù)學(xué)中如果我們要求f(x) = 0處的解,我們可以通過如下誤差等式來(lái)求得:
error = (f(x) - 0)^2
當(dāng)error趨近于最小值時(shí),也就是f(x) = 0處x的解,我們也可以通過圖來(lái)觀察:
通過這函數(shù)圖,我們可以非常直觀的發(fā)現(xiàn),要想求得該函數(shù)的最小值,只要將x指定為函數(shù)圖的最低谷。這在高中我們就已經(jīng)掌握了該函數(shù)的最小值解法。我們可以通過對(duì)該函數(shù)進(jìn)行求導(dǎo)(即斜率):
derivative(x) = 6 * x^5 + 16 * x^3 - 18 * x^2 + 8 * x - 12
如果要得到最小值,只需令derivative(x) = 0,即x = 1。同時(shí)我們結(jié)合圖與導(dǎo)函數(shù)可以知道:
當(dāng)x < 1時(shí),derivative < 0,斜率為負(fù)的;
當(dāng)x > 1時(shí),derivative > 0,斜率為正的;
當(dāng)x 無(wú)限接近 1時(shí),derivative也就無(wú)限=0,斜率為零。
通過上面的結(jié)論,我們可以使用如下表達(dá)式來(lái)代替x在函數(shù)中的移動(dòng)
x = x - reate * derivative
當(dāng)斜率為負(fù)的時(shí)候,x增大,當(dāng)斜率為正的時(shí)候,x減??;因此x總是會(huì)向著低谷移動(dòng),使得error最小,從而求得 f(x) = 0處的解。其中的rate代表x逆著導(dǎo)數(shù)方向移動(dòng)的距離,rate越大,x每次就移動(dòng)的越多。反之移動(dòng)的越少。
這是針對(duì)簡(jiǎn)單的函數(shù),我們可以非常直觀的求得它的導(dǎo)函數(shù)。為了應(yīng)對(duì)復(fù)雜的函數(shù),我們可以通過使用求導(dǎo)函數(shù)的定義來(lái)表達(dá)導(dǎo)函數(shù):若函數(shù)f(x)在點(diǎn)x0處可導(dǎo),那么有如下定義:
上面是都是公式推導(dǎo),下面通過代碼來(lái)實(shí)現(xiàn),下面的代碼都是使用python進(jìn)行實(shí)現(xiàn)。
>>> def f(x): ... return x**3 + 2 * x - 3 ... >>> def error(x): ... return (f(x) - 0)**2 ... >>> def gradient_descent(x): ... delta = 0.00000001 ... derivative = (error(x + delta) - error(x)) / delta ... rate = 0.01 ... return x - rate * derivative ... >>> x = 0.8 >>> for i in range(50): ... x = gradient_descent(x) ... print("x = {:6f}, f(x) = {:6f}".format(x, f(x))) ...
執(zhí)行上面程序,我們就能得到如下結(jié)果:
x = 0.869619, f(x) = -0.603123 x = 0.921110, f(x) = -0.376268 x = 0.955316, f(x) = -0.217521 x = 0.975927, f(x) = -0.118638 x = 0.987453, f(x) = -0.062266 x = 0.993586, f(x) = -0.031946 x = 0.996756, f(x) = -0.016187 x = 0.998369, f(x) = -0.008149 x = 0.999182, f(x) = -0.004088 x = 0.999590, f(x) = -0.002048 x = 0.999795, f(x) = -0.001025 x = 0.999897, f(x) = -0.000513 x = 0.999949, f(x) = -0.000256 x = 0.999974, f(x) = -0.000128 x = 0.999987, f(x) = -0.000064 x = 0.999994, f(x) = -0.000032 x = 0.999997, f(x) = -0.000016 x = 0.999998, f(x) = -0.000008 x = 0.999999, f(x) = -0.000004 x = 1.000000, f(x) = -0.000002 x = 1.000000, f(x) = -0.000001 x = 1.000000, f(x) = -0.000001 x = 1.000000, f(x) = -0.000000 x = 1.000000, f(x) = -0.000000 x = 1.000000, f(x) = -0.000000
通過上面的結(jié)果,也驗(yàn)證了我們最初的結(jié)論。x = 1時(shí),f(x) = 0。
所以通過該方法,只要步數(shù)足夠多,就能得到非常精確的值。
上面是對(duì)一維函數(shù)進(jìn)行求解,那么對(duì)于多維函數(shù)又要如何求呢?我們接著看下面的函數(shù),你會(huì)發(fā)現(xiàn)對(duì)于多維函數(shù)也是那么的簡(jiǎn)單。
f(x) = x[0] + 2 * x[1] + 4
同樣的如果我們要求f(x) = 0處,x[0]與x[1]的值,也可以通過求error函數(shù)的最小值來(lái)間接求f(x)的解。跟一維函數(shù)唯一不同的是,要分別對(duì)x[0]與x[1]進(jìn)行求導(dǎo)。在數(shù)學(xué)上叫做偏導(dǎo)數(shù):
保持x[1]不變,對(duì)x[0]進(jìn)行求導(dǎo),即f(x)對(duì)x[0]的偏導(dǎo)數(shù)
保持x[0]不變,對(duì)x[1]進(jìn)行求導(dǎo),即f(x)對(duì)x[1]的偏導(dǎo)數(shù)
有了上面的理解基礎(chǔ),我們定義的gradient_descent如下:
>>> def gradient_descent(x): ... delta = 0.00000001 ... derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta ... derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta ... rate = 0.01 ... x[0] = x[0] - rate * derivative_x0 ... x[1] = x[1] - rate * derivative_x1 ... return [x[0], x[1]] ...
rate的作用不變,唯一的區(qū)別就是分別獲取最新的x[0]與x[1]。下面是整個(gè)代碼:
>>> def f(x): ... return x[0] + 2 * x[1] + 4 ... >>> def error(x): ... return (f(x) - 0)**2 ... >>> def gradient_descent(x): ... delta = 0.00000001 ... derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta ... derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta ... rate = 0.02 ... x[0] = x[0] - rate * derivative_x0 ... x[1] = x[1] - rate * derivative_x1 ... return [x[0], x[1]] ... >>> x = [-0.5, -1.0] >>> for i in range(100): ... x = gradient_descent(x) ... print("x = {:6f},{:6f}, f(x) = {:6f}".format(x[0],x[1],f(x))) ...
輸出結(jié)果為:
x = -0.560000,-1.120000, f(x) = 1.200000 x = -0.608000,-1.216000, f(x) = 0.960000 x = -0.646400,-1.292800, f(x) = 0.768000 x = -0.677120,-1.354240, f(x) = 0.614400 x = -0.701696,-1.403392, f(x) = 0.491520 x = -0.721357,-1.442714, f(x) = 0.393216 x = -0.737085,-1.474171, f(x) = 0.314573 x = -0.749668,-1.499337, f(x) = 0.251658 x = -0.759735,-1.519469, f(x) = 0.201327 x = -0.767788,-1.535575, f(x) = 0.161061 x = -0.774230,-1.548460, f(x) = 0.128849 x = -0.779384,-1.558768, f(x) = 0.103079 x = -0.783507,-1.567015, f(x) = 0.082463 x = -0.786806,-1.573612, f(x) = 0.065971 x = -0.789445,-1.578889, f(x) = 0.052777 x = -0.791556,-1.583112, f(x) = 0.042221 x = -0.793245,-1.586489, f(x) = 0.033777 x = -0.794596,-1.589191, f(x) = 0.027022 x = -0.795677,-1.591353, f(x) = 0.021617 x = -0.796541,-1.593082, f(x) = 0.017294 x = -0.797233,-1.594466, f(x) = 0.013835 x = -0.797786,-1.595573, f(x) = 0.011068 x = -0.798229,-1.596458, f(x) = 0.008854 x = -0.798583,-1.597167, f(x) = 0.007084 x = -0.798867,-1.597733, f(x) = 0.005667 x = -0.799093,-1.598187, f(x) = 0.004533 x = -0.799275,-1.598549, f(x) = 0.003627 x = -0.799420,-1.598839, f(x) = 0.002901 x = -0.799536,-1.599072, f(x) = 0.002321 x = -0.799629,-1.599257, f(x) = 0.001857 x = -0.799703,-1.599406, f(x) = 0.001486 x = -0.799762,-1.599525, f(x) = 0.001188 x = -0.799810,-1.599620, f(x) = 0.000951 x = -0.799848,-1.599696, f(x) = 0.000761 x = -0.799878,-1.599757, f(x) = 0.000608 x = -0.799903,-1.599805, f(x) = 0.000487 x = -0.799922,-1.599844, f(x) = 0.000389 x = -0.799938,-1.599875, f(x) = 0.000312 x = -0.799950,-1.599900, f(x) = 0.000249 x = -0.799960,-1.599920, f(x) = 0.000199 x = -0.799968,-1.599936, f(x) = 0.000159 x = -0.799974,-1.599949, f(x) = 0.000128 x = -0.799980,-1.599959, f(x) = 0.000102 x = -0.799984,-1.599967, f(x) = 0.000082 x = -0.799987,-1.599974, f(x) = 0.000065 x = -0.799990,-1.599979, f(x) = 0.000052 x = -0.799992,-1.599983, f(x) = 0.000042 x = -0.799993,-1.599987, f(x) = 0.000033 x = -0.799995,-1.599989, f(x) = 0.000027 x = -0.799996,-1.599991, f(x) = 0.000021 x = -0.799997,-1.599993, f(x) = 0.000017 x = -0.799997,-1.599995, f(x) = 0.000014 x = -0.799998,-1.599996, f(x) = 0.000011 x = -0.799998,-1.599997, f(x) = 0.000009 x = -0.799999,-1.599997, f(x) = 0.000007 x = -0.799999,-1.599998, f(x) = 0.000006 x = -0.799999,-1.599998, f(x) = 0.000004 x = -0.799999,-1.599999, f(x) = 0.000004 x = -0.799999,-1.599999, f(x) = 0.000003 x = -0.800000,-1.599999, f(x) = 0.000002 x = -0.800000,-1.599999, f(x) = 0.000002 x = -0.800000,-1.599999, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000000
細(xì)心的你可能會(huì)發(fā)現(xiàn),f(x) = 0不止這一個(gè)解還可以是x = -2, -1。這是因?yàn)樘荻认陆捣ㄖ皇菍?duì)當(dāng)前所處的凹谷進(jìn)行梯度下降求解,對(duì)于error函數(shù)并不代表只有一個(gè)f(x) = 0的凹谷。所以梯度下降法只能求得局部解,但不一定能求得全部的解。當(dāng)然如果對(duì)于非常復(fù)雜的函數(shù),能夠求得局部解也是非常不錯(cuò)的。tensorflow中的運(yùn)用
通過上面的示例,相信對(duì)梯度下降也有了一個(gè)基本的認(rèn)識(shí)。現(xiàn)在我們回到最開始的地方,在tensorflow中使用gradientDescent。
import tensorflow as tf # Model parameters W = tf.Variable([.3], dtype=tf.float32) b = tf.Variable([-.3], dtype=tf.float32) # Model input and output x = tf.placeholder(tf.float32) linear_model = W*x + b y = tf.placeholder(tf.float32) # loss loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares # optimizer optimizer = tf.train.GradientDescentOptimizer(0.01) train = optimizer.minimize(loss) # training data x_train = [1, 2, 3, 4] y_train = [0, -1, -2, -3] # training loop init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) # reset values to wrong for i in range(1000): sess.run(train, {x: x_train, y: y_train}) # evaluate training accuracy curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train}) print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
上面的是tensorflow的官網(wǎng)示例,上面代碼定義了函數(shù)linear_model = W * x + b,其中的error函數(shù)為linear_model - y。目的是對(duì)一組x_train與y_train進(jìn)行簡(jiǎn)單的訓(xùn)練求解W與b。為了求得這一組數(shù)據(jù)的最優(yōu)解,將每一組的error相加從而得到loss,最后再對(duì)loss進(jìn)行梯度下降求解最優(yōu)值。
optimizer = tf.train.GradientDescentOptimizer(0.01) train = optimizer.minimize(loss)
在這里rate為0.01,因?yàn)檫@個(gè)示例也是多維函數(shù),所以也要用到偏導(dǎo)數(shù)來(lái)進(jìn)行逐步向最優(yōu)解靠近。
for i in range(1000): sess.run(train, {x: x_train, y: y_train})
最后使用梯度下降進(jìn)行循環(huán)推導(dǎo),下面給出一些推導(dǎo)過程中的相關(guān)結(jié)果
W: [-0.21999997] b: [-0.456] loss: 4.01814 W: [-0.39679998] b: [-0.49552] loss: 1.81987 W: [-0.45961601] b: [-0.4965184] loss: 1.54482 W: [-0.48454273] b: [-0.48487374] loss: 1.48251 W: [-0.49684232] b: [-0.46917531] loss: 1.4444 W: [-0.50490189] b: [-0.45227283] loss: 1.4097 W: [-0.5115062] b: [-0.43511063] loss: 1.3761 .... .... .... W: [-0.99999678] b: [ 0.99999058] loss: 5.84635e-11 W: [-0.99999684] b: [ 0.9999907] loss: 5.77707e-11 W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11
這里就不推理驗(yàn)證了,如果看了上面的梯度下降的前世今生,相信能夠自主的推導(dǎo)出來(lái)。那么我們直接看最后的結(jié)果,可以估算為W = -1.0與b = 1.0,將他們帶入上面的loss得到的結(jié)果為0.0,即誤差損失值最小,所以W = -1.0與b = 1.0就是x_train與y_train這組數(shù)據(jù)的最優(yōu)解。
好了,關(guān)于梯度下降的內(nèi)容就到這了,希望能夠幫助到你;如有不足之處歡迎來(lái)討論,如果感覺這篇文章不錯(cuò)的話,可以關(guān)注我的博客,或者掃描下方二維碼關(guān)注:怪談時(shí)間到了 公眾號(hào),查看我的其它文章。
博客地址
關(guān)注文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://m.specialneedsforspecialkids.com/yun/19669.html
摘要:對(duì)于維問題就最優(yōu)解,梯度下降法是最常用的方法之一。下面通過梯度下降法的前生今世來(lái)進(jìn)行詳細(xì)推導(dǎo)說(shuō)明。這是因?yàn)樘荻认陆捣ㄖ皇菍?duì)當(dāng)前所處的凹谷進(jìn)行梯度下降求解,對(duì)于函數(shù)并不代表只有一個(gè)的凹谷。所以梯度下降法只能求得局部解,但不一定能求得全部的解。 前言 最近機(jī)器學(xué)習(xí)越來(lái)越火了,前段時(shí)間斯丹福大學(xué)副教授吳恩達(dá)都親自錄制了關(guān)于Deep Learning Specialization的教程,在國(guó)內(nèi)...
摘要:在中是主流布局方式。它有三種狀態(tài)正數(shù)零與負(fù)數(shù)。來(lái)看下運(yùn)行效果。這是為正數(shù)的情況,如果,控件的大小就會(huì)根據(jù)設(shè)置的與來(lái)固定顯示。如果發(fā)現(xiàn)生效的方式請(qǐng)務(wù)必告知。在中有主軸與副軸之分,主軸控制的排列方向,默認(rèn)為。默認(rèn)值為,繼承父容器的屬性。 今天我們來(lái)聊聊Flexbox,它是前端的一個(gè)布局方式。在React Native中是主流布局方式。如果你剛剛?cè)腴TReact Native,或者沒有多少前端...
摘要:在中是主流布局方式。它有三種狀態(tài)正數(shù)零與負(fù)數(shù)。來(lái)看下運(yùn)行效果。這是為正數(shù)的情況,如果,控件的大小就會(huì)根據(jù)設(shè)置的與來(lái)固定顯示。如果發(fā)現(xiàn)生效的方式請(qǐng)務(wù)必告知。在中有主軸與副軸之分,主軸控制的排列方向,默認(rèn)為。默認(rèn)值為,繼承父容器的屬性。 今天我們來(lái)聊聊Flexbox,它是前端的一個(gè)布局方式。在React Native中是主流布局方式。如果你剛剛?cè)腴TReact Native,或者沒有多少前端...
閱讀 2224·2019-08-30 15:53
閱讀 2452·2019-08-30 12:54
閱讀 1197·2019-08-29 16:09
閱讀 728·2019-08-29 12:14
閱讀 754·2019-08-26 10:33
閱讀 2481·2019-08-23 18:36
閱讀 2959·2019-08-23 18:30
閱讀 2118·2019-08-22 17:09