optimize week1

8 years ago · 65676fba28
--- a/notes/image/0f38a99c8ceb8aa5b90a5f12136fdf43.png
+++ b/notes/image/0f38a99c8ceb8aa5b90a5f12136fdf43.png
--- a/notes/image/20180105_212048.png
+++ b/notes/image/20180105_212048.png
--- a/notes/image/20180106_091307.png
+++ b/notes/image/20180106_091307.png
--- a/notes/image/20180106_101659.png
+++ b/notes/image/20180106_101659.png
--- a/notes/image/24e9420f16fdd758ccb7097788f879e7.png
+++ b/notes/image/24e9420f16fdd758ccb7097788f879e7.png
--- a/notes/week1.md
+++ b/notes/week1.md
@@ -33,11 +33,11 @@

  - Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

    这个定义是非正式但是是最早的，来自于一个会计算机编程的下棋菜鸟，计算机通过不断的对弈，通过对弈计算布局的好坏，通过不断“学习”，积累经验，成为了一个厉害的棋手。
    这个定义有点不正式但提出的时间最早，来自于一个懂得计算机编程的下棋菜鸟，编程使得计算机通过不断的对弈，不断地计算布局的好坏来“学习”，从而积累经验，这样，这个计算机程序成为了一个厉害的棋手。

  - Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some **task T** and some **performance measure P**, if its performance on T, as measured by P, improves with **experience E**. 

    此定义是**第一个正式的机器学习定义**，有点拗口，视频中介绍了一个例子，即垃圾邮件分类。对于垃圾邮件分类，文中的三个字母分别代表：
    Tom Mitchell 的定义更现代，也有点拗口，视频中介绍了一个例子，即垃圾邮件分类。对于垃圾邮件分类，文中的三个字母分别代表：

    - T(task): 对垃圾邮件分类这个任务。
    - P(Performance): 垃圾邮件分类的准确程度。
@@ -69,7 +69,7 @@

   回归问题即为预测一系列的**连续值**。

   在房屋价格预测的例子中，给出了一系列的房屋面基数据，根据这些数据来预测任意面积的房屋价格。
   在房屋价格预测的例子中，给出了一系列的房屋面基数据，根据这些数据来预测任意面积的房屋价格。给出照片-年龄数据集，预测给定照片的年龄。

   ![](image\20180105_194712.png)

@@ -77,7 +77,9 @@

   分类问题即为预测一系列的**离散值**。

   即根据数据预测被预测对象属于哪个分类。视频中举了癌症肿瘤这个例子，针对诊断结果，分别分类为良性或恶性。上个视频中的垃圾邮件分类问题，也同样属于监督学习中的分类问题。
   即根据数据预测被预测对象属于哪个分类。

   视频中举了癌症肿瘤这个例子，针对诊断结果，分别分类为良性或恶性。还例如垃圾邮件分类问题，也同样属于监督学习中的分类问题。

   ![](image\20180105_194839.png)

@@ -86,21 +88,24 @@

 ## 1.4 无监督学习(Unsupervised Learning)

 相对于监督学习，训练集不会有人为标注的结果（无反馈），而是由计算机通过无监督学习算法来自行分析，计算机可能会把特定的数据集归为几个不同的簇，故叫做聚类算法。
 相对于监督学习，训练集不会有人为标注的结果（无反馈），我们不会给出结果或无法得知训练集的结果是什么样，而是单纯由计算机通过无监督学习算法自行分析，从而“得出结果”。计算机可能会把特定的数据集归为几个不同的簇，故叫做聚类算法。

 无监督学习一般由两种：
 无监督学习一般分为两种：
 1. 聚类(Clustering)
 2. 关联(Associative)
   - 新闻聚合
   - DNA 个体聚类
   - 天文数据分析
   - 市场细分
   - 社交网络分析
 2. 非聚类(Non-clustering)
   - 鸡尾酒问题

 **新闻聚合**

 这里列举一些无监督学习的例子：
 在例如谷歌新闻这样的网站中，每天后台都会收集成千上万的新闻，然后将这些新闻分组成一个个的新闻专题，这样一个又一个聚类，就是应用了无监督学习的结果。

 - 新闻聚合分类
 - DNA 个体聚类
 - 社交网络
 - 市场细分
 - 天文数据分析
 **鸡尾酒问题**

 **例子，鸡尾酒问题**
 ![](image/20180105_201639.png)

 在鸡尾酒会上，大家说话声音彼此重叠，几乎很难分辨出面前的人说了什么。我们很难对于这个问题进行数据标注，而这里的通过机器学习的无监督学习算法，就可以将说话者的声音同背景音乐分离出来，看视频，效果还不错呢\~~。
@@ -110,7 +115,9 @@
 神奇的一行代码：
 `[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');`

 在机器学习刚开始时，**推荐使用 Octave 类的工程计算软件**，因为在 C++ 或 Java 等编程语言中，编写对应的代码需要用到复杂的库以及要写大量的冗余代码，比较耗费时间，建议可以在学习过后再考虑使用其他语言来构建系统。****
 **编程语言建议**

 在机器学习刚开始时，**推荐使用 Octave 类的工程计算编程软件**，因为在 C++ 或 Java 等编程语言中，编写对应的代码需要用到复杂的库以及要写大量的冗余代码，比较耗费时间，建议可以在学习过后再考虑使用其他语言来构建系统。****
 另外，在做**原型搭建**的时候也应该先考虑使用类似于 Octave 这种便于计算的编程软件，当其已经可以工作后，才将模型移植到其他的高级编程语言中。

 > 注：Octave 与 MATLAB 语法相近，由于 MATLAB 为商业软件，课程中使用开源且免费的 Octave。
@@ -136,9 +143,9 @@

 2. **问题解决模型**

 ：![](image/20180105_212048.png)
 ![](image/20180105_212048.png)

 其中 $h$ 代表结果函数，也称为**假设(hypothesis)** 。这个结果函数根据输入(房屋的面积)，给出预测结果输出(房屋的价格)。
 其中 $h$ 代表结果函数，也称为**假设(hypothesis)** 。这个函数 $h$ 根据输入(房屋的面积)，给出预测结果输出(房屋的价格)，即是一个 $X\to Y$ 的映射。

 $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。

@@ -150,7 +157,7 @@ $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。

 ## 2.2 损失函数(Cost Function)

 目的在于求解预测结果 $h_\theta(x)$  最接近于实际结果 $y$ 时 $\theta$ 的取值，则问题可表达为**求解 $\sum\limits_{i=0}^{m}(h_\theta(x^{(i)})-y^{(i)})$ 的最小值**。
 我们的目的在于求解预测结果 $h_\theta(x)$  最接近于实际结果 $y$ 时 $\theta$ 的取值，则问题可表达为**求解 $\sum\limits_{i=0}^{m}(h_\theta(x^{(i)})-y^{(i)})$ 的最小值**。

 > $m$: 训练集中的样本总数
 >
@@ -166,9 +173,9 @@ $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。

 为了求解最小值，引入损失函数(Cost Function)概念，用于度量建模误差。考虑到要计算最小值，应用二次函数对求和式建模，即应用统计学中的平方损失函数（最小二乘法）：

 $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}} $$ 
 $$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$ 

 > 系数 $\frac{1}{2}$ 存在与否都不会影响结果，这里是为了在应用梯度下降时便于求解。
 > 系数 $\frac{1}{2}$ 存在与否都不会影响结果，这里是为了在应用梯度下降时便于求解，平方的导数会抵消掉 $\frac{1}{2}$ 。

 讨论到这里，我们的问题就转化成了**求解 $J\left( \theta_0, \theta_1  \right)$ 的最小值**。

@@ -183,13 +190,11 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the
 - 损失函数(Cost Function): $ J\left( \theta_0, \theta_1  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}} $
 - 目标(Goal): $\underset{\theta_0, \theta_1}{\text{minimize}} J \left(\theta_0, \theta_1 \right)$

 为了直观理解损失函数到底是在做什么，先假设 $\theta_1 = 0$，并假设训练集有三个数据，分别为$\left(1, 1\right), \left(2, 2\right), \left(3, 3\right)$。

 <!-->TODO: 可更换为动图<-->
 为了直观理解损失函数到底是在做什么，先假设 $\theta_1 = 0$，并假设训练集有三个数据，分别为$\left(1, 1\right), \left(2, 2\right), \left(3, 3\right)$，这样在平面坐标系中绘制出 $h_\theta\left(x\right)$ ，并分析 $J\left(\theta_0, \theta_1\right)$ 的变化。

 ![](image/20180106_085915.png)

 上图中 $J\left(\theta_0, \theta_1\right)$ 随着 $\theta_1$ 的变化而变化，**当 $\theta_1 = 1$ 时，$J\left(\theta_0, \theta_1 \right) = 0$，取得最小值。**
 右图 $J\left(\theta_0, \theta_1\right)$ 随着 $\theta_1$ 的变化而变化，可见**当 $\theta_1 = 1$ 时，$J\left(\theta_0, \theta_1 \right) = 0$，取得最小值，**对应于左图青色直线，即函数 $h$ 拟合程度最好的情况。

 ## 2.4 损失函数 - 直观理解2(Cost Function - Intuition II)

@@ -203,11 +208,17 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the

 ![](image/20180106_090904.png)

 由于3-D图形不便于标注，所以将3-D图形转换为**轮廓图(contour plot)**，下面用轮廓图来作直观理解。
 由于3-D图形不便于标注，所以将3-D图形转换为**轮廓图(contour plot)**，下面用轮廓图（下图中的右图）来作直观理解，其中相同颜色的一个圈代表着同一高度（同一 $J\left(\theta\right)$ 值）。

 $\theta_0 = 360, \theta_1 =0$ 时：

 ![](image/0f38a99c8ceb8aa5b90a5f12136fdf43.png)

 大概在 $\theta_0 = 0.12, \theta_1 =250$ 时：

 ![](image/20180106_092119.png)

 右图轮廓图中，相同颜色的一个圈代表同一高度（同一 $J\left(\theta\right)$ 值），最中心的点（红点），是图像中的最低点，也即损失函数的最小值，此时对应 $h_\theta\left(x\right)$ 对数据的拟合情况如左图所示，嗯，一看就拟合的很不错，预测应该比较精准啦。
 上图中最中心的点（红点），近乎为图像中的最低点，也即损失函数的最小值，此时对应 $h_\theta\left(x\right)$ 对数据的拟合情况如左图所示，嗯，一看就拟合的很不错，预测应该比较精准啦。

 ## 2.5 梯度下降(Gradient Descent)

@@ -222,11 +233,11 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the
 视频中举了下山的例子，即我们在山顶上的某个位置，为了下山，就不断地看一下周围**下一步往哪走**下山比较快，然后就**迈出那一步**，一直重复，直到我们到达山下的某一处**陆地**。

 给出梯度下降的公式：
 $$
 {{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)
 $$

 > ${\theta }_{j}$: 第 $j$ 个特征参数
 	repeat until convergence:
 		${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$

 > ${\theta }_{j}$: 第 $j$ 个特征参数
 >
 > ”:=“: 赋值操作符
 >
@@ -246,7 +257,9 @@ $$

 ![](image/20180106_184926.png)

 把红点定为初始点，切于初始点的红色直线的斜率，表示了函数 $J\left(\theta\right)$ 在初始点处有**正斜率**，也就是说它有**正导数**，则根据梯度下降公式 ，${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$ 即 $\theta_1$ 会**向左边移动**。这样不断重复，直到收敛（达到局部最小值，即斜率为0，当然如果 $\theta$ 值开始就在极小值点处时，梯度下降算法将什么也不做）。
 把红点定为初始点，切于初始点的红色直线的斜率，表示了函数 $J\left(\theta\right)$ 在初始点处有**正斜率**，也就是说它有**正导数**，则根据梯度下降公式 ，${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$ 即 $\theta_1$ 会**向左边移动**。这样不断重复，直到收敛（达到局部最小值，即斜率为0）。

 当然如果 $\theta$ 值开始就在极小值点处时，梯度下降算法将什么也不做（$\theta_1 := \theta_1 - \alpha*0$）。

 > 不熟悉斜率的话，就当斜率的值等于图中三角形的高度除以水平长度好啦，精确地求斜率的方法是求导。

@@ -286,7 +299,7 @@ $$

 ![](image/20180106_203726.png)

 视频中直接给出了 $j = 0, j = 1$ 时求解偏导的计算方法，这里给出推导过程如下：
 对于 $j = 0, j = 1$ 时，给出偏导计算公式的推导过程如下：

 $\frac{\partial}{\partial\theta_j} J(\theta_1, \theta_2)=\frac{\partial}{\partial\theta_j} \left(\frac{1}{2m}\sum\limits_{i=1}^{m}{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}} \right)=$

@@ -306,7 +319,11 @@ $\frac{\partial}{\partial\theta_1} J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m}{{\

 上文中所提到的梯度下降，都为批量梯度下降(Batch Gradient Descent)，即每次计算都使用所有的数据集 $\left(\sum\limits_{i=1}^{m}\right)$ 更新。

 使用循环求解，代码较为冗余，后面会讲到如何使用**向量化(Vectorization)**来简化代码并优化计算，使梯度下降运行的更快更好。
 由于线性回归函数呈现**碗状**，且**只有一个**全局的最优值，所以函数**一定总会**收敛到全局最小值（学习速率不可过大）。同时，函数 $J$ 被称为凸二次函数，而线性回归函数求解最小值问题属于**凸函数优化问题**。

 ![](image/24e9420f16fdd758ccb7097788f879e7.png)

 另外，使用循环求解，代码较为冗余，后面会讲到如何使用**向量化(Vectorization)**来简化代码并优化计算，使梯度下降运行的更快更好。

 # 3 Linear Algebra Review

@@ -314,13 +331,315 @@ $\frac{\partial}{\partial\theta_1} J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m}{{\

 ## 3.1 Matrices and Vectors

 Octave/Matlab 代码:

 ```matlab
 % The ; denotes we are going back to a new row.
 A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]

 % Initialize a vector 
 v = [1;2;3] 

 % Get the dimension of the matrix A where m = rows and n = columns
 [m,n] = size(A)

 % You could also store it this way
 dim_A = size(A)

 % Get the dimension of the vector v 
 dim_v = size(v)

 % Now let's index into the 2nd row 3rd column of matrix A
 A_23 = A(2,3)
 ```

 执行结果:

 ```
 A =

    1    2    3
    4    5    6
    7    8    9
   10   11   12

 v =

   1
   2
   3

 m =  4
 n =  3
 dim_A =

   4   3

 dim_v =

   3   1

 A_23 =  6
 ```

 ## 3.2 Addition and Scalar Multiplication

 Octave/Matlab 代码:

 ```matlab
 % Initialize matrix A and B 
 A = [1, 2, 4; 5, 3, 2]
 B = [1, 3, 4; 1, 1, 1]

 % Initialize constant s 
 s = 2

 % See how element-wise addition works
 add_AB = A + B 

 % See how element-wise subtraction works
 sub_AB = A - B

 % See how scalar multiplication works
 mult_As = A * s

 % Divide A by s
 div_As = A / s

 % What happens if we have a Matrix + scalar?
 add_As = A + s
 ```

 执行结果:

 ```
 A =

   1   2   4
   5   3   2

 B =

   1   3   4
   1   1   1

 s =  2
 add_AB =

   2   5   8
   6   4   3

 sub_AB =

   0  -1   0
   4   2   1

 mult_As =

    2    4    8
   10    6    4

 div_As =

   0.50000   1.00000   2.00000
   2.50000   1.50000   1.00000

 add_As =

   3   4   6
   7   5   4
 ```
 ## 3.3 Matrix Vector Multiplication

 Octave/Matlab 代码:

 ```matlab
 % Initialize matrix A 
 A = [1, 2, 3; 4, 5, 6;7, 8, 9] 

 % Initialize vector v 
 v = [1; 1; 1] 

 % Multiply A * v
 Av = A * v


 ```

 执行结果:

 ```
 A =

   1   2   3
   4   5   6
   7   8   9

 v =

   1
   1
   1

 Av =

    6
   15
   24

 ```
 ## 3.4 Matrix Matrix Multiplication

 Octave/Matlab 代码:

 ```matlab
 % Initialize a 3 by 2 matrix 
 A = [1, 2; 3, 4;5, 6]

 % Initialize a 2 by 1 matrix 
 B = [1; 2] 

 % We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1) 
 mult_AB = A*B

 % Make sure you understand why we got that result
 ```

 执行结果:

 ```
 A =

   1   2
   3   4
   5   6

 B =

   1
   2

 mult_AB =

    5
   11
   17

 ```
 ## 3.5 Matrix Multiplication Properties

 Octave/Matlab 代码:

 ```matlab
 % Initialize random matrices A and B 
 A = [1,2;4,5]
 B = [1,1;0,2]

 % Initialize a 2 by 2 identity matrix
 I = eye(2)

 % The above notation is the same as I = [1,0;0,1]

 % What happens when we multiply I*A ? 
 IA = I*A 

 % How about A*I ? 
 AI = A*I 

 % Compute A*B 
 AB = A*B 

 % Is it equal to B*A? 
 BA = B*A 

 % Note that IA = AI but AB != BA
 ```

 执行结果:

 ```
 A =

   1   2
   4   5

 B =

   1   1
   0   2

 I =

 Diagonal Matrix

   1   0
   0   1

 IA =

   1   2
   4   5

 AI =

   1   2
   4   5

 AB =

    1    5
    4   14

 BA =

    5    7
    8   10
 ```
 ## 3.6 Inverse and Transpose

 Octave/Matlab 代码:

 ```matlab
 % Initialize matrix A 
 A = [1,2,0;0,5,6;7,0,9]

 % Transpose A 
 A_trans = A' 

 % Take the inverse of A 
 A_inv = inv(A)

 % What is A^(-1)*A? 
 A_invA = inv(A)*A


 ```

 执行结果:

 ```
 A =

   1   2   0
   0   5   6
   7   0   9

 A_trans =

   1   0   7
   2   5   0
   0   6   9

 A_inv =

   0.348837  -0.139535   0.093023
   0.325581   0.069767  -0.046512
  -0.271318   0.108527   0.038760

 A_invA =

   1.00000  -0.00000   0.00000
   0.00000   1.00000  -0.00000
  -0.00000   0.00000   1.00000

 ```