diff --git a/notes/image/0f38a99c8ceb8aa5b90a5f12136fdf43.png b/notes/image/0f38a99c8ceb8aa5b90a5f12136fdf43.png
new file mode 100644
index 0000000..e3e94cb
Binary files /dev/null and b/notes/image/0f38a99c8ceb8aa5b90a5f12136fdf43.png differ
diff --git a/notes/image/20180105_212048.png b/notes/image/20180105_212048.png
index 0a8cab9..cf8671c 100644
Binary files a/notes/image/20180105_212048.png and b/notes/image/20180105_212048.png differ
diff --git a/notes/image/20180106_091307.png b/notes/image/20180106_091307.png
index 81adbeb..011dd24 100644
Binary files a/notes/image/20180106_091307.png and b/notes/image/20180106_091307.png differ
diff --git a/notes/image/20180106_101659.png b/notes/image/20180106_101659.png
index 5945029..7acbf0c 100644
Binary files a/notes/image/20180106_101659.png and b/notes/image/20180106_101659.png differ
diff --git a/notes/image/24e9420f16fdd758ccb7097788f879e7.png b/notes/image/24e9420f16fdd758ccb7097788f879e7.png
new file mode 100644
index 0000000..27dbfdd
Binary files /dev/null and b/notes/image/24e9420f16fdd758ccb7097788f879e7.png differ
diff --git a/notes/week1.md b/notes/week1.md
index 206bd29..93d3bd6 100644
--- a/notes/week1.md
+++ b/notes/week1.md
@@ -33,11 +33,11 @@
 
   - Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
 
-    这个定义是非正式但是是最早的，来自于一个会计算机编程的下棋菜鸟，计算机通过不断的对弈，通过对弈计算布局的好坏，通过不断“学习”，积累经验，成为了一个厉害的棋手。
+    这个定义有点不正式但提出的时间最早，来自于一个懂得计算机编程的下棋菜鸟，编程使得计算机通过不断的对弈，不断地计算布局的好坏来“学习”，从而积累经验，这样，这个计算机程序成为了一个厉害的棋手。
 
   - Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some **task T** and some **performance measure P**, if its performance on T, as measured by P, improves with **experience E**. 
 
-    此定义是**第一个正式的机器学习定义**，有点拗口，视频中介绍了一个例子，即垃圾邮件分类。对于垃圾邮件分类，文中的三个字母分别代表：
+    Tom Mitchell 的定义更现代，也有点拗口，视频中介绍了一个例子，即垃圾邮件分类。对于垃圾邮件分类，文中的三个字母分别代表：
 
     - T(task): 对垃圾邮件分类这个任务。
     - P(Performance): 垃圾邮件分类的准确程度。
@@ -69,7 +69,7 @@
 
    回归问题即为预测一系列的**连续值**。
 
-   在房屋价格预测的例子中，给出了一系列的房屋面基数据，根据这些数据来预测任意面积的房屋价格。
+   在房屋价格预测的例子中，给出了一系列的房屋面基数据，根据这些数据来预测任意面积的房屋价格。给出照片-年龄数据集，预测给定照片的年龄。
 
    ![](image\20180105_194712.png)
 
@@ -77,7 +77,9 @@
 
    分类问题即为预测一系列的**离散值**。
 
-   即根据数据预测被预测对象属于哪个分类。视频中举了癌症肿瘤这个例子，针对诊断结果，分别分类为良性或恶性。上个视频中的垃圾邮件分类问题，也同样属于监督学习中的分类问题。
+   即根据数据预测被预测对象属于哪个分类。
+
+   视频中举了癌症肿瘤这个例子，针对诊断结果，分别分类为良性或恶性。还例如垃圾邮件分类问题，也同样属于监督学习中的分类问题。
 
    ![](image\20180105_194839.png)
 
@@ -86,21 +88,24 @@
 
 ## 1.4 无监督学习(Unsupervised Learning)
 
-相对于监督学习，训练集不会有人为标注的结果（无反馈），而是由计算机通过无监督学习算法来自行分析，计算机可能会把特定的数据集归为几个不同的簇，故叫做聚类算法。
+相对于监督学习，训练集不会有人为标注的结果（无反馈），我们不会给出结果或无法得知训练集的结果是什么样，而是单纯由计算机通过无监督学习算法自行分析，从而“得出结果”。计算机可能会把特定的数据集归为几个不同的簇，故叫做聚类算法。
 
-无监督学习一般由两种：
+无监督学习一般分为两种：
 1. 聚类(Clustering)
-2. 关联(Associative)
+   - 新闻聚合
+   - DNA 个体聚类
+   - 天文数据分析
+   - 市场细分
+   - 社交网络分析
+2. 非聚类(Non-clustering)
+   - 鸡尾酒问题
+
+**新闻聚合**
 
-这里列举一些无监督学习的例子：
+在例如谷歌新闻这样的网站中，每天后台都会收集成千上万的新闻，然后将这些新闻分组成一个个的新闻专题，这样一个又一个聚类，就是应用了无监督学习的结果。
 
-- 新闻聚合分类
-- DNA 个体聚类
-- 社交网络
-- 市场细分
-- 天文数据分析
+**鸡尾酒问题**
 
-**例子，鸡尾酒问题**
 ![](image/20180105_201639.png)
 
 在鸡尾酒会上，大家说话声音彼此重叠，几乎很难分辨出面前的人说了什么。我们很难对于这个问题进行数据标注，而这里的通过机器学习的无监督学习算法，就可以将说话者的声音同背景音乐分离出来，看视频，效果还不错呢\~~。
@@ -110,7 +115,9 @@
 神奇的一行代码：
 `[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');`
 
-在机器学习刚开始时，**推荐使用 Octave 类的工程计算软件**，因为在 C++ 或 Java 等编程语言中，编写对应的代码需要用到复杂的库以及要写大量的冗余代码，比较耗费时间，建议可以在学习过后再考虑使用其他语言来构建系统。****
+**编程语言建议**
+
+在机器学习刚开始时，**推荐使用 Octave 类的工程计算编程软件**，因为在 C++ 或 Java 等编程语言中，编写对应的代码需要用到复杂的库以及要写大量的冗余代码，比较耗费时间，建议可以在学习过后再考虑使用其他语言来构建系统。****
 另外，在做**原型搭建**的时候也应该先考虑使用类似于 Octave 这种便于计算的编程软件，当其已经可以工作后，才将模型移植到其他的高级编程语言中。
 
 > 注：Octave 与 MATLAB 语法相近，由于 MATLAB 为商业软件，课程中使用开源且免费的 Octave。
@@ -136,9 +143,9 @@
 
 2. **问题解决模型**
 
-：![](image/20180105_212048.png)
+![](image/20180105_212048.png)
 
-其中 $h$ 代表结果函数，也称为**假设(hypothesis)** 。这个结果函数根据输入(房屋的面积)，给出预测结果输出(房屋的价格)。
+其中 $h$ 代表结果函数，也称为**假设(hypothesis)** 。这个函数 $h$ 根据输入(房屋的面积)，给出预测结果输出(房屋的价格)，即是一个 $X\to Y$ 的映射。
 
 $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。
 
@@ -150,7 +157,7 @@ $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。
 
 ## 2.2 损失函数(Cost Function)
 
-目的在于求解预测结果 $h_\theta(x)$  最接近于实际结果 $y$ 时 $\theta$ 的取值，则问题可表达为**求解 $\sum\limits_{i=0}^{m}(h_\theta(x^{(i)})-y^{(i)})$ 的最小值**。
+我们的目的在于求解预测结果 $h_\theta(x)$  最接近于实际结果 $y$ 时 $\theta$ 的取值，则问题可表达为**求解 $\sum\limits_{i=0}^{m}(h_\theta(x^{(i)})-y^{(i)})$ 的最小值**。
 
 > $m$: 训练集中的样本总数
 >
@@ -166,9 +173,9 @@ $h_\theta(x)=\theta_0+\theta_1x$，为其中一种可行的表达式。
 
 为了求解最小值，引入损失函数(Cost Function)概念，用于度量建模误差。考虑到要计算最小值，应用二次函数对求和式建模，即应用统计学中的平方损失函数（最小二乘法）：
 
-$$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}} $$ 
+$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$ 
 
-> 系数 $\frac{1}{2}$ 存在与否都不会影响结果，这里是为了在应用梯度下降时便于求解。
+> 系数 $\frac{1}{2}$ 存在与否都不会影响结果，这里是为了在应用梯度下降时便于求解，平方的导数会抵消掉 $\frac{1}{2}$ 。
 
 讨论到这里，我们的问题就转化成了**求解 $J\left( \theta_0, \theta_1  \right)$ 的最小值**。
 
@@ -183,13 +190,11 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the
 - 损失函数(Cost Function): $ J\left( \theta_0, \theta_1  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}} $
 - 目标(Goal): $\underset{\theta_0, \theta_1}{\text{minimize}} J \left(\theta_0, \theta_1 \right)$
 
-为了直观理解损失函数到底是在做什么，先假设 $\theta_1 = 0$，并假设训练集有三个数据，分别为$\left(1, 1\right), \left(2, 2\right), \left(3, 3\right)$。
-
-<!-->TODO: 可更换为动图<-->
+为了直观理解损失函数到底是在做什么，先假设 $\theta_1 = 0$，并假设训练集有三个数据，分别为$\left(1, 1\right), \left(2, 2\right), \left(3, 3\right)$，这样在平面坐标系中绘制出 $h_\theta\left(x\right)$ ，并分析 $J\left(\theta_0, \theta_1\right)$ 的变化。
 
 ![](image/20180106_085915.png)
 
-上图中 $J\left(\theta_0, \theta_1\right)$ 随着 $\theta_1$ 的变化而变化，**当 $\theta_1 = 1$ 时，$J\left(\theta_0, \theta_1 \right) = 0$，取得最小值。**
+右图 $J\left(\theta_0, \theta_1\right)$ 随着 $\theta_1$ 的变化而变化，可见**当 $\theta_1 = 1$ 时，$J\left(\theta_0, \theta_1 \right) = 0$，取得最小值，**对应于左图青色直线，即函数 $h$ 拟合程度最好的情况。
 
 ## 2.4 损失函数 - 直观理解2(Cost Function - Intuition II)
 
@@ -203,11 +208,17 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the
 
 ![](image/20180106_090904.png)
 
-由于3-D图形不便于标注，所以将3-D图形转换为**轮廓图(contour plot)**，下面用轮廓图来作直观理解。
+由于3-D图形不便于标注，所以将3-D图形转换为**轮廓图(contour plot)**，下面用轮廓图（下图中的右图）来作直观理解，其中相同颜色的一个圈代表着同一高度（同一 $J\left(\theta\right)$ 值）。
+
+$\theta_0 = 360, \theta_1 =0$ 时：
+
+![](image/0f38a99c8ceb8aa5b90a5f12136fdf43.png)
+
+大概在 $\theta_0 = 0.12, \theta_1 =250$ 时：
 
 ![](image/20180106_092119.png)
 
-右图轮廓图中，相同颜色的一个圈代表同一高度（同一 $J\left(\theta\right)$ 值），最中心的点（红点），是图像中的最低点，也即损失函数的最小值，此时对应 $h_\theta\left(x\right)$ 对数据的拟合情况如左图所示，嗯，一看就拟合的很不错，预测应该比较精准啦。
+上图中最中心的点（红点），近乎为图像中的最低点，也即损失函数的最小值，此时对应 $h_\theta\left(x\right)$ 对数据的拟合情况如左图所示，嗯，一看就拟合的很不错，预测应该比较精准啦。
 
 ## 2.5 梯度下降(Gradient Descent)
 
@@ -222,11 +233,11 @@ $$ J\left( \theta  \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\the
 视频中举了下山的例子，即我们在山顶上的某个位置，为了下山，就不断地看一下周围**下一步往哪走**下山比较快，然后就**迈出那一步**，一直重复，直到我们到达山下的某一处**陆地**。
 
 给出梯度下降的公式：
-$$
-{{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)
-$$
 
-> ${\theta }_{j}​$: 第 $j​$ 个特征参数
+​	repeat until convergence:
+​		${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$
+
+> ${\theta }_{j}$: 第 $j$ 个特征参数
 >
 > ”:=“: 赋值操作符
 >
@@ -246,7 +257,9 @@ $$
 
 ![](image/20180106_184926.png)
 
-把红点定为初始点，切于初始点的红色直线的斜率，表示了函数 $J\left(\theta\right)$ 在初始点处有**正斜率**，也就是说它有**正导数**，则根据梯度下降公式 ，${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$ 即 $\theta_1$ 会**向左边移动**。这样不断重复，直到收敛（达到局部最小值，即斜率为0，当然如果 $\theta$ 值开始就在极小值点处时，梯度下降算法将什么也不做）。
+把红点定为初始点，切于初始点的红色直线的斜率，表示了函数 $J\left(\theta\right)$ 在初始点处有**正斜率**，也就是说它有**正导数**，则根据梯度下降公式 ，${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta_0, \theta_1  \right)$ 即 $\theta_1$ 会**向左边移动**。这样不断重复，直到收敛（达到局部最小值，即斜率为0）。
+
+当然如果 $\theta$ 值开始就在极小值点处时，梯度下降算法将什么也不做（$\theta_1 := \theta_1 - \alpha*0$）。
 
 > 不熟悉斜率的话，就当斜率的值等于图中三角形的高度除以水平长度好啦，精确地求斜率的方法是求导。
 
@@ -286,7 +299,7 @@ $$
 
 ![](image/20180106_203726.png)
 
-视频中直接给出了 $j = 0, j = 1$ 时求解偏导的计算方法，这里给出推导过程如下：
+对于 $j = 0, j = 1$ 时，给出偏导计算公式的推导过程如下：
 
 $\frac{\partial}{\partial\theta_j} J(\theta_1, \theta_2)=\frac{\partial}{\partial\theta_j} \left(\frac{1}{2m}\sum\limits_{i=1}^{m}{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}} \right)=$
 
@@ -306,7 +319,11 @@ $\frac{\partial}{\partial\theta_1} J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m}{{\
 
 上文中所提到的梯度下降，都为批量梯度下降(Batch Gradient Descent)，即每次计算都使用所有的数据集 $\left(\sum\limits_{i=1}^{m}\right)$ 更新。
 
-使用循环求解，代码较为冗余，后面会讲到如何使用**向量化(Vectorization)**来简化代码并优化计算，使梯度下降运行的更快更好。
+由于线性回归函数呈现**碗状**，且**只有一个**全局的最优值，所以函数**一定总会**收敛到全局最小值（学习速率不可过大）。同时，函数 $J$ 被称为凸二次函数，而线性回归函数求解最小值问题属于**凸函数优化问题**。
+
+![](image/24e9420f16fdd758ccb7097788f879e7.png)
+
+另外，使用循环求解，代码较为冗余，后面会讲到如何使用**向量化(Vectorization)**来简化代码并优化计算，使梯度下降运行的更快更好。
 
 # 3 Linear Algebra Review
 
@@ -314,13 +331,315 @@ $\frac{\partial}{\partial\theta_1} J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m}{{\
 
 ## 3.1 Matrices and Vectors
 
+Octave/Matlab 代码:
+
+```matlab
+% The ; denotes we are going back to a new row.
+A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]
+
+% Initialize a vector 
+v = [1;2;3] 
+
+% Get the dimension of the matrix A where m = rows and n = columns
+[m,n] = size(A)
+
+% You could also store it this way
+dim_A = size(A)
+
+% Get the dimension of the vector v 
+dim_v = size(v)
+
+% Now let's index into the 2nd row 3rd column of matrix A
+A_23 = A(2,3)
+```
+
+执行结果:
+
+```
+A =
+
+    1    2    3
+    4    5    6
+    7    8    9
+   10   11   12
+
+v =
+
+   1
+   2
+   3
+
+m =  4
+n =  3
+dim_A =
+
+   4   3
+
+dim_v =
+
+   3   1
+
+A_23 =  6
+```
+
 ## 3.2 Addition and Scalar Multiplication
 
+Octave/Matlab 代码:
+
+```matlab
+% Initialize matrix A and B 
+A = [1, 2, 4; 5, 3, 2]
+B = [1, 3, 4; 1, 1, 1]
+
+% Initialize constant s 
+s = 2
+
+% See how element-wise addition works
+add_AB = A + B 
+
+% See how element-wise subtraction works
+sub_AB = A - B
+
+% See how scalar multiplication works
+mult_As = A * s
+
+% Divide A by s
+div_As = A / s
+
+% What happens if we have a Matrix + scalar?
+add_As = A + s
+```
+
+执行结果:
+
+```
+A =
+
+   1   2   4
+   5   3   2
+
+B =
+
+   1   3   4
+   1   1   1
+
+s =  2
+add_AB =
+
+   2   5   8
+   6   4   3
+
+sub_AB =
+
+   0  -1   0
+   4   2   1
+
+mult_As =
+
+    2    4    8
+   10    6    4
+
+div_As =
+
+   0.50000   1.00000   2.00000
+   2.50000   1.50000   1.00000
+
+add_As =
+
+   3   4   6
+   7   5   4
+```
 ## 3.3 Matrix Vector Multiplication
 
+Octave/Matlab 代码:
+
+```matlab
+% Initialize matrix A 
+A = [1, 2, 3; 4, 5, 6;7, 8, 9] 
+
+% Initialize vector v 
+v = [1; 1; 1] 
+
+% Multiply A * v
+Av = A * v
+
+
+```
+
+执行结果:
+
+```
+A =
+
+   1   2   3
+   4   5   6
+   7   8   9
+
+v =
+
+   1
+   1
+   1
+
+Av =
+
+    6
+   15
+   24
+
+```
 ## 3.4 Matrix Matrix Multiplication
 
+Octave/Matlab 代码:
+
+```matlab
+% Initialize a 3 by 2 matrix 
+A = [1, 2; 3, 4;5, 6]
+
+% Initialize a 2 by 1 matrix 
+B = [1; 2] 
+
+% We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1) 
+mult_AB = A*B
+
+% Make sure you understand why we got that result
+```
+
+执行结果:
+
+```
+A =
+
+   1   2
+   3   4
+   5   6
+
+B =
+
+   1
+   2
+
+mult_AB =
+
+    5
+   11
+   17
+
+```
 ## 3.5 Matrix Multiplication Properties
 
+Octave/Matlab 代码:
+
+```matlab
+% Initialize random matrices A and B 
+A = [1,2;4,5]
+B = [1,1;0,2]
+
+% Initialize a 2 by 2 identity matrix
+I = eye(2)
+
+% The above notation is the same as I = [1,0;0,1]
+
+% What happens when we multiply I*A ? 
+IA = I*A 
+
+% How about A*I ? 
+AI = A*I 
+
+% Compute A*B 
+AB = A*B 
+
+% Is it equal to B*A? 
+BA = B*A 
+
+% Note that IA = AI but AB != BA
+```
+
+执行结果:
+
+```
+A =
+
+   1   2
+   4   5
+
+B =
+
+   1   1
+   0   2
+
+I =
+
+Diagonal Matrix
+
+   1   0
+   0   1
+
+IA =
+
+   1   2
+   4   5
+
+AI =
+
+   1   2
+   4   5
+
+AB =
+
+    1    5
+    4   14
+
+BA =
+
+    5    7
+    8   10
+```
 ## 3.6 Inverse and Transpose
 
+Octave/Matlab 代码:
+
+```matlab
+% Initialize matrix A 
+A = [1,2,0;0,5,6;7,0,9]
+
+% Transpose A 
+A_trans = A' 
+
+% Take the inverse of A 
+A_inv = inv(A)
+
+% What is A^(-1)*A? 
+A_invA = inv(A)*A
+
+
+```
+
+执行结果:
+
+```
+A =
+
+   1   2   0
+   0   5   6
+   7   0   9
+
+A_trans =
+
+   1   0   7
+   2   5   0
+   0   6   9
+
+A_inv =
+
+   0.348837  -0.139535   0.093023
+   0.325581   0.069767  -0.046512
+  -0.271318   0.108527   0.038760
+
+A_invA =
+
+   1.00000  -0.00000   0.00000
+   0.00000   1.00000  -0.00000
+  -0.00000   0.00000   1.00000
+
+```
\ No newline at end of file