李宏毅深度学习作业二
任務(wù)說明
Binary classification is one of the most fundamental problem in machine learning. In this tutorial, you are going to build linear binary classifiers to predict whether the income of an indivisual exceeds 50,000 or not. We presented a discriminative and a generative approaches, the logistic regression(LR) and the linear discriminant anaysis(LDA). You are encouraged to compare the differences between the two, or explore more methodologies.
總結(jié):在本次作業(yè)中,需要寫一個(gè)線性二元分類器,根據(jù)人們的個(gè)人信息,判斷其年收入是否高于 50,000 美元。作業(yè)將用logistic regression 與 generative model兩種模型來實(shí)現(xiàn)目標(biāo),并且鼓勵(lì)比較兩者異同點(diǎn)
數(shù)據(jù)說明
這個(gè)資料集是由 UCI Machine Learning Repository 的 Census-Income (KDD) Data Set) 經(jīng)過一些處理而得來。
事實(shí)上在訓(xùn)練過程中,只有 X_train、Y_train 和 X_test 這三個(gè)經(jīng)過處理的檔案會(huì)被使用到,train.csv 和 test.csv 這兩個(gè)原始資料檔則可以提供你一些額外的資訊。
原數(shù)據(jù)經(jīng)過了如下處理
- 移除了一些不必要的數(shù)據(jù)
- 對(duì)離散值進(jìn)行了one-hot編碼
- 稍微平衡了正負(fù)標(biāo)記的數(shù)據(jù)的比例的處理
X_train與X_test格式相同,利用jupter notebook打開X_train
import numpy as np import pandas as pdnp.random.seed(0) X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/X_train' Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/Y_train' X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/X_test' output_fpath = './output_{}.csv'data = pd.read_csv(X_train_fpath,index_col=0) data結(jié)果:
第一行是表頭,從第二行開始是具體數(shù)據(jù),表頭是人們的個(gè)人信息,比如年齡、性別、學(xué)歷、婚姻狀況、孩子個(gè)數(shù)等等。
打開Y_train文件
target = pd.read_csv(Y_train_fpath,index_col=0) target結(jié)果:
只有兩列,第一列是人的編號(hào)(ID),第二列是一個(gè)label——如果年收入>50K美元,label就是1;如果年收入≤50K美元,label就是0.
任務(wù)目標(biāo)
輸入:人們的個(gè)人信息
輸出:0(年收入≤50K)或1(年收入>50K)
模型:logistic regression 或者 generative model
任務(wù)解答
預(yù)處理
將數(shù)據(jù)轉(zhuǎn)化為array
import numpy as npnp.random.seed(0)X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/X_train' Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/Y_train' X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度學(xué)習(xí)作業(yè)/第二次作業(yè)/X_test' output_fpath = './output_{}.csv'# Parse csv files to numpy array with open(X_train_fpath) as f:next(f) #next()讀取下一行X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float) with open(Y_train_fpath) as f:next(f)Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float) with open(X_test_fpath) as f:next(f)X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)print(X_train) print(Y_train) print(X_test) out: [[33. 1. 0. ... 52. 0. 1.][63. 1. 0. ... 52. 0. 1.][71. 0. 0. ... 0. 0. 1.]...[16. 0. 0. ... 8. 1. 0.][48. 1. 0. ... 52. 0. 1.][48. 0. 0. ... 0. 0. 1.]][1. 0. 0. ... 0. 0. 0.][[37. 1. 0. ... 52. 0. 1.][48. 1. 0. ... 52. 0. 1.][68. 0. 0. ... 0. 1. 0.]...[38. 1. 0. ... 52. 0. 1.][17. 0. 0. ... 40. 1. 0.][22. 0. 0. ... 25. 1. 0.]]標(biāo)準(zhǔn)化
定義一個(gè)標(biāo)準(zhǔn)化函數(shù)_normalize():
- X:是指需要處理的數(shù)據(jù)
- train:布爾變量,True表示訓(xùn)練集,False表示測(cè)試集
- specified_column:定義了需要被標(biāo)準(zhǔn)化的列。如果輸入為None,則表示所有列都需要被標(biāo)準(zhǔn)化。
- X_mean:訓(xùn)練集中每一列的均值。
- X_std:訓(xùn)練集中每一列的方差。
然后,對(duì)X_train和X_test分別調(diào)用該函數(shù),完成標(biāo)準(zhǔn)化。
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):if specified_column == None:specified_column = np.arange(X.shape[1])if train:X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)X_std = np.std(X[:, specified_column], 0).reshape(1, -1)X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8) #1e-8防止除零return X, X_mean, X_std# 標(biāo)準(zhǔn)化訓(xùn)練數(shù)據(jù)和測(cè)試數(shù)據(jù) X_train, X_mean, X_std = _normalize(X_train, train = True) X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std) # 用 _ 這個(gè)變量來存儲(chǔ)函數(shù)返回的無用值 out: [[-0.42755297 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][ 1.19978055 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][ 1.63373616 -1.00040556 -0.1822401 ... -1.4553617 -1.014855231.01485523]...[-1.34970863 -1.00040556 -0.1822401 ... -1.10738915 0.9853622-0.9853622 ][ 0.38611379 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][ 0.38611379 -1.00040556 -0.1822401 ... -1.4553617 -1.014855231.01485523]] [[-0.21057517 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][ 0.38611379 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][ 1.47100281 -1.00040556 -0.1822401 ... -1.4553617 0.9853622-0.9853622 ]...[-0.15633072 0.99959459 -0.1822401 ... 0.80645986 -1.014855231.01485523][-1.29546418 -1.00040556 -0.1822401 ... 0.28450104 0.9853622-0.9853622 ][-1.02424193 -1.00040556 -0.1822401 ... -0.36794749 0.9853622-0.9853622 ]]分割測(cè)試集與驗(yàn)證集
對(duì)原來的X_train進(jìn)行分割,分割比例為train:dev = 9:1。這里沒有shuffle,是固定分割
def _train_dev_split(X, Y, dev_ratio = 0.25):# This function spilts data into training set and development set.train_size = int(len(X) * (1 - dev_ratio))return X[:train_size], Y[:train_size], X[train_size:], Y[train_size:]# 把數(shù)據(jù)分成訓(xùn)練集和驗(yàn)證集 # 這里的Development set即為驗(yàn)證集 dev_ratio = 0.1 X_train, Y_train, X_dev, Y_dev = _train_dev_split(X_train, Y_train, dev_ratio = dev_ratio)train_size = X_train.shape[0] #訓(xùn)練集 dev_size = X_dev.shape[0] #驗(yàn)證集 test_size = X_test.shape[0] #測(cè)試集 data_dim = X_train.shape[1] print('Size of training set: {}'.format(train_size)) print('Size of development set: {}'.format(dev_size)) print('Size of testing set: {}'.format(test_size)) print('Dimension of data: {}'.format(data_dim)) out: Size of training set: 48830 Size of development set: 5426 Size of testing set: 27622 Dimension of data: 510這幾個(gè)函數(shù)可能會(huì)在訓(xùn)練中被重復(fù)使用到
def _shuffle(X, Y):# This function shuffles two equal-length list/array, X and Y, together.randomize = np.arange(len(X))np.random.shuffle(randomize)return (X[randomize], Y[randomize])def _sigmoid(z):# Sigmoid function can be used to calculate probability.# To avoid overflow, minimum/maximum output value is set.return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))# 該函數(shù)的作用是將數(shù)組a中的所有數(shù)限定到范圍1e-8和1 - (1e-8)之中。# 部分參數(shù)解釋:# a_min:被限定的最小值,所有比 1e-8 小的數(shù)都會(huì)強(qiáng)制變?yōu)?1e-8;# a_max:被限定的最大值,所有比 1 - (1e-8) 大的數(shù)都會(huì)強(qiáng)制變?yōu)? - (1e-8);def _f(X, w, b):# This is the logistic regression function, parameterized by w and b# Arguements:# X: input data, shape = [batch_size, data_dimension]# w: weight vector, shape = [data_dimension, ]# b: bias, scalar# Output:# predicted probability of each row of X being positively labeled, shape = [batch_size, ]return _sigmoid(np.matmul(X, w) + b)def _predict(X, w, b):# This function returns a truth value prediction for each row of X # by rounding the result of logistic regression function.return np.round(_f(X, w, b)).astype(np.int)def _accuracy(Y_pred, Y_label):# This function calculates prediction accuracyacc = 1 - np.mean(np.abs(Y_pred - Y_label))return accLogistic Regression
損失函數(shù)(交叉熵的求和)和梯度
def _cross_entropy_loss(y_pred, Y_label):# This function computes the cross entropy.## Arguements:# y_pred: probabilistic predictions, float vector# Y_label: ground truth labels, bool vector# Output:# cross entropy, scalarcross_entropy = -np.dot(Y_label, np.log(y_pred)) - np.dot((1 - Y_label), np.log(1 - y_pred))return cross_entropy def _gradient(X, Y_label, w, b):# This function computes the gradient of cross entropy loss with respect to weight w and bias b.y_pred = _f(X, w, b)pred_error = Y_label - y_predw_grad = -np.sum(pred_error * X.T, 1)b_grad = -np.sum(pred_error)return w_grad, b_gradTraining
使用小批次(mini-batch)的梯度下降法來訓(xùn)練。訓(xùn)練數(shù)據(jù)被分為許多小批次,針對(duì)每一個(gè)小批次,我們分別計(jì)算其梯度以及損失,并根據(jù)該批次來更新模型的參數(shù)。當(dāng)一次循環(huán)(iteration)完成,也就是整個(gè)訓(xùn)練集的所有小批次都被使用過一次以后,我們將所有訓(xùn)練數(shù)據(jù)打散并且重新分成新的小批次,進(jìn)行下一個(gè)循環(huán),直到事先設(shè)定的循環(huán)數(shù)量(iteration)達(dá)成為止。
- GD(Gradient Descent):就是沒有利用Batch Size,用基于整個(gè)數(shù)據(jù)庫(kù)得到梯度,梯度準(zhǔn)確,但數(shù)據(jù)量大時(shí),計(jì)算非常耗時(shí),同時(shí)神經(jīng)網(wǎng)絡(luò)常是非凸的,網(wǎng)絡(luò)最終可能收斂到初始點(diǎn)附近的局部最優(yōu)點(diǎn)。
- SGD(Stochastic Gradient Descent):就是Batch Size=1,每次計(jì)算一個(gè)樣本,梯度不準(zhǔn)確,所以學(xué)習(xí)率要降低。
- mini-batch SGD:就是選著合適Batch Size的SGD算法,mini-batch利用噪聲梯度,一定程度上緩解了GD算法直接掉進(jìn)初始點(diǎn)附近的局部最優(yōu)值。同時(shí)梯度準(zhǔn)確了,學(xué)習(xí)率要加大。
結(jié)果:
Training loss: 0.27375098820698607 Development loss: 0.29846019916163835 Training accuracy: 0.8825107515871391 Development accuracy: 0.877441946185035畫出loss和accuracy的曲線?
import matplotlib.pyplot as plt# Loss curve plt.plot(train_loss) plt.plot(dev_loss) plt.title('Loss') plt.legend(['train', 'dev']) plt.savefig('loss.png') plt.show()# Accuracy curve plt.plot(train_acc) plt.plot(dev_acc) plt.title('Accuracy') plt.legend(['train', 'dev']) plt.savefig('acc.png') plt.show()結(jié)果:
?測(cè)試
output_fpath = './output_{}.csv'# Predict testing labels predictions = _predict(X_test, w, b) with open(output_fpath.format('logistic'), 'w') as f:f.write('id,label\n')for i, label in enumerate(predictions):f.write('{},{}\n'.format(i, label))# Print out the most significant weights ind = np.argsort(np.abs(w))[::-1] with open(X_test_fpath) as f:content = f.readline().strip('\n').split(',') features = np.array(content) for i in ind[0:10]:print(features[i], w[i])結(jié)果:
Other Rel <18 never married RP of subfamily -1.5156535032617535Other Rel <18 ever marr RP of subfamily -1.2493025752946474Unemployed full-time 1.14893439607246471 0.8323252735693378Italy -0.7951922604515268Neither parent present -0.7749673709650178Kentucky -0.7717486769177805 num persons worked for employer 0.7617890642364086Householder -0.753455652297259 dividends from stocks -0.6728525747897033概率生成模型(Porbabilistic generative model)
訓(xùn)練集與測(cè)試集的處理方法跟 logistic regression 一模一樣,然而因?yàn)?generative model 有可解析的最佳解,因此不必使用到驗(yàn)證集(development set)
數(shù)據(jù)預(yù)處理
# Parse csv files to numpy array with open(X_train_fpath) as f:next(f)X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float) with open(Y_train_fpath) as f:next(f)Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float) with open(X_test_fpath) as f:next(f)X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)# Normalize training and testing data X_train, X_mean, X_std = _normalize(X_train, train = True) X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)均值和協(xié)方差矩陣
# 分別計(jì)算類別0和類別1的均值 X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if y == 0]) X_train_1 = np.array([x for x, y in zip(X_train, Y_train) if y == 1])mean_0 = np.mean(X_train_0, axis = 0) mean_1 = np.mean(X_train_1, axis = 0) # 分別計(jì)算類別0和類別1的協(xié)方差 cov_0 = np.zeros((data_dim, data_dim)) cov_1 = np.zeros((data_dim, data_dim))for x in X_train_0:cov_0 += np.dot(np.transpose([x - mean_0]), [x - mean_0]) / X_train_0.shape[0] for x in X_train_1:cov_1 += np.dot(np.transpose([x - mean_1]), [x - mean_1]) / X_train_1.shape[0]# 共享協(xié)方差 = 獨(dú)立的協(xié)方差的加權(quán)求和 cov = (cov_0 * X_train_0.shape[0] + cov_1 * X_train_1.shape[0]) / (X_train_0.shape[0] + X_train_1.shape[0])計(jì)算權(quán)重和偏差?
權(quán)重矩陣與偏差向量可以直接被計(jì)算出來,詳情可見視頻P10 Classification:
# 計(jì)算協(xié)方差矩陣的逆 # 協(xié)方差矩陣可能是奇異矩陣, 直接使用np.linalg.inv() 可能會(huì)產(chǎn)生錯(cuò)誤 # 通過SVD矩陣分解,可以快速準(zhǔn)確地獲得方差矩陣的逆 u, s, v = np.linalg.svd(cov, full_matrices=False) inv = np.matmul(v.T * 1 / s, u.T)# 計(jì)算w和b w = np.dot(inv, mean_0 - mean_1) b = (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\+ np.log(float(X_train_0.shape[0]) / X_train_1.shape[0]) # 計(jì)算訓(xùn)練集上的準(zhǔn)確率 Y_train_pred = 1 - _predict(X_train, w, b) #這邊別和邏輯回歸弄混了_predict(X_train, w, b)算出來是屬于第0類的概率 print('Training accuracy: {}'.format(_accuracy(Y_train_pred, Y_train)))結(jié)果:
Training accuracy: 0.8719404305514598預(yù)測(cè):
# Predict testing labels predictions = 1 - _predict(X_test, w, b) with open(output_fpath.format('generative'), 'w') as f:f.write('id,label\n')for i, label in enumerate(predictions):f.write('{},{}\n'.format(i, label))# Print out the most significant weights ind = np.argsort(np.abs(w))[::-1] with open(X_test_fpath) as f:content = f.readline().strip('\n').split(',') features = np.array(content) for i in ind[0:10]:print(features[i], w[i])結(jié)果:
Professional specialty -7.3757 6.8125Retail trade 6.7695312529 6.7109375MSA to nonMSA -6.5Finance insurance and real estate -6.3125Different state same division 6.078125Abroad -6.0Sales -5.1562534 -5.041015625模型修改
引入二次項(xiàng)
def _add_feature(X):X_2 = np.power(X,2)X = np.concatenate([X,X_2], axis=1)return X# 引入二次項(xiàng) X_train = _add_feature(X_train) X_test = _add_feature(X_test)?adagrad
# adagrad所需的累加和 adagrad_w = 0 adagrad_b = 0 # 防止adagrad除零 eps = 1e-8# 迭代訓(xùn)練 for epoch in range(max_iter):# 在每個(gè)epoch開始時(shí),隨機(jī)打散訓(xùn)練數(shù)據(jù)X_train, Y_train = _shuffle(X_train, Y_train)# Mini-batch訓(xùn)練for idx in range(int(np.floor(train_size / batch_size))):X = X_train[idx * batch_size:(idx + 1) * batch_size]Y = Y_train[idx * batch_size:(idx + 1) * batch_size]# 計(jì)算梯度w_grad, b_grad = _gradient(X, Y, w, b)adagrad_w += w_grad**2adagrad_b += b_grad**2# 梯度下降法adagrad更新w和bw = w - learning_rate / (np.sqrt(adagrad_w + eps)) * w_gradb = b - learning_rate / (np.sqrt(adagrad_b + eps)) * b_grad總結(jié)
以上是生活随笔為你收集整理的李宏毅深度学习作业二的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 李宏毅深度学习——逻辑回归
- 下一篇: 李宏毅深度学习——深度学习介绍