當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习实践五---支持向量机（SVM）

發布時間：2023/11/29 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习实践五---支持向量机（SVM）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

之前已經學到了很多監督學習算法，今天的監督學習算法是支持向量機，與邏輯回歸和神經網絡算法相比，它在學習復雜的非線性方程時提供了一種更為清晰，更強大的方式。

Support Vector Machines

SVM hypothesis

Example Dataset 1

import numpy as np import pandas as pd import matplotlib.pyplot as plt import scipy from scipy.io import loadmat from sklearn import svm mat = loadmat("ex6data1.mat") print(mat.keys()) X = mat['X'] y = mat['y']def plot_data(X, y):plt.figure(figsize=(6, 4))plt.scatter(X[:, 0], X[:, 1], c=y.flatten(), cmap='rainbow')plt.xlabel('X1')plt.ylabel('X2')plt.legend()plot_data(X, y) plt.show() def plot_boundary(clf, X):x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1y_min, y_max = X[:, 1].min() * 1.1, X[:, 1].max() * 1.1xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),np.linspace(y_min, y_max, 500))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)plt.contour(xx, yy, Z)models = [svm.SVC(C, kernel='linear') for C in [1, 100]] clfs = [model.fit(X, y.ravel()) for model in models] title = ['SVM Decision Boundary with C = {} (Example Dataset 1'.format(C) for C in [1, 100]] for model, title in zip(clfs, title):plt.figure(figsize=(8, 5))plot_data(X, y)plot_boundary(model, X)plt.title(title)plt.show()

SVM with Gaussian Kernels

Gaussian Kernel

def gauss_kernel(x1, x2, sigma):return np.exp(- ((x1 - x2) ** 2).sum() / (2 * sigma ** 2))

Example Dataset 2

mat = loadmat('ex6data2.mat') X2 = mat['X'] y2 = mat['y'] plot_data(X2, y2)sigma = 0.1 gamma = np.power(sigma, -2.)/2 clf = svm.SVC(C=1, kernel='rbf', gamma=gamma) modle = clf.fit(X2, y2.flatten()) plot_data(X2, y2) plot_boundary(modle, X2)

Example Dataset 3

mat3 = loadmat('ex6data3.mat') X3, y3 = mat3['X'], mat3['y'] Xval, yval = mat3['Xval'], mat3['yval'] plot_data(X3, y3)

Spam Classification

Preprocessing Emails

with open('emailSample1.txt', 'r') as f:email = f.read()print(email) # 做除了Word Stemming和Removal of non-words的所有處理 def process_email(email):email = email.lower()email = re.sub('<[^<>]>', ' ', email) # 匹配<開頭，然后所有不是< ,> 的內容，知道>結尾，相當于匹配<...>email = re.sub('(http|https)://[^\s]*', 'httpaddr', email ) # 匹配//后面不是空白字符的內容，遇到空白字符則停止email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)email = re.sub('[\$]+', 'dollar', email)email = re.sub('[\d]+', 'number', email)return email # 預處理數據，返回一個干凈的單詞列表 def email2TokenList(email):# I'll use the NLTK stemmer because it more accurately duplicates the# performance of the OCTAVE implementation in the assignmentstemmer = nltk.stem.porter.PorterStemmer()email = process_email(email)# 將郵件分割為單個單詞，re.split() 可以設置多種分隔符tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\{\}\,\'\"\>\_\<\;\%]', email)# 遍歷每個分割出來的內容tokenlist = []for token in tokens:# 刪除任何非字母數字的字符token = re.sub('[^a-zA-Z0-9]', '', token);# Use the Porter stemmer to 提取詞根stemmed = stemmer.stem(token)# 去除空字符串‘’，里面不含任何字符if not len(token): continuetokenlist.append(stemmed)return tokenlist

Vocabulary List

# 提取存在單詞的索引 def email2VocabIndices(email, vocab):token = email2TokenList(email)index = [i for i in range(len(vocab)) if vocab[i] in token ]return index

Extracting Features from Emails

# 將email轉化為詞向量，n是vocab的長度。存在單詞的相應位置的值置為1，其余為0 def email2FeatureVector(email):df = pd.read_table('data/vocab.txt',names=['words'])vocab = df.as_matrix() # return arrayvector = np.zeros(len(vocab)) # init vectorvocab_indices = email2VocabIndices(email, vocab) # 返回含有單詞的索引# 將有單詞的索引置為1for i in vocab_indices:vector[i] = 1return vector

Training SVM for Spam Classification

vector = email2FeatureVector(email) print('length of vector = {}\nnum of non-zero = {}'.format(len(vector), int(vector.sum())))# 2.3 Training SVM for Spam Classification # Training set mat1 = loadmat('spamTrain.mat') X, y = mat1['X'], mat1['y']# Test set mat2 = scipy.io.loadmat('spamTest.mat') Xtest, ytest = mat2['Xtest'], mat2['ytest']clf = svm.SVC(C=0.1, kernel='linear') clf.fit(X, y)

Top Predictors for Spam

predTrain = clf.score(X, y) predTest = clf.score(Xtest, ytest) predTrain, predTest

參數對算法的影響：
C = 1/λ
大C：低偏差，高方差（對應低λ）
小C：高偏差，低方差（對應高λ）
大δ^2: 分布更平滑，高偏差，低方差
小δ^2: 分布更集中，地偏差，高方差

使用SVM 的步驟：

使用SVM軟件庫去求解參數θ

Need to specify:

choice of parameter C

choice of kernel (similarity function):
eg： No kernel(‘linear kernel’)
Gaussian kernel
need to choose θ^2

logistic vs SVM
n為特征數，m為訓練樣本數。
(1)如果相較于而言，要大許多，即訓練集數據量不夠支持我們訓練一個復雜的非線性模型，我們選用邏輯回歸模型或者不帶核函數的支持向量機。
(2)如果較小，而且大小中等，例如在 1-1000 之間，而在10-10000之間，使用高斯核函數的支持向量機。
(3)如果較小，而較大，例如在1-1000之間，而大于50000，則使用支持向量機會非常慢，解決方案是創造、增加更多的特征，然后使用邏輯回歸或不帶核函數的支持向量機。

值得一提的是，神經網絡在以上三種情況下都可能會有較好的表現，但是訓練神經網絡可能非常慢，選擇支持向量機的原因主要在于它的代價函數是凸函數，不存在局部最小值。

總結

以上是生活随笔為你收集整理的机器学习实践五---支持向量机（SVM）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：梦到看到死人了有什么兆头
下一篇：机器学习实践七----异常检测和推荐系统