4. 好的分割线有什么特点

SVM就是找出这个分割线的过程，这个分割线离双方距离越远，这个分割线越好，越健壮，不容易出现分类误差。

6. SVMs 和棘手的数据分布

SVM的首要目标是保证分类正确，在分类正确的前提下，对间隔进行最大化。

10. SKlearn 中的 SVM

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict([[2., 2.]])
array([1])

12. SVM编码

• 创建分类器 -> 拟合 -> 预测 -> 评价精度：
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import copy
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

########################## SVM #################################
### we handle the import statement and SVC creation for you here
from sklearn.svm import SVC
clf = SVC(kernel="linear")

#### now your job is to fit the classifier
#### using the training features/labels, and to
#### make a set of predictions on the test data
clf.fit(features_train,labels_train)

#### store your predictions in a list named pred
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

def submitAccuracy():
return acc
• 输出结果为：
Good job! Your output matches our solution.
0.92

20. 尝试选择各种核

from sklearn.svm import SVC
clf = SVC(kernel="linear")

• kernel (linear,rbf,poly…)
• C:Controls tradeoff between smooth decision boundary and classifying training points correctly,控制平滑边界与正确分类的平衡
• gamma

27. SVM 作者 ID 准确率 / 耗时

from sklearn.svm import SVC
clf = SVC(kernel="linear")

#### now your job is to fit the classifier
#### using the training features/labels, and to
#### make a set of predictions on the test data
t0 = time()
clf.fit(features_train,labels_train)
print "fit training time:", round(time()-t0, 3), "s"

#### store your predictions in a list named pred
t0 = time()
pred = clf.predict(features_test)
print "predict training time:", round(time()-t0, 3), "s"

from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

print acc
• 输出结果如下:
no. of Chris training emails: 7936
no. of Sara training emails: 7884
fit training time: 187.351 s
predict training time: 19.866 s
0.984072810011

Exact times may vary a bit, but in general, the SVM is MUCH slower to train and use for predicting.

29. 更小的训练集

features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]

fit training time: 0.1 s
predict training time: 1.094 s
0.884527872582

Voice recognition and transaction blocking need to happen in real time, with almost no delay. There’s no obvious need to predict an email author instantly.

31. 部署 RBF 内核 / 优化C参数

fit training time: 0.12 s
predict training time: 1.313 s
0.616040955631

• C=10
fit training time: 0.124 s
predict training time: 1.295 s
0.616040955631
• C=100
fit training time: 0.124 s
predict training time: 1.295 s
0.616040955631
• C=1000
fit training time: 0.118 s
predict training time: 1.246 s
0.821387940842
• C=10000
fit training time: 0.117 s
predict training time: 1.062 s
0.892491467577

fit training time: 120.092 s
predict training time: 11.973 s
0.990898748578

35. 从 SVM 提取预测

print pred[10],pred[26],pred[50]

36. 预测有多少 Chris 的邮件？

n = 0
for result in pred:
if result == 1:
n = n + 1

print "Chris:",n

fit training time: 125.144 s
predict training time: 12.211 s
Chris 877
0.990898748578