sklearn
是利用python进行机器学习中一个非常全面和好用的第三方库,用过的都说好。今天主要记录一下sklearn
中关于交叉验证的各种用法,主要是对sklearn
官方文档 Cross-validation: evaluating estimator performance进行讲解,英文水平好的建议读官方文档,里面的知识点很详细。
先导入需要的库及数据集
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 1 ]: import numpy as np In [ 2 ]: from sklearn.model_selection import train_test_split In [ 3 ]: from sklearn.datasets import load_iris In [ 4 ]: from sklearn import svm In [ 5 ]: iris = load_iris() In [ 6 ]: iris.data.shape, iris.target.shape Out[ 6 ]: (( 150 , 4 ), ( 150 ,)) |
1.train_test_split
对数据集进行快速打乱(分为训练集和测试集)
这里相当于对数据集进行了shuffle后按照给定的test_size
进行数据集划分。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
In [ 7 ]: X_train, X_test, y_train, y_test = train_test_split( ...: iris.data, iris.target, test_size = . 4 , random_state = 0 ) #这里是按照6:4对训练集测试集进行划分 In [ 8 ]: X_train.shape, y_train.shape Out[ 8 ]: (( 90 , 4 ), ( 90 ,)) In [ 9 ]: X_test.shape, y_test.shape Out[ 9 ]: (( 60 , 4 ), ( 60 ,)) In [ 10 ]: iris.data[: 5 ] Out[ 10 ]: array([[ 5.1 , 3.5 , 1.4 , 0.2 ], [ 4.9 , 3. , 1.4 , 0.2 ], [ 4.7 , 3.2 , 1.3 , 0.2 ], [ 4.6 , 3.1 , 1.5 , 0.2 ], [ 5. , 3.6 , 1.4 , 0.2 ]]) In [ 11 ]: X_train[: 5 ] Out[ 11 ]: array([[ 6. , 3.4 , 4.5 , 1.6 ], [ 4.8 , 3.1 , 1.6 , 0.2 ], [ 5.8 , 2.7 , 5.1 , 1.9 ], [ 5.6 , 2.7 , 4.2 , 1.3 ], [ 5.6 , 2.9 , 3.6 , 1.3 ]]) In [ 12 ]: clf = svm.SVC(kernel = 'linear' , C = 1 ).fit(X_train, y_train) In [ 13 ]: clf.score(X_test, y_test) Out[ 13 ]: 0.96666666666666667 |
2.cross_val_score
对数据集进行指定次数的交叉验证并为每次验证效果评测
其中,score
默认是以 scoring='f1_macro'进行评测的,余外针对分类或回归还有:
这需要from sklearn import metrics
,通过在cross_val_score
指定参数来设定评测标准;
当cv
指定为int
类型时,默认使用KFold
或StratifiedKFold
进行数据集打乱,下面会对KFold
和StratifiedKFold
进行介绍。
1
2
3
4
5
6
7
8
9
10
11
|
In [ 15 ]: from sklearn.model_selection import cross_val_score In [ 16 ]: clf = svm.SVC(kernel = 'linear' , C = 1 ) In [ 17 ]: scores = cross_val_score(clf, iris.data, iris.target, cv = 5 ) In [ 18 ]: scores Out[ 18 ]: array([ 0.96666667 , 1. , 0.96666667 , 0.96666667 , 1. ]) In [ 19 ]: scores.mean() Out[ 19 ]: 0.98000000000000009 |
除使用默认交叉验证方式外,可以对交叉验证方式进行指定,如验证次数,训练集测试集划分比例等
1
2
3
4
5
6
7
8
|
In [ 20 ]: from sklearn.model_selection import ShuffleSplit In [ 21 ]: n_samples = iris.data.shape[ 0 ] In [ 22 ]: cv = ShuffleSplit(n_splits = 3 , test_size = . 3 , random_state = 0 ) In [ 23 ]: cross_val_score(clf, iris.data, iris.target, cv = cv) Out[ 23 ]: array([ 0.97777778 , 0.97777778 , 1. ]) |
在cross_val_score
中同样可使用pipeline
进行流水线操作
1
2
3
4
5
6
7
8
|
In [ 24 ]: from sklearn import preprocessing In [ 25 ]: from sklearn.pipeline import make_pipeline In [ 26 ]: clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C = 1 )) In [ 27 ]: cross_val_score(clf, iris.data, iris.target, cv = cv) Out[ 27 ]: array([ 0.97777778 , 0.93333333 , 0.95555556 ]) |
3.cross_val_predict
cross_val_predict
与cross_val_score
很相像,不过不同于返回的是评测效果,cross_val_predict
返回的是estimator
的分类结果(或回归值),这个对于后期模型的改善很重要,可以通过该预测输出对比实际目标值,准确定位到预测出错的地方,为我们参数优化及问题排查十分的重要。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
In [ 28 ]: from sklearn.model_selection import cross_val_predict In [ 29 ]: from sklearn import metrics In [ 30 ]: predicted = cross_val_predict(clf, iris.data, iris.target, cv = 10 ) In [ 31 ]: predicted Out[ 31 ]: array([ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 1 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ]) In [ 32 ]: metrics.accuracy_score(iris.target, predicted) Out[ 32 ]: 0.96666666666666667 |
4.KFold
K折交叉验证,这是将数据集分成K份的官方给定方案,所谓K折就是将数据集通过K次分割,使得所有数据既在训练集出现过,又在测试集出现过,当然,每次分割中不会有重叠。相当于无放回抽样。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
In [ 33 ]: from sklearn.model_selection import KFold In [ 34 ]: X = [ 'a' , 'b' , 'c' , 'd' ] In [ 35 ]: kf = KFold(n_splits = 2 ) In [ 36 ]: for train, test in kf.split(X): ...: print train, test ...: print np.array(X)[train], np.array(X)[test] ...: print '\n' ...: [ 2 3 ] [ 0 1 ] [ 'c' 'd' ] [ 'a' 'b' ] [ 0 1 ] [ 2 3 ] [ 'a' 'b' ] [ 'c' 'd' ] |
5.LeaveOneOut
LeaveOneOut
其实就是KFold
的一个特例,因为使用次数比较多,因此独立的定义出来,完全可以通过KFold
实现。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
In [ 37 ]: from sklearn.model_selection import LeaveOneOut In [ 38 ]: X = [ 1 , 2 , 3 , 4 ] In [ 39 ]: loo = LeaveOneOut() In [ 41 ]: for train, test in loo.split(X): ...: print train, test ...: [ 1 2 3 ] [ 0 ] [ 0 2 3 ] [ 1 ] [ 0 1 3 ] [ 2 ] [ 0 1 2 ] [ 3 ] #使用KFold实现LeaveOneOtut In [ 42 ]: kf = KFold(n_splits = len (X)) In [ 43 ]: for train, test in kf.split(X): ...: print train, test ...: [ 1 2 3 ] [ 0 ] [ 0 2 3 ] [ 1 ] [ 0 1 3 ] [ 2 ] [ 0 1 2 ] [ 3 ] |
6.LeavePOut
这个也是KFold
的一个特例,用KFold
实现起来稍麻烦些,跟LeaveOneOut
也很像。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
In [ 44 ]: from sklearn.model_selection import LeavePOut In [ 45 ]: X = np.ones( 4 ) In [ 46 ]: lpo = LeavePOut(p = 2 ) In [ 47 ]: for train, test in lpo.split(X): ...: print train, test ...: [ 2 3 ] [ 0 1 ] [ 1 3 ] [ 0 2 ] [ 1 2 ] [ 0 3 ] [ 0 3 ] [ 1 2 ] [ 0 2 ] [ 1 3 ] [ 0 1 ] [ 2 3 ] |
7.ShuffleSplit
ShuffleSplit
咋一看用法跟LeavePOut
很像,其实两者完全不一样,LeavePOut
是使得数据集经过数次分割后,所有的测试集出现的元素的集合即是完整的数据集,即无放回的抽样,而ShuffleSplit
则是有放回的抽样,只能说经过一个足够大的抽样次数后,保证测试集出现了完成的数据集的倍数。
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 48 ]: from sklearn.model_selection import ShuffleSplit In [ 49 ]: X = np.arange( 5 ) In [ 50 ]: ss = ShuffleSplit(n_splits = 3 , test_size = . 25 , random_state = 0 ) In [ 51 ]: for train_index, test_index in ss.split(X): ...: print train_index, test_index ...: [ 1 3 4 ] [ 2 0 ] [ 1 4 3 ] [ 0 2 ] [ 4 0 2 ] [ 1 3 ] |
8.StratifiedKFold
这个就比较好玩了,通过指定分组,对测试集进行无放回抽样。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
In [ 52 ]: from sklearn.model_selection import StratifiedKFold In [ 53 ]: X = np.ones( 10 ) In [ 54 ]: y = [ 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 ] In [ 55 ]: skf = StratifiedKFold(n_splits = 3 ) In [ 56 ]: for train, test in skf.split(X,y): ...: print train, test ...: [ 2 3 6 7 8 9 ] [ 0 1 4 5 ] [ 0 1 3 4 5 8 9 ] [ 2 6 7 ] [ 0 1 2 4 5 6 7 ] [ 3 8 9 ] |
9.GroupKFold
这个跟StratifiedKFold
比较像,不过测试集是按照一定分组进行打乱的,即先分堆,然后把这些堆打乱,每个堆里的顺序还是固定不变的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
In [ 57 ]: from sklearn.model_selection import GroupKFold In [ 58 ]: X = [. 1 , . 2 , 2.2 , 2.4 , 2.3 , 4.55 , 5.8 , 8.8 , 9 , 10 ] In [ 59 ]: y = [ 'a' , 'b' , 'b' , 'b' , 'c' , 'c' , 'c' , 'd' , 'd' , 'd' ] In [ 60 ]: groups = [ 1 , 1 , 1 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ] In [ 61 ]: gkf = GroupKFold(n_splits = 3 ) In [ 62 ]: for train, test in gkf.split(X,y,groups = groups): ...: print train, test ...: [ 0 1 2 3 4 5 ] [ 6 7 8 9 ] [ 0 1 2 6 7 8 9 ] [ 3 4 5 ] [ 3 4 5 6 7 8 9 ] [ 0 1 2 ] |
10.LeaveOneGroupOut
这个是在GroupKFold
上的基础上混乱度又减小了,按照给定的分组方式将测试集分割下来。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
In [ 63 ]: from sklearn.model_selection import LeaveOneGroupOut In [ 64 ]: X = [ 1 , 5 , 10 , 50 , 60 , 70 , 80 ] In [ 65 ]: y = [ 0 , 1 , 1 , 2 , 2 , 2 , 2 ] In [ 66 ]: groups = [ 1 , 1 , 2 , 2 , 3 , 3 , 3 ] In [ 67 ]: logo = LeaveOneGroupOut() In [ 68 ]: for train, test in logo.split(X, y, groups = groups): ...: print train, test ...: [ 2 3 4 5 6 ] [ 0 1 ] [ 0 1 4 5 6 ] [ 2 3 ] [ 0 1 2 3 ] [ 4 5 6 ] |
11.LeavePGroupsOut
这个没啥可说的,跟上面那个一样,只是一个是单组,一个是多组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
from sklearn.model_selection import LeavePGroupsOut X = np.arange( 6 ) y = [ 1 , 1 , 1 , 2 , 2 , 2 ] groups = [ 1 , 1 , 2 , 2 , 3 , 3 ] lpgo = LeavePGroupsOut(n_groups = 2 ) for train, test in lpgo.split(X, y, groups = groups): print train, test [ 4 5 ] [ 0 1 2 3 ] [ 2 3 ] [ 0 1 4 5 ] [ 0 1 ] [ 2 3 4 5 ] |
12.GroupShuffleSplit
这个是有放回抽样
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
In [ 75 ]: from sklearn.model_selection import GroupShuffleSplit In [ 76 ]: X = [. 1 , . 2 , 2.2 , 2.4 , 2.3 , 4.55 , 5.8 , . 001 ] In [ 77 ]: y = [ 'a' , 'b' , 'b' , 'b' , 'c' , 'c' , 'c' , 'a' ] In [ 78 ]: groups = [ 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ] In [ 79 ]: gss = GroupShuffleSplit(n_splits = 4 , test_size = . 5 , random_state = 0 ) In [ 80 ]: for train, test in gss.split(X, y, groups = groups): ...: print train, test ...: [ 0 1 2 3 ] [ 4 5 6 7 ] [ 2 3 6 7 ] [ 0 1 4 5 ] [ 2 3 4 5 ] [ 0 1 6 7 ] [ 4 5 6 7 ] [ 0 1 2 3 ] |
13.TimeSeriesSplit
针对时间序列的处理,防止未来数据的使用,分割时是将数据进行从前到后切割(这个说法其实不太恰当,因为切割是延续性的。。)
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 81 ]: from sklearn.model_selection import TimeSeriesSplit In [ 82 ]: X = np.array([[ 1 , 2 ],[ 3 , 4 ],[ 1 , 2 ],[ 3 , 4 ],[ 1 , 2 ],[ 3 , 4 ]]) In [ 83 ]: tscv = TimeSeriesSplit(n_splits = 3 ) In [ 84 ]: for train, test in tscv.split(X): ...: print train, test ...: [ 0 1 2 ] [ 3 ] [ 0 1 2 3 ] [ 4 ] [ 0 1 2 3 4 ] [ 5 ] |
这个repo
用来记录一些python技巧、书籍、学习链接等,欢迎star
github地址
原文链接:https://blog.csdn.net/xiaodongxiexie/article/details/71915259