前言
最近工作工作中遇到一个需求,是要根据CDN日志过滤一些数据,例如流量、状态码统计,TOP IP、URL、UA、Referer等。以前都是用 bash shell 实现的,但是当日志量较大,日志文件数G、行数达数千万亿级时,通过 shell 处理有些力不从心,处理时间过长。于是研究了下Python pandas这个数据处理库的使用。一千万行日志,处理完成在40s左右。
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
#!/usr/bin/python # -*- coding: utf-8 -*- # sudo pip install pandas __author__ = 'Loya Chen' import sys import pandas as pd from collections import OrderedDict """ Description: This script is used to analyse qiniu cdn log. ================================================================================ 日志格式 IP - ResponseTime [time +0800] "Method URL HTTP/1.1" code size "referer" "UA" ================================================================================ 日志示例 [0] [1][2] [3] [4] [5] 101.226.66.179 - 68 [16/Nov/2016:04:36:40 +0800] "GET http://www.qn.com/1.jpg -" [6] [7] [8] [9] 200 502 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" ================================================================================ """ if len (sys.argv) ! = 2 : print ( 'Usage:' , sys.argv[ 0 ], 'file_of_log' ) exit() else : log_file = sys.argv[ 1 ] # 需统计字段对应的日志位置 ip = 0 url = 5 status_code = 6 size = 7 referer = 8 ua = 9 # 将日志读入DataFrame reader = pd.read_table(log_file, sep = ' ' , names = [i for i in range ( 10 )], iterator = True ) loop = True chunkSize = 10000000 chunks = [] while loop: try : chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: #Iteration is stopped. loop = False df = pd.concat(chunks, ignore_index = True ) byte_sum = df[size]. sum () #流量统计 top_status_code = pd.DataFrame(df[ 6 ].value_counts()) #状态码统计 top_ip = df[ip].value_counts().head( 10 ) #TOP IP top_referer = df[referer].value_counts().head( 10 ) #TOP Referer top_ua = df[ua].value_counts().head( 10 ) #TOP User-Agent top_status_code[ 'persent' ] = pd.DataFrame(top_status_code / top_status_code. sum () * 100 ) top_url = df[url].value_counts().head( 10 ) #TOP URL top_url_byte = df[[url,size]].groupby(url). sum (). apply ( lambda x:x.astype( float ) / 1024 / 1024 ) \ . round (decimals = 3 ).sort_values(by = [size], ascending = False )[size].head( 10 ) #请求流量最大的URL top_ip_byte = df[[ip,size]].groupby(ip). sum (). apply ( lambda x:x.astype( float ) / 1024 / 1024 ) \ . round (decimals = 3 ).sort_values(by = [size], ascending = False )[size].head( 10 ) #请求流量最多的IP # 将结果有序存入字典 result = OrderedDict([( "流量总计[单位:GB]:" , byte_sum / 1024 / 1024 / 1024 ), ( "状态码统计[次数|百分比]:" , top_status_code), ( "IP TOP 10:" , top_ip), ( "Referer TOP 10:" , top_referer), ( "UA TOP 10:" , top_ua), ( "URL TOP 10:" , top_url), ( "请求流量最大的URL TOP 10[单位:MB]:" , top_url_byte), ( "请求流量最大的IP TOP 10[单位:MB]:" , top_ip_byte) ]) # 输出结果 for k,v in result.items(): print (k) print (v) print ( '=' * 80 ) |
pandas 学习笔记
Pandas 中有两种基本的数据结构,Series 和 Dataframe。 Series 是一种类似于一维数组的对象,由一组数据和索引组成。 Dataframe 是一个表格型的数据结构,既有行索引也有列索引。
1
2
|
from pandas import Series, DataFrame import pandas as pd |
Series
1
2
3
4
5
6
7
|
In [ 1 ]: obj = Series([ 4 , 7 , - 5 , 3 ]) In [ 2 ]: obj Out[ 2 ]: 0 4 1 7 2 - 5 3 3 |
Series的字符串表现形式为:索引在左边,值在右边。没有指定索引时,会自动创建一个0到N-1(N为数据的长度)的整数型索引。可以通过Series的values和index属性获取其数组表示形式和索引对象:
1
2
3
4
|
In [ 3 ]: obj.values Out[ 3 ]: array([ 4 , 7 , - 5 , 3 ]) In [ 4 ]: obj.index Out[ 4 ]: RangeIndex(start = 0 , stop = 4 , step = 1 ) |
通常创建Series时会指定索引:
1
2
3
4
5
6
7
|
In [ 5 ]: obj2 = Series([ 4 , 7 , - 5 , 3 ], index = [ 'd' , 'b' , 'a' , 'c' ]) In [ 6 ]: obj2 Out[ 6 ]: d 4 b 7 a - 5 c 3 |
通过索引获取Series中的单个或一组值:
1
2
3
4
5
6
|
In [ 7 ]: obj2[ 'a' ] Out[ 7 ]: - 5 In [ 8 ]: obj2[[ 'c' , 'd' ]] Out[ 8 ]: c 3 d 4 |
排序
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 9 ]: obj2.sort_index() Out[ 9 ]: a - 5 b 7 c 3 d 4 In [ 10 ]: obj2.sort_values() Out[ 10 ]: a - 5 c 3 d 4 b 7 |
筛选运算
1
2
3
4
5
6
7
8
9
10
11
|
In [ 11 ]: obj2[obj2 > 0 ] Out[ 11 ]: d 4 b 7 c 3 In [ 12 ]: obj2 * 2 Out[ 12 ]: d 8 b 14 a - 10 c 6 |
成员
1
2
3
4
|
In [ 13 ]: 'b' in obj2 Out[ 13 ]: True In [ 14 ]: 'e' in obj2 Out[ 14 ]: False |
通过字典创建Series
1
2
3
4
5
6
7
8
|
In [ 15 ]: sdata = { 'Shanghai' : 35000 , 'Beijing' : 40000 , 'Nanjing' : 26000 , 'Hangzhou' : 30000 } In [ 16 ]: obj3 = Series(sdata) In [ 17 ]: obj3 Out[ 17 ]: Beijing 40000 Hangzhou 30000 Nanjing 26000 Shanghai 35000 |
如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)
1
2
3
4
5
6
7
8
|
In [ 18 ]: states = [ 'Beijing' , 'Hangzhou' , 'Shanghai' , 'Suzhou' ] In [ 19 ]: obj4 = Series(sdata, index = states) In [ 20 ]: obj4 Out[ 20 ]: Beijing 40000.0 Hangzhou 30000.0 Shanghai 35000.0 Suzhou NaN |
当指定index时,sdata中跟states索引相匹配的3个值会被找出并放到响应的位置上,但由于‘Suzhou'所对应的sdata值找不到,所以其结果为NaN(not a number),pandas中用于表示缺失或NA值
pandas的isnull和notnull函数可以用于检测缺失数据:
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 21 ]: pd.isnull(obj4) Out[ 21 ]: Beijing False Hangzhou False Shanghai False Suzhou True In [ 22 ]: pd.notnull(obj4) Out[ 22 ]: Beijing True Hangzhou True Shanghai True Suzhou False |
Series也有类似的实例方法
1
2
3
4
5
6
|
In [ 23 ]: obj4.isnull() Out[ 23 ]: Beijing False Hangzhou False Shanghai False Suzhou True |
Series的一个重要功能是,在数据运算中,自动对齐不同索引的数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
In [ 24 ]: obj3 Out[ 24 ]: Beijing 40000 Hangzhou 30000 Nanjing 26000 Shanghai 35000 In [ 25 ]: obj4 Out[ 25 ]: Beijing 40000.0 Hangzhou 30000.0 Shanghai 35000.0 Suzhou NaN In [ 26 ]: obj3 + obj4 Out[ 26 ]: Beijing 80000.0 Hangzhou 60000.0 Nanjing NaN Shanghai 70000.0 Suzhou NaN |
Series的索引可以通过复制的方式就地修改
1
2
3
4
5
6
7
|
In [ 27 ]: obj.index = [ 'Bob' , 'Steve' , 'Jeff' , 'Ryan' ] In [ 28 ]: obj Out[ 28 ]: Bob 4 Steve 7 Jeff - 5 Ryan 3 |
DataFrame
pandas读取文件
1
2
3
4
5
6
7
8
|
In [ 29 ]: df = pd.read_table( 'pandas_test.txt' ,sep = ' ' , names = [ 'name' , 'age' ]) In [ 30 ]: df Out[ 30 ]: name age 0 Bob 26 1 Loya 22 2 Denny 20 3 Mars 25 |
DataFrame列选取
1
|
df[name] |
1
2
3
4
5
6
7
|
In [ 31 ]: df[ 'name' ] Out[ 31 ]: 0 Bob 1 Loya 2 Denny 3 Mars Name: name, dtype: object |
DataFrame行选取
1
2
|
df.iloc[ 0 ,:] #第一个参数是第几行,第二个参数是列。这里指第0行全部列 df.iloc[:, 0 ] #全部行,第0列 |
1
2
3
4
5
6
7
8
9
10
11
12
|
In [ 32 ]: df.iloc[ 0 ,:] Out[ 32 ]: name Bob age 26 Name: 0 , dtype: object In [ 33 ]: df.iloc[:, 0 ] Out[ 33 ]: 0 Bob 1 Loya 2 Denny 3 Mars Name: name, dtype: object |
获取一个元素,可以通过iloc,更快的方式是iat
1
2
3
4
|
In [ 34 ]: df.iloc[ 1 , 1 ] Out[ 34 ]: 22 In [ 35 ]: df.iat[ 1 , 1 ] Out[ 35 ]: 22 |
DataFrame块选取
1
2
3
4
5
|
In [ 36 ]: df.loc[ 1 : 2 ,[ 'name' , 'age' ]] Out[ 36 ]: name age 1 Loya 22 2 Denny 20 |
根据条件过滤行
在方括号中加入判断条件来过滤行,条件必需返回 True 或者 False
1
2
3
4
5
6
7
8
9
10
11
|
In [ 37 ]: df[(df.index > = 1 ) & (df.index < = 3 )] Out[ 37 ]: name age city 1 Loya 22 Shanghai 2 Denny 20 Hangzhou 3 Mars 25 Nanjing In [ 38 ]: df[df[ 'age' ] > 22 ] Out[ 38 ]: name age city 0 Bob 26 Beijing 3 Mars 25 Nanjing |
增加列
1
2
3
4
5
6
7
8
|
In [ 39 ]: df[ 'city' ] = [ 'Beijing' , 'Shanghai' , 'Hangzhou' , 'Nanjing' ] In [ 40 ]: df Out[ 40 ]: name age city 0 Bob 26 Beijing 1 Loya 22 Shanghai 2 Denny 20 Hangzhou 3 Mars 25 Nanjing |
排序
按指定列排序
1
2
3
4
5
6
7
|
In [ 41 ]: df.sort_values(by = 'age' ) Out[ 41 ]: name age city 2 Denny 20 Hangzhou 1 Loya 22 Shanghai 3 Mars 25 Nanjing 0 Bob 26 Beijing |
1
2
|
# 引入numpy 构建 DataFrame import numpy as np |
1
2
3
4
5
6
|
In [ 42 ]: df = pd.DataFrame(np.arange( 8 ).reshape(( 2 , 4 )), index = [ 'three' , 'one' ], columns = [ 'd' , 'a' , 'b' , 'c' ]) In [ 43 ]: df Out[ 43 ]: d a b c three 0 1 2 3 one 4 5 6 7 |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# 以索引排序 In [ 44 ]: df.sort_index() Out[ 44 ]: d a b c one 4 5 6 7 three 0 1 2 3 In [ 45 ]: df.sort_index(axis = 1 ) Out[ 45 ]: a b c d three 1 2 3 0 one 5 6 7 4 # 降序 In [ 46 ]: df.sort_index(axis = 1 , ascending = False ) Out[ 46 ]: d c b a three 0 3 2 1 one 4 7 6 5 |
查看
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# 查看表头5行 df.head( 5 ) # 查看表末5行 df.tail( 5 ) # 查看列的名字 In [ 47 ]: df.columns Out[ 47 ]: Index([ 'name' , 'age' , 'city' ], dtype = 'object' ) # 查看表格当前的值 In [ 48 ]: df.values Out[ 48 ]: array([[ 'Bob' , 26 , 'Beijing' ], [ 'Loya' , 22 , 'Shanghai' ], [ 'Denny' , 20 , 'Hangzhou' ], [ 'Mars' , 25 , 'Nanjing' ]], dtype = object ) |
转置
1
2
3
4
5
6
|
df.T Out[ 49 ]: 0 1 2 3 name Bob Loya Denny Mars age 26 22 20 25 city Beijing Shanghai Hangzhou Nanjing |
使用isin
1
2
3
4
5
6
|
In [ 50 ]: df2 = df.copy() In [ 51 ]: df2[df2[ 'city' ].isin([ 'Shanghai' , 'Nanjing' ])] Out[ 52 ]: name age city 1 Loya 22 Shanghai 3 Mars 25 Nanjing |
运算操作:
1
2
3
4
5
6
7
8
9
|
In [ 53 ]: df = pd.DataFrame([[ 1.4 , np.nan], [ 7.1 , - 4.5 ], [np.nan, np.nan], [ 0.75 , - 1.3 ]], ...: index = [ 'a' , 'b' , 'c' , 'd' ], columns = [ 'one' , 'two' ]) In [ 54 ]: df Out[ 54 ]: one two a 1.40 NaN b 7.10 - 4.5 c NaN NaN d 0.75 - 1.3 |
1
2
3
4
5
6
7
8
9
10
11
12
|
#按列求和 In [ 55 ]: df. sum () Out[ 55 ]: one 9.25 two - 5.80 # 按行求和 In [ 56 ]: df. sum (axis = 1 ) Out[ 56 ]: a 1.40 b 2.60 c NaN d - 0.55 |
group
group 指的如下几步:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
See the Grouping section
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
In [ 57 ]: df = pd.DataFrame({ 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' , ....: 'foo' , 'bar' , 'foo' , 'foo' ], ....: 'B' : [ 'one' , 'one' , 'two' , 'three' , ....: 'two' , 'two' , 'one' , 'three' ], ....: 'C' : np.random.randn( 8 ), ....: 'D' : np.random.randn( 8 )}) ....: In [ 58 ]: df Out[ 58 ]: A B C D 0 foo one - 1.202872 - 0.055224 1 bar one - 1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three - 0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two - 0.392670 - 0.136473 6 foo one 0.007207 - 0.561757 7 foo three 1.928123 - 1.623033 |
group一下,然后应用sum函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
In [ 59 ]: df.groupby( 'A' ). sum () Out[ 59 ]: C D A bar - 2.802588 2.42611 foo 3.146492 - 0.63958 In [ 60 ]: df.groupby([ 'A' , 'B' ]). sum () Out[ 60 ]: C D A B bar one - 1.814470 2.395985 three - 0.595447 0.166599 two - 0.392670 - 0.136473 foo one - 1.195665 - 0.616981 three 1.928123 - 1.623033 two 2.414034 1.600434 |
总结
以上就是关于利用Python中的pandas库进行cdn日志分析的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对服务器之家的支持。
原文链接:http://qingkang.me/python-pandas.html