爬虫python验证码识别
前言:
二值化、普通降噪、8邻域降噪
tesseract、tesserocr、pil
参考文献--代码地址:https://github.com/liguobao/python-verify-code-ocr
1、批量下载验证码图片
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
import shutil import requests from loguru import logger for i in range ( 100 ): url = 'http://xxxx/create/validate/image' response = requests.get(url, stream = true) with open (f './imgs/{i}.png' , 'wb' ) as out_file: response.raw.decode_content = true shutil.copyfileobj(response.raw, out_file) logger.info(f "download {i}.png successfully." ) del response |
2、识别代码看看效果
1
2
3
4
5
6
7
8
9
10
|
from pil import image import tesserocr img = image. open ( "./imgs/98.png" ) img.show() img_l = img.convert( "l" ) # 灰阶图 img_l.show() verify_code1 = tesserocr.image_to_text(img) verify_code2 = tesserocr.image_to_text(img_l) print (f "verify_code1:{verify_code1}" ) print (f "verify_code2:{verify_code2}" ) |
毫无疑问,无论是原图还是灰阶图,一无所有。
3、折腾降噪、去干扰
python图片验证码降噪 - 8邻域降噪
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
from pil import image # https://www.cnblogs.com/jhao/p/10345853.html python图片验证码降噪 — 8邻域降噪 def noise_remove_pil(image_name, k): """ 8邻域降噪 args: image_name: 图片文件命名 k: 判断阈值 returns: """ def calculate_noise_count(img_obj, w, h): """ 计算邻域非白色的个数 args: img_obj: img obj w: width h: height returns: count (int) """ count = 0 width, height = img_obj.size for _w_ in [w - 1 , w, w + 1 ]: for _h_ in [h - 1 , h, h + 1 ]: if _w_ > width - 1 : continue if _h_ > height - 1 : continue if _w_ = = w and _h_ = = h: continue if img_obj.getpixel((_w_, _h_)) < 230 : # 这里因为是灰度图像,设置小于230为非白色 count + = 1 return count img = image. open (image_name) # 灰度 gray_img = img.convert( 'l' ) w, h = gray_img.size for _w in range (w): for _h in range (h): if _w = = 0 or _h = = 0 : gray_img.putpixel((_w, _h), 255 ) continue # 计算邻域非白色的个数 pixel = gray_img.getpixel((_w, _h)) if pixel = = 255 : continue if calculate_noise_count(gray_img, _w, _h) < k: gray_img.putpixel((_w, _h), 255 ) return gray_img if __name__ = = '__main__' : image = noise_remove_pil( "./imgs/1.png" , 4 ) image.show() |
看下图效果:
这样差不多了,不过还可以提升
提升新思路:
这边的干扰线是从某个点发出来的红色线条,
其实我只需要把红色的像素点都干掉,这个线条也会被去掉。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
from pil import image import tesserocr img = image. open ( "./imgs/98.png" ) img.show() # 尝试去掉红像素点 w, h = img.size for _w in range (w): for _h in range (h): o_pixel = img.getpixel((_w, _h)) if o_pixel = = ( 255 , 0 , 0 ): img.putpixel((_w, _h), ( 255 , 255 , 255 )) img.show() img_l = img.convert( "l" ) # img_l.show() verify_code1 = tesserocr.image_to_text(img) verify_code2 = tesserocr.image_to_text(img_l) print (f "verify_code1:{verify_code1}" ) print (f "verify_code2:{verify_code2}" ) |
看起来ok,上面还有零星的蓝色像素掉,也可以用同样的方法一起去掉。
甚至ocr都直接出效果了
好了,完结撒花。
不过,后面发现,有些红色线段和蓝色点,是和验证码重合的。
这个时候,如果直接填成白色,就容易把字母切开,导致识别效果变差。
当前点是红色或者蓝色,判断周围点是不是超过两个像素点是黑色。
是,填充为黑色。
否,填充成白色。
最终完整代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
from pil import image import tesserocr from loguru import logger class verfycodeocr(): def __init__( self ) - > none: pass def ocr( self , img): """ 验证码ocr args: img (img): imgobject/imgpath returns: [string]: 识别结果 """ img_obj = image. open (img) if type (img) = = str else img self ._remove_pil(img_obj) verify_code = tesserocr.image_to_text(img_obj) return verify_code.replace( "\n" , "").strip() def _get_p_black_count( self , img: image, _w: int , _h: int ): """ 获取当前位置周围像素点中黑色元素的个数 args: img (img): 图像信息 _w (int): w坐标 _h (int): h坐标 returns: int: 个数 """ w, h = img.size p_round_items = [] # 超过了横纵坐标 if _w = = 0 or _w = = w - 1 or 0 = = _h or _h = = h - 1 : return 0 p_round_items = [img.getpixel( (_w, _h - 1 )), img.getpixel((_w, _h + 1 )), img.getpixel((_w - 1 , _h)), img.getpixel((_w + 1 , _h))] p_black_count = 0 for p_item in p_round_items: if p_item = = ( 0 , 0 , 0 ): p_black_count = p_black_count + 1 return p_black_count def _remove_pil( self , img: image): """清理干扰识别的线条和噪点 args: img (img): 图像对象 returns: [img]: 被清理过的图像对象 """ w, h = img.size for _w in range (w): for _h in range (h): o_pixel = img.getpixel((_w, _h)) # 当前像素点是红色(线段) 或者 绿色(噪点) if o_pixel = = ( 255 , 0 , 0 ) or o_pixel = = ( 0 , 0 , 255 ): # 周围黑色数量大于2,则把当前像素点填成黑色;否则用白色覆盖 p_black_count = self ._get_p_black_count(img, _w, _h) if p_black_count > = 2 : img.putpixel((_w, _h), ( 0 , 0 , 0 )) else : img.putpixel((_w, _h), ( 255 , 255 , 255 )) logger.info(f "_remove_pil finish." ) # img.show() return img if __name__ = = '__main__' : verfycodeocr = verfycodeocr() img_path = "./imgs/51.png" img = image. open (img_path) img.show() ocr_result = verfycodeocr.ocr(img) img.show() logger.info(ocr_result) |
到此这篇关于爬虫python验证码识别入门的文章就介绍到这了,更多相关python验证码识别内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://www.cnblogs.com/liguobao/p/15111849.html