新浪微博需要登录才能爬取,这里使用m.weibo.cn这个移动端网站即可实现简化操作,用这个访问可以直接得到的微博id。
分析新浪微博的评论获取方式得知,其采用动态加载。所以使用json模块解析json代码
单独编写了字符优化函数,解决微博评论中的嘈杂干扰字符
本函数是用python写网络爬虫的终极目的,所以采用函数化方式编写,方便后期优化和添加各种功能
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
# -*- coding:gbk -*- import re import requests import json from lxml import html #测试微博4054483400791767 comments = [] def get_page(weibo_id): url = 'https://m.weibo.cn/status/{}' . format (weibo_id) html = requests.get(url).text regcount = r '"comments_count": (.*?),' comments_count = re.findall(regcount,html)[ - 1 ] comments_count_number = int (comments_count) page = int (comments_count_number / 10 ) return page - 1 def opt_comment(comment): tree = html.fromstring(comment) strcom = tree.xpath( 'string(.)' ) reg1 = r '回复@.*?:' reg2 = r '回覆@.*?:' reg3 = r '//@.*' newstr = '' comment1 = re.subn(reg1,newstr,strcom)[ 0 ] comment2 = re.subn(reg2,newstr,comment1)[ 0 ] comment3 = re.subn(reg3,newstr,comment2)[ 0 ] return comment3 def get_responses( id ,page): url = "https://m.weibo.cn/api/comments/show?id={}&page={}" . format ( id ,page) response = requests.get(url) return response def get_weibo_comments(response): json_response = json.loads(response.text) for i in range ( 0 , len (json_response[ 'data' ])): comment = opt_comment(json_response[ 'data' ][i][ 'text' ]) comments.append(comment) weibo_id = input ( "输入微博id,自动返回前5页评论:" ) weibo_id = int (weibo_id) print ( '\n' ) page = get_page(weibo_id) for page in range ( 1 ,page + 1 ): response = get_responses(weibo_id,page) get_weibo_comments(response) for com in comments: print (com) print ( len (comments)) |
以上所述是小编给大家介绍的python爬取新浪微博评论详解整合,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对服务器之家网站的支持!
原文链接:https://blog.csdn.net/Joliph/article/details/77334354