本文实例讲述了Python爬虫爬取新浪微博内容。分享给大家供大家参考,具体如下:
用Python编写爬虫,爬取微博大V的微博内容,本文以女神的微博为例(爬新浪m站:https://m.weibo.cn/u/1259110474)
一般做爬虫爬取网站,首选的都是m站,其次是wap站,最后考虑PC站。当然,这不是绝对的,有的时候PC站的信息最全,而你又恰好需要全部的信息,那么PC站是你的首选。一般m站都以m开头后接域名, 所以本文开搞的网址就是 m.weibo.cn。
前期准备
1.代理IP
网上有很多免费代理ip,如西刺免费代理IPhttp://www.xicidaili.com/,自己可找一个可以使用的进行测试;
2.抓包分析
通过抓包获取微博内容地址,这里不再细说,不明白的小伙伴可以自行百度查找相关资料,下面直接上完整的代码
完整代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
# -*- coding: utf-8 -*- import urllib.request import json #定义要爬取的微博大V的微博ID id = '1259110474' #设置代理IP proxy_addr = "122.241.72.191:808" #定义页面打开函数 def use_proxy(url,proxy_addr): req = urllib.request.Request(url) req.add_header( "User-Agent" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0" ) proxy = urllib.request.ProxyHandler({ 'http' :proxy_addr}) opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(req).read().decode( 'utf-8' , 'ignore' ) return data #获取微博主页的containerid,爬取微博内容时需要此id def get_containerid(url): data = use_proxy(url,proxy_addr) content = json.loads(data).get( 'data' ) for data in content.get( 'tabsInfo' ).get( 'tabs' ): if (data.get( 'tab_type' ) = = 'weibo' ): containerid = data.get( 'containerid' ) return containerid #获取微博大V账号的用户基本信息,如:微博昵称、微博地址、微博头像、关注人数、粉丝数、性别、等级等 def get_userInfo( id ): url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id data = use_proxy(url,proxy_addr) content = json.loads(data).get( 'data' ) profile_image_url = content.get( 'userInfo' ).get( 'profile_image_url' ) description = content.get( 'userInfo' ).get( 'description' ) profile_url = content.get( 'userInfo' ).get( 'profile_url' ) verified = content.get( 'userInfo' ).get( 'verified' ) guanzhu = content.get( 'userInfo' ).get( 'follow_count' ) name = content.get( 'userInfo' ).get( 'screen_name' ) fensi = content.get( 'userInfo' ).get( 'followers_count' ) gender = content.get( 'userInfo' ).get( 'gender' ) urank = content.get( 'userInfo' ).get( 'urank' ) print ( "微博昵称:" + name + "\n" + "微博主页地址:" + profile_url + "\n" + "微博头像地址:" + profile_image_url + "\n" + "是否认证:" + str (verified) + "\n" + "微博说明:" + description + "\n" + "关注人数:" + str (guanzhu) + "\n" + "粉丝数:" + str (fensi) + "\n" + "性别:" + gender + "\n" + "微博等级:" + str (urank) + "\n" ) #获取微博内容信息,并保存到文本中,内容包括:每条微博的内容、微博详情页面地址、点赞数、评论数、转发数等 def get_weibo( id , file ): i = 1 while True : url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id weibo_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + id + '&containerid=' + get_containerid(url) + '&page=' + str (i) try : data = use_proxy(weibo_url,proxy_addr) content = json.loads(data).get( 'data' ) cards = content.get( 'cards' ) if ( len (cards)> 0 ): for j in range ( len (cards)): print ( "-----正在爬取第" + str (i) + "页,第" + str (j) + "条微博------" ) card_type = cards[j].get( 'card_type' ) if (card_type = = 9 ): mblog = cards[j].get( 'mblog' ) attitudes_count = mblog.get( 'attitudes_count' ) comments_count = mblog.get( 'comments_count' ) created_at = mblog.get( 'created_at' ) reposts_count = mblog.get( 'reposts_count' ) scheme = cards[j].get( 'scheme' ) text = mblog.get( 'text' ) with open ( file , 'a' ,encoding = 'utf-8' ) as fh: fh.write( "----第" + str (i) + "页,第" + str (j) + "条微博----" + "\n" ) fh.write( "微博地址:" + str (scheme) + "\n" + "发布时间:" + str (created_at) + "\n" + "微博内容:" + text + "\n" + "点赞数:" + str (attitudes_count) + "\n" + "评论数:" + str (comments_count) + "\n" + "转发数:" + str (reposts_count) + "\n" ) i + = 1 else : break except Exception as e: print (e) pass if __name__ = = "__main__" : file = id + ".txt" get_userInfo( id ) get_weibo( id , file ) |
爬取结果
希望本文所述对大家Python程序设计有所帮助。
原文链接:https://blog.csdn.net/d1240673769/article/details/74278547