声明:基本步骤和核心方法均参考[1],未做诸多更改,但是细节上可能因为微信自己做了更新,爬取细节很不同,另外就是加入了一些文本处理的操作。
1. 注册微信公众号 -> 新建图文消息 -> 超链接 -> 右键下拉菜单点击“检查” -> 检查页面中最上面选“Network” -> 在左边的搜索公众号文章搜索自己想要的公众号,并选中 -> 观察右边的检查页面,会发现下方列表中新增了一个以“appmsg”开头的项目,点击这个“appmsg”开头的项目,然后在检查页面的右方选择“Headers”。”Headers” 下面有个”General”,”General”里面有个”Request URL“ :
https://mp.weixin.qq.com/cgi-bin/appmsgpublish?sub=list&search_field=null&begin=0&count=5&query=&type=101_1&free_publish_type=1&sub_action=list_ex&token=221080036&lang=zh_CN&f=json&ajax=1
2. 解析”Request URL“:
该链接分三部分:
- https://mp.weixin.qq.com/cgi-bin/appmsg 请求的基础部分
- ?action=list 常用于动态网站,实现不同的参数值而生成不同的页面或者返回不同的结果
- &search_field=null&begin=0&count=5&query=&type=101_1&free_publish_type=1&sub_action=list_ex&token=221080036&lang=zh_CN&f=json&ajax=1 设置各种参数
3. 获取Cookie和User-Agent
直接用Python的Requests库访问该url,并不能正常获得结果。原因在于利用网页版微信公众号后台插入超链接时,是登录状态,而用python直接访问时是未登录状态。因此,需要手动获取访问时的Cookie
和User-Agent
,在用Python的Requests库进行访问时将其传入headers
参数。这里将公众号标识符fakeid
以及token
参数保存在了一个yaml文件中,方便爬取时加载。这些都可以在右边的检查页面中搜到:
fakeid: Mzg5...
token: 22...
user-agent: Mozilla/5.0 ...
cookie: ua_id=9gk3jF5RLFPEru...
用yaml包读取参数
import yaml
with open("chaos-gravity.yaml", "r") as file:
file_data = file.read()
config = yaml.safe_load(file_data)
headers = {
"Cookie": config['cookie'],
"User-Agent": config['user-agent']
}
4. 设置请求参数,生成url
对照自己找到的Request URL
改下面的参数:
# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
begin = "0"
params = {
"sub": "list",
"search_field":"null",
"sub_action": "list_ex",
"begin": begin,
"free_publish_type": "1",
"count": "5",
"fakeid": config['fakeid'],
"type": "101_1",
"token": config['token'],
"lang": "zh_CN",
"f": "json",
"ajax": "1"
}
5. 抓取文章题目,作者,链接及其他有用信息并保存成csv文件
这部分需要一行一行代码校验,防止微信公众号改了规则,下面代码和参考[1]就很不同了,而不同的主要原因来自于细节上的更改。
i = 0
column_name = "aid,appmsgid,author_name,title,cover_img,digest,link,create_time"
article_list_path = "article_list.csv"
with open(article_list_path, "a") as f:
f.write(column_name + '\n')
while True:
begin = i * 5
params["begin"] = str(begin)
# 随机暂停几秒,避免过快的请求导致过快的被查到
time.sleep(random.randint(1,10))
resp = requests.get(url, headers=headers, params = params, verify=False)
# 微信流量控制, 退出
if resp.json()['base_resp']['ret'] == 200013:
print("frequencey control, stop at {}".format(str(begin)))
time.sleep(3600)
continue
if i == "0":
total_count = eval(resp.json()['publish_page'])['total_count']
print("We have "+str(tatal_count) + " articles.")
publish_list = eval(resp.json()['publish_page'])['publish_list']
# 如果返回的内容中为空则结束
if len(publish_list) == 0:
print("all ariticle parsed")
break
for publish in publish_list:
publish = eval(publish['publish_info'].replace("true","True").replace("false","False"))['appmsgex'][0]
info = '"{}","{}","{}","{}","{}","{}","{}","{}"'.format(str(publish["aid"]), \
str(publish['appmsgid']), str(publish['author_name']), \
str(publish['title'].replace("\n","").replace(",",";")), \
str(publish['cover']), str(publish['digest'].replace("\n","").replace(",",";")), \
str(publish['link']), str(publish['create_time']))
with open(article_list_path, "a") as f:
f.write(info+'\n')
print("\n".join(info.split(",")))
print("\n\n---------------------------------------------------------------------------------\n")
# 翻页
i += 1
6. 完整代码get_wechart_article_list.py:
注意:因为内容是保存成csv档,所以如果标题和内容简介中有逗好,都替换成了分号。
import yaml
import time
import random
import requests
def get_headers(config):
headers = {
"Cookie": config['cookie'],
"User-Agent": config['user-agent']
}
return headers
def get_params(config):
url = "https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
begin = "0"
params = {
"sub": "list",
"search_field":"null",
"sub_action": "list_ex",
"begin": begin,
"free_publish_type": "1",
"count": "5",
"fakeid": config['fakeid'],
"type": "101_1",
"token": config['token'],
"lang": "zh_CN",
"f": "json",
"ajax": "1"
}
return params
def get_article_list(headers, params):
i = 0
column_name = "aid,appmsgid,author_name,title,cover_img,digest,link,create_time"
article_list_path = "article_list.csv"
with open(article_list_path, "a") as f:
f.write(column_name + '\n')
while True:
begin = i * 5
params["begin"] = str(begin)
# 随机暂停几秒,避免过快的请求导致过快的被查到
time.sleep(random.randint(1,10))
resp = requests.get(url, headers=headers, params = params, verify=False)
# 微信流量控制, 退出
if resp.json()['base_resp']['ret'] == 200013:
print("frequencey control, stop at {}".format(str(begin)))
time.sleep(3600)
continue
if i == "0":
total_count = eval(resp.json()['publish_page'])['total_count']
print("We have "+str(tatal_count) + " articles.")
publish_list = eval(resp.json()['publish_page'])['publish_list']
# 如果返回的内容中为空则结束
if len(publish_list) == 0:
print("all ariticle parsed")
break
for publish in publish_list:
publish = eval(publish['publish_info'].replace("true","True").replace("false","False"))['appmsgex'][0]
info = '"{}","{}","{}","{}","{}","{}","{}","{}"'.format(str(publish["aid"]), \
str(publish['appmsgid']), str(publish['author_name']), \
str(publish['title'].replace("\n","").replace(",",";")), \
str(publish['cover']), str(publish['digest'].replace("\n","").replace(",",";")), \
str(publish['link']), str(publish['create_time']))
with open(article_list_path, "a") as f:
f.write(info+'\n')
print("\n".join(info.split(",")))
print("\n\n---------------------------------------------------------------------------------\n")
# 翻页
i += 1
def main():
with open("chaos-gravity.yaml", "r") as file:
file_data = file.read()
config = yaml.safe_load(file_data)
headers = get_headers(config)
params = get_params(config)
get_article_list(headers, params)
if __name__ == '__main__':
main()
这篇抓链接,下篇抓文章。
参考链接:
- https://zhuanlan.zhihu.com/p/379062852
Comments