如何抓取一个微信公众号的所有文章(Python)上篇

Luna
Written by Luna on

声明:基本步骤和核心方法均参考[1],未做诸多更改,但是细节上可能因为微信自己做了更新,爬取细节很不同,另外就是加入了一些文本处理的操作。

1. 注册微信公众号 -> 新建图文消息 -> 超链接  -> 右键下拉菜单点击“检查” -> 检查页面中最上面选“Network” -> 在左边的搜索公众号文章搜索自己想要的公众号,并选中 -> 观察右边的检查页面,会发现下方列表中新增了一个以“appmsg”开头的项目,点击这个“appmsg”开头的项目,然后在检查页面的右方选择“Headers”。”Headers” 下面有个”General”,”General”里面有个”Request URL“ :

https://mp.weixin.qq.com/cgi-bin/appmsgpublish?sub=list&search_field=null&begin=0&count=5&query=&type=101_1&free_publish_type=1&sub_action=list_ex&token=221080036&lang=zh_CN&f=json&ajax=1

2. 解析”Request URL“:

该链接分三部分:

  1. https://mp.weixin.qq.com/cgi-bin/appmsg 请求的基础部分
  2. ?action=list 常用于动态网站,实现不同的参数值而生成不同的页面或者返回不同的结果
  3. &search_field=null&begin=0&count=5&query=&type=101_1&free_publish_type=1&sub_action=list_ex&token=221080036&lang=zh_CN&f=json&ajax=1 设置各种参数

3. 获取Cookie和User-Agent

直接用Python的Requests库访问该url,并不能正常获得结果。原因在于利用网页版微信公众号后台插入超链接时,是登录状态,而用python直接访问时是未登录状态。因此,需要手动获取访问时的CookieUser-Agent,在用Python的Requests库进行访问时将其传入headers参数。这里将公众号标识符fakeid以及token参数保存在了一个yaml文件中,方便爬取时加载。这些都可以在右边的检查页面中搜到:

fakeid: Mzg5...
token: 22...
user-agent: Mozilla/5.0 ...
cookie: ua_id=9gk3jF5RLFPEru...

用yaml包读取参数

import yaml
with open("chaos-gravity.yaml", "r") as file:
    file_data = file.read()
config = yaml.safe_load(file_data) 

headers = {
    "Cookie": config['cookie'],
    "User-Agent": config['user-agent'] 
}

4. 设置请求参数,生成url

对照自己找到的Request URL改下面的参数:

# 请求参数
url = "https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
begin = "0"
params = {
    "sub": "list",
    "search_field":"null",
    "sub_action": "list_ex", 
    "begin": begin,
    "free_publish_type": "1",
    "count": "5",
    "fakeid": config['fakeid'],
    "type": "101_1",
    "token": config['token'],
    "lang": "zh_CN",
    "f": "json",
    "ajax": "1"
}

5. 抓取文章题目,作者,链接及其他有用信息并保存成csv文件

这部分需要一行一行代码校验,防止微信公众号改了规则,下面代码和参考[1]就很不同了,而不同的主要原因来自于细节上的更改。

i = 0
column_name = "aid,appmsgid,author_name,title,cover_img,digest,link,create_time"
article_list_path = "article_list.csv"
with open(article_list_path, "a") as f:
    f.write(column_name + '\n')
while True:
    begin = i * 5
    params["begin"] = str(begin)
    # 随机暂停几秒,避免过快的请求导致过快的被查到
    time.sleep(random.randint(1,10))
    resp = requests.get(url, headers=headers, params = params, verify=False)
    # 微信流量控制, 退出
    if resp.json()['base_resp']['ret'] == 200013:
        print("frequencey control, stop at {}".format(str(begin)))
        time.sleep(3600)
        continue
        
    if i == "0":
        total_count = eval(resp.json()['publish_page'])['total_count']
        print("We have "+str(tatal_count) + " articles.")
    
    publish_list = eval(resp.json()['publish_page'])['publish_list']
    # 如果返回的内容中为空则结束
    if len(publish_list) == 0:
        print("all ariticle parsed")
        break
    
    for publish in publish_list:
        publish = eval(publish['publish_info'].replace("true","True").replace("false","False"))['appmsgex'][0]
        info = '"{}","{}","{}","{}","{}","{}","{}","{}"'.format(str(publish["aid"]), \
                str(publish['appmsgid']), str(publish['author_name']), \
                str(publish['title'].replace("\n","").replace(",",";")), \
                str(publish['cover']), str(publish['digest'].replace("\n","").replace(",",";")), \
                str(publish['link']), str(publish['create_time']))
        with open(article_list_path, "a") as f:
            f.write(info+'\n')
        print("\n".join(info.split(",")))
        print("\n\n---------------------------------------------------------------------------------\n")

    # 翻页
    i += 1

6. 完整代码get_wechart_article_list.py:

注意:因为内容是保存成csv档,所以如果标题和内容简介中有逗好,都替换成了分号。

import yaml
import time
import random
import requests
def get_headers(config):
    headers = {
        "Cookie": config['cookie'],
        "User-Agent": config['user-agent']
    }
    return headers
def get_params(config):
    url = "https://mp.weixin.qq.com/cgi-bin/appmsgpublish"
    begin = "0"
    params = {
        "sub": "list",
        "search_field":"null",
        "sub_action": "list_ex", 
        "begin": begin,
        "free_publish_type": "1",
        "count": "5",
        "fakeid": config['fakeid'],
        "type": "101_1",
        "token": config['token'],
        "lang": "zh_CN",
        "f": "json",
        "ajax": "1"
    }
    return params
def get_article_list(headers, params):
    i = 0
    column_name = "aid,appmsgid,author_name,title,cover_img,digest,link,create_time"
    article_list_path = "article_list.csv"
    with open(article_list_path, "a") as f:
        f.write(column_name + '\n')
    while True:
        begin = i * 5
        params["begin"] = str(begin)
        # 随机暂停几秒,避免过快的请求导致过快的被查到
        time.sleep(random.randint(1,10))
        resp = requests.get(url, headers=headers, params = params, verify=False)
        # 微信流量控制, 退出
        if resp.json()['base_resp']['ret'] == 200013:
            print("frequencey control, stop at {}".format(str(begin)))
            time.sleep(3600)
            continue
        
        if i == "0":
            total_count = eval(resp.json()['publish_page'])['total_count']
            print("We have "+str(tatal_count) + " articles.")
    
        publish_list = eval(resp.json()['publish_page'])['publish_list']
        # 如果返回的内容中为空则结束
        if len(publish_list) == 0:
            print("all ariticle parsed")
            break
    
        for publish in publish_list:
            publish = eval(publish['publish_info'].replace("true","True").replace("false","False"))['appmsgex'][0]
            info = '"{}","{}","{}","{}","{}","{}","{}","{}"'.format(str(publish["aid"]), \
                str(publish['appmsgid']), str(publish['author_name']), \
                str(publish['title'].replace("\n","").replace(",",";")), \
                str(publish['cover']), str(publish['digest'].replace("\n","").replace(",",";")), \
                str(publish['link']), str(publish['create_time']))
            with open(article_list_path, "a") as f:
                f.write(info+'\n')
            print("\n".join(info.split(",")))
            print("\n\n---------------------------------------------------------------------------------\n")

        # 翻页
        i += 1
def main():
    with open("chaos-gravity.yaml", "r") as file:
        file_data = file.read()
    config = yaml.safe_load(file_data) 
    headers = get_headers(config)
    params = get_params(config)
    get_article_list(headers, params)
if __name__ == '__main__':
    main()

这篇抓链接,下篇抓文章。

参考链接:

  1. https://zhuanlan.zhihu.com/p/379062852

Comments

comments powered by Disqus