老堂主爬虫交流–百度贴吧模拟回帖
admin
2023-07-30 21:18:43
0

老堂主虽然洋文好得很,但语文还是要学习一个,所以文章中种种错误还请各位海涵。

提前准备

  1. python 2.7.11
  2. requests

自己体验

老九门电视剧贴吧发个帖子,开启chrome开发模式感受一番。

  1. 打开http://tieba.baidu.com/p/4646166740
  2. 输入:苟利国家生死以 岂因祸福避趋之
  3. 点击发表按钮


截图1

从Network中感受一番

1 url为 http://tieba.baidu.com/f/commit/post/add ,
且为post方式。


截图2

2 下图为headers,还有cookies相关信息


截图3

3 附带的参数信息


截图4

4 response应答信息


截图5

想法设法

我们先来瞧瞧那个参数信息
ie: 应该就是编码格式,utf-8

kw: 不出意外就是贴吧名

fid: 这个20578208是什么大新闻?它从哪里来?回想一下我们先前体验的过程,首先点击http://tieba.baidu.com/p/4646166740 帖子,接着再点击发表。思来想去,这个20578208八成就是出现在帖子的源代码里,事实果真如此吗?我们去源代码里crtl-f一下。


截图6

哈哈果然不出所料,正则大法抠出来咯。

re.findall(\"\"\"fid\\s*:\\s*\'\\s*(.*?)\\s*\'\"\"\", content)[0]

后面验证,此fid应为贴吧的标示id。

tid:4646166740?? http://tieba.baidu.com/p/4646166740 昭然若揭

vcode_md5: 就空值吧。

floor_num: 洋文好得很,发帖第几楼咯;经验证,此值没那么讲究,一般都行,比如500。

rich_text: 就1咯。

tbs: 还记得先前的fid么?

re.findall(\"\"\"tbs:\\s*\'(.*?)\',\"\"\", content)[0]

content: 就是先前回的那两句诗。

files: 我们没回附件,就[]即可。

mouse_pwd: 这是个大新闻啊,鼠标轨迹!到底是怎么个算法,四个字,无可奉告!但总有应对之策!

mouse_list = [    \"118,112,113,110,115,123,114,117,75,115,110,114,110,115,110,114,110,115,110,114,110,115,110,114,110,115,110,114,75,117,113,118,114,75,115,112,122,114,110,122,114,114,\",    \"90,84,82,78,83,84,87,83,107,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,107,83,81,84,81,83,107,83,80,90,82,78,90,82,82,\",    \"9,4,1,28,1,9,8,0,57,1,28,0,28,1,28,0,28,1,28,0,28,1,28,0,28,1,28,0,57,4,7,6,5,57,1,2,8,0,28,8,0,0,\",    \"80,91,86,79,82,84,91,86,106,82,79,83,106,87,86,84,80,106,82,81,91,83,79,91,83,83,\",    \"96,101,96,125,96,104,96,101,88,96,125,97,125,96,125,97,125,96,125,97,125,96,125,97,125,96,125,97,88,103,98,97,105,88,96,99,105,97,125,105,97,97,\",    \"19,29,24,7,25,18,19,18,34,26,7,27,7,26,7,27,7,26,7,27,7,26,7,27,7,26,7,27,34,31,26,25,19,34,26,25,19,27,7,19,27,27,\",    \"59,63,59,38,50,63,51,50,3,59,38,58,38,59,38,58,38,59,38,58,38,59,38,58,38,59,38,58,3,60,58,56,62,3,59,56,50,58,38,50,58,58,\",    \"81,87,91,78,81,91,82,80,107,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,78,83,78,82,107,85,81,91,86,107,83,80,90,82,78,90,82,82,\",    \"30,30,28,0,30,20,28,27,37,29,0,28,0,29,0,28,0,29,0,28,0,29,0,28,0,29,0,28,37,31,26,21,25,37,29,30,20,28,0,20,28,28,\",    \"103,106,107,127,98,106,99,102,90,98,127,99,127,98,127,99,127,98,127,99,127,98,127,99,127,98,127,99,90,98,97,99,103,107,90,98,97,107,99,127,107,99,99,\",    \"37,32,32,58,36,39,34,46,31,39,58,38,58,39,58,38,58,39,58,38,58,39,58,38,58,39,58,38,31,34,46,34,39,37,31,47,32,38,58,35,34,38,\",]

mouse_pwd_t: 按照基本法,这是时间戳。

str(time.time()).replace(\".\", \"\")

mouse_pwd_isclick: 0 决定就是0了

type: 也就是\”reply\”了

接着是登陆问题,老堂主还是偷个懒,用cookies的方式了。


截图7

抠出来然后

def get_cookies(): 
    #sb = \"你自己的cookies\"
    sb = \"BAIDUID=A4C1F1C2DC2D78995C5E96C0B5823437:FG=1; PSTM=1469801697; BIDUPSID=C44D78ADAB15C76D89820BA40622B137; BDUSS=0FKempYflhxcjdZd2Z0Wnh5WmVlVW43U1VtSlVyNHZ6UkV-Q3NWeFkzWHM4Y0pYQUFBQUFBJCQAAAAAAAAAAAEAAADgvDV7ZGFyYnJhMDE4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOxkm1fsZJtXc; H_PS_PSSID=20739_1447_18280_17948_20416_17001_15706_11927_20698_20745_20705\"
    tmap = {}
    for i in sb.strip().split(\";\"):       
        key, value = i.split(\"=\", 1)       
        tmap[key.strip()] = value.strip()    
    return tmap

思路整理

第一步,对http://tieba.baidu.com/p/4646166740 页面进行get请求,得到fid, tbs等参数。
第二步,对 http://tieba.baidu.com/f/commit/post/add 带上fid,tid等一系列参数 以及自己的cookies进行post请求。
倘若最后得到的response的中的\”error_code\”为0,就是发帖成功了

话说这两句诗和广告有啥关系,I AM ANGRY!


截图8

代码交流

#-*-coding:utf-8-*-
import time
import requests
import re
import random

def get_cookies():
    sb = \"\"\"BAIDUID=A4C1F1C2DC2D78995C5E96C0B5823437:FG=1; PSTM=1469801697; BIDUPSID=C44D78ADAB15C76D89820BA40622B137; BDUSS=0FKempYflhxcjdZd2Z0Wnh5WmVlVW43U1VtSlVyNHZ6UkV-Q3NWeFkzWHM4Y0pYQUFBQUFBJCQAAAAAAAAAAAEAAADgvDV7ZGFyYnJhMDE4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOxkm1fsZJtXc; H_PS_PSSID=20739_1447_18280_17948_20416_17001_15706_11927_20698_20745_20705\"\"\"
    tmap = {}
    for i in sb.strip().split(\";\"):
        key, value = i.split(\"=\", 1)
        tmap[key.strip()] = value.strip()
    return tmap

def get_headers1():
    return {
        \'Accept\':\'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\',
        \'Accept-Encoding\':\'gzip,deflate,sdch\',
        \'Accept-Language\':\'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2\',
        \'Cache-Control\':\'max-age=0\',
        \'Connection\':\'keep-alive\',
        \'Host\':\'tieba.baidu.com\',
        \'Referer\':\'http://tieba.baidu.com/p/4695010754\',
        \'User-Agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36\'}

def get_headers2():
    return {\'Accept\':\'application/json, text/javascript, */*; q=0.01\',
\'Accept-Encoding\':\'gzip,deflate,sdch\',
\'Accept-Language\':\'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2\',
\'Connection\':\'keep-alive\',
\'Content-Length\':\'487\',
\'Content-Type\':\'application/x-www-form-urlencoded; charset=UTF-8\',
\'Host\':\'tieba.baidu.com\',
\'Origin\':\'http://tieba.baidu.com\',
\'Referer\':\'http://tieba.baidu.com/p/4664815593\',
\'User-Agent\':\'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36\',
\'X-Requested-With\':\'XMLHttpRequest\'

        }

def get_ie():
    return \"utf-8\"

def get_kw(content):
    return re.findall(\"\"\"forumName:\'(.*?)\', \"\"\", content)[0]

def get_fid(content):
    return re.findall(\"\"\"fid:\'(.*?)\',\"\"\", content)[0]

def get_tid(tid):
    return tid

def get_vcode_md5():
    return \"\"

def get_floor_num():
    return \"500\"

def get_rich_text():
    return \"1\"

def get_tbs(content):
    return re.findall(\"\"\"tbs:\\s*\'(.*?)\',\"\"\", content)[0]

def get_content():
    return \"苟利国家生死以,岂因祸福避趋之\"

def get_files():
    return \"[]\"

def get_sign_id(content):

    return re.findall(\'\"sign_id\":(.*?),\', content)[0]

def get_mouse_pwd():
    return \"113,114,115,111,113,116,118,122,74,114,111,115,111,114,111,115,111,114,111,115,111,114,111,115,111,114,111,115,74,114,115,117,112,114,74,122,117,115,111,118,119,115,\"+str(time.time()).replace(\".\", \"\")

def get_mouse_pwd_t():
    return str(time.time()).replace(\".\", \"\")

def get_mouse_pwd_isclick():
    return \"0\"

def get_type():
    return \"reply\"


def post_one(tid):
    tid = random.choice(tid)
    s1 = requests.session()
    headers=get_headers1()
    g1 = s1.get(\"http://tieba.baidu.com/p/%s\"%(tid), headers= headers,cookies=get_cookies())
    data = {
        \"ie\": get_ie(),
        \"kw\": get_kw(g1.content),
        \"fid\": get_fid(g1.content),
        \"tid\": get_tid(tid),
        \"vcode_md5\": get_vcode_md5(),
        \"floor_num\": get_floor_num(),
        \"rich_text\": get_rich_text(),
        \"tbs\": get_tbs(g1.content),
        \"content\": get_content(),
        \"files\": get_files(),
        \"mouse_pwd\": get_mouse_pwd(),
        \"mouse_pwd_t\": get_mouse_pwd_t(),
        \"mouse_pwd_isclick\": get_mouse_pwd_isclick(),
        \"__type__\": get_type()

        }
    headers=get_headers2()
    headers[\"Referer\"] = \'http://tieba.baidu.com/p/%s\'%(tid)
    p1 = s1.post(\"http://tieba.baidu.com/f/commit/post/add\", headers=headers, cookies=get_cookies(), data=data)
    print p1.content
    return p1.content


post_one([\"4646166740\"])

亲测可用,同样的内容这次倒没删。


截图9


截图10

有趣,美好,这就是老堂主的爬虫小交流。

话说这是啥呢?


验证码1


验证码2

有机会再交流交流啊

相关内容

热门资讯

500 行 Python 代码... 语法分析器描述了一个句子的语法结构,用来帮助其他的应用进行推理。自然语言引入了很多意外的歧义,以我们...
定时清理删除C:\Progra... C:\Program Files (x86)下面很多scoped_dir开头的文件夹 写个批处理 定...
65536是2的几次方 计算2... 65536是2的16次方:65536=2⁶ 65536是256的2次方:65536=256 6553...
Mobi、epub格式电子书如... 在wps里全局设置里有一个文件关联,打开,勾选电子书文件选项就可以了。
scoped_dir32_70... 一台虚拟机C盘总是莫名奇妙的空间用完,导致很多软件没法再运行。经过仔细检查发现是C:\Program...
pycparser 是一个用... `pycparser` 是一个用 Python 编写的 C 语言解析器。它可以用来解析 C 代码并构...
小程序支付时提示:appid和... [Q]小程序支付时提示:appid和mch_id不匹配 [A]小程序和微信支付没有进行关联,访问“小...
微信小程序使用slider实现... 众所周知哈,微信小程序里面的音频播放是没有进度条的,但最近有个项目呢,客户要求音频要有进度条控制,所...
python绘图库Matplo... 本文简单介绍了Python绘图库Matplotlib的安装,简介如下: matplotlib是pyt...
Prometheus+Graf... 一,Prometheus概述 1,什么是Prometheus?Prometheus是最初在Sound...