目录

吾爱破解帖子列表获取脚本

黑麋鹿

2023-09-05 2023-09-05 约 486 字预计阅读 1 分钟

目录

背景

今天工作中遇到一个情况，别人将合同拍照后发给我，合同照片打印出来后黑不溜秋的，我记得曾经在52破解上看到过一个网友分享的小软件，可以很好的去除黑底，但是当时没有将贴文收藏，于是要一页一页的翻看帖子列表查找（翻之前已经使用52自带的搜索引擎搜索过了，还是没找到，而且自带搜索引擎有使用限制），但是总共有200多页，不太方便，我的思路是现将帖子列表下载下来，然后在excel中筛选特定关键词（如图片）来找回这个帖子。虽然后来还是通过百度搜索引擎尝试不同关键词找到了目标帖子，但是此代码还是记录一下，以便日后需要时不需要重复劳动。于是便有了以下代码，此处备忘，供以后需要时使用。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


import requests,time,random
import pandas as pd

proxies={'http': 'http://127.0.0.1:8889', 'https': 'http://127.0.0.1:8889'}

cookies = {
    'wzws_sessionid': 'gDIyMi4yNDQuMTIzLjE3OIJkYjFjYWGBM2MwZWZmoGT2+fU=',
    'htVC_2132_saltkey': 'UW6PnZBb',
    'htVC_2132_lastvisit': '1693902481',
    'htVC_2132_lastact': '1693902666%09forum.php%09forumdisplay',
    'htVC_2132_st_t': '0%7C1693902666%7C8e0f8128b4a9faa43733ba3af60b3420',
    'htVC_2132_forum_lastvisit': 'D_16_1693901860D_2_1693902666',
    'htVC_2132_visitedfid': '2D16',
}

headers = {
    'Accept': '*/*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-CN;q=0.8,en;q=0.7,zh-HK;q=0.6',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'DNT': '1',
    'Pragma': 'no-cache',
    'Referer': 'https://www.52pojie.cn/forum-2-1.html',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}
def get_tiewen(page):
    params = (
        ('mod', 'forumdisplay'),
        ('fid', '2'),
        ('page', str(page)),
        ('typeid', '4'),
        ('t', str(random.randint(189047396,999999999))),
    )
    response = requests.get('https://www.52pojie.cn/forum.php', headers=headers, params=params, cookies=cookies,proxies=proxies, verify=False)
    return response.text

# 获取第二页，然后在此基础上拼接
html = get_tiewen(2)
df = pd.read_html(html)[3]

for page in range(3,234):
    html = get_tiewen(page)
    try:
        df = pd.concat([df,pd.read_html(html)[3]]) # 合并df
    except:
        print(page)
    time.sleep(random.randint(1,3))

df.to_csv("52.csv",index=False)