准备工具
- python3.6
- pycharm
- 良好的网络
第三方库
1 | import requests |
- requests: 用于网络请求
- re: 正则表达式筛选邮箱
- eventlet: 并发网络库
- time: 产生随机延迟
- BeautifulSoup: 解析网页文件
网络请求
请求头(headers)
1
2
3
4
5
6
7
8
9headers = {
'USER_AGENTS' :
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)"
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)"
... ...
# 更多请求头可以百度获得
}多个请求头防止封爬虫
请求页面
1
2
3
4
5for a in [0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,610,620,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960,970,980,990]:
url = 'http://www.baidu.com/s?wd="keywords"&pn='+str(a)
web_data = requests.get(url,headers=headers).text
soup = BeautifulSoup(web_data, 'lxml')
titles = soup.select('div.result h3.t > a ')获取搜索结果页面名及链接地址并记录
1
2
3
4
5
6for title in titles:
print(title.get_text(), title.get('href'))
temp = title.get_text(), title.get('href')
f = open('E:/test.txt', 'a', encoding='utf-8')
f.write('\n'+'\n'+str(temp))
f.close()访问记录下的网址并抓取邮箱地址
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17try:
eventlet.monkey_patch()
with eventlet.Timeout(10, False):
web_data2 = requests.get(title.get('href'),verify=False)
soup2 = BeautifulSoup(web_data2.text, 'lxml')
regex = r'([a-zA-Z0-9_.+-]+@[a-pr-zA-PRZ0-9-]+\.[a-zA-Z0-9-.]+)'
emails = re.findall(regex, str(soup2))
for email in emails:
f = open('E:/test.txt', 'a', encoding='utf-8')
f.write('\n'+str(email))
f.close()
except:
f = open('E:/test.txt', 'a', encoding='utf-8')
f.close()
注意:这里正则式不抓取QQ邮箱
- 程序运行完毕后 即可在E盘根目录出现test.txt文件 里面存放着网页名称、网页链接、邮箱地址
关于我
- 国立华侨大学
- 软件工程大二在读
- 独立视频制作者
- 啥都感兴趣
- 联系方式:
- qq: 1093846898
- wechat: czh-0526
- e-mail: 1093846898@qq.com