准备工具

python3.6
pycharm
良好的网络

第三方库

import requests
import re
import eventlet
import time
from bs4 import BeautifulSoup

requests：用于网络请求
re：正则表达式筛选邮箱
eventlet：并发网络库
time：产生随机延迟
BeautifulSoup：解析网页文件

网络请求

请求头（headers）

headers = {
    'USER_AGENTS' :
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)"
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)"
        ... ...
        # 更多请求头可以百度获得
}

多个请求头防止封爬虫

请求页面

for a in [0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,610,620,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960,970,980,990]:
    url = 'http://www.baidu.com/s?wd="keywords"&pn='+str(a)
    web_data = requests.get(url,headers=headers).text
    soup = BeautifulSoup(web_data, 'lxml')
    titles = soup.select('div.result h3.t > a ')

获取搜索结果页面名及链接地址并记录

for title in titles:
    print(title.get_text(), title.get('href'))
    temp = title.get_text(), title.get('href')
    f = open('E:/test.txt', 'a', encoding='utf-8')
    f.write('\n'+'\n'+str(temp))
    f.close()

访问记录下的网址并抓取邮箱地址

try:
    eventlet.monkey_patch()

    with eventlet.Timeout(10, False):
        web_data2 = requests.get(title.get('href'),verify=False)
        soup2 = BeautifulSoup(web_data2.text, 'lxml')
        regex = r'([a-zA-Z0-9_.+-]+@[a-pr-zA-PRZ0-9-]+\.[a-zA-Z0-9-.]+)'
        emails = re.findall(regex, str(soup2))

    for email in emails:
        f = open('E:/test.txt', 'a', encoding='utf-8')
        f.write('\n'+str(email))
        f.close()

except:
    f = open('E:/test.txt', 'a', encoding='utf-8')
    f.close()

注意：这里正则式不抓取QQ邮箱

程序运行完毕后即可在E盘根目录出现test.txt文件里面存放着网页名称、网页链接、邮箱地址

关于我

国立华侨大学
软件工程大二在读
独立视频制作者
啥都感兴趣
联系方式：
- qq: 1093846898
- wechat: czh-0526
- e-mail: 1093846898@qq.com

【python3】利用百度搜索结果批量爬取邮箱地址

准备工具

第三方库

网络请求

关于我

觉得本文不错？请支持我