【python3】利用百度搜索结果批量爬取邮箱地址

准备工具

  • python3.6
  • pycharm
  • 良好的网络

第三方库

1
2
3
4
5
import requests
import re
import eventlet
import time
from bs4 import BeautifulSoup
  • requests: 用于网络请求
  • re: 正则表达式筛选邮箱
  • eventlet: 并发网络库
  • time: 产生随机延迟
  • BeautifulSoup: 解析网页文件

网络请求

  • 请求头(headers)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    headers = {
    'USER_AGENTS' :
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)"
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)"
    ... ...
    # 更多请求头可以百度获得
    }

    多个请求头防止封爬虫

  • 请求页面

    1
    2
    3
    4
    5
    for a in [0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,610,620,630,640,650,660,670,680,690,700,710,720,730,740,750,760,770,780,790,800,810,820,830,840,850,860,870,880,890,900,910,920,930,940,950,960,970,980,990]:
    url = 'http://www.baidu.com/s?wd="keywords"&pn='+str(a)
    web_data = requests.get(url,headers=headers).text
    soup = BeautifulSoup(web_data, 'lxml')
    titles = soup.select('div.result h3.t > a ')
  • 获取搜索结果页面名及链接地址并记录

    1
    2
    3
    4
    5
    6
    for title in titles:
    print(title.get_text(), title.get('href'))
    temp = title.get_text(), title.get('href')
    f = open('E:/test.txt', 'a', encoding='utf-8')
    f.write('\n'+'\n'+str(temp))
    f.close()
  • 访问记录下的网址并抓取邮箱地址

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    try:
    eventlet.monkey_patch()

    with eventlet.Timeout(10, False):
    web_data2 = requests.get(title.get('href'),verify=False)
    soup2 = BeautifulSoup(web_data2.text, 'lxml')
    regex = r'([a-zA-Z0-9_.+-]+@[a-pr-zA-PRZ0-9-]+\.[a-zA-Z0-9-.]+)'
    emails = re.findall(regex, str(soup2))

    for email in emails:
    f = open('E:/test.txt', 'a', encoding='utf-8')
    f.write('\n'+str(email))
    f.close()

    except:
    f = open('E:/test.txt', 'a', encoding='utf-8')
    f.close()

注意:这里正则式不抓取QQ邮箱

  • 程序运行完毕后 即可在E盘根目录出现test.txt文件 里面存放着网页名称、网页链接、邮箱地址

关于我

觉得本文不错?请支持我