scrapy分布式(二)

正文:

我们紧跟作者分布式爬虫。
教程:https://blog.csdn.net/seven_2016/article/details/72802961
项目地址:https://github.com/shisiying/tc_zufang

一.前语

如果仅仅是爬取页面以及存取就没有什么好说的了,我们已经做过太多这种简单重复的工作了。
其中我们可以学习到的知识,django页面,ip代理所用的web库页面,还有就是作者代码的分类包装。
django页面我们已经分析过;
ip代理页面,是一个开源项目,个人感觉要吃透的话,本身比这个项目还难;
代码的分类包装这个,作者很好的展示了python代码的分类,其中包括函数的分类以及scrapy的中间插件的分类,反爬虫设置;

二.思路:

1.作者利用IP代理的开源项目,将爬取的ip地址放在本地的8000端口,即运行ipproxy.py
2.然后利用爬取时,利用scrapy的自写中间件ipproxy设置代理,其中作者还自写了useragent中间件和redirect中间件以及timeout中间件,最后在settings中设置          (关于中间件的书写和反爬虫设置,也可参考本人以前的几篇博客)
utils文件夹中有分好了的功能函数
3.先爬取本地的代理ip,作者是master爬取urls,slave爬取网页细节,存入mogo数据库
4.最后django页面从数据库拿出数据显示

三.ip代理分析

参考开源作者的博客
http://www.cnblogs.com/qiyeboy/p/5693128.html

四.diango分析

参考本人前面的文章

五.master以及slave代码分析

其中主端和分端仅仅是抓取信息的差别,大体框架一样
如下:
├── __init__.py
├── __init__.pyc
├── Proxy_Middleware.py   随机代理
├── Proxy_Middleware.pyc
├── redirect_middleware.py    禁止了重定向
├── redirect_middleware.pyc
├── rotate_useragent_dowmloadmiddleware.py   随机头
├── rotate_useragent_dowmloadmiddleware.pyc
├── timeout_middleware.py    延时中间件
├── timeout_middleware.pyc
├── settings.py   禁用cookies这是基本操作,延时操作delay,数据库数据等设置
├── settings.pyc
├── spiders
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── logs
│   │   └── scrapy.log
│   ├── tczufang_detail_spider.py
│   └── tczufang_detail_spider.pyc
└── utils    功能函数文件夹
├── GetProxyIp.py      getip
├── GetProxyIp.pyc
├── __init__.py
├── __init__.pyc
├── InsertRedis.py        插入数据到redis
├── InsertRedis.pyc
├── message.py    爬虫被封了,发出警告
├── message.pyc
├── result_parse.py
└── result_parse.pyc
 
ok,我们具体来看下

一)utils文件下的几个python文件

1.GetProxyIp.py

def GetIps():
    li=[]
    global count
    url ='http://127.0.0.1:8000/?types=0&count=300'
    ips=requests.get(url)
    for ip in eval(ips.content):
        li.append(ip[0]+':'+ip[1])
    return li
GetIps()

很明显的看到,去请求了本地的8000端口,而我们打开浏览器访问,发现确实是list形式存在的,所以解析掉list
2.InsertRedis.py

# -*- coding: utf-8 -*-
import redis
def inserintotc(str,type):
    try:
        r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    except:
        print '连接redis失败'
    else:
        if type == 1:
            r.lpush('start_urls', str)
def inserintota(str,type):
    try:
        r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    except:
        print '连接redis失败'
    else:
        if type == 2:
            r.lpush('tczufang_tc:requests', str)

去文件夹中搜索下函数调用的位置,发现都在scrapy的请求页面
其中inserintotc是存入第一层,即start_url页面
inserintota是detail页面,即第二层
3.message.py

# -*- coding: utf-8 -*-
import smtplib
from email.mime.text import MIMEText
from email.header import Header
def sendMessage_warning():
    server = smtplib.SMTP('smtp.163.com', 25)
    server.login('seven_2016@163.com', '*******')
    msg = MIMEText('爬虫Master被封警告!请求解封!', 'plain', 'utf-8')
    msg['From'] = 'seven_2016@163.com <seven_2016@163.com>'
    msg['Subject'] = Header(u'爬虫被封禁警告!', 'utf8').encode()
    msg['To'] = u'seven <751401459@qq.com>'
    server.sendmail('seven_2016@163.com', ['751401459@qq.com'], msg.as_string())

发送爬虫被禁的信息
4.result_parse.py

# -*- coding: utf-8 -*-
#如果没有下一页的地址则返回none
list_first_item = lambda x:x[0] if x else None

二)中间件

1.Proxy_Middleware.py

# -*- coding: utf-8 -*-
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64
import random
from tc_zufang.utils.GetProxyIp import GetIps
# Start your middleware class
class ProxyMiddleware():
    global count
    count=1
    global ips
    ips=[]
    # overwrite process request
    # def process_request(self, request, spider):
    #     # Set the location of the proxy
    #     global count
    #     if count%4==0:
    #         print '####'
    #         ip=GetIps()
    #         request.meta['proxy'] = ip
    #         print request.meta
    #         print 'get the proxyIp'
    #     count+=1
    #     print 'the ip count is %d'%(count)
    def process_request(self, request, spider):
        # Set the location of the proxy
        global count
        global ips
        if count == 1:
            # print 'the first#############'
            ips = GetIps()
        elif count % 100 == 0:
            # print '#####'
            # print count
            ips = []
            ips = GetIps()
        else:
            pass
        try:
            # print count
            num = random.randint(0, len(ips))
            ress = 'http://' + ips[num]
        except:
            # print 'try to get another ip!'
            # return request.replace(dont_filter=True)
            #使用本机ip进行爬取
            pass
        else:
            request.meta['proxy'] = str(ress)
            count += 1

作者这个代码是每次判断是不是从本地端口获取的ip是否够100个,然后随机选取

2.redirect_middleware.py

#!/usr/bin/python
#-*-coding:utf-8-*-
import urlparse
from scrapy.exceptions import IgnoreRequest
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tc_zufang.utils.message import sendMessage_warning
class Redirect_Middleware():
    '''这里重点讲解一下关于处理下载器刚下载下来的response。在spidermiddleware中,我禁掉了httperror中间件,其中的原因是,我禁止了scrapy自带下载中间件中重定向,重试以及metarefeash中间件。原因是这非常的影响爬虫的性能,只会增加爬虫的消耗,而不会带来任何好处。为什么这么说呢?
每一次的重定向,都有可能增加dns解析,tcp/ip链接,然后才是发送http请求。我们为什么要浪费这么多的时间,没任何理由。所以我的做法是,接受任何响应,然后在下载中间件中处理这个响应,过滤出200状态码的相应交给engine.对于那么重定向的(301, 302,meta-refreash等),我提取出响应头部中的’location’等,然后重新生成一个request对象,交给调度器重新调度。对于404响应,直接抛弃。对于500+响应,把初始request对象重新交给调度器。这样,既不会影响爬虫的正常抓取,也不会落下需要再次抓取的request对象。'''
    global count
    count = 1
    def process_response(self, request, response, spider):
        # 处理下载完成的response
        # 排除状态码不是304的所有以3为开头的响应
        http_code = response.status
        if http_code // 100 == 2:
            return response
        if http_code // 100 == 3 and http_code != 304:
            # 获取重定向的url
            # url = response.headers['location']
            # domain = urlparse.urlparse(url).netloc
            # 判断重定向的url的domain是否在allowed_domains中
            # if domain in spider.allowed_domains:
            #     return Request(url=url, meta=request.meta)
            # else:
            global count
            if count == 1:
                sendMessage_warning()
            print '302'
            count += 1
            #把request返回到下载器
            return request.replace(dont_filter=True)
        if http_code // 100 == 4:
            # 需要注意403不是响应错误,是无权访问
            raise IgnoreRequest(u'404')
        if http_code // 100 == 5:
            return request.replace(dont_filter=True)

3.rotate_useragent_dowmloadmiddleware.py

#!/usr/bin/python
#-*-coding:utf-8-*-
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
    """
        a useragent middleware which rotate the user agent when crawl websites
        if you set the USER_AGENT_LIST in settings,the rotate with it,if not,then use the default user_agent_list attribute instead.
    """
    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [\
        'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31',\
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17',\
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',\
        \
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',\
        'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)',\
        'Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)',\
        \
        'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1',\
        'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1',\
        'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:15.0) Gecko/20120910144328 Firefox/15.0.2',\
        \
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',\
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a3pre) Gecko/20070330',\
        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203',\
        \
        'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',\
        'Opera/9.80 (X11; Linux x86_64; U; fr) Presto/2.9.168 Version/11.50',\
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; de) Presto/2.9.168 Version/11.52',\
        \
        'Mozilla/5.0 (Windows; U; Win 9x 4.90; SG; rv:1.9.2.4) Gecko/20101104 Netscape/9.1.0285',\
        'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3',\
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',\
    ]
    def __init__(self, user_agent=''):
        self.user_agent = user_agent
    def _user_agent(self, spider):
        if hasattr(spider, 'user_agent'):
            return spider.user_agent
        elif self.user_agent:
            return self.user_agent
        return random.choice(self.user_agent_list)
    def process_request(self, request, spider):
        ua = self._user_agent(spider)
        if ua:
            request.headers.setdefault('User-Agent', ua)

是改造scrapy的useragent代理中间件

4.timeout_middleware.py    延时中间件

# -*- coding: utf-8 -*-
from scrapy.http import Request
from scrapy.downloadermiddlewares.downloadtimeout import DownloadTimeoutMiddleware
class Timeout_Middleware(DownloadTimeoutMiddleware):
    def process_exception(self,request, exception, spider):
        #print "####the downloader has exception!"
        print exception
        return request.replace(dont_filter=True)

settings文件中delay也可设置

5. settings.py

# -*- coding: utf-8 -*-
BOT_NAME = 'tc_zufang'
SPIDER_MODULES = ['tc_zufang.spiders']
NEWSPIDER_MODULE = 'tc_zufang.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tc_zufang (+http://www.yourdomain.com)'
#item Pipeline同时处理item的最大值为100
# CONCURRENT_ITEMS=100
#scrapy downloader并发请求最大值为16
#CONCURRENT_REQUESTS=4
#对单个网站进行并发请求的最大值为8
#CONCURRENT_REQUESTS_PER_DOMAIN=2
#抓取网站的最大允许的抓取深度值
DEPTH_LIMIT=0
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOAD_TIMEOUT=10
DNSCACHE_ENABLED=True
#避免爬虫被禁的策略1,禁用cookie
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
CONCURRENT_REQUESTS=4
#CONCURRENT_REQUESTS_PER_IP=2
#CONCURRENT_REQUESTS_PER_DOMAIN=2
#设置下载延时,防止爬虫被禁
DOWNLOAD_DELAY = 5
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    "tc_zufang.Proxy_Middleware.ProxyMiddleware":100,
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'tc_zufang.timeout_middleware.Timeout_Middleware':610,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 300,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 400,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': None,
    'tc_zufang.rotate_useragent_dowmloadmiddleware.RotateUserAgentMiddleware':400,
    'tc_zufang.redirect_middleware.Redirect_Middleware':500,
}
#使用scrapy-redis组件,分布式运行多个爬虫
#配置日志存储目录
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1' # 也可以根据情况改成 localhost
REDIS_PORT = '6379'
#LOG_FILE = "logs/scrapy.log"

提几点出来
1)禁用cookie
2)下载延时
3)请求延时
4)如果是自己写的中间件,那么必须在settings文件中声明,类似于python调用的感觉,数字越小,优先度越高
“tc_zufang.Proxy_Middleware.ProxyMiddleware”:100,
‘tc_zufang.timeout_middleware.Timeout_Middleware’:610,
‘tc_zufang.rotate_useragent_dowmloadmiddleware.RotateUserAgentMiddleware’:400,
‘tc_zufang.redirect_middleware.Redirect_Middleware’:500,

6.爬取页面本人不做分析

六.分布式的运行

如果你看了redis,或者博主前面的文章,就知道,只需要设置对redis和mogo数据库的目标地址,然后按照顺序运行即可
 
 
2018.9.2

发表评论

电子邮件地址不会被公开。 必填项已用*标注