scrapy的模拟登陆

正文:

这个只需要post一些数据上去就行了,当然还要找一些接口
比如,github或者大多数网站,都会有一些auto_token这样的,属性一般是hidden,查看网页源码即可
提一下settings中这两个设置
#COOKIES_ENABLED = False     cookie 延续
#DOWNLOAD_DELAY = 3             请求延时
一是cookie登陆,而是form登陆,分别举一个例子

一.github的form登陆

虽然会出现Referrer Policy,但是是登陆进去了的。(防止跨站,同源策略)
也试了下cookie登陆,不知是网络原因还是什么,延时。(github仓库翻不翻墙网都不好。。)

import scrapy
class GhSpider(scrapy.Spider):
    name = 'gh'
    allowed_domains = ['github.com']
    start_urls = ['http://www.github.com/']
    post_headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Content-Type": "application/x-www-form-urlencoded",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
        "Referer": "https://github.com/login",
        "cookie":"Cookie: logged_in=no; _octo=GH1.1.921088465.1533352339; _ga=GA1.2.58955335.1533352353; _gh_sess=WnNrTDlBVUZ4Nm4rbjh1dXFCNUdMdEJWcnluNmNlZS9DNEdhM0h2VTZQVldaVjdoR0F2SENtbFcxOU9KSDhubmpWSENEMnRxUHpmeDFJdlNEcktFOWVXRkhOU3grMmlVR0xtcVZhN3VJQnFCVDAzbjVOemRDMDZxcUlWYnZJQ0lWeTRLcjdzbzFubGtBK1p5aTNYQ0xLdUt6UEVXUDRKVUFyMVJmdUs5ZFNhcTdzSnpDa092alVvRG5oK0VMdGFvZk1FNTEvUU51L3lJVkJaeFRseWViQ2xFSmdZR2hQTnJWWmdVSUc3dWJsK1VheE9wK1IzbmJ4NzVJOHBHZjVaOTQ5cVNxV2E4K2Q4TW10Rjh4QUNaTVhUZFpaYVdTRmI2aDhpT0l0YUwrSzVwNVVoZGVPTWhHVm9kK1RoZk5KRlBEcnFCWTlyM0dnL1dFN3EwV09NMkFmeDQwNGh6S1I5NkgrK1JxZHBDQmRZUEFleW9mdE85cVl5L2NGUGdtOHFRb1VpdTZsWmNIN0xWd2wvUWlOL1ZuMmk2bS9yeU5mS0lIR2FkYS9ISjRMQT0tLXk5c0xNNEUyWlhlcTJPTmp6Q0FFV2c9PQ%3D%3D--a3d81cd424e5769fefd36631fba68b9c65b26b3b; tz=Asia%2FShanghai; _gid=GA1.2.519795943.1533875679; has_recent_activity=1; _gat=1"
    }
    def start_requests(self):
        yield scrapy.Request("https://github.com/login",callback=self.post_login)
    def post_login(self, response):
        auto_token=response.xpath('/html/body/div[3]/div[1]/div/form/input[2]/@value').extract_first()
        yield scrapy.FormRequest(url="https://github.com/session",headers=self.post_headers,formdata={
            "utf8":"%E2%9C%93",
            "login":"1543460413@qq.com",
            "password":"helloworld1999",
            'authenticity_token':auto_token
            },
            callback=self.after_post,dont_filter=True)
    def after_post(self,response):
        res=response.xpath("/html/body/div[4]/div[1]/div[1]/div/div/a[1]").extract_first()
        if res:
            print "ok login successfully!"
            print res

二.慕课的cookie登陆

来自:https://blog.csdn.net/topleeyap/article/details/79144326

#coding=utf-8
import scrapy
class LoginByCookie(scrapy.Spider):
    """
    模拟登录方式一:直接使用cookie登录
        登录慕课网
    """
    name = 'mk'
    allowed_domains=['www.imooc.com']
    start_urls=[]
    def start_requests(self):
        """重写start_requests()方法"""
        home_url='https://www.imooc.com/u/2346025'
        login_cookie={'imooc_uuid':'c13c8cb7-442a-430e-a2c1-78d91c347b67',
                      'imooc_isnew_ct':'1515076153',
                      'imooc_isnew':'2',
                      'loginstate':'1',
                      'apsid':'NhMDY2ZDFmODhmYWQ5ZmQ2NDI3ZDg0OTU0NWM3NTQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMjM0NjAyNQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA4MDkwMjE4MjNAcXEuY29tAAAAAAAAAAAAAAAAAAAAAGI3ZmJjOTUxMTU2YjBlOTVlOTIxYzM1ZDk0OTVmOGNhW3FQWltxUFo%3DYm',
                      'PHPSESSID':'vd48nsltdovbbifsn48pu15763',
                      'IMCDNS':'0',
                      'Hm_lvt_f0cfcccd7b1393990c78efdeebff3968':'1515076155,1515221269,1515746784,1516641134',
                      'Hm_lpvt_f0cfcccd7b1393990c78efdeebff3968':'1516641134',
                      'cvde':'5a661b6d0246d-3'
                      }
        yield scrapy.FormRequest(
            url=home_url,cookies=login_cookie,callback=self.parse_page)
    def parse_page(self,response):
        print(response.body.decode('utf-8'))
        print(response.xpath('//title/text()').extract_first())

 
 
2018.8.10

发表评论

电子邮件地址不会被公开。 必填项已用*标注