Scrapy+Chromium+代理+selenium

Pocher 發(fā)布于2019-07-30 17:37 / 2397人閱讀

摘要：通常的解決辦法是通過(guò)抓包，然后查看信息，接著捕獲返回的消息。為了減少因?yàn)榘惭b環(huán)境所帶來(lái)的煩惱。代理因?yàn)槲覀円呀?jīng)用替換了。我們需要直接用來(lái)處理代理問(wèn)題。根據(jù)上面這段代碼，我們也不難猜出解決代理的方法了。

上周說(shuō)到scrapy的基本入門。這周來(lái)寫寫其中遇到的代理和js渲染的坑。

js渲染

js是爬蟲中畢竟麻煩處理的一塊。通常的解決辦法是通過(guò)抓包，然后查看request信息，接著捕獲ajax返回的消息。
但是，如果遇到一些js渲染特別復(fù)雜的情況，這種辦法就非常非常的麻煩。所以我們采用了selenium這個(gè)包，用它來(lái)調(diào)用chromium完成js渲染的問(wèn)題。

安裝

安裝selenium

安裝chromium

安裝chromium-drive

tip:為什么選擇chromium而不是chrome。我之前裝的就是chrome。但是安裝chrome之后還需要安裝chrome-drive，而很多l(xiāng)inux發(fā)行版的包管理沒(méi)有現(xiàn)成的chrome包和chrome-drive包，自己去找的話很容易出現(xiàn)chrome-drive和chrome版本不一致而導(dǎo)致不能使用。

為了減少因?yàn)榘惭b環(huán)境所帶來(lái)的煩惱。我們這邊用docker來(lái)解決。
Dockerfile

FROM alpine:3.8
COPY requirements.txt /tmp
RUN apk update 
    && apk add --no-cache xvfb python3 python3-dev curl libxml2-dev libxslt-dev libffi-dev gcc musl-dev 
    && apk add --no-cache libgcc openssl-dev chromium=68.0.3440.75-r0 libexif udev chromium-chromedriver=68.0.3440.75-r0 
    && curl https://bootstrap.pypa.io/get-pip.py | python3 
    && adduser -g chromegroup -D chrome 
    && pip3 install -r /tmp/requirements.txt && rm /tmp/requirements.txt
USER chrome

tip：這邊還有一個(gè)坑，chrome和chromium都不能在root模式下運(yùn)行，而且也不安全。所以最好是創(chuàng)建一個(gè)用戶來(lái)運(yùn)行。使用docker的時(shí)候，run時(shí)候需要加--privileged參數(shù)

如果你需要了解如何在root用戶下運(yùn)行chrome，請(qǐng)閱讀這篇博文
Ubuntu16.04安裝Chrome瀏覽器及解決root不能打開的問(wèn)題

requirements.txt

Scrapy
selenium
Twisted
PyMysql
pyvirtualdisplay

把requirements.txt和Dockerfile放在一起。
并在目錄下使用docker命令docker build -t "chromium-scrapy-image" .

至于為什么要安裝xvfb和pyvirtualdisplay。因?yàn)?b>chromium的headless模式下不能處理帶賬號(hào)密碼的問(wèn)題。待會(huì)就會(huì)說(shuō)到了。

Redhat和Debian可以去包倉(cāng)庫(kù)找一下最新的chromium和對(duì)應(yīng)的chromium-drive下載安裝就可以了。版本一定要是對(duì)應(yīng)的！這邊使用chromium=68.0.3440.75-r0和chromium-chromedriver=68.0.3440.75-r0。

修改Scrapy的Middleware

使用了chromium之后，我們?cè)?b>middlewares.py文件修改一下。我們的設(shè)想是讓chromium來(lái)替代掉request請(qǐng)求。所以我們修改了DownloaderMiddleware

#DownloaderMiddleware
class DemoDownloaderMiddleware(object):
    def __init__(self):
        chrome_options = webdriver.ChromeOptions()
        # 啟用headless模式
        chrome_options.add_argument("--headless")
        # 關(guān)閉gpu
        chrome_options.add_argument("--disable-gpu")
        # 關(guān)閉圖像顯示
        chrome_options.add_argument("--blink-settings=imagesEnabled=false") 
        self.driver = webdriver.Chrome(chrome_options=chrome_options)
        
    def __del__(self):
        self.driver.quit()
        
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
        
    def process_request(self, request, spider):
        # chromium處理
        # ...
        return HtmlResponse(url=request.url, 
        body=self.driver.page_source, 
        request=request, 
        encoding="utf-8", 
        status=200)
        
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

tip：這邊我們只有一個(gè)中間件來(lái)處理request。也就是說(shuō)，所有的邏輯都要經(jīng)過(guò)這兒。所以直接返回了response。

這就解決了selenium和chromium的安裝問(wèn)題。

chromium不支持headless問(wèn)題

如果你安裝的chromium版本太老，不支持headless，不著急。之前我們安裝的xvfb和pyvirtualdisplay就派上用場(chǎng)了。

from pyvirtualdisplay import Display
...
>>>
chrome_options.add_argument("--headless")

<<<
# chrome_options.add_argument("--headless")
display=Display(visible=0,size=(800,800))
display.start()
...

>>>
self.driver.quit()

<<<
self.driver.quit()
display.stop()
...

我們模擬出了一個(gè)顯示界面，這個(gè)時(shí)候，不管chromium開不開啟headless，都能在我們的服務(wù)器上運(yùn)行了。

代理

因?yàn)槲覀円呀?jīng)用chromium替換了request。所以我們做的代理也不能在Scrapy中來(lái)處理。
我們需要直接用chromium來(lái)處理IP代理問(wèn)題。

這是不使用chromium之前使用代理的辦法

class DemoProxyMiddleware(object):
    # overwrite process request

    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta["proxy"] = "https://proxy.com:8080"

        # Use the following lines if your proxy requires authentication
        
        proxy_user_pass = "username:password"
        encoded_user_pass = base64.b64encode(proxy_user_pass.encode("utf-8"))

        # setup basic authentication for the proxy
        request.headers["Proxy-Authorization"] = "Basic " + str(encoded_user_pass, encoding="utf-8")

如果你的IP代理不需要賬號(hào)密碼的話，只需要把后面三行刪除了就可以了。

根據(jù)上面這段代碼，我們也不難猜出chromium解決代理的方法了。

chrome_options.add_argument("--proxy=proxy.com:8080")

只需要加一段argument就可以了。

那解決帶賬號(hào)密碼的辦法呢？

解決chromium下帶賬號(hào)密碼的代理問(wèn)題

先創(chuàng)建一個(gè)py文件

import string
import zipfile


def create_proxyauth_extension(proxy_host, proxy_port,
                               proxy_username, proxy_password,
                               scheme="http", plugin_path=None):
    """代理認(rèn)證插件

    args:
        proxy_host (str): 你的代理地址或者域名（str類型）
        proxy_port (int): 代理端口號(hào)（int類型）
        proxy_username (str):用戶名（字符串）
        proxy_password (str): 密碼 （字符串）
    kwargs:
        scheme (str): 代理方式 默認(rèn)http
        plugin_path (str): 擴(kuò)展的絕對(duì)路徑

    return str -> plugin_path
    """

    if plugin_path is None:
        plugin_path = "vimm_chrome_proxyauth_plugin.zip"

    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {
            "scripts": ["background.js"]
        },
        "minimum_chrome_version":"22.0.0"
    }
    """

    background_js = string.Template(
        """
        var config = {
                mode: "fixed_servers",
                rules: {
                  singleProxy: {
                    scheme: "${scheme}",
                    host: "${host}",
                    port: parseInt(${port})
                  },
                  bypassList: ["foobar.com"]
                }
              };
    
        chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
    
        function callbackFn(details) {
            return {
                authCredentials: {
                    username: "${username}",
                    password: "${password}"
                }
            };
        }
    
        chrome.webRequest.onAuthRequired.addListener(
                    callbackFn,
                    {urls: [""]},
                    ["blocking"]
        );
        """
    ).substitute(
        host=proxy_host,
        port=proxy_port,
        username=proxy_username,
        password=proxy_password,
        scheme=scheme,
    )
    with zipfile.ZipFile(plugin_path, "w") as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return plugin_path

使用方式

    proxyauth_plugin_path = create_proxyauth_extension(
        proxy_host="host",
        proxy_port=port,
        proxy_username="user",
        proxy_password="pwd")
    chrome_options.add_extension(proxyauth_plugin_path)

這樣就完成了chromium的代理了。但是，如果你開啟了headless模式，這個(gè)方法會(huì)提示錯(cuò)誤。所以解決辦法就是，關(guān)閉headless模式。
至于怎么在沒(méi)有gui的情況下使用chromium。在之前已經(jīng)提到過(guò)，使用xvfb和pyvirtualdisplay就可以了。

GPU云服務(wù)器云服務(wù)器 selenium-selenium-we Selenium chromium selenium webdirver

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://hztianpu.com/yun/42310.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Pocher

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

#黑五#virtono：全場(chǎng)VPS、虛擬主機(jī)最低3折優(yōu)惠，VPS首月€2.68起

閱讀 3179·2021-11-24 10:21
2.《JSP應(yīng)用開發(fā)案例教程》第1章 JSP概述

閱讀 1780·2021-10-11 10:57
云主機(jī)的ip地址是什么東西-云ip地址怎樣設(shè)置？

閱讀 2954·2021-09-22 15:24
ftp主機(jī)是什么-ftp主機(jī)地址是什么？

閱讀 2847·2021-09-22 14:58
js制作簡(jiǎn)易計(jì)算器(-)

閱讀 2471·2019-08-30 13:16
前端文章- 收藏集 - 掘金

閱讀 3658·2019-08-29 13:05
line box，inline box及vertical-align分析

閱讀 3553·2019-08-29 12:14
MyBatis的常見(jiàn)屬性總結(jié)select、insert、update、delete

閱讀 3611·2019-08-27 10:55

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Scrapy+Chromium+代理+selenium

相關(guān)文章

Python入門網(wǎng)絡(luò)爬蟲之精華版

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

網(wǎng)絡(luò)爬蟲介紹

精通Python網(wǎng)絡(luò)爬蟲(0):網(wǎng)絡(luò)爬蟲學(xué)習(xí)路線

發(fā)表評(píng)論

0條評(píng)論

Pocher

男|高級(jí)講師

TA的文章

#黑五#virtono：全場(chǎng)VPS、虛擬主機(jī)最低3折優(yōu)惠，VPS首月€2.68起

2.《JSP應(yīng)用開發(fā)案例教程》第1章 JSP概述

云主機(jī)的ip地址是什么東西-云ip地址怎樣設(shè)置？

ftp主機(jī)是什么-ftp主機(jī)地址是什么？

js制作簡(jiǎn)易計(jì)算器(-)

前端文章- 收藏集 - 掘金

line box，inline box及vertical-align分析

MyBatis的常見(jiàn)屬性總結(jié)select、insert、update、delete

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Scrapy+Chromium+代理+selenium

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！