Downloader Middleware Demo

核心方法

每个Downloader Middleware都定义了一个或多个方法的类,核心的方法有如下三个。

  1. process_request(request, spider)
  2. process_response(request, response, spider)
  3. process_exception(request, exception, spider)

process_request

Request被Scrapy引擎调度给Downloader之前,process_request()方法就会被调用,也就是在Request从队列里调度出来到Downloader下载执行之前,我们都可以用process_request()方法对Request进行处理。方法的返回值必须为None、Response对象、Request对象之一,或者抛出IgnoreRequest异常。

process_request()方法的参数有如下两个。

  1. request,是Request对象,即被处理的Request。
  2. spider,是Spider对象,即此Request对应的Spider。

process_response

Downloader执行Request下载之后,会得到对应的Response。Scrapy引擎便会将Response发送给Spider进行解析。在发送之前,我们都可以用process_response()方法来对Response进行处理。方法的返回值必须为Request对象、Response对象之一,或者抛出IgnoreRequest异常。

process_response()方法的参数有如下三个。

  1. request,是Request对象,即此Response对应的Request。
  2. response,是Response对象,即此被处理的Response。
  3. spider,是Spider对象,即此Response对应的Spider。

process_exception

当Downloader或process_request()方法抛出异常时,例如抛出IgnoreRequest异常,process_exception()方法就会被调用。方法的返回值必须为None、Response对象、Request对象之一。

process_exception()方法的参数有如下三个。

  1. request,是Request对象,即产生异常的Request。
  2. exception,是Exception对象,即抛出的异常。
  3. spider,是Spider对象,即Request对应的Spider。

基本用法

利用process_request()方法设置随机的User-Agent,使用process_response()方法修改Response的状态码。

Httpbin.py

1
2
3
4
5
6
7
8
9
10
import scrapy

class HttpbinSpider(scrapy.Spider):
name = 'Httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/get']

def parse(self, response):
self.logger.debug(response.text)
self.logger.debug('Status Code: ' + str(response.status))

middlewares.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import random

class RandomUserAgentMiddleware(object):
def __init__(self):
self.user_agents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.1 Safari/605.1.15",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
]

def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agents)

def process_response(self, request, response, spider):
response.status = 201
return response

settings.py

1
2
3
DOWNLOADER_MIDDLEWARES = {
'httpbin.middlewares.RandomUserAgentMiddleware': 543,
}

result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en",
"Connection": "close",
"Host": "httpbin.org",
# "User-Agent": "Scrapy/1.5.0 (+https://scrapy.org)"
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
},
"origin": "218.82.103.201",
"url": "http://httpbin.org/get"
}

Status Code: 201