scrapy_redis使用
本站寻求有缘人接手,详细了解请联系站长QQ1493399855
URL去重
定义去重规则(被调度器调用并应用)a. 内部会使用以下配置进行连接Redis# REDIS_HOST = 'localhost' # 主机名# REDIS_PORT = 6379 # 端口# REDIS_URL = 'redis://user:pass@hostname:9001' # 连接URL(优先于以上配置)# REDIS_PARAMS = {} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})# REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 指定连接Redis的Python模块 默认:redis.StrictRedis# REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8' b. 去重规则通过redis的集合完成,集合的Key为:key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}默认配置:DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'c. 去重规则中将url转换成唯一标示,然后在redis中检查是否已经在集合中存在from scrapy.utils import requestfrom scrapy.http import Requestreq = Request(url='http://www.cnblogs.com/wupeiqi.html')result = request.request_fingerprint(req)print(result) # 8ea4fd67887449313ccc12e5b6b92510cc53675c PS:- URL参数位置不同时,计算结果一致;- 默认请求头不在计算范围,include_headers可以设置指定请求头示例:from scrapy.utils import requestfrom scrapy.http import Requestreq = Request(url='http://www.baidu.com?name=8&id=1',callback=lambda x:print(x),cookies={'k1':'vvvvv'})result = request.request_fingerprint(req,include_headers=['cookies',])print(result)req = Request(url='http://www.baidu.com?id=1&name=8',callback=lambda x:print(x),cookies={'k1':666})result = request.request_fingerprint(req,include_headers=['cookies',])print(result)""" # Ensure all spiders share same duplicates filter through redis. # DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"