登录 | 注册

Scrapy抓取当当网全部图书信息并存入mongo

Robin 153次浏览

摘要:虽然网上有好多现成的例子,今天自己亲自动手来抓取当当图书信息。通过当当网分类页作为入口,按照一级二级分类依次去抓取分类下图书数据,已达到抓取全部图书数据的目的,保存形式采用mongodb。

首先创建scrapy项目:

scrapy startproject dangdang


生成的项目目录如下:

dangdang
    │  items.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │  
    └─spiders
            dangdang.py
            __init__.py


配置settings.py文件:

ROBOTSTXT_OBEY = False #不遵循robots协议
DOWNLOAD_DELAY = 1 #设置延时等待时间1s

ITEM_PIPELINES = {
   'dangdang.pipelines.DangdangPipeline': 300,
}

#配置mongo连接信息
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = "dangdang"
MONGODB_DOCNAME = "books"


接下来设置抓取字段配置items.py文件:

import scrapy

class DangdangItem(scrapy.Item):
    _id = scrapy.Field()
    title = scrapy.Field()
    comments = scrapy.Field()
    time = scrapy.Field()
    price = scrapy.Field()
    discount = scrapy.Field()
    category_one = scrapy.Field()
    category_two = scrapy.Field()

这里我们抓取书名,评论量,出版时间,价格,折扣,一级分类,二级分类。


接下来配置pipelines.py文件:

import pymongo
from scrapy.conf import settings
from .items import DangdangItem

class DangdangPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        tdb = client[db_name]
        self.post = tdb[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
        if isinstance(item, DangdangItem):
            try:
                book_info = dict(item)  #
                if self.post.insert(book_info):
                    print('Successful!')
            except Exception:
                pass
        return item

这里__init__初始化连接到mongodb数据库的,process_item判断item,并保存到数据库。


接下来就是主爬虫文件的编写了,总体思路从分类首页开始,获取到所有一级分类的链接,然后依次访问获得所有二级分类链接,再依次访问二级分类链接,获取到我们要的字段信息,下面直接上代码:

import scrapy
import requests
from scrapy import Selector
from lxml import etree
from ..items import DangdangItem


class DangDangSpider(scrapy.Spider):
    name = 'dangdangspider'
    redis_key = 'dangdangspider:urls'
    allowed_domains = ['dangdang.com']
    start_urls = 'http://category.dangdang.com/cp01.00.00.00.00.00.html'


    def start_requests(self):
        user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'
        headers = {'User-Agent': user_agent}
        yield scrapy.Request(url=self.start_urls, headers=headers, method='GET', callback=self.parse)


    def parse(self, response):
        user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'
        headers = {'User-Agent': user_agent}
        lists = response.body.decode('gbk')
        goodslists = response.selector.xpath('//*[@id="navigation"]/ul/li[1]/div[2]/div[1]/div/span')
        for goods in goodslists:
            try:
                category_big = goods.xpath('a/@title').extract()[0]
                category_big_id = goods.xpath('a/@href').extract()[0].split('.')[1]  # id
                category_big_url = "http://category.dangdang.com/pg1-cp01.{}.00.00.00.00.html".format(str(category_big_id))
                yield scrapy.Request(url=category_big_url, headers=headers, callback=self.second_parse, meta={"ID1": category_big_id, "ID2": category_big})
            except Exception:
                pass

    def second_parse(self, response):
        '''
        ID1:一级分类ID   ID2:一级分类名称   ID3:二级分类ID  ID4:二级分类名称
        '''
        url = 'http://category.dangdang.com/pg1-cp01.{}.00.00.00.00.html'.format(response.meta["ID1"])
        category_small_content = requests.get(url).content.decode('gbk')
        contents = etree.HTML(category_small_content)
        goodslist = contents.xpath('//*[@id="navigation"]/ul/li[1]/div[2]/div[1]/div/span')
        for goods in goodslist:
            try:
                category_small_name = goods.xpath('a/text()').pop().replace(" ", "").split('(')[0]
                category_small_id = goods.xpath('a/@href').pop().split('.')[2]
                category_small_url = "http://category.dangdang.com/pg1-cp01.{}.{}.00.00.00.html".format(str(response.meta["ID1"]), str(category_small_id))
                yield scrapy.Request(url=category_small_url, callback=self.detail_parse, \
                                     meta={"ID1": response.meta["ID1"], "ID2": response.meta["ID2"], \
                                           "ID3": category_small_id, "ID4": category_small_name})
            except Exception:
                pass

    def detail_parse(self, response):
        for i in range(1, 101):
            url = 'http://category.dangdang.com/pg{}-cp01.{}.{}.00.00.00.html'.format(str(i), response.meta["ID1"], response.meta["ID3"])
            try:
                contents = etree.HTML(requests.get(url).content.decode('gbk'))
                goodslist = contents.xpath('//ul[@class="bigimg"]/li')
                for goods in goodslist:
                    item = DangdangItem()
                    try:
                        item['title'] = goods.xpath('p[1]/a/text()').pop()
                        item['comments'] = goods.xpath('p[5]/a/text()').pop()
                        item['price'] = goods.xpath('p[3]/span[1]/text()').pop()
                        item['discount'] = goods.xpath('p[3]/span[3]/text()').pop().replace('\xa0(', '').replace(')','')
                        item['time'] = goods.xpath('p[6]/span[2]/text()').pop().replace("/", "")
                        item['category_one'] = response.meta["ID2"]
                        item['category_two'] = response.meta["ID4"]
                    except Exception:
                        pass
                    yield item
            except Exception:
                pass


爬虫执行半小时后

image.png

image.png


至此大功告成!


原创文章转载请注明出处。

相关文章

表情

共1条评论
  • 博客网友

    受教了。。。希望博主多写点爬虫相关的文章

    2017-09-19