Scrapy爬取起点小说网数据导入MongoDB数据库
生活随笔
收集整理的這篇文章主要介紹了
Scrapy爬取起点小说网数据导入MongoDB数据库
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
本文中我們將詳細介紹使用Scrapy抓取數據并存入MongoDB數據庫,首先給出我們需要抓取得數據:
抓取起點網得全部作品,網址為:https://www.qidian.com/all
關于Scrapy的下載與安裝請移步上篇博客Scrapy簡單案例
關于MongoDB的下載安裝請移步博客MongoDB安裝
下面直接給出相關代碼;
(1) 數據封裝類item.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NovelItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()link = scrapy.Field()#URLcategory = scrapy.Field()bookname = scrapy.Field()author = scrapy.Field()content = scrapy.Field()(2)爬蟲主程序
# -*- coding: utf-8 -*- import scrapyfrom novel.items import NovelItemclass SolveSpider(scrapy.Spider):name = "solve"allowed_domains = ["qidian.com"]start_urls = [];for x in range(1,5):#只有5頁start_urls.append("https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=" + str(x))#start_urls = ["https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="]# page_index = ["1", "2", "3", "4", "5", "6", "7","8", "9", "10"]def parse(self, response):nolves = response.xpath('//ul[@class="all-img-list cf"]/li')for each in nolves:# print("***************************")item = NovelItem()part = each.xpath('./div[@class="book-mid-info"]')#print(part)item['bookname'] = part.xpath('./h4/a/text()').extract()[0]item['link'] = part.xpath('./h4/a/@href').extract()[0]item['author'] = part.xpath('./p[@class="author"]/a[@class="name"]/text()').extract()[0]item['category'] = part.xpath('./p[@class="author"]/a/text()').extract()[1]item['content'] = part.xpath('./p[@class="intro"]/text()').extract()[0]yield item(3)管道pipeline.py
import pymongoclass MongoDBPipeline(object):collection_name = 'scrapy_items'def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'),)def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]self.collection = self.db["novel"]def close_spider(self, spider):self.client.close()def process_item(self, item, spider):self.collection.insert(dict(item))print("插入成功")return item(4)配置文件
BOT_NAME = 'novel'SPIDER_MODULES = ['novel.spiders'] NEWSPIDER_MODULE = 'novel.spiders' ITEM_PIPELINES = {'novel.pipelines.NovelPipeline':100,}MONGO_URI = "192.168.177.13" MONGO_DB = "novels" MONGO_COLLECTION = "novel" # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'novel (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = True # 不驗證SSL證書 DOWNLOAD_HANDLERS_BASE = {'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler','http': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler','https': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler','s3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler', }(5)查詢結果
總結
以上是生活随笔為你收集整理的Scrapy爬取起点小说网数据导入MongoDB数据库的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 编译ffmpeg+dl等库
- 下一篇: GPS模块运用: GPS模块数据提取、常