11_简书业务分析
文章目錄
- 簡書結(jié)構(gòu)分析
- 創(chuàng)建簡書爬蟲項目
- 創(chuàng)建crawl解析器
- 配置簡書下載格式
博文配套視頻課程:24小時實現(xiàn)從零到AI人工智能
簡書結(jié)構(gòu)分析
創(chuàng)建簡書爬蟲項目
C:\Users\Administrator\Desktop>scrapy startproject jianshu New Scrapy project 'jianshu', using template directory 'd:\anaconda3\lib\site-packages\scrapy\templates\project', created in:C:\Users\Administrator\Desktop\jianshuYou can start your first spider with:cd jianshuscrapy genspider example example.com創(chuàng)建crawl解析器
之前創(chuàng)建的spider解析器采用都是basic模板,這次爬蟲是要下載簡書文章,需要支持正則表達(dá)式匹配,因此建議采用crawl模板來創(chuàng)建spider解析器
C:\Users\Administrator\Desktop>cd jianshuC:\Users\Administrator\Desktop\jianshu>scrapy genspider -t crawl jianshu_spider jianshu.com Created spider 'jianshu_spider' using template 'crawl' in module:jianshu.spiders.jianshu_spider配置簡書下載格式
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass JianshuSpiderSpider(CrawlSpider):name = 'jianshu_spider'allowed_domains = ['jianshu.com']start_urls = ['https://www.jianshu.com/']# 可以指定爬蟲抓取的規(guī)則,支持正則表達(dá)式# https://www.jianshu.com/p/df7cad4eb8d8# https://www.jianshu.com/p/07b0456cbadb?*****# https://www.jianshu.com/p/.*rules = (Rule(LinkExtractor(allow=r'https://www.jianshu.com/p/[0-9a-z]{12}.*'), callback='parse_item', follow=True),)# name = title = url = collection = scrapy.Field()def parse_item(self, response):print(response.text)總結(jié)
- 上一篇: 这些单晶XRD测试问题你了解吗?(二)
- 下一篇: 电话机器人的技术分析