scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬蟲——基于scrapy框架爬取網易新聞內容
- 1、需求【前期準備】
- 2、分析及代碼實現(1)獲取五大板塊詳情頁url(2)解析每個板塊(3)解析每個模塊里的標題中詳情頁信息
點擊此處,獲取海量Python學習資料!
1、需求
爬取網易新聞的標題和內容
【前期準備】
創建工程文件:scrapy startproject wangyiPro
創建spiders文件:scrapy genspider wangyi http://www.xxx.com
配置settings.py
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/...'# Obey robots.txt rules ROBOTSTXT_OBEY = False2、分析及代碼實現
(1)獲取五大板塊詳情頁url
?
需要獲取國內、國際、軍事、航空、無人機,五大板塊的詳情頁地址。它們均存在ul下的li標簽先
xpath語言://*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li
它們分別位于第3,4,6,7,8個li標簽下的a標簽href屬性中
- 代碼編寫
(2)解析每個板塊
由于從下載器下載下來的五大板塊的詳情頁數據為靜態數據,而目標的實際頁面為動態加載的新聞數據。因此需要在引擎和下載器之間的中間件中設置方法來實現獲取動態加載新聞數據。
?
wangyi.py文件
#每一個板塊對應的新聞標題相關的內容都是動態加載def parse_model(self,response): #解析每一個板塊頁面中對應新聞的標題和新聞詳情頁的url# response.xpath()div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')for div in div_list:title = div.xpath('./div/div[1]/h3/a/text()').extract_first()new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()item = WangyiproItem()item['title'] = title#對新聞詳情頁的url發起請求yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})在wangyi.py中添加selenium模塊,用于進行動態加載頁面的處理。
middlewares.py文件
from scrapy import signalsfrom scrapy.http import HtmlResponse from time import sleepclass WangyiproDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.def process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of# installed downloader middleware will be calledreturn None#通過該方法攔截五大板塊對應的響應對象,進行篡改def process_response(self, request, response, spider):#獲取在爬蟲類定義的瀏覽器對象bro = spider.bro#挑選出指定的響應對象進行篡改#通過url指定request#通過request指定response#spider表示的是爬蟲對象if request.url in spider.models_urls:#五個模板對應的url進行請求bro.get(request.url)sleep(2)page_text = bro.page_source #包含了動態加載的新聞數據#response #五大板塊對應的響應對象#針對定位到的這些response進行篡改#實例化一個新的響應對象(符合需求:包含動態加載出的新聞數據),替代原來舊的響應對象#如何獲取動態加載出的新聞數據?#基于selenium可便捷的獲取動態加載數據new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)return new_responseelse:#response #其他請求對應的響應對象return responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpass在process_response(self,request,response,spider)方法中攔截五大板塊的響應對象,對其進行篡改。方法是通過url指定request,再通過request指定response。從spider中獲取bro瀏覽器對象,再從中獲取動態加載的頁面數據。從而構建出動態加載的頁面的新的請求對象。
(3)解析每個模塊里的標題中詳情頁信息
wangyi.py
import scrapy from selenium import webdriver from wangyiPro.items import WangyiproItem class WangyiSpider(scrapy.Spider):name = 'wangyi'# allowed_domains = ['www.cccom']start_urls = ['https://news.163.com/']models_urls = [] #存儲五個板塊對應詳情頁的url#解析五大板塊對應詳情頁的url#實例化一個瀏覽器對象def __init__(self):self.bro = webdriver.Chrome(executable_path='D:WebCrawlerscrapy_programwangyiProwangyiProspiderschromedriver')def parse(self, response):li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')alist = [3,4,6,7,8]for index in alist:model_url = li_list[index].xpath('./a/@href').extract_first()self.models_urls.append(model_url)#依次對每一個板塊對應的頁面進行請求for url in self.models_urls:#對每一個板塊的url進行請求發送yield scrapy.Request(url,callback=self.parse_model)#每一個板塊對應的新聞標題相關的內容都是動態加載def parse_model(self,response): #解析每一個板塊頁面中對應新聞的標題和新聞詳情頁的url# response.xpath()div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')for div in div_list:title = div.xpath('./div/div[1]/h3/a/text()').extract_first()new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()item = WangyiproItem()item['title'] = title#對新聞詳情頁的url發起請求yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})def parse_detail(self,response):#解析新聞內容content = response.xpath('//*[@id="endText"]//text()').extract()content = ''.join(content)item = response.meta['item']item['content'] = contentyield itemdef closed(self,spider):self.bro.quit()items.py
import scrapyclass WangyiproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()content = scrapy.Field()pipelines.py
import scrapyclass WangyiproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()content = scrapy.Field()總結
以上是生活随笔為你收集整理的scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: cytoscape使用方法_7种方法 ,
- 下一篇: 测试人员如何使用浏览器的f12_测试过程