scrapy + selenium + chromedriver爬取动态数据
scrapy是一個網頁爬蟲框架
安裝scrapy?推薦使用Anaconda?安裝
Anaconda?安裝介紹?http://www.scrapyd.cn/doc/124.html
安裝后需要配置?清華鏡像
在 Anacoda prompt中?輸入
1 conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
2 conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
3 conda config --set show_channel_urls yes?
然后?在cmd?窗口輸入
conda install scrapy按提示?輸入 y?開始下載完成后
輸入?scrapy?查看?是否安裝成功
安裝 selenium?模塊
conda install selenium?安裝 xlutils?模塊
conda install xlutils下載chromedriver
http://npm.taobao.org/mirrors/chromedriver/
請根據自己的電腦上的?谷歌瀏覽器?版本?下載對應的chromedriver
?
創建scrapy?項目
scrapy startproject mingyanmingyan?是?項目名稱
在mingyan/spiders?下新建index.py?用于編寫爬蟲
1 import scrapy 2 import xlwt 3 import xlrd 4 5 # 全局變量 6 row = 0 #表格 行數 7 book = xlwt.Workbook()#新建一個excel 8 sheet = book.add_sheet('全國')#添加一個sheet頁 9 sheet.write(row,0,"大學") 10 sheet.write(row,1,"圖片地址") 11 sheet.write(row,2,"省份") 12 sheet.write(row,3,"大學類型") 13 sheet.write(row,4,"辦學性質") 14 sheet.write(row,5,"所屬單位") 15 sheet.write(row,6,"標簽") 16 sheet.write(row,7,"大學詳情網址") 17 18 class yzy(scrapy.Spider): #需要繼承scrapy.Spider類 19 20 name = "yzy"#爬蟲名稱 21 22 # start_urls = [ # 另外一種寫法,無需定義start_requests方法 23 # 'http://lab.scrapyd.cn/page/1/', 24 # ] 25 26 def start_requests(self): #定義方法爬取頁面 27 headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36"} 28 # 設置cookie 29 cookies = { 30 'connect.sid':'s%3Ao5B5Jt2IE07URCM1yHnF7_qjEsx38tVP.CZ%2Fy0YUUNvuYoh0VaIUAusX8pCH8O%2BQZXC%2FUQ6O7OBw', 31 'UM_distinctid':'16c26c375e62c7-04af75133362ad-3e385f05-100200-16c26c375e7de', 32 'CNZZDATA1254568697':'902292085-1564015095-https%253A%252F%252Fwww.youzy.cn%252F%7C1564015095', 33 'Youzy2CCurrentProvince':'%7B%22provinceId%22%3A%22849%22%2C%22provinceName%22%3A%22%E6%B9%96%E5%8C%97%22%2C%22isGaokaoVersion%22%3Atrue%7D' 34 } 35 36 # url = 'https://www.youzy.cn/tzy/search/colleges/collegeList?page=1' 37 # yield scrapy.Request(url,headers=headers, cookies=cookies,callback=self.parse) 38 #已知有144頁 range(1,145) 取值 1 - 144 39 for i in range(1, 145) : 40 url = 'https://www.youzy.cn/tzy/search/colleges/collegeList?page='+str(i) 41 yield scrapy.Request(url,headers=headers, cookies=cookies,callback=self.parse) 42 43 def parse(self,response): #定義parse 函數 系統自動給予 self 參數 response 爬取頁面的返回數據 44 #聲明全局變量 45 global row 46 global sheet 47 global book 48 li_box = response.css('.uzy-college-list li.clearfix') 49 for item in li_box: #循環數據集 50 imgSrc = item.css('.mark img::attr(src)').extract()[0] 51 colegeName = item.css('.info .top a::text').extract()[0] 52 classify = item.css(".info .bottom .quarter_1::text").extract()[0] 53 collegeType = item.css(".info .bottom .quarter_2::text").extract()[0] 54 industry = item.css(".info .bottom .quarter::text").extract()[0] 55 # 因為 有可能 沒有所屬地 導致 取數組的值 超出范圍 改為 取最后一項 56 province = item.css(".info .bottom .quarter::text").extract()[-1] 57 tags = item.css(".info .bottom .college-types-txt::text").extract() 58 tags = ",".join(tags) 59 60 infoUrl = item.css('.mark a::attr(href)').extract()[0] 61 infoUrl = "https://www.youzy.cn"+infoUrl 62 63 # tags = item.css('.tags .tag::text').extract() 64 # tags = ",".join(tags) 65 # 寫入 表格 66 row+=1 67 sheet.write(row,0,colegeName) 68 sheet.write(row,1,imgSrc) 69 sheet.write(row,2,province.strip()) 70 sheet.write(row,3,classify) 71 sheet.write(row,4,collegeType.strip()) 72 sheet.write(row,5,industry) 73 sheet.write(row,6,tags) 74 sheet.write(row,7,infoUrl) 84 book.save('yzy.xls') 85 # 找下一頁的代碼 不實用 因為是同步的 所以會導致 程序遠行過慢 86 # next_page = response.css('li.next a::attr(href)').extract_first() 87 # if next_page is not None: 88 # print(next_page) 89 # next_page = response.urljoin(next_page) 90 # yield scrapy.Request(next_page, callback=self.parse)關于 scrapy?的設置
?
使用chormedriver??
myMiddleware.py?中間件
from selenium import webdriver from selenium.webdriver.chrome.options import Options from scrapy.http import HtmlResponse import timechrome_options = Options() chrome_options.add_argument('--headless')driver = webdriver.Chrome(chrome_options = chrome_options ,executable_path='D:\\Anaconda3\\Scripts\\chromedriver.exe')class javaScriptMiddleware(object):def process_request(self,request,spider):global driverif spider.name == 'yzy_majorinfo_bk': # driver = webdriver.Chrome('D:\\Anaconda3\\Scripts\\chromedriver.exe')driver.get(request.url)print(request.cookies)# 寫入cookie # for key in request.cookies:# # print(cookie)# # tampOBJ = {}# # tampOBJ[key] = request.cookies[key]# # print(tampOBJ)# driver.add_cookie({'name':key,'value':request.cookies[key]})# driver.refresh()# time.sleep(1)# js="var q=document.documentElement.scrollTop=10000"# driver.execute_script(js) #可執行js.模仿用戶操作,此次是將頁面拉至最底部time.sleep(2)body = driver.page_sourceprint("訪問"+request.url)return HtmlResponse(driver.current_url,body=body,encoding='utf-8',request=request)elif spider.name == 'yzy_majorinfo_zk':driver.get(request.url)time.sleep(2)body = driver.page_sourceprint("訪問"+request.url)return HtmlResponse(driver.current_url,body=body,encoding='utf-8',request=request)else:return None
完整代碼? 碼云地址
https://gitee.com/gw2_cs/pachonglianshou轉載于:https://www.cnblogs.com/csdcs/p/11255538.html
總結
以上是生活随笔為你收集整理的scrapy + selenium + chromedriver爬取动态数据的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux学习笔记8——VIM编辑器的使
- 下一篇: deep learning:RBM公式推