爬虫+数据分析:重庆买房吗?爬取重庆房价
現(xiàn)在結(jié)婚,女方一般要求城里有套房。要了解近些年的房?jī)r(jià),首先就要獲取網(wǎng)上的房?jī)r(jià)信息,今天以重慶鏈家網(wǎng)上出售的房?jī)r(jià)信息為例,將數(shù)據(jù)爬取下來分析。
爬蟲部分
一.網(wǎng)址分析
https://cq.fang.lianjia.com/loupan/
下面我們來分析我們所要提取的信息的位置,打開開發(fā)者模式查找元素,我們找到房子如下圖.如圖發(fā)現(xiàn),一個(gè)房子信息被存儲(chǔ)到一個(gè)li標(biāo)簽里。
單擊一個(gè)li標(biāo)簽,再查找房子名,地址,房?jī)r(jià)信息。
網(wǎng)址分析,當(dāng)我點(diǎn)擊下一頁(yè)時(shí),網(wǎng)絡(luò)地址pg參數(shù)會(huì)發(fā)生變化。
第一頁(yè)pg1,第二頁(yè)pg2…
二.單頁(yè)網(wǎng)址爬取
采取requests-Beautiful Soup的方式來爬取
三.網(wǎng)頁(yè)信息提取
#解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息,即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for house in houses:#遍歷每一個(gè)房子"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字 英華天元,斌鑫江南御府...#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪'],['巴南', '李家沱']...#print(big_address)"具體地址:如:銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..#print(small_address)"優(yōu)勢(shì)。如:['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便'],['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']#print(advantage)"均價(jià):多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套,總價(jià)100萬/套'...#print(total_price)四.多頁(yè)爬取,并將信息存儲(chǔ)到表格
from bs4 import BeautifulSoup import numpy as np import requests from requests.exceptions import RequestException import pandas as pd #讀取網(wǎng)頁(yè) def craw(url,page):try:headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}html1 = requests.request("GET", url, headers=headers,timeout=10)html1.encoding ='utf-8' # 加編碼,重要!轉(zhuǎn)換為字符串編碼,read()得到的是byte格式的html=html1.textreturn htmlexcept RequestException:#其他問題print('第{0}讀取網(wǎng)頁(yè)失敗'.format(page))return None #解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息,即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for j in range(len(houses)):#遍歷每一個(gè)房子house=houses[j]"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字 英華天元,斌鑫江南御府...recommend_project=' '.join(recommend_project)#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...house_type=' '.join(house_type)#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...sale_status=' '.join(sale_status)#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪'],['巴南', '李家沱']...big_address=''.join(big_address)#print(big_address)"具體地址:如:銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..small_address=' '.join(small_address)#print(small_address)"優(yōu)勢(shì)。如:['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便'],['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']advantage=' '.join(advantage)#print(advantage)"均價(jià):多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..average_price=' '.join(average_price)#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套,總價(jià)100萬/套'...total_price=' '.join(total_price)#print(total_price)"--------------寫入表格-------------"information = [recommend_project, house_type, sale_status,big_address,small_address,advantage,average_price,total_price]information = np.array(information)information = information.reshape(-1, 8)information = pd.DataFrame(information, columns=['名稱', '類型', '銷售狀態(tài)','大地址','具體地址','優(yōu)勢(shì)','均價(jià)','總價(jià)'])if page== 1 and j==0:information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False) # mode='a+'追加寫入else:information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False, header=False) # mode='a+'追加寫入print('第{0}頁(yè)存儲(chǔ)數(shù)據(jù)成功'.format(page))else:print('解析失敗')for i in range(1,101):#遍歷網(wǎng)頁(yè)1url="https://cq.fang.lianjia.com/loupan/pg"+str(i)+"/"pase_page(url,i)print('結(jié)束')五.多線程爬取
from bs4 import BeautifulSoup import numpy as np import requests from requests.exceptions import RequestException import pandas as pd#讀取網(wǎng)頁(yè) def craw(url,page):try:headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}html1 = requests.request("GET", url, headers=headers,timeout=10)html1.encoding ='utf-8' # 加編碼,重要!轉(zhuǎn)換為字符串編碼,read()得到的是byte格式的html=html1.textreturn htmlexcept RequestException:#其他問題print('第{0}讀取網(wǎng)頁(yè)失敗'.format(page))return None #解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息,即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for j in range(len(houses)):#遍歷每一個(gè)房子house=houses[j]"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字 英華天元,斌鑫江南御府...recommend_project=' '.join(recommend_project)#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...house_type=' '.join(house_type)#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...sale_status=' '.join(sale_status)#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪'],['巴南', '李家沱']...big_address=''.join(big_address)#print(big_address)"具體地址:如:銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..small_address=' '.join(small_address)#print(small_address)"優(yōu)勢(shì)。如:['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便'],['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']advantage=' '.join(advantage)#print(advantage)"均價(jià):多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..average_price=' '.join(average_price)#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套,總價(jià)100萬/套'...total_price=' '.join(total_price)#print(total_price)"--------------寫入表格-------------"information = [recommend_project, house_type, sale_status,big_address,small_address,advantage,average_price,total_price]information = np.array(information)information = information.reshape(-1, 8)information = pd.DataFrame(information, columns=['名稱', '類型', '銷售狀態(tài)','大地址','具體地址','優(yōu)勢(shì)','均價(jià)','總價(jià)'])information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False, header=False) # mode='a+'追加寫入print('第{0}頁(yè)存儲(chǔ)數(shù)據(jù)成功'.format(page))else:print('解析失敗')#雙線程 import threading for i in range(1,99,2):#遍歷網(wǎng)頁(yè)1-101url1="https://cq.fang.lianjia.com/loupan/pg"+str(i)+"/"url2 = "https://cq.fang.lianjia.com/loupan/pg" + str(i+1) + "/"t1 = threading.Thread(target=pase_page, args=(url1,i))#線程1t2 = threading.Thread(target=pase_page, args=(url2,i+1))#線程2t1.start()t2.start()可能是網(wǎng)的問題,很多頁(yè)的數(shù)據(jù)沒有讀取下來。
存儲(chǔ)到的信息有近438條。原始數(shù)據(jù)有1838條。
可以自己把失敗的頁(yè)數(shù)存儲(chǔ)下來,再重新請(qǐng)求一次。我這里就不搞啦。將就用。
總結(jié)
以上是生活随笔為你收集整理的爬虫+数据分析:重庆买房吗?爬取重庆房价的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 爬虫实战:Requests+Beauti
- 下一篇: 怎样去除厨房的油污?