當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爬虫+数据分析：重庆买房吗？爬取重庆房价

發(fā)布時(shí)間：2024/9/30 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫+数据分析：重庆买房吗？爬取重庆房价小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

現(xiàn)在結(jié)婚，女方一般要求城里有套房。要了解近些年的房?jī)r(jià)，首先就要獲取網(wǎng)上的房?jī)r(jià)信息，今天以重慶鏈家網(wǎng)上出售的房?jī)r(jià)信息為例，將數(shù)據(jù)爬取下來分析。

爬蟲部分

一.網(wǎng)址分析
https://cq.fang.lianjia.com/loupan/

下面我們來分析我們所要提取的信息的位置，打開開發(fā)者模式查找元素，我們找到房子如下圖.如圖發(fā)現(xiàn)，一個(gè)房子信息被存儲(chǔ)到一個(gè)li標(biāo)簽里。

單擊一個(gè)li標(biāo)簽，再查找房子名，地址，房?jī)r(jià)信息。

網(wǎng)址分析，當(dāng)我點(diǎn)擊下一頁(yè)時(shí)，網(wǎng)絡(luò)地址pg參數(shù)會(huì)發(fā)生變化。
第一頁(yè)pg1,第二頁(yè)pg2…

二.單頁(yè)網(wǎng)址爬取
采取requests-Beautiful Soup的方式來爬取

三.網(wǎng)頁(yè)信息提取

#解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息，即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for house in houses:#遍歷每一個(gè)房子"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字英華天元，斌鑫江南御府...#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪']，['巴南', '李家沱']...#print(big_address)"具體地址:如：銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..#print(small_address)"優(yōu)勢(shì)。如：['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']，['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']#print(advantage)"均價(jià)：多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套，總價(jià)100萬/套'...#print(total_price)

四.多頁(yè)爬取，并將信息存儲(chǔ)到表格

from bs4 import BeautifulSoup import numpy as np import requests from requests.exceptions import RequestException import pandas as pd #讀取網(wǎng)頁(yè) def craw(url,page):try:headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}html1 = requests.request("GET", url, headers=headers,timeout=10)html1.encoding ='utf-8' # 加編碼，重要！轉(zhuǎn)換為字符串編碼，read()得到的是byte格式的html=html1.textreturn htmlexcept RequestException:#其他問題print('第{0}讀取網(wǎng)頁(yè)失敗'.format(page))return None #解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息，即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for j in range(len(houses)):#遍歷每一個(gè)房子house=houses[j]"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字英華天元，斌鑫江南御府...recommend_project=' '.join(recommend_project)#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...house_type=' '.join(house_type)#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...sale_status=' '.join(sale_status)#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪']，['巴南', '李家沱']...big_address=''.join(big_address)#print(big_address)"具體地址:如：銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..small_address=' '.join(small_address)#print(small_address)"優(yōu)勢(shì)。如：['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']，['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']advantage=' '.join(advantage)#print(advantage)"均價(jià)：多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..average_price=' '.join(average_price)#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套，總價(jià)100萬/套'...total_price=' '.join(total_price)#print(total_price)"--------------寫入表格-------------"information = [recommend_project, house_type, sale_status,big_address,small_address,advantage,average_price,total_price]information = np.array(information)information = information.reshape(-1, 8)information = pd.DataFrame(information, columns=['名稱', '類型', '銷售狀態(tài)','大地址','具體地址','優(yōu)勢(shì)','均價(jià)','總價(jià)'])if page== 1 and j==0:information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False) # mode='a+'追加寫入else:information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False, header=False) # mode='a+'追加寫入print('第{0}頁(yè)存儲(chǔ)數(shù)據(jù)成功'.format(page))else:print('解析失敗')for i in range(1,101):#遍歷網(wǎng)頁(yè)1url="https://cq.fang.lianjia.com/loupan/pg"+str(i)+"/"pase_page(url,i)print('結(jié)束')

五.多線程爬取

from bs4 import BeautifulSoup import numpy as np import requests from requests.exceptions import RequestException import pandas as pd#讀取網(wǎng)頁(yè) def craw(url,page):try:headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}html1 = requests.request("GET", url, headers=headers,timeout=10)html1.encoding ='utf-8' # 加編碼，重要！轉(zhuǎn)換為字符串編碼，read()得到的是byte格式的html=html1.textreturn htmlexcept RequestException:#其他問題print('第{0}讀取網(wǎng)頁(yè)失敗'.format(page))return None #解析網(wǎng)頁(yè)并保存數(shù)據(jù)到表格 def pase_page(url,page):html=craw(url,page)html = str(html)if html is not None:soup = BeautifulSoup(html, 'lxml')"--先確定房子信息，即li標(biāo)簽列表--"houses=soup.select('.resblock-list-wrapper li')#房子列表"--再確定每個(gè)房子的信息--"for j in range(len(houses)):#遍歷每一個(gè)房子house=houses[j]"名字"recommend_project=house.select('.resblock-name a.name')recommend_project=[i.get_text()for i in recommend_project]#名字英華天元，斌鑫江南御府...recommend_project=' '.join(recommend_project)#print(recommend_project)"類型"house_type=house.select('.resblock-name span.resblock-type')house_type=[i.get_text()for i in house_type]#寫字樓,底商...house_type=' '.join(house_type)#print(house_type)"銷售狀態(tài)"sale_status = house.select('.resblock-name span.sale-status')sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...sale_status=' '.join(sale_status)#print(sale_status)"大地址:如['南岸', '南坪']"big_address=house.select('.resblock-location span')big_address=[i.get_text()for i in big_address]#['南岸', '南坪']，['巴南', '李家沱']...big_address=''.join(big_address)#print(big_address)"具體地址:如：銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上"small_address=house.select('.resblock-location a')small_address=[i.get_text()for i in small_address]#銅元局輕軌站菜園壩長(zhǎng)江大橋南橋頭堡上,龍洲大道1788號(hào)..small_address=' '.join(small_address)#print(small_address)"優(yōu)勢(shì)。如：['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']"advantage=house.select('.resblock-tag span')advantage=[i.get_text()for i in advantage]#['環(huán)線房', '近主干道', '配套齊全', '購(gòu)物方便']，['地鐵沿線', '公交直達(dá)', '配套齊全', '購(gòu)物方便']advantage=' '.join(advantage)#print(advantage)"均價(jià)：多少1平"average_price=house.select('.resblock-price .main-price .number')average_price=[i.get_text()for i in average_price]#16000,25000,價(jià)格待定..average_price=' '.join(average_price)#print(average_price)"總價(jià),單位萬"total_price=house.select('.resblock-price .second')total_price=[i.get_text()for i in total_price]#總價(jià)400萬/套，總價(jià)100萬/套'...total_price=' '.join(total_price)#print(total_price)"--------------寫入表格-------------"information = [recommend_project, house_type, sale_status,big_address,small_address,advantage,average_price,total_price]information = np.array(information)information = information.reshape(-1, 8)information = pd.DataFrame(information, columns=['名稱', '類型', '銷售狀態(tài)','大地址','具體地址','優(yōu)勢(shì)','均價(jià)','總價(jià)'])information.to_csv('鏈家網(wǎng)重慶房子數(shù)據(jù).csv', mode='a+', index=False, header=False) # mode='a+'追加寫入print('第{0}頁(yè)存儲(chǔ)數(shù)據(jù)成功'.format(page))else:print('解析失敗')#雙線程 import threading for i in range(1,99,2):#遍歷網(wǎng)頁(yè)1-101url1="https://cq.fang.lianjia.com/loupan/pg"+str(i)+"/"url2 = "https://cq.fang.lianjia.com/loupan/pg" + str(i+1) + "/"t1 = threading.Thread(target=pase_page, args=(url1,i))#線程1t2 = threading.Thread(target=pase_page, args=(url2,i+1))#線程2t1.start()t2.start()

可能是網(wǎng)的問題，很多頁(yè)的數(shù)據(jù)沒有讀取下來。

存儲(chǔ)到的信息有近438條。原始數(shù)據(jù)有1838條。
可以自己把失敗的頁(yè)數(shù)存儲(chǔ)下來，再重新請(qǐng)求一次。我這里就不搞啦。將就用。

總結(jié)

以上是生活随笔為你收集整理的爬虫+数据分析：重庆买房吗？爬取重庆房价的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：爬虫实战：Requests+Beauti
下一篇：怎样去除厨房的油污？