建方公寓挂牌房源信息爬取
生活随笔
收集整理的這篇文章主要介紹了
建方公寓挂牌房源信息爬取
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
爬取建方公寓掛牌房源信息
- 背景
- 完整代碼
- 后話
背景
自從青客公寓分城市掛牌房源和優客逸家掛牌房源爬取之后,發現爬蟲也挺有趣的,于是今天又拿建方公寓練手,差點栽跟頭了,且聽我慢慢道來。有前兩次爬蟲經驗,發現在爬取青客設計的半自動邏輯較好,所以這次采用了只要輸入城市名稱和城市代碼以及總網頁數3個參數然后再執行程序,發現自己挺喜歡這種互動式的模式,有參與感,但是打印整個解析網頁的時候總提示我沒找到我要找到的東西,經過一番折騰,發現是請求頭出問題了,最初只構造了一個User-Agent, 很可能別人家服務器識別為爬蟲程序,于是在網頁源碼Network下面把headers原原本本寫下來
header={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cache-Control": "max-age=0","Connection": "keep-alive","Cookie": "_site_id_cookie=1; clientlanguage=zh_CN; SESSION=62a74a27387f4f4a9ca7cf4e45768631; _cookie_city_name=%E5%B9%BF%E5%B7%9E","Host": "www.giantfind.com.cn","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"} #構造請求頭修改之后打印整個解析網頁,發現要找的東西都出來,再也沒有提示沒找到我要找到的東西,心情瞬間大好,完整代碼如下
完整代碼
# -*- coding: utf-8 -*- """ project_name:giantfind @author: 帥帥de三叔 Created on Tue Aug 6 09:21:11 2019 """ import requests #導入請求模塊 from bs4 import BeautifulSoup #導入網頁解析模塊 import urllib.parse #url中文編碼 import re #導入正則模塊 import pymysql #導入數據庫功能模塊 import time #導入時間模塊 host="http://www.giantfind.com.cn" #主域名 header={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cache-Control": "max-age=0","Connection": "keep-alive","Cookie": "_site_id_cookie=1; clientlanguage=zh_CN; SESSION=62a74a27387f4f4a9ca7cf4e45768631; _cookie_city_name=%E5%B9%BF%E5%B7%9E","Host": "www.giantfind.com.cn","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"} #構造請求頭print("connecting mysql……\n") db=pymysql.connect("localhost","root","123456","giantfind",charset='utf8') #鏈接數據庫 print("connect successfully\n") cursor=db.cursor() #獲取游標 cursor.execute("drop table if exists giantfind_gz\n") #重新創建表print("start creating table giantfind_gz") c_sql="""CREATE TABLE giantfind_gz(district varchar(8),title varchar(20),area varchar(6),price varchar(6),house_type varchar(6),floor varchar(6),towards_or_style varchar(4),address varchar(30) )Engine=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=UTF8""" cursor.execute(c_sql) print("table giantfind_gz has been created,please insert into data\n")def generate_page(page_num,city,cityCode): #定義生成總網頁數url="http://www.giantfind.com.cn/findRoomPc/index_{}.jhtml?city={}&cityCode={}&reservationChannel=21"for next_page in range(1,int(page_num)+1): yield url.format(next_page,city,cityCode,next_page)def get_detail_item(generate_page): #定義獲取詳情頁網址#print("網址是:",generate_page)response=requests.get(generate_page,headers=header) #發出請求time.sleep(1) #掛起進程1秒soup=BeautifulSoup(response.text,'lxml') #解析網頁detail_list=soup.find("div","content").find("div",class_="list-life list-lifen").findAll("a",class_="list-la list-lb stat") #該頁所有房源列表#print(len(detail_list))for content in detail_list:detail_url=host+content['href'] #構造詳情頁answer=requests.get(detail_url,headers=header) #進入詳情頁answer_json=BeautifulSoup(answer.text,'lxml') #解析詳情頁district=answer_json.find("div",class_="hos-csho").find("p").get_text().replace("建方·家","").replace("建方·寓","").strip() #區域title=answer_json.find("div",class_="hos-csho").find("h2").find("span").get_text() #房源名稱area=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[0].find("i").find("span").get_text().split(" ")[1].replace("㎡","") #居住面積house_type=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[0].find("i").find("span").get_text().split(" ")[0] #房型pattern_price=re.compile("\d+") #用以正則價格price=re.search(pattern_price,answer_json.find("div",class_="hos-csho").find("div").find("strong").get_text()).group(0) #價格floor=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[1].find("i").get_text().replace("層","") #樓層towards_or_style=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[2].find("i").get_text().strip() #朝向address=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[4].find("i").get_text().replace(">","").strip() #詳細地址print(district,title,area,price,house_type,floor,towards_or_style,address) #字段測試insert_data=("INSERT INTO giantfind_gz(district,title,area,price,house_type,floor,towards_or_style,address)""VALUES(%s,%s,%s,%s,%s,%s,%s,%s)") #控制插入格式gaintfind_data=([district,title,area,price,house_type,floor,towards_or_style,address]) #待插入數據cursor.execute(insert_data,gaintfind_data) #執行插入操作db.commit() #主動提交def main(): #定義一個主函數整合其他所有函數city=urllib.parse.quote(input("please input city name:")) #請輸入城市名稱并Unicode編碼cityCode=input("please input city code:") #請輸入城市代碼page_num=input("please input total pages num:")for page_link in generate_page(page_num,city,cityCode):#print(page_link)get_detail_item(page_link)if __name__=="__main__":main()后話
謹以此篇記錄遇到的header請求頭問題,不做代碼解析,爬蟲僅作為交流,如有冒犯,請告知刪。
延申閱讀
青客公寓掛牌房源分城市爬取
優客逸家掛牌房源爬取
總結
以上是生活随笔為你收集整理的建方公寓挂牌房源信息爬取的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: R24DVD1人体存在雷达模块
- 下一篇: 计算机考研数学一大纲,考研数学一大纲