爬虫 --通用篇
概述
爬蟲是合法的嗎?
是的,它是一個計算機的學(xué)科!一個工具
什么是爬蟲?
通過編寫程序,模擬瀏覽器上網(wǎng),然后讓其去互聯(lián)網(wǎng)上爬取/獲取數(shù)據(jù)的過程.爬蟲爬取的也就是服務(wù)端的響應(yīng)數(shù)據(jù)
爬蟲使用場景的分類
- 通用爬蟲 : 爬取一整張頁面數(shù)據(jù)."抓取系統(tǒng)"
- 聚焦爬蟲 : 爬取頁面中指定的內(nèi)容,建立在通用爬蟲的基礎(chǔ)上,爬到數(shù)據(jù)后,進行局部數(shù)據(jù)解析篩選
- 增量式爬蟲 : 用來檢測網(wǎng)站數(shù)據(jù)更新的情況.只爬取網(wǎng)站最新更新的數(shù)據(jù).
反扒機制
網(wǎng)站指定了相關(guān)的技術(shù)手段或者策略阻止爬蟲程序進行網(wǎng)頁數(shù)據(jù)的爬取
- 機制一 : robots協(xié)議:一個文本協(xié)議,防君子不防小人的協(xié)議(哈哈),只是讓你主觀遵從,但也可以忽略直接爬取!
- 機制二 : UA檢測,檢測請求載體是否基于某一款瀏覽器
反反扒策略
爬蟲破解網(wǎng)站指定的反扒策略
機制一 : 直接忽略
機制二 : UA偽裝
http/https協(xié)議
客戶端和服務(wù)器端進行數(shù)據(jù)交互的一種形式
- 請求頭信息 :
? ? -? User-Agent? : 請求載體身份標識
? ? - Connection : close? (請求成功后馬上斷開)
- 響應(yīng)頭信息
? ? -Content-Type : json...
- https : 安全
? ? - 加密方式 :?
? -? 對稱秘鑰加密 : 瀏覽器將秘鑰和密文一起發(fā)送給服務(wù)器,極度不安全
? ? ? ? - 非對稱秘鑰加密 : 客戶端沒有保障秘鑰是服務(wù)器發(fā)送的,可能被攔截替換,也不安全
? ? ? ? - 證書秘鑰加密 : 安全
Jupyter
編寫爬蟲程序的環(huán)境
編寫程序
什么是動態(tài)加載的數(shù)據(jù)?
頁面加載的時候,通過ajax提交的post數(shù)據(jù).
相關(guān)模塊
-urllib #比較古老,用法繁瑣被requests模塊代替 requests:網(wǎng)絡(luò)請求的一個模塊.requests的作用: 模擬瀏覽器發(fā)請求。進而實現(xiàn)爬蟲
requests的編碼流程:
- 1.指定url
- 2.發(fā)起請求
- 3.獲取響應(yīng)數(shù)據(jù)
- 4.持久化存儲
示例1?搜狗首頁頁面數(shù)據(jù)
#簡單通用爬蟲 import requests #指定url url = "https://www.sougou.com/" #發(fā)起請求:get的返回值就是一個響應(yīng)對象 response = requests.get(url=url) #獲取響應(yīng)數(shù)據(jù),返回字符串形式的響應(yīng)數(shù)據(jù) page_text = response.text #持久化存儲 with open("./sougou.html","w",encoding="utf-8") as fp:fp.write(page_text)示例2 爬取搜狗自定詞條搜索后的頁面數(shù)據(jù)
import requestsurl = "https://www.sogou.com/web" content = input(">>> ").strip() param = {"query":content} headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" }response = requests.get(url=url,params=param,headers = headers) response.encoding = "utf-8" page_text = response.text name = content + ".html" with open(name,'w',encoding="utf-8") as f:f.write(page_text)print("爬取成功")示例3 破解百度翻譯
#破解百度翻譯爬取想要的信息 動態(tài)加載數(shù)據(jù), import requests content = input("輸入一個單詞: ") url = "https://fanyi.baidu.com/sug" headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } data = {"kw":content} response = requests.post(url=url,headers=headers,data=data) obj_json = response.json() print(obj_json)?示例4 爬取豆瓣電影中的電影詳情數(shù)據(jù)
#爬取豆瓣上的電影,注意,頁面上可能存在動態(tài)頁面 import requests,json url = "https://movie.douban.com/j/chart/top_list" headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } param = {"type": "5","interval_id": "100:90","action": "","start": "0","limit": "200" } response = requests.get(url=url,params=param,headers=headers) movie_json = response.json() name = "dz_movie"+".json" print(len(movie_json)) with open(name,"w",encoding="utf-8") as f:json.dump(movie_json,f)print("爬取寫入完成")示例5 爬取任意城市肯德基的餐廳位置信息
import requests,json all_data = [] url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword" content = input("請輸入城市名稱: ").strip() headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } for i in range(1,8):data = {"cname":"","pid": "","keyword": content,"pageIndex": str(i),"pageSize": "10"}json_obj = requests.post(url=url,headers=headers,data=data).json()for i in json_obj['Table1']:all_data.append(i) name = 'KFC.json' with open (name,"w",encoding="utf-8")as f:json.dump(all_data,f)print("KFC data is ok") 爬取KFC門店?
示例6.化妝品企業(yè)
#查看國家藥監(jiān)總局中基于中華人民共和國化妝品生產(chǎn)許可證相關(guān)數(shù)據(jù) import requests,json id_lst = [] #獲取所有企業(yè)UUID all_data = [] #存儲所有企業(yè)的詳情信息 post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } for i in range(1,10):data = {"on": "true","page": str(i),"pageSize": "15","productName": "","conditionType": "1","applyname": "","applysn": ""}json_obj = requests.post(url=post_url,headers=headers,data=data).json()for dic in json_obj["list"]:ID = dic["ID"]id_lst.append(ID) for id in id_lst:detail_post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"data = {"id":id}detail_dic = requests.post(url=detail_post_url,data=data).json()all_data.append(detail_dic) name = "hzpqy"+".json" with open(name,"w",encoding="utf-8") as fb:json.dump(all_data,fb)print("data is ok!") 爬取化妝品企業(yè)信息?
?
...
轉(zhuǎn)載于:https://www.cnblogs.com/CrazySheldon1/p/10788588.html
總結(jié)
- 上一篇: GB28181开放流媒体服务平台Live
- 下一篇: gedit配置