當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫: 爬一个英语学习网站

發布時間：2023/12/10 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫: 爬一个英语学习网站小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

爬蟲的基本概念

關于爬蟲的基本概念, 推薦博客https://xlzd.me/ 里面關于爬蟲的介紹非常通俗易懂.
簡單地說,在我們輸入網址后到可以瀏覽網頁,中間瀏覽器做了很多工作, 這里面涉及到兩個概念：

IP地址： IP地址是你在網絡上的地址，大部分情況下它的表示是一串類似于192.168.1.1的數字串。訪問網址的時候，事實上就是向這個IP地址請求了一堆數據。
DNS服務器：而DNS服務器則是記錄IP到域名映射關系的服務器，爬蟲在很大層次上不關系這一步的過程。一般瀏覽器并不會直接向DNS服務器查詢的IP，這個過程要復雜的多，包括向瀏覽器自己、hosts文件等很多地方先查找一次，上面的說法只是一個統稱。
瀏覽器得到IP地址之后，瀏覽器就會向這個地址發送一個HTTP請求。然后從網站的服務器端請求到了首頁的HTML源碼，然后通過瀏覽器引擎解析源碼，再次向服務器發請求得到里面引用過的Javascript、CSS、圖片等資源，得到了網頁瀏覽時的樣子。

搭建實驗環境

import requests # 網頁下載工具 from bs4 import BeautifulSoup # 分析網頁數據的工具 import re # 正則表達式,用于數據清洗\整理等工作 import os # 系統模塊,用于創建結果文件保存等 import codecs # 用于創建結果文件并寫入

2 爬蟲的實現
編寫爬蟲之前，我們需要先思考爬蟲需要干什么、目標網站有什么特點，以及根據目標網站的數據量和數據特點選擇合適的架構。
推薦使用Chrome的開發者工具來觀察網頁結構。對應的快捷鍵是"F12"。
右鍵選擇檢查(inspect)即可查看選中條目的HTML結構。

一般，瀏覽器在向服務器發送請求的時候，會有一個請求頭——User-Agent，它用來標識瀏覽器的類型.
如果是python代碼直接爬,發送默認的User-Agent是python-requests/[版本號數字]
這時可能會請求失敗,因為服務器會設置通過校驗請求的U-A來識別爬蟲，通過模擬瀏覽器的U-A，能夠很輕松地繞過這個問題。
設置Use-Agent請求頭, 在代碼'User-Agent':后就是虛擬的瀏覽器請求頭.

def fetch_html(url_link):html = requests.get(url_link, headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}).content return htmlDOWNLOAD_URL = 'http://usefulenglish.ru/phrases/' html = fetch_html(DOWNLOAD_URL)

上面這段代碼可以將url_link中網址的頁面信息全部下載到變量html中.
之后調用Beautifulsoup解析html中的數據.

#!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://www.jb51.net" class="sister" id="link1">Elsie</a>, <a href="http://www.jb51.net" class="sister" id="link2">Lacie</a> and <a href="http://www.jb51.net" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ soup = BeautifulSoup(html_doc) print soup.title # 輸出:<title>The Dormouse's story</title> print soup.title.name # 輸出:title print soup.title.string # 輸出: The Dormouse's story print soup.p # 輸出: The Dormouse's story print soup.a # 輸出:<a class="sister" href="http://www.jb51.net" id="link1">Elsie</a> print soup.find_all('a') print soup.find(id='link3') print soup.get_text()

soup 就是BeautifulSoup處理格式化后的字符串
soup.title得到的是title標簽
soup.p 得到的是文檔中的第一個p標簽
soup.find_all(‘p’) 遍歷樹, 得到所有的標簽’p’匹配結果
get_text() 是返回文本, 其實是可以獲得標簽的其他屬性的，比如要獲得a標簽的href屬性的值，可以使用 print soup.a['href'],類似的其他屬性，比如class也可以這么得到: soup.a['class']）

用Beautifulsoup解析html文件, 就可以獲得需要的網址\數據等文件. 下面這段代碼是用來獲取字段中的url鏈接的:

def findLinks(htmlString):"""搜索htmlString中所有的有效鏈接地址，方便下次繼續使用htmlString = ‘<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>'類似的字符串"""linkPattern = re.compile("(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')")return linkPattern.findall(htmlString)

正則表達式

實際使用中,用Beautifulsoup應該可以處理絕大多數需求,但是如果需要對爬去下來的數據再進一步分析的話,就要用到正則表達式了.

大多數符號在正則表達式中都有特別的含義,所以要作為匹配模式的話,就必須要用\符號轉義.例如搜索'/a', 對應的正則表達式是'/a'
[]代表匹配里面的字符中的任意一個
[^]代表除了內部包含的字符以外都能匹配,例如pattern1 = re.compile("[^p]at")#這代表除了p以外都匹配
正則表達式中()內的部分是通過匹配提取的字段.

基本用法,例如得到的url字符串后,需要截取最后一個/符號后的字段作為保存的文件名:

topic_url='http://usefulenglish.ru/phrases/bus-taxi-train-plane' filename_pattern=re.compile('.*\/(.*)$') filename = filename_pattern.findall(topic_url)[0]+'.txt'

得到:
filename = 'bus-taxi-train-plane.txt'

最后給出完整代碼. 這段代碼實現了對英文學習網站()中phrase欄目下日常口語短語的提取.代碼的功能主要分為兩部分:
1)解析側邊欄條目的url地址, 保存在url_list中
2)下載url_list中網頁正文列表中的英文句子.
網站頁面如下:

image.png import requests from bs4 import BeautifulSoup import re import os import codecsdef findLinks(htmlString):"""搜索htmlString中所有的有效鏈接地址，方便下次繼續使用"""linkPattern = re.compile("(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')")return linkPattern.findall(htmlString)def fetch_html(url_link):html = requests.get(url_link, headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}).contentreturn htmlDOWNLOAD_URL = 'http://usefulenglish.ru/phrases/'html = fetch_html(DOWNLOAD_URL)soup = BeautifulSoup(html, "lxml")side_soup = soup.find('div', attrs={'id': 'sidebar'})url_pattern = re.compile('^[\<a]') start = 0 url_list = [] for items in side_soup.find_all('li'):if url_pattern.findall(str(items)):url = findLinks(str(items))if url == []:start = start + 1continueelse:if start == 1:url_list.append(url[0])else:continuefilename_pattern=re.compile('.*\/(.*)')for topic_url in url_list[5]:html = []soup = []page_soup = [] # topic_url = 'http://usefulenglish.ru/phrases/time'print(topic_url)html = fetch_html(topic_url) # print(html)soup = BeautifulSoup(html)page_soup = soup.find('div', attrs={'class':'body'}) # print(page_soup)filename = filename_pattern.findall(topic_url)[0]+'.txt' # print(filename)if not os.path.exists(filename):os.system('touch %s' % filename)outfile = codecs.open(filename, 'wb', encoding='utf-8')ftext = []pattern = re.compile('^[a-zA-Z]')for table_div in page_soup.find_all('p'):text = table_div.getText()if pattern.findall(text):line = re.sub('$.*$|\?|\.|\/|\,','', text)ftext.append(line)with codecs.open(filename, 'wb', encoding='utf-8') as fp:fp.write('{ftext}\n'.format(ftext='\n'.join(ftext)))

附錄: 正則表達式

參考:
使用request爬:https://zhuanlan.zhihu.com/p/20410446
爬豆瓣電影top250(網頁列表中的數據):
https://xlzd.me/2015/12/16/python-crawler-03
Beautifulsoup使用方法詳解: http://www.jb51.net/article/43572.htm
Beautifulsoup官網文檔: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
http://www.169it.com/article/9913111281939258943.html
Python正則表達式:
http://www.runoob.com/python/python-reg-expressions.html
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

總結

以上是生活随笔為你收集整理的python爬虫: 爬一个英语学习网站的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： UVa 297 - Quadtrees
下一篇： java中volatile的使用方式