趁著暑假的空閑,把在上個學期學到的Python數據采集的皮毛用來試試手,寫了一個爬取豆瓣圖書的爬蟲,總結如下:
下面是我要做的事:
1. 登錄
2. 獲取豆瓣圖書分類目錄
3. 進入每一個分類里面,爬取第一頁的書的書名,作者,譯者,出版時間等信息,放入MySQL中,然后將封面下載下來。
第一步
首先,盜亦有道嘛,看看豆瓣網的robots協議:
User-agent: *
Disallow: /subject_search
Disallow: /amazon_search
Disallow: /search
Disallow: /group/search
Disallow: /event/search
Disallow: /celebrities/search
Disallow: /location/drama/search
Disallow: /forum/
Disallow: /new_subject
Disallow: /service/iframe
Disallow: /j/
Disallow: /link2/
Disallow: /recommend/
Disallow: /trailer/
Disallow: /doubanapp/card
Sitemap: https://www
.douban.com/sitemap_index
.xml
Sitemap: https://www
.douban.com/sitemap_updated_index
.xml
# Crawl-delay: 5User-agent: Wandoujia Spider
Disallow: /
再看看我要爬取的網站:
https://book.douban.com/tag/?view=type&icn=index-sorttags-allhttps://book.douban.com/tag/?icn=index-navhttps://book.douban.com/tag/[此處是標簽名]https://book.douban.com/subject/[書的編號]/
好了,并沒有違反robots協議,安心的寫代碼了。
第二步
既然寫了,就做得完整一些,現在先登錄一下豆瓣:
我在這里采用的是cookies登錄的方式,首先用firefox神奇的插件HttpFox獲得一下正常登錄的headers和cookies、
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt")
as file:raw_cookies = file.read();
for line
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s
我在這里采用的是將headers復制到程序里,將cookies放入文件中讀取的方式,同時注意要將cookies處理成字典的形式,然后用requests庫的get函數獲得網頁響應。
第三步
先進入豆瓣讀書的分類目錄
https://book.douban.com/tag/?view=type&icn=index-sorttags-all
我們把這個網站上的分類鏈接爬取下來:
import requests
from bs4 import BeautifulSoup
url =
"https://book.douban.com/tag/?icn=index-nav"
web = requests.
get(url)
soup = BeautifulSoup(web.
text,
"lxml")
tags = soup.select(
"#content > div > div.article > div > div > table > tbody > tr > td > a")
urls = []
for tag
in tags: tag=tag.get_text() helf=
"https://book.douban.com/tag/" url=helf+str(tag) urls.append(url)
with open(
"channel.txt",
"w")
as file:
for link
in urls:
file.
write(link+
'\n')
上面代碼當中用了CSS選擇器,不懂CSS沒關系,將相應的網站頁面用瀏覽器打開,打開開發者工具,在elements界面右鍵要爬取的內容,copy->selector
(我用的是chrome瀏覽器,在正常的圖形網頁里右鍵檢查就能直接定位到對應的elements位置),將CSS選擇器復制下來,注意如果出現了:nth-child(*)之類的都要去掉,不然會報錯。
然后我們得到了鏈接的目錄:
第四步
下面先找一找爬取的方法
根據上面說的CSS選擇器的方法,可以得到書名,作者,譯者,評價人數,評分,還有這本書的封面鏈接和簡介。
title = bookSoup.
select('#wrapper > h1 > span')[0].contents[0]
title = deal_title(title)
author = get_author(bookSoup.select("#info > a")[0].contents[0])
translator = bookSoup.select("#info > span > a")[0].contents[0]
person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]
scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]
coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];
brief = get_brief(bookSoup.
select('#link-report > div > div > p'))
有幾點要注意:
- 文件名不能含有 :?<>"|\/* 所以用正則表達式處理一下:
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title)
然后將封面下載下來:
path =
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"
urlretrieve(coverUrl,path);
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts))
def get_brief(line_tags):brief = line_tags[
0].contents
for tag
in line_tags[
1:]:brief += tag.contentsbrief =
'\n'.join(brief)
return brief
而對于出版社,出版時間,ISBN和圖書定價,則可以用下面更簡潔的方法獲得:
info = bookSoup.
select(
'#info')
infos = list(info[
0].strings)
publish = infos[infos.
index(
'出版社:') +
1]
ISBN = infos[infos.
index(
'ISBN:') +
1]
Ptime = infos[infos.
index(
'出版年:') +
1]
price = infos[infos.
index(
'定價:') +
1]
第五步
先創建數據庫和數據表
CREATE TABLE `allbooks` (`title` char(255) NOT NULL,`scor` char(255) DEFAULT NULL,`author` char(255) DEFAULT NULL,`price` char(255) DEFAULT NULL,`time` char(255) DEFAULT NULL,`publish` char(255) DEFAULT NULL,`person` char(255) DEFAULT NULL,`yizhe` char(255) DEFAULT NULL,`tag` char(255) DEFAULT NULL,`brief` mediumtext,`ISBN` char(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
然后用executemany方法便捷地將數據存入。
connection = pymysql.connect(host=
'你的主機',user=
'你的賬號',password=
'你的密碼',charset=
'utf8')
with connection.cursor()
as cursor:sql =
"USE DOUBAN_DB;"cursor.execute(sql)sql =
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()
第六步
到此,我們已經找到了全部的方法,就剩寫出完整程序了
還要注意的一點就是要設置隨機訪問間隔,以防封IP。
代碼如下,也在github更新,歡迎star,我的github鏈接。
"""
Created on Sat Aug 12 13:29:17 2017@author: Throne
"""
import requests
from bs4
import BeautifulSoup
import time
import pymysql
import random
from urllib.request
import urlretrieve
import re connection = pymysql.connect(host=
'localhost',user=
'root',password=
'',charset=
'utf8')
with connection.cursor()
as cursor:sql =
"USE DOUBAN_DB;"cursor.execute(sql)
connection.commit()
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title)
def get_brief(line_tags):brief = line_tags[
0].contents
for tag
in line_tags[
1:]:brief += tag.contentsbrief =
'\n'.join(brief)
return brief
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts))
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt")
as file:raw_cookies = file.read();
for line
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s
def crawl():channel = []
with open(
'C:/Users/lenovo/OneDrive/projects/Scraping/channel.txt')
as file:channel = file.readlines()
for url
in channel:data = [] web_data = login(url.strip())soup = BeautifulSoup(web_data.text.encode(
'utf-8'),
'lxml')tag = url.split(
"?")[
0].split(
"/")[-
1]books = soup.select(
'''#subject_list > ul > li > div.info > h2 > a''')
for book
in books:bookurl = book.attrs[
'href']book_data = login(bookurl)bookSoup = BeautifulSoup(book_data.text.encode(
'utf-8'),
'lxml')info = bookSoup.select(
'#info')infos = list(info[
0].strings)
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') +
1]translator = bookSoup.select(
"#info > span > a")[
0].contents[
0]author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') +
1]Ptime = infos[infos.index(
'出版年:') +
1]price = infos[infos.index(
'定價:') +
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except :
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') +
1]translator =
""author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') +
1]Ptime = infos[infos.index(
'出版年:') +
1]price = infos[infos.index(
'定價:') +
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except:
continuefinally:path =
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"urlretrieve(coverUrl,path);data.append([title,scor,author,price,Ptime,publish,person,translator,tag,brief,ISBN])
with connection.cursor()
as cursor:sql =
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()
del datatime.sleep(random.randint(
0,
9)) start = time.clock()
crawl()
end = time.clock()
with connection.cursor()
as cursor:print(
"Time Usage:", end -start)count = cursor.execute(
'SELECT * FROM allbooks')print(
"Total of books:", count)
if connection.open:connection.close()
結果展示
文章原創,要轉載請聯系作者
參考博客: http://www.jianshu.com/p/6c060433facf?appinstall=0
總結
以上是生活随笔為你收集整理的Python爬虫爬取豆瓣图书的信息和封面,放入MySQL数据库中。的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。