當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。

發布時間：2023/12/9 数据库 62 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

趁著暑假的空閑，把在上個學期學到的Python數據采集的皮毛用來試試手，寫了一個爬取豆瓣圖書的爬蟲，總結如下：
下面是我要做的事：
1. 登錄
2. 獲取豆瓣圖書分類目錄
3. 進入每一個分類里面，爬取第一頁的書的書名，作者，譯者，出版時間等信息，放入MySQL中，然后將封面下載下來。

第一步

首先，盜亦有道嘛，看看豆瓣網的robots協議：

User-agent: * Disallow: /subject_search Disallow: /amazon_search Disallow: /search Disallow: /group/search Disallow: /event/search Disallow: /celebrities/search Disallow: /location/drama/search Disallow: /forum/ Disallow: /new_subject Disallow: /service/iframe Disallow: /j/ Disallow: /link2/ Disallow: /recommend/ Disallow: /trailer/ Disallow: /doubanapp/card Sitemap: https://www.douban.com/sitemap_index.xml Sitemap: https://www.douban.com/sitemap_updated_index.xml # Crawl-delay: 5User-agent: Wandoujia Spider Disallow: /

再看看我要爬取的網站：

https://book.douban.com/tag/?view=type&icn=index-sorttags-allhttps://book.douban.com/tag/?icn=index-navhttps://book.douban.com/tag/[此處是標簽名]https://book.douban.com/subject/[書的編號]/

好了，并沒有違反robots協議，安心的寫代碼了。

第二步

既然寫了，就做得完整一些，現在先登錄一下豆瓣：
我在這里采用的是cookies登錄的方式，首先用firefox神奇的插件HttpFox獲得一下正常登錄的headers和cookies、

找到這條記錄
查看內容

這樣就獲得了cookies和headers，把他們復制下來，直接復制到程序里或者用文件存儲，用你喜歡的方式保存下來。
login函數

def login(url):cookies = {}with open("C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") as file:raw_cookies = file.read();for line in raw_cookies.split(';'):key,value = line.split('=',1)cookies[key] = valueheaders = {'User-Agent':'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)return s

我在這里采用的是將headers復制到程序里，將cookies放入文件中讀取的方式，同時注意要將cookies處理成字典的形式，然后用requests庫的get函數獲得網頁響應。

第三步

先進入豆瓣讀書的分類目錄
https://book.douban.com/tag/?view=type&icn=index-sorttags-all
我們把這個網站上的分類鏈接爬取下來：

import requests from bs4 import BeautifulSoup #from login import login #導入上述login函數url = "https://book.douban.com/tag/?icn=index-nav" web = requests.get(url) #請求網址 soup = BeautifulSoup(web.text,"lxml") #解析網頁信息 tags = soup.select("#content > div > div.article > div > div > table > tbody > tr > td > a") urls = [] #儲存所有鏈接 for tag in tags: tag=tag.get_text() #將列表中的每一個標簽信息提取出來helf="https://book.douban.com/tag/" #觀察一下豆瓣的網址，基本都是這部分加上標簽信息，所以我們要組裝網址，用于爬取標簽詳情頁url=helf+str(tag) urls.append(url) # 將鏈接存入文件 with open("channel.txt","w") as file:for link in urls:file.write(link+'\n')

上面代碼當中用了CSS選擇器，不懂CSS沒關系，將相應的網站頁面用瀏覽器打開，打開開發者工具，在elements界面右鍵要爬取的內容，copy->selector
(我用的是chrome瀏覽器，在正常的圖形網頁里右鍵檢查就能直接定位到對應的elements位置），將CSS選擇器復制下來，注意如果出現了:nth-child(*)之類的都要去掉，不然會報錯。

然后我們得到了鏈接的目錄：

第四步

下面先找一找爬取的方法
根據上面說的CSS選擇器的方法，可以得到書名，作者，譯者，評價人數，評分，還有這本書的封面鏈接和簡介。

title = bookSoup.select('#wrapper > h1 > span')[0].contents[0] title = deal_title(title) author = get_author(bookSoup.select("#info > a")[0].contents[0]) translator = bookSoup.select("#info > span > a")[0].contents[0] person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0] scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0] coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src']; brief = get_brief(bookSoup.select('#link-report > div > div > p'))

有幾點要注意：

文件名不能含有 :?<>"|\/* 所以用正則表達式處理一下：

def deal_title(raw_title):r = re.compile('[/\*?"<>|:]')return r.sub('~',raw_title)

然后將封面下載下來：

path = "C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+".png" urlretrieve(coverUrl,path);

作者名字爬取下來格式要處理過，否者會很難看

def get_author(raw_author):parts = raw_author.split('\n')return ''.join(map(str.strip,parts))

圖書簡介也要處理一下

def get_brief(line_tags):brief = line_tags[0].contentsfor tag in line_tags[1:]:brief += tag.contentsbrief = '\n'.join(brief)return brief

而對于出版社，出版時間，ISBN和圖書定價，則可以用下面更簡潔的方法獲得：

info = bookSoup.select('#info') infos = list(info[0].strings) publish = infos[infos.index('出版社:') + 1] ISBN = infos[infos.index('ISBN:') + 1] Ptime = infos[infos.index('出版年:') + 1] price = infos[infos.index('定價:') + 1]

第五步

先創建數據庫和數據表

CREATE TABLE `allbooks` (`title` char(255) NOT NULL,`scor` char(255) DEFAULT NULL,`author` char(255) DEFAULT NULL,`price` char(255) DEFAULT NULL,`time` char(255) DEFAULT NULL,`publish` char(255) DEFAULT NULL,`person` char(255) DEFAULT NULL,`yizhe` char(255) DEFAULT NULL,`tag` char(255) DEFAULT NULL,`brief` mediumtext,`ISBN` char(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

然后用executemany方法便捷地將數據存入。

connection = pymysql.connect(host='你的主機',user='你的賬號',password='你的密碼',charset='utf8') with connection.cursor() as cursor:sql = "USE DOUBAN_DB;"cursor.execute(sql)sql = '''INSERT INTO allbooks ( title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()

第六步

到此，我們已經找到了全部的方法，就剩寫出完整程序了
還要注意的一點就是要設置隨機訪問間隔，以防封IP。
代碼如下，也在github更新，歡迎star，我的github鏈接。

# -*- coding: utf-8 -*- """ Created on Sat Aug 12 13:29:17 2017@author: Throne """ #每一本書的 1cover 2author 3yizhe(not must) 4time 5publish 6price 7scor 8person 9title 10brief 11tag 12ISBNimport requests #用來請求網頁 from bs4 import BeautifulSoup #解析網頁 import time #設置延時時間，防止爬取過于頻繁被封IP號 import pymysql #由于爬取的數據太多，我們要把他存入MySQL數據庫中，這個庫用于連接數據庫 import random #這個庫里用到了產生隨機數的randint函數，和上面的time搭配，使爬取間隔時間隨機 from urllib.request import urlretrieve #下載圖片 import re #處理詭異的書名connection = pymysql.connect(host='localhost',user='root',password='',charset='utf8') with connection.cursor() as cursor:sql = "USE DOUBAN_DB;"cursor.execute(sql) connection.commit()def deal_title(raw_title):r = re.compile('[/\*?"<>|:]')return r.sub('~',raw_title)def get_brief(line_tags):brief = line_tags[0].contentsfor tag in line_tags[1:]:brief += tag.contentsbrief = '\n'.join(brief)return briefdef get_author(raw_author):parts = raw_author.split('\n')return ''.join(map(str.strip,parts))def login(url):cookies = {}with open("C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") as file:raw_cookies = file.read();for line in raw_cookies.split(';'):key,value = line.split('=',1)cookies[key] = valueheaders = {'User-Agent':'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)return s def crawl():# 獲取鏈接channel = []with open('C:/Users/lenovo/OneDrive/projects/Scraping/channel.txt') as file:channel = file.readlines()for url in channel:data = [] #存放每一本書的數據web_data = login(url.strip())soup = BeautifulSoup(web_data.text.encode('utf-8'),'lxml')tag = url.split("?")[0].split("/")[-1]books = soup.select('''#subject_list > ul > li > div.info > h2 > a''')for book in books:bookurl = book.attrs['href']book_data = login(bookurl)bookSoup = BeautifulSoup(book_data.text.encode('utf-8'),'lxml')info = bookSoup.select('#info')infos = list(info[0].strings)try:title = bookSoup.select('#wrapper > h1 > span')[0].contents[0]title = deal_title(title)publish = infos[infos.index('出版社:') + 1]translator = bookSoup.select("#info > span > a")[0].contents[0]author = get_author(bookSoup.select("#info > a")[0].contents[0])ISBN = infos[infos.index('ISBN:') + 1]Ptime = infos[infos.index('出版年:') + 1]price = infos[infos.index('定價:') + 1]person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];brief = get_brief(bookSoup.select('#link-report > div > div > p'))except :try:title = bookSoup.select('#wrapper > h1 > span')[0].contents[0]title = deal_title(title)publish = infos[infos.index('出版社:') + 1]translator = ""author = get_author(bookSoup.select("#info > a")[0].contents[0])ISBN = infos[infos.index('ISBN:') + 1]Ptime = infos[infos.index('出版年:') + 1]price = infos[infos.index('定價:') + 1]person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];brief = get_brief(bookSoup.select('#link-report > div > div > p'))except:continuefinally:path = "C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+".png"urlretrieve(coverUrl,path);data.append([title,scor,author,price,Ptime,publish,person,translator,tag,brief,ISBN]) with connection.cursor() as cursor:sql = '''INSERT INTO allbooks ( title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()del datatime.sleep(random.randint(0,9)) #防止IP被封start = time.clock() crawl() end = time.clock() with connection.cursor() as cursor:print("Time Usage:", end -start)count = cursor.execute('SELECT * FROM allbooks')print("Total of books:", count)if connection.open:connection.close()

結果展示

文章原創，要轉載請聯系作者

參考博客: http://www.jianshu.com/p/6c060433facf?appinstall=0

總結

以上是生活随笔為你收集整理的Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：解决firebug报“illegal c
下一篇：在FF与IE中使用数据岛