生活随笔
收集整理的這篇文章主要介紹了
python爬虫拉取豆瓣Top250数据
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
python爬蟲拉取豆瓣Top250數據
利用request和正則表達式抓取豆瓣電影Top250的相關內容,提取出電影的名稱、時間、評分和圖片等信息,提取的站點url為https://movie.douban.com/top250,提取的結果以數據庫和文本文件保存。
導入包
import json
import requests
from requests
.exceptions
import RequestException
import re
import time
import pymysql
相關配置,傳入url參數,抓取頁面結果返回
def get_one_page(url
):try:headers
= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response
= requests
.get
(url
, headers
=headers
)if response
.status_code
== 200:return response
.text
return Noneexcept RequestException
:return None
獲取頁面并解析相關數據,用正則表達式匹配電影的信息,如電影名、評分、電影海報、導演、主角和上映年份等,并賦值為字典,形成結構化數據。
def parse_one_page(html
):pattern
= re
.compile('<li>''.*?<em class="">(.*?)</em>''.*?src="(.*?)" ''.*?title">(.*?)</span>''.*?<br>.*?(\d+)''.*?inq">(.*?)</span>''.*?</li>', re
.S
)items
= re
.findall
(pattern
, html
)for item
in items
:yield {'id': item
[0],'image': item
[1],'name': item
[2],'year': item
[3],'inq': item
[4]}
將數據寫入文本文件,通過JSON庫的dumps()方法實現字典的序列化
def write_to_file(content
):with open('D:/result.txt', 'a', encoding
='utf-8') as f
:f
.write
(json
.dumps
(content
, ensure_ascii
=False) + '\n')
將數據寫入數據庫
def write_to_mysql(item
):db
= pymysql
.connect
(host
='localhost', user
='root', passwd
='xjz01405', db
='test', port
=3306)cursor
= db
.cursor
()cursor
.execute
('SELECT VERSION()')data
= cursor
.fetchone
()print('Database version:', data
)table
= 'movies'keys
= ', '.join
(item
.keys
())values
= ', '.join
(['%s'] * len(item
))sql
= 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table
=table
, keys
=keys
, values
=values
)try:if cursor
.execute
(sql
, tuple(item
.values
())):print('sucessful')db
.commit
()except:print('failed')db
.rollback
()
主函數,調用之前實現的方法,將電影數據寫入數據庫和文件中
def main(offset
):url
= 'https://movie.douban.com/top250?start=' + str(start
) + '&filter='html
= get_one_page
(url
)for item
in parse_one_page
(html
):print(item
)write_to_mysql
(item
)update_to_mysql
(item
)write_to_file
(item
)
頁面跳轉,給主函數內的url傳入start參數,實現全部250條數據的爬取
if __name__
== '__main__':for i
in range(2, 3):main
(start
=i
* 25)time
.sleep
(2)
結果:
更新數據,如果有新的數據就插入數據,如果數據已經存在數據庫中,就更新數據,通過比較主鍵判斷是否存在,實現主鍵不存在就插入數據,若存在就更新數據。
def update_to_mysql(item
):db
= pymysql
.connect
(host
='localhost', user
='root', passwd
='xjz01405', db
='test', port
=3306)cursor
= db
.cursor
()table
= 'movies'keys
= ', '.join
(item
.keys
())values
= ', '.join
(['%s'] * len(item
))sql
= 'INSERT INTO {table}({keys}) VALUES ({values}) ON DUPLICATE KEY UPDATE'.format(table
=table
, keys
=keys
, values
=values
)update
= ','.join
([" {key} = %s".format(key
=key
) for key
in item
])sql
+= update
try:if cursor
.execute
(sql
, tuple(item
.values
())*2):print('sucessful')db
.commit
()except:print('failed')db
.rollback
()
總結
以上是生活随笔為你收集整理的python爬虫拉取豆瓣Top250数据的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。