當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python爬虫爬取博客_Python爬虫抓取csdn博客

發(fā)布時(shí)間：2023/12/20 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫爬取博客_Python爬虫抓取csdn博客小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Python爬蟲(chóng)抓取csdn博客

昨天晚上為了下載保存某位csdn大牛的全部博文，寫(xiě)了一個(gè)爬蟲(chóng)來(lái)自動(dòng)抓取文章并保存到txt文本，當(dāng)然也可以保存到html網(wǎng)頁(yè)中。這樣就可以不用Ctrl+C 和Ctrl+V了，非常方便，抓取別的網(wǎng)站也是大同小異。

為了解析抓取的網(wǎng)頁(yè)，用到了第三方模塊，BeautifulSoup，這個(gè)模塊對(duì)于解析html文件非常有用，當(dāng)然也可以自己使用正則表達(dá)式去解析，但是比較麻煩。

由于csdn網(wǎng)站的robots.txt文件中顯示禁止任何爬蟲(chóng)，所以必須把爬蟲(chóng)偽裝成瀏覽器，而且不能頻繁抓取，得sleep一會(huì)再抓，使用頻繁會(huì)被封ip的，但可以使用代理ip。

#-*-?encoding:?utf-8?-*-

'''

Created?on?2014-09-18?21:10:39

@author:?Mangoer

@email:?2395528746@qq.com

'''

import?urllib2

import?re

from?bs4?import?BeautifulSoup

import?random

import?time

class?CSDN_Blog_Spider:

def?__init__(self,url):

print?'\n'

print('已啟動(dòng)網(wǎng)絡(luò)爬蟲(chóng)。。。')

print??'網(wǎng)頁(yè)地址：?'?+?url

user_agents?=?[

'Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?it;?rv:1.8.1.11)?Gecko/20071127?Firefox/2.0.0.11',

'Opera/9.25?(Windows?NT?5.1;?U;?en)',

'Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?.NET?CLR?1.1.4322;?.NET?CLR?2.0.50727)',

'Mozilla/5.0?(compatible;?Konqueror/3.5;?Linux)?KHTML/3.5.5?(like?Gecko)?(Kubuntu)',

'Mozilla/5.0?(X11;?U;?Linux?i686;?en-US;?rv:1.8.0.12)?Gecko/20070731?Ubuntu/dapper-security?Firefox/1.5.0.12',

'Lynx/2.8.5rel.1?libwww-FM/2.14?SSL-MM/1.4.1?GNUTLS/1.2.9',

"Mozilla/5.0?(X11;?Linux?i686)?AppleWebKit/535.7?(KHTML,?like?Gecko)?Ubuntu/11.04?Chromium/16.0.912.77?Chrome/16.0.912.77?Safari/535.7",

"Mozilla/5.0?(X11;?Ubuntu;?Linux?i686;?rv:10.0)?Gecko/20100101?Firefox/10.0?",

]

#?use?proxy?ip

#?ips_list?=?['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',

#???????????'1.50.235.82:80','203.80.144.4:80']

#?ip?=?random.choice(ips_list)

#?print?'使用的代理ip地址：?'?+?ip

#?proxy_support?=?urllib2.ProxyHandler({'http':'http://'+ip})

#?opener?=?urllib2.build_opener(proxy_support)

#?urllib2.install_opener(opener)

agent?=?random.choice(user_agents)

req?=?urllib2.Request(url)

req.add_header('User-Agent',agent)

req.add_header('Host','blog.csdn.net')

req.add_header('Accept','*/*')

req.add_header('Referer','http://blog.csdn.net/mangoer_ys?viewmode=list')

req.add_header('GET',url)

html?=?urllib2.urlopen(req)

page?=?html.read().decode('gbk','ignore').encode('utf-8')

self.page?=?page

self.title?=?self.getTitle()

self.content?=?self.getContent()

self.saveFile()

def?printInfo(self):

print('文章標(biāo)題是：???'+self.title?+?'\n')

print('內(nèi)容已經(jīng)存儲(chǔ)到out.txt文件中！')

def?getTitle(self):

rex?=?re.compile('

(.*?)',re.DOTALL)

match?=?rex.search(self.page)

if?match:

return?match.group(1)

return?'NO?TITLE'

def?getContent(self):

bs?=?BeautifulSoup(self.page)

html_content_list?=?bs.findAll('div',{'id':'article_content','class':'article_content'})

html_content?=?str(html_content_list[0])

rex_p?=?re.compile(r'(?:.*?)>(.*?)

p_list?=?rex_p.findall(html_content)

content?=?''

for?p?in?p_list:

if?p.isspace()?or?p?==?'':

continue

content?=?content?+?p

return?content

def?saveFile(self):

outfile?=?open('out.txt','a')

outfile.write(self.content)

def?getNextArticle(self):

bs2?=?BeautifulSoup(self.page)

html_nextArticle_list?=?bs2.findAll('li',{'class':'prev_article'})

#?print?str(html_nextArticle_list[0])

html_nextArticle?=?str(html_nextArticle_list[0])

#?print?html_nextArticle

rex_link?=?re.compile(r'

link?=?rex_link.search(html_nextArticle)

#?print?link.group(1)

if?link:

next_url?=?'http://blog.csdn.net'?+?link.group(1)

return?next_url

return?None

class?Scheduler:

def?__init__(self,url):

self.start_url?=?url

def?start(self):

spider?=?CSDN_Blog_Spider(self.start_url)

spider.printInfo()

while?True:

if?spider.getNextArticle():

spider?=?CSDN_Blog_Spider(spider.getNextArticle())

spider.printInfo()

elif?spider.getNextArticle()?==?None:

print?'All?article?haved?been?downloaded!'

break

time.sleep(10)

#url?=?input('請(qǐng)輸入CSDN博文地址：')

url?=?"http://blog.csdn.net/mangoer_ys/article/details/38427979"

Scheduler(url).start()

程序中有個(gè)問(wèn)題一直不能解決：不能使用標(biāo)題去命名文件，所以所有的文章全部放在一個(gè)out.txt中，說(shuō)的編碼的問(wèn)題，希望大神可以解決這個(gè)問(wèn)題。

總結(jié)

以上是生活随笔為你收集整理的python爬虫爬取博客_Python爬虫抓取csdn博客的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： linux电机驱动程序,基于Linux系
下一篇： java常用类（一）