python爬虫爬取博客_Python爬虫抓取csdn博客
Python爬蟲(chóng)抓取csdn博客
昨天晚上為了下載保存某位csdn大牛的全部博文,寫(xiě)了一個(gè)爬蟲(chóng)來(lái)自動(dòng)抓取文章并保存到txt文本,當(dāng)然也可以 保存到html網(wǎng)頁(yè)中。這樣就可以不用Ctrl+C 和Ctrl+V了,非常方便,抓取別的網(wǎng)站也是大同小異。
為了解析抓取的網(wǎng)頁(yè),用到了第三方模塊,BeautifulSoup,這個(gè)模塊對(duì)于解析html文件非常有用,當(dāng)然也可以自己使用正則表達(dá)式去解析,但是比較麻煩。
由于csdn網(wǎng)站的robots.txt文件中顯示禁止任何爬蟲(chóng),所以必須把爬蟲(chóng)偽裝成瀏覽器,而且不能頻繁抓取,得sleep一會(huì)再抓,使用頻繁會(huì)被封ip的,但可以使用代理ip。
#-*-?encoding:?utf-8?-*-
'''
Created?on?2014-09-18?21:10:39
@author:?Mangoer
@email:?2395528746@qq.com
'''
import?urllib2
import?re
from?bs4?import?BeautifulSoup
import?random
import?time
class?CSDN_Blog_Spider:
def?__init__(self,url):
print?'\n'
print('已啟動(dòng)網(wǎng)絡(luò)爬蟲(chóng)。。。')
print??'網(wǎng)頁(yè)地址:?'?+?url
user_agents?=?[
'Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?it;?rv:1.8.1.11)?Gecko/20071127?Firefox/2.0.0.11',
'Opera/9.25?(Windows?NT?5.1;?U;?en)',
'Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?.NET?CLR?1.1.4322;?.NET?CLR?2.0.50727)',
'Mozilla/5.0?(compatible;?Konqueror/3.5;?Linux)?KHTML/3.5.5?(like?Gecko)?(Kubuntu)',
'Mozilla/5.0?(X11;?U;?Linux?i686;?en-US;?rv:1.8.0.12)?Gecko/20070731?Ubuntu/dapper-security?Firefox/1.5.0.12',
'Lynx/2.8.5rel.1?libwww-FM/2.14?SSL-MM/1.4.1?GNUTLS/1.2.9',
"Mozilla/5.0?(X11;?Linux?i686)?AppleWebKit/535.7?(KHTML,?like?Gecko)?Ubuntu/11.04?Chromium/16.0.912.77?Chrome/16.0.912.77?Safari/535.7",
"Mozilla/5.0?(X11;?Ubuntu;?Linux?i686;?rv:10.0)?Gecko/20100101?Firefox/10.0?",
]
#?use?proxy?ip
#?ips_list?=?['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',
#???????????'1.50.235.82:80','203.80.144.4:80']
#?ip?=?random.choice(ips_list)
#?print?'使用的代理ip地址:?'?+?ip
#?proxy_support?=?urllib2.ProxyHandler({'http':'http://'+ip})
#?opener?=?urllib2.build_opener(proxy_support)
#?urllib2.install_opener(opener)
agent?=?random.choice(user_agents)
req?=?urllib2.Request(url)
req.add_header('User-Agent',agent)
req.add_header('Host','blog.csdn.net')
req.add_header('Accept','*/*')
req.add_header('Referer','http://blog.csdn.net/mangoer_ys?viewmode=list')
req.add_header('GET',url)
html?=?urllib2.urlopen(req)
page?=?html.read().decode('gbk','ignore').encode('utf-8')
self.page?=?page
self.title?=?self.getTitle()
self.content?=?self.getContent()
self.saveFile()
def?printInfo(self):
print('文章標(biāo)題是:???'+self.title?+?'\n')
print('內(nèi)容已經(jīng)存儲(chǔ)到out.txt文件中!')
def?getTitle(self):
rex?=?re.compile('
(.*?)',re.DOTALL)match?=?rex.search(self.page)
if?match:
return?match.group(1)
return?'NO?TITLE'
def?getContent(self):
bs?=?BeautifulSoup(self.page)
html_content_list?=?bs.findAll('div',{'id':'article_content','class':'article_content'})
html_content?=?str(html_content_list[0])
rex_p?=?re.compile(r'(?:.*?)>(.*?)
p_list?=?rex_p.findall(html_content)
content?=?''
for?p?in?p_list:
if?p.isspace()?or?p?==?'':
continue
content?=?content?+?p
return?content
def?saveFile(self):
outfile?=?open('out.txt','a')
outfile.write(self.content)
def?getNextArticle(self):
bs2?=?BeautifulSoup(self.page)
html_nextArticle_list?=?bs2.findAll('li',{'class':'prev_article'})
#?print?str(html_nextArticle_list[0])
html_nextArticle?=?str(html_nextArticle_list[0])
#?print?html_nextArticle
rex_link?=?re.compile(r'
link?=?rex_link.search(html_nextArticle)
#?print?link.group(1)
if?link:
next_url?=?'http://blog.csdn.net'?+?link.group(1)
return?next_url
return?None
class?Scheduler:
def?__init__(self,url):
self.start_url?=?url
def?start(self):
spider?=?CSDN_Blog_Spider(self.start_url)
spider.printInfo()
while?True:
if?spider.getNextArticle():
spider?=?CSDN_Blog_Spider(spider.getNextArticle())
spider.printInfo()
elif?spider.getNextArticle()?==?None:
print?'All?article?haved?been?downloaded!'
break
time.sleep(10)
#url?=?input('請(qǐng)輸入CSDN博文地址:')
url?=?"http://blog.csdn.net/mangoer_ys/article/details/38427979"
Scheduler(url).start()
程序中有個(gè)問(wèn)題一直不能解決:不能使用標(biāo)題去命名文件,所以所有的文章全部放在一個(gè)out.txt中,說(shuō)的編碼的問(wèn)題,希望大神可以解決這個(gè)問(wèn)題。
總結(jié)
以上是生活随笔為你收集整理的python爬虫爬取博客_Python爬虫抓取csdn博客的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: linux电机驱动程序,基于Linux系
- 下一篇: java常用类(一)