python 爬虫 爬取序列博客文章列表
生活随笔
收集整理的這篇文章主要介紹了
python 爬虫 爬取序列博客文章列表
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
python中寫個爬蟲真是太簡單了
import urllib.request from pyquery import PyQuery as PQ# 根據URL獲取內容并解碼為UTF-8 def getHtml(url):page = urllib.request.urlopen(url)html = page.read()html = html.decode('UTF-8')return html# 解析返回的html def getArtical(html, results):doc = PQ(html)# data = doc('.searchAtcList .searchAtc_top a')data = doc('.atc_title a')for x in data.items():title = x.text()href = x.attr('href')if title.find('教你炒股票') >= 0:# 標題被截斷的需要根據URL獲取完整的標題if title.find('…') >= 0:title = getArticalDetail(x.attr('href'))r = '[' + title + '](' + href + ')'index = title[5 : title.index(':')]results.append((int(index),r))# 獲取文章標題 def getArticalDetail(url):html = getHtml(url)doc = PQ(html)data = doc('.articalTitle h2')title = data.text()return titleblog3 = 'http://blog.sina.com.cn/s/articlelist_1215172700_0_' # http://blog.sina.com.cn/s/articlelist_1215172700_0_1.html # http://blog.sina.com.cn/s/articlelist_1215172700_0_15.html # blog = 'http://control.blog.sina.com.cn/search/search.php?uid=1215172700&keyword=%E8%82%A1%E7%A5%A8&page=' # blog2 = 'http://control.blog.sina.com.cn/search/search.php?uid=1215172700&keyword=%E8%82%A1%E7%A5%A8&page='results = []# 總共有23頁 for i in range(1, 24):url = blog3 + str(i) + '.html'print(url)html = getHtml(url)getArtical(html, results)# 排序后輸出 results.sort() for x in results:print(x[1])總結
以上是生活随笔為你收集整理的python 爬虫 爬取序列博客文章列表的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 服务 进程守护 MarsDaemon 简
- 下一篇: I2C子系统