當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

下载天涯的文章

發(fā)布時間：2024/6/14 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了下载天涯的文章小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

寫了個下載天涯文章的Python腳本，有點亂，效率很低，單線程，用正則表達式拼起來的。

能簡單的判斷一些是不是作者與網(wǎng)友聊天的帖子，也會有失誤，而且很多！！！！

有時候會卡死，我的解決辦法是重新來一次……汗

# -*- coding: utf-8 -*- import urllib import urllib2 import re,osdef cn(s):return s.decode("utf-8").encode("gbk")def getUrlContent(url):return urllib2.urlopen(url).read()def getFirst(cont):p1 = re.findall('''<div class="bbs-content clearfix">(.+?)</div>''',cont,re.DOTALL)if len(p1)>0:return p1[0]else:return ""def getNextPageUrl(cont):p1 = re.findall('''<a href="(.+?)" class="js-keyboard-next">下頁</a>''',cont)if len(p1)>0:return "http://bbs.tianya.cn"+p1[0]else:return Nonedef getAuthor(cont):p1 = re.findall('''<meta name="author" content="(.+)">''',cont)if len(p1)>0:return p1[0]def getTitle(cont): p1 = re.findall('''<span class="s_title"><span style="font-weight:400;">(.+?)</span>''',cont)if len(p1)>0:return p1[0]def getOnePage(cont,author,fp):t=""n=0#print contp='''<div class="atl-item".+?uname="(.+?)">.+?<span>(時間.+?)</span>.+?<div class="bbs-content">(.+?)</div>'''p1 = re.findall(p,cont,re.S)#print p1if len(p1)>0:for t in p1:if t[0]==author:if re.findall("[^-]+?-----------------------------[^-]*?",t[2])==[] and len(t[2])>512:fp.writelines("<hr/>%s<br/>%s"%(t[1],t[2]))def main(url):n=0author=""print urlcont=getUrlContent(url)if cont<0:returnprint 'open OK'author=getAuthor(cont)if author<0:print "url error"returntitle = getTitle(cont)if author<0:print "url error"returntime=re.findall("<span>(時間：.+?)</span>",cont)[0]print 'title:',cn(title)print 'author:',cn(author)print 'time:',cn(time)while 1:if n>0:fn="%s[%d].htm"%(cn(title),n)else:fn="%s.htm"%cn(title)if os.path.isfile(fn):print "File %s already exists!"%fnn=n+1else:breakfp=open(fn,'w')fp.writelines('''<html><head><meta charset="utf-8"/><title>%s</title></head><body>'''%(title))fp.writelines("【%s】<br/>【%s】\n<hr/>%s<br/>"%(title,author,time))fp.writelines(getFirst(cont))n=1while 1:print "page:%d"%ngetOnePage(cont,author,fp)url=getNextPageUrl(cont)if url!=None:cont=getUrlContent(url)n=n+1else:breakfp.writelines('''</body></html>''')fp.close()print "download ok"if __name__ == '__main__':url=raw_input('input url:')main(url)

轉載于:https://www.cnblogs.com/fwindpeak/p/3369383.html

總結

以上是生活随笔為你收集整理的下载天涯的文章的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： web.xml 通过contextCon
下一篇： HTML语言中checkbox的行为