當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取新闻网站内容findall函数爬取_Python爬取新闻网标题、日期、点击量

發(fā)布時間：2024/9/27 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取新闻网站内容findall函数爬取_Python爬取新闻网标题、日期、点击量小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

最近接觸Python爬蟲，以爬取學(xué)校新聞網(wǎng)新聞標(biāo)題、日期、點擊量為例，記錄一下工作進(jìn)度

目前，感覺Python爬蟲的過程無非兩步：

Step1.獲取網(wǎng)頁url(利用Python庫函數(shù)import urllib2)

Step2.利用正則表達(dá)式對html中的字符串進(jìn)行匹配、查找等操作

自我感覺sublime text2編輯器真心好用，部署Python后不會像WingIDE、notepad++那樣存在那么多頭疼的小問題，推薦使用

# -*- coding: UTF-8 -*-

import urllib2

import sys

import re

import os

#***********fuction define************#

def extract_url(info):

rege="

"#fei tan lan mo shi

re_url = re.findall(rege, info)

n=len(re_url)

for i in range(0,n):

re_url[i]="http://news.swjtu.edu.cn/"+re_url[i]

return re_url

def extract_title(sub_web):

re_key = "

\r\n (.*)\r\n

title = re.findall(re_key,sub_web)

return title

def extract_date(sub_web):

re_key = "日期：(.*?) ?"

date = re.findall(re_key,sub_web)

return date

def extract_counts(sub_web):

re_key = "點擊數(shù)：(.*?) ?"

counts = re.findall(re_key,sub_web)

return counts

#*************main**************#

fp=open('output.txt','w')

content = urllib2.urlopen('http://news.swjtu.edu.cn/ShowList-82-0-1.shtml').read()

url=extract_url(content)

string=""

n=len(url)

print n

for i in range(0,n):

sub_web = urllib2.urlopen(url[i]).read()

sub_title = extract_title(sub_web)

string+=sub_title[0]

string+=' '

sub_date = extract_date(sub_web)

string+="日期："+sub_date[0]

string+=' '

sub_counts = extract_counts(sub_web)

string+="點擊數(shù)："+sub_counts[0]

string+='\n'

# print string

print string

fp.close()

原文：http://blog.csdn.net/u012717411/article/details/46486679

總結(jié)

以上是生活随笔為你收集整理的python爬取新闻网站内容findall函数爬取_Python爬取新闻网标题、日期、点击量的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux mv 保持目录结构_（三）L
下一篇： textview加载html glide