當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

web编程模块1 html,PYcore python programming笔记C20 Web编程

發(fā)布時(shí)間：2023/12/10 python 69 豆豆

生活随笔收集整理的這篇文章主要介紹了 web编程模块1 html,PYcore python programming笔记C20 Web编程小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

C20 Web編程

20.1介紹

C/S架構(gòu)? 服務(wù)端永遠(yuǎn)運(yùn)行

HTTP協(xié)議：無(wú)狀態(tài)協(xié)議，不跟蹤一個(gè)客戶端到另一個(gè)客戶端的請(qǐng)求，但會(huì)被處理為獨(dú)立的服務(wù)請(qǐng)求

使用URL和cookie保存信息

URL? 統(tǒng)一資源定位器

URI?? 統(tǒng)一資源標(biāo)識(shí)器

URL是URI的一部分

prot_sch://net_loc/path;params?query#frag

prot_sch網(wǎng)絡(luò)協(xié)議或者下載規(guī)劃

net_loc? 服務(wù)器位置和用戶信息具體為user:passwd@host:port

path?????? 斜杠(/)限定文件或者CGI應(yīng)用程序的路徑

params? 可選參數(shù)

query????? 連接符(&)連接鍵值對(duì)

frag???????? 拆分文檔中的特殊錨

20.2.2? urlparse模塊

.urlparse(urlstr,defProSch=None,allowFrag=None)??? 將urlstr解析成各個(gè)部件，allowFrag決定是否有零部件

>>>urlparse.urlparse('http://www.python.org/doc/FAQ.html')

('http','www.python.org','/doc/FAQ.html','','','')

.urlunparse(urlup)????????????? 將urlup元組組合成一個(gè)url字符串

.urljoin(baseurl,newurl,allowFrag=None)??? ，將baseurl的部件的一部分替換成newurl(newurl不需要完整)，

返回字符串

20.2.3? urllib模塊#提供了一個(gè)高級(jí)Web庫(kù)，避免了使用httplib、ftplib、gopherlib等底層模塊

1.urllib.urlopen()

f=urllib.urlopen(urlstr,postQueryData=None)???? #以文件類型打開(kāi)URL

.read([bytes])???????????? 從f中讀出所有或bytes個(gè)字節(jié)

.readline()?????????????????? 從f中讀出一行

.readlines()???????????????? 從f中讀出所有行并返回一個(gè)列表

.close()??????????????????????? 關(guān)閉url

.fileno()????????????????????? 返回f的文件句柄

.info()????????????????????????? 獲得f的MIME頭文件(通知使用哪類應(yīng)用程序打開(kāi))

.geturl()????????????????????? 返回f所打開(kāi)的真正的url

2.urllib.urlretrieve(urlstr,local-file=None,downloadStatusHook=None)

將文件下載為localfile或者tempfile，如果已下載則downloadStautsHook有統(tǒng)計(jì)信息

3.urllib.quote(urldata,safe='/')

將urldata的無(wú)效的url字符編碼，在safe列的則不必編碼，使適用打印或web服務(wù)器

urlilib.quote_plus(urldata,safe='/')

將空格編譯成(+)號(hào)(而非%20)，其他功能同上

4.urllib.unquote(urldata)

將urldata解碼

urllib.unquote_plus(urldata)

將加號(hào)解碼成空格，其他同上

5.urllib.urlencode(dict)

將字典鍵--值對(duì)編譯成有效的CGI請(qǐng)求字符串，用quote_plus()對(duì)鍵和值字符串分別編碼

20.2.4 urllib2模塊 #可以處理更復(fù)雜的請(qǐng)求，比如登錄名和密碼

方法一：建立一個(gè)基礎(chǔ)認(rèn)證服務(wù)器(urllib2.HTTPBasicAuthHandler)，同時(shí)在基本URL或域上注冊(cè)一個(gè)登錄密碼。

方法二：當(dāng)瀏覽器提示的時(shí)候，輸入用戶名和密碼，這樣就發(fā)送了一個(gè)帶有適當(dāng)用戶請(qǐng)求的認(rèn)證頭。

#urlopenAuth.py

import urllib2

PASSWD="you'll never guess"

URL='http://localhost'

def handler_version(url):

from urlparse import urlparse as up

hdlr=urllib2.HTTPBasicAuthHandler()

hdlr.add_password('Archives',up(url)[1],LOGIN,PASSWD)

opener=urllib2.build_opener(hdlr)

urllib2.install_opener(opener)

return url

def request_version(url):

from base64 import encodestring

req=urllib2.Request(url)

b64str=encodestring('%s%s'%(LOGIN,PASSWD))[:-1]

req.add_header("Authorization","Basic %s"%b64str)

return req

for funcType in('handler','request'):

print '***using %s'%funcType.upper()

url=eval('%s_version'%funcType)(URL)

f=urllib2.urlopen(url)

print f.readline()

f.close()

20.3 高級(jí)Web客戶端

網(wǎng)絡(luò)爬蟲(chóng)：為搜索引擎建索引、脫機(jī)瀏覽、下載并保存歷史記錄或框架、Web頁(yè)的緩存節(jié)約訪問(wèn)時(shí)間

#coding=UTF-8

#crawl.py

from sys import argv

from os import makedirs, unlink, sep

from os.path import dirname, exists, isdir, splitext

from string import replace, find, lower

from htmllib import HTMLParser

from urllib import urlretrieve

from urlparse import urlparse, urljoin

from formatter import DumbWriter, AbstractFormatter

from cStringIO import StringIO

class Retriever(object):#download Web pages

def __init__(self, url): #初始化

self.url = url

self.file = self.filename(url) #以filename方式新建文件

def filename(self, url, deffile='index.htm'): #開(kāi)辟文件路徑

parsedurl = urlparse(url, 'http:', 0)

path = parsedurl[1] + parsedurl[2] #域名和遠(yuǎn)程文件路徑

ext = splitext(path)

if ext[1] == '': #no file ,ues default

if path[-1] == '/':

path += deffile #后綴

else:

path += '/' + deffile

ldir = dirname(path)

#print "thatis",ldir

if sep != '/':

ldir = replace(ldir, '/', sep) #以系統(tǒng)的分隔符替換'/'

if not isdir(ldir): #不存在則創(chuàng)建文檔

if exists(ldir): unlink(ldir)

# print "thisis",ldir

makedirs(ldir or 'undenied')

return path

def download(self): #獲取文件

try:

retval = urlretrieve(self.url, self.file)

except:

retval = ('*** ERROR: invalid URL "%s"' % self.url,)

print "download done"

return retval

def parseAndGetLinks(self): #定義獲取links的方法

self.parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

self.parser.feed(open(self.file).read())

self.parser.close()

return self.parser.anchorlist

class Crawler(object):

count = 0

def __init__(self, url):

self.q = [url] #初始化隊(duì)列

self.seen = []

self.dom = urlparse(url)[1] #域名部分

def getPage(self, url):

r = Retriever(url) #先下載了

retval = r.download() #下載的內(nèi)容部分

if retval[0] == '*':

print retval, '... skipping parse'

return

Crawler.count += 1 #加一次爬取

print '\n(', Crawler.count, ')'

print 'URL:', url #打印鏈接

print 'FILE:', retval[0]

self.seen.append(url)#已爬取的添加到歷史記錄里

links = r.parseAndGetLinks()#獲取links

for eachLink in links:

if eachLink[:4] != 'http' and find(eachLink, '://') == -1:

eachLink = urljoin(url, eachLink)#補(bǔ)充成完整鏈接

print '* ', eachLink,

if find(lower(eachLink), 'mailto:') != -1: #濾過(guò)郵件鏈接

print '... discarded, mailto link'

continue

if eachLink not in self.seen: #歷史記錄里未爬取

if find(eachLink, self.dom) == -1: #濾過(guò)非域內(nèi)鏈接

print '... discarded, not in domain'

else:

if eachLink not in self.q:

self.q.append(eachLink) #判定為新的非重復(fù)鏈接，添加

print '... new, added to Q'

else:

print '... discarded, already in Q'

else:

print '... discarded, already processed'#判定已爬取

def go(self):

while self.q: #隊(duì)列非空則繼續(xù)爬取

url = self.q.pop() #從隊(duì)列中取出

self.getPage(url) #下載并分析鏈接

def main():

if len(argv) > 1:

url = argv[1]

else:

try:

url = raw_input('Enter starting URL: ') #http://www.baidu.com/index.html

except:

url = ''

if not url: return

robot = Crawler(url) #入口,并初始化隊(duì)列

robot.go()

if __name__ == "__main__":

main() #首先看看main函數(shù)吧，再看到go

20.4 CGI:幫助Web服務(wù)器處理客戶端數(shù)據(jù)

#CGI程序與應(yīng)用程序不同在于輸入、輸出及用戶和計(jì)算機(jī)交互方面

#cgi模塊主要類是FiledStorage,一旦實(shí)例化會(huì)具有一系列鍵值對(duì)，這些值本身可以是FiledStorage對(duì)象，也可以是MiniFiledStorage對(duì)象，或者是這些對(duì)象的列表。

20.5建立CGI應(yīng)用程序

10.5.1建立Web服務(wù)器

可以用Apache

也可以建立一個(gè)基于web的簡(jiǎn)單服務(wù)器，默認(rèn)8000端口

$ Python -m CGIHTTPServer

20.9相關(guān)模塊 cgi?? cgitb? htmllib?? HTMLparser??? htmlentitydefs cookie?? cookielib? webbrowser??? sgmllib? robotparser?? httplib?? xmllib?? xml?? xml.sax?? xml.dom? xml.etree? xml.parsers.expat?? xmlrpclib SimpleXMLRPCServer??? DocXMLRPCServer??? BaseHTTPServer?? SimpleHTTPServer?? CGIHTTPServer wsgiref? HTMLgen? BeautifulSoup?? poplib?? imaplib?? email?? mailbox? mailcap?? mimetools?? mimetypes MimeWriter?? multifile quopri?? rfc822?? smtplib? base64?? binascii? binhex uu httplib? ftplib? gopherlib telnetlib? nntplib

總結(jié)

以上是生活随笔為你收集整理的web编程模块1 html,PYcore python programming笔记C20 Web编程的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：海盗分金问题冲突分析—非合作博弈
下一篇： Pandas loc/iloc用法详解

python

web编程 模块1 html,PYcore python programming笔记C20 Web编程

總結(jié)

web编程模块1 html,PYcore python programming笔记C20 Web编程