python 访问网页aspx_asp.net – 如何向python中的.aspx页面提交查询
作為概述,您將需要執行四個主要任務:
>向網站提交請求,
>從站點檢索響應
>來解析這些響應
>有一些邏輯來迭代上面的任務,與導航相關的參數(到結果列表中的“下一個”頁面)
http請求和響應處理使用Python標準庫的urllib和urllib2中的方法和類來完成.html頁面的解析可以使用Python的標準庫HTMLParser或其他模塊,如Beautiful Soup
以下代碼段演示了在問題中指出的站點上請求和接收搜索。該網站是ASP驅動的,因此我們需要確保我們發送幾個表單域,其中一些具有“可怕”值,因為ASP邏輯使用這些字段來維護狀態并在一定程度上驗證請求。確實提交。必須使用http POST方法發送請求,因為這是ASP應用程序的預期。主要的困難在于識別ASP期望的表單域和相關值(使用Python獲取頁面很容易)。
這個代碼是有效的,更確切地說是功能,直到我刪除了大部分的VSTATE值,并且可能通過添加注釋引入了一個或兩個。
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots,etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13','HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8','Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable,actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value,or
# it may have to be refreshed each time / each day,by essentially
# running an extra page request and parse,for this specific value.
(r'__VSTATE',r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),# following are more of these ASP form fields
(r'__VIEWSTATE',r''),(r'__EVENTVALIDATION',r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),(r'ctl00_RadScriptManager1_HiddenField',''),(r'ctl00_tabTop_ClientState',(r'ctl00_ContentPlaceHolder1_menuMain_ClientState',(r'ctl00_ContentPlaceHolder1_gridMain_ClientState',#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0','on'),# file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1',# Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2',# attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch','york'),# Search text
(r'ctl00$ContentPlaceHolder1$lstYears','All Years'),# Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic','All Types'),#types to include
(r'ctl00$ContentPlaceHolder1$btnSearch','Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri,encodedFields,headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents,but instead I store this to file
# this is useful during design,allowing to have a
# sample of what is to be parsed in a text editor,for analysis.
try:
fout = open('tmp.htm','w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
這是關于獲取初始頁面的。如上所述,然后需要解析頁面,即找到感興趣的部分并適當地收集它們,并將它們存儲到文件/數據庫/ whereever。這個工作可以通過很多方式完成:使用html解析器,或XSLT類型的技術(確實在解析html到xml之后),甚至是粗略的作業,簡單的正則表達式。而且,通常提取的一個項目是“下一個信息”,即可以在服務器的新請求中使用以獲得后續頁面的各種鏈接。
這應該給你一個粗俗的味道,“長手”html刮擦是關于。還有許多其他方法,例如Mozilla(FireFox)GreaseMonkey插件,XSLT中的專用工具,腳本
總結
以上是生活随笔為你收集整理的python 访问网页aspx_asp.net – 如何向python中的.aspx页面提交查询的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: xiao77论坛php,论坛
- 下一篇: C++输出变量类型、max报错原因