抓取网页数据并解析
這天遇到這樣一個需求:這種頁面數據可以抓取嗎?
隨后提供了賬號、密碼和網站地址:
帳號:kytj1 ? ?
密碼:****************** ? ?
登陸地址:http://student.tiaoji.kaoyan.com/tjadm
主要思路:
1、使用Fiddler4分析http請求交互方式,包括數據發送方式(POST或GET),攜帶參數等,獲得返回的數據信息
2、用Android程序模擬HTTP請求
3、用Java解析HTML代碼,提取出對應的姓名、報考學校、報考專業、分數、聯系電話、發布時間等字段
4、把txt文件導入到Excel里,待進一步處理。
用Fiddle查看數據包
1、打開Fiddler
2、打開網站,填入用戶名和密碼,點擊登錄
登陸地址:http://student.tiaoji.kaoyan.com/tjadm
3、觀察Filldder抓到的包
可以看到HOST、URL、POST方式以及明文密碼
4、觀察網頁數據
登錄成功后,網頁數據顯示為
? ?對應的Filldder抓包數據為
可以看到請求的HOST以及URL,方式為GET,返回的數據也可以在body體中獲取到。
5、HTML代碼
返回的HTML頁面代碼為(選取了部分)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=3.0,user-scalable=no "> <title>考研調劑中心_考研調劑意向發布系統_考研調劑_考研網(kaoyan.com)</title> <meta name="description" content="" /> <link rel="stylesheet" type="text/css" href="http://img.kaoyan.com/tiaoji/css/tiaoji-h5.css" /> <link href="http://img.kaoyan.com/global/style/header.css" rel="stylesheet"> <link href="http://img.kaoyan.com/yz/style/yz.index.css" rel="stylesheet"> <script type='text/javascript' src='http://cbjs.baidu.com/js/m.js'></script> </head> <body> <div class="kyHd"><div class="kyTop"><script src="http://img.kaoyan.com/www/header-tiaoji.js" type="text/javascript"></script><script src="http://img.kaoyan.com/www/headera.js" type="text/javascript"></script></div> </div> <div style="height:10px;"></div> <div class="w1000ad tc"><script type="text/javascript">/*考研網-大通欄-通用*/var cpro_id = "u1773335";</script><script src="http://cpro.baidustatic.com/cpro/ui/c.js" type="text/javascript"></script> </div> <ul class="nav" id="tjNav"><li><a href="http://tiaoji.kaoyan.com/" title="考研調劑首頁">調劑首頁</a></li><li><a href="http://www.kaoyan.com/kaoyan/27/474572/" title="考研調劑流程" target="_blank">調劑流程</a></li><li><a href="http://www.kaoyan.com/tiaoji/xinxi/" title="考研調劑信息">調劑信息</a></li><li><a href="http://tiaoji.kaoyan.com/xinwen/" title="考研調劑新聞">調劑新聞</a></li><li><a href="http://tiaoji.kaoyan.com/jingyan/" title="考研調劑經驗">調劑經驗</a></li><li><a href="http://tiaoji.bbs.kaoyan.com/" title="考研調劑論壇" target="_blank">調劑論壇</a></li> </ul> <div class="courseArea"><ul class="tjPicAd mt10 clear"><li><script type="text/javascript">BAIDU_CLB_fillSlot("850729");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("850747");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("850763");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("850766");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869710");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869712");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869713");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869714");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869898");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869899");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869901");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869902");</script></li></ul><ul class="tjPicAd clear"><li><script type="text/javascript">BAIDU_CLB_fillSlot("859514");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("859516");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("859517");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("859518");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869981");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869982");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869983");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("869984");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("1033455");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("1033457");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("1033458");</script></li><li><script type="text/javascript">BAIDU_CLB_fillSlot("1033459");</script></li></ul> </div><div class="box pc-index"><div class="tiaoji-content-nav"><ul><li><a href="http://www.kaoyan.com">考研網>></a></li><li><a href="http://tiaoji.kaoyan.com">考研調劑中心>></a></li><li><a href="http://student.tiaoji.kaoyan.com">考生調劑意向</a></li></ul></div><form action="" method="GET"><select name="course"><option value="">專業門類</option><option value="哲學">哲學</option><option value="經濟學">經濟學</option><option value="法學">法學</option><option value="教育學">教育學</option><option value="文學">文學</option><option value="歷史學">歷史學</option><option value="理學">理學</option><option value="工學">工學</option><option value="農學">農學</option><option value="醫學">醫學</option><option value="軍事學">軍事學</option><option value="管理學">管理學</option><option value="藝術學">藝術學</option></select>報考專業: <input name="major" value=""></input><input type="submit" value="搜索" /></form><div class="tiaoji-content"><div class="tiaoji-cont-top"><h5>考生調劑信息</h5><span><a href="/tjadm/logout">退出</a></span></div><table class="tiaoji-tab" cellpadding="0" cellspacing="0"><tr><th width="3%">姓名</th><th width="5%">報考學校</th><th width="5%">報考專業</th><th width="5%">專業門類</th><th width="2%">總分</th><th width="2%">政治</th><th width="2%">外語</th><th width="2%">專一</th><th width="2%">專二</th><th width="5%">電話</th><th width="5%">郵箱</th><th width="10%">調劑意向</th><th width="5%">發布時間</th></tr><tr><td>李***</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">天津大學</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">應用化學</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">工學</td><td>244</td><td>58</td><td>53</td><td>133</td><td>0</td><td>15********15</td><td></td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">希望能調劑到211或者985院校,只要是與化學相關的都服從調劑</td><td>2016-03-01</td></tr> <tr><td>何***</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">安徽大學</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">中國現當代文學</td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;"></td><td>137</td><td>71</td><td>66</td><td>0</td><td>0</td><td>18********74</td><td></td><td style=" height:25px; line-height:25px; padding:0 5px; text-align:left;">服從調劑</td><td>2016-03-01</td></tr></table><table width="100%" align="center" border="0" cellpadding="2" cellspacing="1" class="tiaoji-fy"><tr><td colspan="2"> <span>[每頁顯示:20條/總共:161659條]</span> <a>上一頁</a> <b>1</b> <a href="/tjadm/2.html" >2</a> <a href="/tjadm/3.html" >3</a> <a href="/tjadm/4.html" >4</a> <a href="/tjadm/5.html" >5</a> <a href="/tjadm/6.html" >6</a> <a href="/tjadm/7.html" >7</a> <a href="/tjadm/8.html" >8</a> <a href="/tjadm/9.html" >9</a> <a href="/tjadm/10.html" >10</a> <span>...</span> <a href="/tjadm/8082.html" >8082</a> <a href="/tjadm/8083.html" >8083</a> <a href="/tjadm/2.html">下一頁</a></td></tr></table></div> </div><p class="clearFooter"></p> <div class="footer"> <script src="http://img.kaoyan.com/www/footera.js" type="text/javascript"></script>- <a href="http://www.kaoyan.com/sitemap/">網站地圖</a>- <a href="http://www.kaoyan.com/yzsitemap/">院校地圖</a>- <a href="http://www.kaoyan.com/update/">最新更新</a> <script src="http://img.kaoyan.com/www/footerb.js" type="text/javascript"></script> </div> <script src='http://img.kaoyan.com/global/js/gcc.js' type='text/javascript'></script> <script src="http://img.kaoyan.com/global/js/backtopnew.js?ver=2014092901" type="text/javascript"></script> <script type="text/javascript">/*考研網-全站對聯*/var cpro_id = "u1773154";</script> </body> </html>要做的就是從以下格式的HTML代碼中解析出需要的數據
Android程序模擬HTTP請求
經過上述分析,清楚了HTTP的請求地址、請求方式和攜帶參數格式,所以接下來要開發Android程序編程實現這個過程。 (不一定非要Android實現,在PC上直接實現應該也是可以的。但本人比較熟悉Android上的一個HTTP開發庫,所以計劃Android平臺實現)。 1、打開Eclipse,新建一個工程TestGet,把實現HTTP庫的代碼拷入工程中, 使用的庫android-async-http 官網源碼:https://github.com/loopj/android-async-http 官網教程:http://loopj.com/android-async-http/這個網絡請求庫是基于Apache HttpClient庫之上的一個異步網絡請求處理庫,網絡處理均基于Android的非UI線程,通過回調方法處理請求結果。
工程目錄如下,其中com.loopj.android.http包就是android-async-http的源碼
2、新建XcAsyncHttpClientUtil.java,添加請求URL地址,封裝AsyncHttpClient的GET和POST請求 package com.example.testget;import org.apache.http.HttpEntity;import android.content.Context;import com.loopj.android.http.AsyncHttpClient; import com.loopj.android.http.AsyncHttpResponseHandler; import com.loopj.android.http.RequestParams;public class XcAsyncHttpClientUtil {public static final String BASE_URL = "http://ntiaoji.kaoyan.com";public static final String LOGIN_URL = "/tjadm/login";public static final String INDEX1 = "/tjadm/1.html";private static AsyncHttpClient client = new AsyncHttpClient();public static void get(String url, RequestParams params,AsyncHttpResponseHandler responseHandler) {client.get(getAbsoluteUrl(url), params, responseHandler);}public static void post(String url, RequestParams params,AsyncHttpResponseHandler responseHandler) {client.post(getAbsoluteUrl(url), params, responseHandler);}public static void post(Context context, String url, HttpEntity entity,AsyncHttpResponseHandler responseHandler) {client.post(context, getAbsoluteUrl(url), entity, "", responseHandler);}public static String getAbsoluteUrl(String relativeUrl) {return BASE_URL + relativeUrl;} }
3、編輯activity_main.xml,添加兩個按鈕,一個登陸,一個獲取表格數據 <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"xmlns:tools="http://schemas.android.com/tools"android:layout_width="match_parent"android:layout_height="match_parent"android:paddingBottom="@dimen/activity_vertical_margin"android:paddingLeft="@dimen/activity_horizontal_margin"android:paddingRight="@dimen/activity_horizontal_margin"android:paddingTop="@dimen/activity_vertical_margin"tools:context="com.example.testget.MainActivity" ><TextViewandroid:layout_width="wrap_content"android:layout_height="wrap_content"android:text="@string/hello_world" /><Buttonandroid:id="@+id/btn"android:layout_width="wrap_content"android:layout_height="wrap_content"android:text="登陸" ></Button><Buttonandroid:id="@+id/btn1"android:layout_width="wrap_content"android:layout_height="wrap_content"android:text="獲取表格數據" ></Button></LinearLayout>效果圖如下:
4、編輯MainActivity.java,添加按鈕點擊動作,dologin()用來實現登陸,doGetData()用來獲取表格數據,參數page用來構建請求的URL,初始化值為1,可自增,獲取其他頁面的數據 @Overrideprotected void onCreate(Bundle savedInstanceState) {super.onCreate(savedInstanceState);setContentView(R.layout.activity_main);btn = (Button) findViewById(R.id.btn);btn.setOnClickListener(new OnClickListener() {@Overridepublic void onClick(View v) {dologin();}});btn1 = (Button) findViewById(R.id.btn1);btn1.setOnClickListener(new OnClickListener() {@Overridepublic void onClick(View v) {doGetData();}});}private void dologin() {RequestParams params = new RequestParams();params.put("username", "kytj1");params.put("password", "***********");XcAsyncHttpClientUtil.post(XcAsyncHttpClientUtil.LOGIN_URL, params,new AsyncHttpResponseHandler() {@Overridepublic void onSuccess(int statusCode, Header[] headers,byte[] responseBody) {try {String jsonString = new String(responseBody,"UTF-8");Log.e("TAG", jsonString);} catch (UnsupportedEncodingException e) {e.printStackTrace();}}@Overridepublic void onFailure(int statusCode, Header[] headers,byte[] responseBody, Throwable error) {Log.e("Login", "onFailure");}});}protected void doGetData() {RequestParams params = new RequestParams();XcAsyncHttpClientUtil.get("/tjadm/" + page + ".html", params,new AsyncHttpResponseHandler() {@Overridepublic void onSuccess(int statusCode, Header[] headers,byte[] responseBody) {try {String jsonString = new String(responseBody,"UTF-8");parse(jsonString);} catch (UnsupportedEncodingException e) {e.printStackTrace();}}@Overridepublic void onFailure(int statusCode, Header[] headers,byte[] responseBody, Throwable error) {}});} 取到網頁數據內容還不算完,還得解析出所需數據。 doGetData()取到的respoBody由UTF-8編碼,解碼后得到字符串格式數據 交由parse()解析。
Java解析HTML數據
之前沒有做過如何解析HTML數據,開始還有點頭疼,不知道如何下手,在網上搜索解決辦法。然后發現了這個庫 jsoupjsoup 是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于jQuery的操作方法來取出和操作數據。 官方網站:http://jsoup.org/
點擊下載jsoup庫
把下載到的jsoup-1.8.3.jar庫添加到Android工程libs文件夾下
解析如下HTML數據
原數據里table下有21條數據,第1條為表格title信息,如姓名、報考學校、報考專業等字段,第2-21條為實際的學生信息。 Java代碼如下: protected void parse(String html) {Document doc = Jsoup.parse(html);Element tiaojiTab = doc.select("table.tiaoji-tab").first();Elements lists = tiaojiTab.getElementsByTag("tr");int size = lists.size();for (int i = 1; i < size; i++) {Element item = lists.get(i);Elements els = item.getElementsByTag("td");String all = "";for (int j = 0; j < els.size(); j++) {Element value = els.get(j);String text = value.text();all = all + text + "#";}initData(all);Log.e("tag", all);}page++;if (page < totalsize + 1) {doGetData();} else {page = 1;}}doc.select("table.tiaoji-tab").first();從整個HTML文檔里取出要解析的內容信息,根據“tr”取得元素組,從第2條開始取數據,調用for循環。 對于單個字段間添加“#”號,以方便后續在Excel中處理數據。 initData(all)把單獨html頁上取得的數據寫到txt文件里。 page++;if (page < totalsize + 1) {doGetData();} else {page = 1;}繼續取下一頁,設置的totalSize=200,即每運行一次程序,抓取200頁數據。
把數據寫入到本地txt文件里代碼: private void initData(String msg) {String filePath = "/sdcard/Test/";String fileName = "tiaoji.txt";makeFilePath(filePath, fileName);writeTxtToFile(msg, filePath, fileName);}// 將字符串寫入到文本文件中public void writeTxtToFile(String strcontent, String filePath,String fileName) {// 生成文件夾之后,再生成文件,不然會出錯String strFilePath = filePath + fileName;// 每次寫入時,都換行寫String strContent = strcontent + "\r\n";try {File file = new File(strFilePath);if (!file.exists()) {Log.d("TestFile", "Create the file:" + strFilePath);file.getParentFile().mkdirs();file.createNewFile();}RandomAccessFile raf = new RandomAccessFile(file, "rwd");raf.seek(file.length());raf.write(strContent.getBytes());raf.close();} catch (Exception e) {Log.e("TestFile", "Error on write File:" + e);}}// 生成文件public File makeFilePath(String filePath, String fileName) {File file = null;makeRootDirectory(filePath);try {file = new File(filePath + fileName);if (!file.exists()) {file.createNewFile();}} catch (Exception e) {e.printStackTrace();}return file;}// 生成文件夾public static void makeRootDirectory(String filePath) {File file = null;try {file = new File(filePath);if (!file.exists()) {file.mkdir();}} catch (Exception e) {Log.i("error:", e + "");}}
導入txt到Excel
1、連接手機到電腦,打開應用寶,工具箱里選擇文件管理,把txt文件導入到電腦上 2、打開Excel,選擇數據-自文本按照提示,選擇導出的txt文件,第2步中分隔符號選擇其他,填入“#”,再完成
這樣,就把數據成功的在Excel中顯示了。
完整MainActivity.java如下: package com.example.testget;import java.io.File; import java.io.RandomAccessFile; import java.io.UnsupportedEncodingException;import org.apache.http.Header; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import android.app.Activity; import android.os.Bundle; import android.util.Log; import android.view.View; import android.view.View.OnClickListener; import android.widget.Button;import com.loopj.android.http.AsyncHttpResponseHandler; import com.loopj.android.http.RequestParams;public class MainActivity extends Activity {private Button btn, btn1;private int page = 1;private static final int totalsize = 200;@Overrideprotected void onCreate(Bundle savedInstanceState) {super.onCreate(savedInstanceState);setContentView(R.layout.activity_main);btn = (Button) findViewById(R.id.btn);btn.setOnClickListener(new OnClickListener() {@Overridepublic void onClick(View v) {dologin();}});btn1 = (Button) findViewById(R.id.btn1);btn1.setOnClickListener(new OnClickListener() {@Overridepublic void onClick(View v) {doGetData();}});}private void dologin() {RequestParams params = new RequestParams();params.put("username", "kytj1");params.put("password", "************");XcAsyncHttpClientUtil.post(XcAsyncHttpClientUtil.LOGIN_URL, params,new AsyncHttpResponseHandler() {@Overridepublic void onSuccess(int statusCode, Header[] headers,byte[] responseBody) {try {String jsonString = new String(responseBody,"UTF-8");Log.e("TAG", jsonString);} catch (UnsupportedEncodingException e) {e.printStackTrace();}}@Overridepublic void onFailure(int statusCode, Header[] headers,byte[] responseBody, Throwable error) {Log.e("Login", "onFailure");}});}protected void doGetData() {RequestParams params = new RequestParams();XcAsyncHttpClientUtil.get("/tjadm/" + page + ".html", params,new AsyncHttpResponseHandler() {@Overridepublic void onSuccess(int statusCode, Header[] headers,byte[] responseBody) {try {String jsonString = new String(responseBody,"UTF-8");parse(jsonString);} catch (UnsupportedEncodingException e) {e.printStackTrace();}}@Overridepublic void onFailure(int statusCode, Header[] headers,byte[] responseBody, Throwable error) {}});}protected void parse(String html) {Document doc = Jsoup.parse(html);Element tiaojiTab = doc.select("table.tiaoji-tab").first();Elements lists = tiaojiTab.getElementsByTag("tr");int size = lists.size();for (int i = 1; i < size; i++) {Element item = lists.get(i);Elements els = item.getElementsByTag("td");String all = "";for (int j = 0; j < els.size(); j++) {Element value = els.get(j);String text = value.text();all = all + text + "#";}initData(all);Log.e("tag", all);}page++;if (page < totalsize + 1) {doGetData();} else {page = 1;}}private void initData(String msg) {String filePath = "/sdcard/Test/";String fileName = "tiaoji.txt";makeFilePath(filePath, fileName);writeTxtToFile(msg, filePath, fileName);}// 將字符串寫入到文本文件中public void writeTxtToFile(String strcontent, String filePath,String fileName) {// 生成文件夾之后,再生成文件,不然會出錯String strFilePath = filePath + fileName;// 每次寫入時,都換行寫String strContent = strcontent + "\r\n";try {File file = new File(strFilePath);if (!file.exists()) {Log.d("TestFile", "Create the file:" + strFilePath);file.getParentFile().mkdirs();file.createNewFile();}RandomAccessFile raf = new RandomAccessFile(file, "rwd");raf.seek(file.length());raf.write(strContent.getBytes());raf.close();} catch (Exception e) {Log.e("TestFile", "Error on write File:" + e);}}// 生成文件public File makeFilePath(String filePath, String fileName) {File file = null;makeRootDirectory(filePath);try {file = new File(filePath + fileName);if (!file.exists()) {file.createNewFile();}} catch (Exception e) {e.printStackTrace();}return file;}// 生成文件夾public static void makeRootDirectory(String filePath) {File file = null;try {file = new File(filePath);if (!file.exists()) {file.mkdir();}} catch (Exception e) {Log.i("error:", e + "");}}}
哈哈哈哈哈哈~
總結
- 上一篇: hr有必要学python吗_人力资源分析
- 下一篇: 国内9大免费CDN汇总,除了加速乐,你还