c++ curl 超时_cc++写网络爬虫,curl+gumbo配合使用
生活随笔
收集整理的這篇文章主要介紹了
c++ curl 超时_cc++写网络爬虫,curl+gumbo配合使用
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
是的,你沒有聽錯。就是用c++或者說c語言寫爬蟲。
其實不難,雖然沒有Python寫起來那么簡單。但是也不是那么復雜啦,畢竟好多大佬都寫了那么多庫,我們只要會用大佬寫的庫就行。
網址:https://acm.sjtu.edu.cn/OnlineJudge/status
我們就爬取這個頁面的評審狀態的所有內容。
代碼如下:
iostreamfstream c.nodeNum(); i++)\n\t{\n\t\tfor (int j = 0; j > c.nodeAt(i).childNum(); j++)\n\t\t{\n\t\t\tCNode nd = c.nodeAt(i).childAt(j);\n\t\t\tcout >> MyStringFormat::UTF_82ASCII(nd.text()).c_str() >> \" \";\n\t\t}\n\t\tcout >> endl;\n\t}\n}\n\nstatic size_t OnWriteData(void* buffer, size_t size, size_t nmemb, void* lpVoid)\n{\n\tstring* str = dynamic_cast>string*#include #include #include "gumbo/Document.h"#include "gumbo/Node.h"#include "MyStringFormat.h"#include "curl/curl.h"using namespace std;#define URL_REFERER "https://acm.sjtu.edu.cn/OnlineJudge/"void printFunc(string page){ CDocument doc; doc.parse(page.c_str()); CSelection c = doc.find("#status tr"); for (int i = 0; i < c.nodeNum(); i++) { for (int j = 0; j < c.nodeAt(i).childNum(); j++) { CNode nd = c.nodeAt(i).childAt(j); cout << MyStringFormat::UTF_82ASCII(nd.text()).c_str() << " "; } cout << endl; }}static size_t OnWriteData(void* buffer, size_t size, size_t nmemb, void* lpVoid){ string* str = dynamic_cast<string*>((string *)lpVoid); if (NULL == str || NULL == buffer) { return -1; } char* pData = (char*)buffer; str->append(pData, size * nmemb); return nmemb;}bool HttpRequest(const char* url, string& strResponse, bool get/* = true*/, const char* headers/* = NULL*/, const char* postdata/* = NULL*/, bool bReserveHeaders/* = false*/, int timeout/* = 10*/){ CURLcode res; CURL* curl = curl_easy_init(); if (NULL == curl) { return false; } curl_easy_setopt(curl, CURLOPT_URL, url); //響應結果中保留頭部信息 if (bReserveHeaders) curl_easy_setopt(curl, CURLOPT_HEADER, 1); curl_easy_setopt(curl, CURLOPT_COOKIEFILE, ""); curl_easy_setopt(curl, CURLOPT_READFUNCTION, NULL); curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, OnWriteData); curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)&strResponse); curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1); //設定為不驗證證書和HOST //curl_easy_setopt(curl, CURLOPT_PROXY, "127.0.0.1:8888");//設置代理 //curl_easy_setopt(curl, CURLOPT_PROXYPORT, 9999); //代理服務器端口 curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, false); curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, false); //設置超時時間 curl_easy_setopt(curl, CURLOPT_CONNECTTIMEOUT, timeout); curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout); curl_easy_setopt(curl, CURLOPT_REFERER, URL_REFERER); curl_easy_setopt(curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"); //不設置接收的編碼格式或者設置為空,libcurl會自動解壓壓縮的格式,如gzip //curl_easy_setopt(curl, CURLOPT_ACCEPT_ENCODING, "gzip, deflate, br"); //設置hostConnection: Keep-Alive struct curl_slist *chunk = NULL; chunk = curl_slist_append(chunk, "Host: acm.sjtu.edu.cn"); chunk = curl_slist_append(chunk, "Connection: Keep-Alive"); curl_easy_setopt(curl, CURLOPT_HTTPHEADER, chunk); //添加自定義頭信息 if (headers != NULL) { chunk = curl_slist_append(chunk, headers); curl_easy_setopt(curl, CURLOPT_HTTPHEADER, chunk); } if (!get && postdata != NULL) { curl_easy_setopt(curl, CURLOPT_POSTFIELDS, postdata); } res = curl_easy_perform(curl); bool bError = false; if (res == CURLE_OK) { int code; res = curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE, &code); if (code != 200 && code != 302) { bError = true; } } else { bError = true; } curl_easy_cleanup(curl); return !bError;}int main(int argc, char * argv[]){ string response; HttpRequest("https://acm.sjtu.edu.cn/OnlineJudge/status", response, true, NULL, NULL, false, 10); printFunc(response); system("pause"); return 0;}我知道,我貼出這些代碼,也沒法運行,所以我把工程文件也發出來。為了不被大家說我騙積分,我的所有東西都貼出百度云鏈接。
鏈接:https://pan.baidu.com/s/1jBZ-6tT-4ne0uTMw4jFvKA?
提取碼:pmg6?
喜歡的歡迎關注我的公眾號,歡迎關注我的csdn:wu_lian_nan
總結
以上是生活随笔為你收集整理的c++ curl 超时_cc++写网络爬虫,curl+gumbo配合使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux内核态获取ip地址,Linux
- 下一篇: 进阶清单 | 这份码农修炼指南,助你掌控