當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

LeetCode 1236. 网络爬虫（BFS/DFS）

發布時間：2024/7/5 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 LeetCode 1236. 网络爬虫（BFS/DFS）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 1. 題目
- 2. 解題
- - 2.1 BFS
  - 2.2 DFS

1. 題目

給定一個鏈接 startUrl 和一個接口 HtmlParser ，請你實現一個網絡爬蟲，以實現爬取同 startUrl 擁有相同域名標簽 的全部鏈接。該爬蟲得到的全部鏈接可以任何順序返回結果。

你的網絡爬蟲應當按照如下模式工作：

自鏈接 startUrl 開始爬取
調用 HtmlParser.getUrls(url) 來獲得鏈接url頁面中的全部鏈接
同一個鏈接最多只爬取一次
只輸出域名與 startUrl 相同的鏈接集合

如上所示的一個鏈接，其域名為 example.org。
簡單起見，你可以假設所有的鏈接都采用 http協議并沒有指定端口。
例如，鏈接 http://leetcode.com/problems 和 http://leetcode.com/contest 是同一個域名下的，
而鏈接http://example.org/test 和 http://example.com/abc 是不在同一域名下的。

HtmlParser 接口定義如下： interface HtmlParser {// 返回給定 url 對應的頁面中的全部 url 。public List<String> getUrls(String url); }

下面是兩個實例，用以解釋該問題的設計功能，對于自定義測試，你可以使用三個變量 urls, edges 和 startUrl。
注意在代碼實現中，你只可以訪問 startUrl ，而 urls 和 edges 不可以在你的代碼中被直接訪問。

示例 1：

輸入： urls = ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.google.com","http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" 輸出：["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us" ]

示例 2：

輸入： urls = ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.google.com" ] edges = [[0,2],[2,1],[3,2],[3,1],[3,0]] startUrl = "http://news.google.com" 輸入：["http://news.google.com"] 解釋：startUrl 鏈接到所有其他不共享相同主機名的頁面。提示： 1 <= urls.length <= 1000 1 <= urls[i].length <= 300 startUrl 為 urls 中的一個。域名標簽的長為1到63個字符（包括點），只能包含從‘a’到‘z’的ASCII字母、‘0’到‘9’的數字以及連字符即減號（‘-’）。域名標簽不會以連字符即減號（‘-’）開頭或結尾。關于域名有效性的約束可參考: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames 你可以假定url庫中不包含重復項。

來源：力扣（LeetCode）鏈接：https://leetcode-cn.com/problems/web-crawler
著作權歸領扣網絡所有。商業轉載請聯系官方授權，非商業轉載請注明出處。

2. 解題

簡單的BFS或者DFS模板題

2.1 BFS

/*** // This is the HtmlParser's API interface.* // You should not implement it, or speculate about its implementation* class HtmlParser {* public:* vector<string> getUrls(string url);* };*/class Solution { public:vector<string> crawl(string startUrl, HtmlParser htmlParser) {unordered_set<string> visited;visited.insert(startUrl);queue<string> q;q.push(startUrl);vector<string> ans;string cur, domain;while(!q.empty()){cur = q.front();ans.push_back(cur);q.pop();domain = getdomain(cur);vector<string> sub = htmlParser.getUrls(cur);for(string& link : sub){if(getdomain(link) == domain && !visited.count(link)){q.push(link);visited.insert(link);}}}return ans;}string getdomain(string& url) {auto it = find(url.begin()+7, url.end(), '/');return string(url.begin(), it);} };

184 ms 32.9 MB

2.2 DFS

class Solution {unordered_set<string> visited;vector<string> ans; public:vector<string> crawl(string startUrl, HtmlParser htmlParser) {visited.insert(startUrl);dfs(startUrl, htmlParser);return ans;}void dfs(string& cur, HtmlParser& htmlParser){ans.push_back(cur);string domain = getdomain(cur);vector<string> sub = htmlParser.getUrls(cur);for(string& link : sub){if(getdomain(link) == domain && !visited.count(link)){visited.insert(link);dfs(link, htmlParser);}}}string getdomain(string& url) {auto it = find(url.begin()+7, url.end(), '/');return string(url.begin(), it);} };

我的CSDN博客地址 https://michael.blog.csdn.net/

長按或掃碼關注我的公眾號（Michael阿明），一起加油、一起學習進步！

總結

以上是生活随笔為你收集整理的LeetCode 1236. 网络爬虫（BFS/DFS）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 第 28 场双周赛（5
下一篇： LeetCode 487. 最大连续1的