Request.Browser.Crawler 属性的工作原理分析
如何判斷一個請求是否為搜索引擎,我的回答是通過UserAgent字符串自己去判斷,而他告訴我.net提供了Crawler屬性。我還真沒有使用過這個屬性,但是我第一反應(yīng)是微軟做的這東西怎么可能把所有搜索引擎都能判斷出來呢?這是不可能的阿,至少據(jù)我所知網(wǎng)絡(luò)中還有很多“特殊的搜索引擎”比如說Email的爬蟲。我想研究一下什么樣特征的搜索引擎能夠讓Crawler屬性為真。
?????? 在Web項目中大家都知道有一個Request這個對象,這個對象下面還有一個Browser屬性,是描述瀏覽器的一些屬性的類實例,這個Browser下面還有一個布爾值屬性Crawler,下面是MSND的解釋:
?
讓我們看看它的值是如何得到的,以下代碼均通過Reflector查看System.web.dll獲得。
public bool Crawler?
{?
??? get?
??? {?
??????? if (!this._havecrawler)?
??????? {?
??????????? this._crawler = this.CapsParseBool("crawler");?
??????????? this._havecrawler = true;?
??????? }?
??????? return this._crawler;?
??? }?
}
很明顯這里的CapsParseBool("crawler")是關(guān)鍵地方所在,我們繼續(xù)看這個方法中如何實現(xiàn)的。?
private bool CapsParseBool(string capsKey)?
{?
??? bool flag;?
??? try?
??? {?
??????? flag = bool.Parse(this[capsKey]);?
??? }?
??? catch (FormatException exception)?
??? {?
??????? throw this.BuildParseError(exception, capsKey);?
??? }?
??? return flag;?
}
我們看到這里就有些明白了,這里的flag就是存儲當(dāng)前請求是否為搜索引擎的變量。而這個值真正存儲在this[capsKey]中,我們繼續(xù)我們的探索。
public virtual string this[string key]?
{?
??? get?
??? {?
??????? return (string) this._items[key];?
??? }?
}
我們又進(jìn)了一步,真正的值被存儲在_items這個集合中,我們來看看它是如何定義的。
private IDictionary _items;?
現(xiàn)在我們的問題來了:
這是一個字典接口,它是如何初始化并且得到這個crawler Key/Value鍵值對的呢?目前我們好像沒有什么頭緒繼續(xù)向下查找這個Clawler鍵對應(yīng)的值是如何得到的了。
我們回頭理一下思路:
Crawler屬性是Browser對象的,Browser又是Request類的一個屬性。所以初始化Crawler有可能在Browser初始化的時候。Request這個對象是客戶端發(fā)出請求達(dá)到服務(wù)器端后,服務(wù)器端負(fù)責(zé)初始化的。Browser對象一定會根據(jù)當(dāng)前請求的Request來初始化自己,所以我們有必要看看Browser這個屬性是如何實現(xiàn)的了。
public HttpBrowserCapabilities Browser?
{?
??? get?
??? {?
??????? if (this._browsercaps == null)?
??????? {?
??????????? if (!s_browserCapsEvaled)?
??????????? {?
??????????????? lock (s_browserLock)?
??????????????? {?
??????????????????? if (!s_browserCapsEvaled)?
??????????????????? {?
??????????????????????? HttpCapabilitiesBase.GetBrowserCapabilities(this);?
??????????????????? }?
??????????????????? s_browserCapsEvaled = true;?
??????????????? }?
??????????? }?
??????????? this._browsercaps = HttpCapabilitiesBase.GetBrowserCapabilities(this);?
??????? }?
??????? return this._browsercaps;?
??? }?
??? set?
??? {?
??????? this._browsercaps = value;?
??? }?
}
this._browsercaps = HttpCapabilitiesBase.GetBrowserCapabilities(this);??
正是我們要找到的,我們繼續(xù)向下看。
internal static HttpBrowserCapabilities GetBrowserCapabilities(HttpRequest request)?
{?
??? HttpCapabilitiesBase base2 = null;?
??? HttpCapabilitiesEvaluator browserCaps = RuntimeConfig.GetConfig(request.Context).BrowserCaps;?
??? if (browserCaps != null)?
??? {?
??????? base2 = browserCaps.Evaluate(request);?
??? }?
??? return (HttpBrowserCapabilities) base2;?
}
?
我們又前進(jìn)了一大步,代碼印證了前面我們的假象,它的確是通過當(dāng)前Request來初始化自己的,讓我們來查看browserCaps.Evaluate(request);
這個方法的代碼還是有一點多的,為了版面就先部全部放上來了。代碼雖然多,但是還是容易看懂的,其中有很多都是緩存判斷和處理的。因為這個方法返回值就是我們一直關(guān)注的Browser屬性對應(yīng)的HttpCapabilitiesBase類,我的注意力全都放在這個返回值result上了,通過閱讀代碼得到result共有兩個途徑:緩存、EvaluateFinal方法。讓我們來看看這個EvaluateFinal方法吧,我要看到勝利的曙光了。
internal HttpCapabilitiesBase EvaluateFinal(HttpRequest request, bool onlyEvaluateUserAgent)?
{?
??? HttpBrowserCapabilities httpBrowserCapabilities = this.BrowserCapFactory.GetHttpBrowserCapabilities(request);?
??? CapabilitiesState state = new CapabilitiesState(request, httpBrowserCapabilities.Capabilities);?
??? if (onlyEvaluateUserAgent)?
??? {?
??????? state.EvaluateOnlyUserAgent = true;?
??? }?
??? if (this._rule != null)?
??? {?
??????? string str = httpBrowserCapabilities["isMobileDevice"];?
??????? httpBrowserCapabilities.Capabilities["isMobileDevice"] = null;?
??????? this._rule.Evaluate(state);?
??????? string str2 = httpBrowserCapabilities["isMobileDevice"];?
??????? if (str2 == null)?
??????? {?
??????????? httpBrowserCapabilities.Capabilities["isMobileDevice"] = str;?
??????? }?
??????? else if (str2.Equals("true"))?
??????? {?
??????????? httpBrowserCapabilities.DisableOptimizedCacheKey();?
??????? }?
??? }?
??? HttpCapabilitiesBase base2 = (HttpCapabilitiesBase) HttpRuntime.CreateNonPublicInstance(this._resultType);?
??? base2.InitInternal(httpBrowserCapabilities);?
??? return base2;?
}
?
太棒了,第一句話就告訴我們Browser屬性的初始化是在BrowserCapFactory工廠中完成的。那我們到工廠中看看吧!
internal BrowserCapabilitiesFactoryBase BrowserCapFactory?
{?
??? get?
??? {?
??????? return BrowserCapabilitiesCompiler.BrowserCapabilitiesFactory;?
??? }?
}
這里微軟用了抽象工廠設(shè)計模式,所以真正的工廠類是根據(jù)一個私有靜態(tài)變量中的類型字符串創(chuàng)建出來的。
internal static BrowserCapabilitiesFactoryBase BrowserCapabilitiesFactory?
{?
??? get?
??? {?
??????? if (_browserCapabilitiesFactoryBaseInstance == null)?
??????? {?
??????????? lock (_lockObject)?
??????????? {?
??????????????? if (_browserCapabilitiesFactoryBaseInstance == null)?
??????????????? {?
??????????????????? _browserCapabilitiesFactoryType = GetBrowserCapabilitiesType();?
??????????????????? if (_browserCapabilitiesFactoryType != null)?
??????????????????? {?
??????????????????????? _browserCapabilitiesFactoryBaseInstance = (BrowserCapabilitiesFactoryBase) Activator.CreateInstance(_browserCapabilitiesFactoryType);?
??????????????????? }?
??????????????? }?
??????????? }?
??????? }?
??????? return _browserCapabilitiesFactoryBaseInstance;?
??? }?
}
這難不倒我們,我們看一下只有BrowserCapabilitiesFactory繼承了BrowserCapabilitiesFactoryBase工廠了,所以真正的代碼應(yīng)該是在這個工廠類中了,我們繼續(xù)我們的探索,來查看一下工廠類的GetHttpBrowserCapabilities方法。
internal HttpBrowserCapabilities GetHttpBrowserCapabilities(HttpRequest request)?
{?
??? if (request == null)?
??? {?
??????? throw new ArgumentNullException("request");?
??? }?
??? NameValueCollection headers = request.Headers;?
??? HttpBrowserCapabilities browserCaps = new HttpBrowserCapabilities();?
??? Hashtable hashtable = new Hashtable(180, StringComparer.OrdinalIgnoreCase);
?hashtable[string.Empty] = HttpCapabilitiesEvaluator.GetUserAgent(request);?
??? browserCaps.Capabilities = hashtable;?
??? this.ConfigureBrowserCapabilities(headers, browserCaps);?
??? return browserCaps;?
}
?
最終我們走到了ConfigureBrowserCapabilities這個虛方法上了。?
public override void ConfigureBrowserCapabilities(NameValueCollection headers, HttpBrowserCapabilities browserCaps)?
{?
??? this.DefaultProcess(headers, browserCaps);?
??? if (base.IsBrowserUnknown(browserCaps))?
??? {?
??????? this.DefaultDefaultProcess(headers, browserCaps);?
??? }?
}?
private bool DefaultProcess(NameValueCollection headers, HttpBrowserCapabilities browserCaps)?
{?
??? IDictionary capabilities = browserCaps.Capabilities;??? ....?
??? capabilities["crawler"] = "false";??? ....??? this.CrawlerProcess(headers, browserCaps);?
??? return true;?
}private bool CrawlerProcess(NameValueCollection headers, HttpBrowserCapabilities browserCaps)?
{?
??? IDictionary capabilities = browserCaps.Capabilities;?
??? string target = browserCaps[string.Empty];?
??? RegexWorker worker = new RegexWorker(browserCaps);?
??? if (!worker.ProcessRegex(target, "crawler|Crawler|Googlebot|msnbot"))?
??? {?
??????? return false;?
??? }?
??? capabilities["crawler"] = "true";?
??? this.CrawlerProcessGateways(headers, browserCaps);?
??? bool ignoreApplicationBrowsers = false;?
??? this.CrawlerProcessBrowsers(ignoreApplicationBrowsers, headers, browserCaps);?
??? return true;?
}?
public bool ProcessRegex(string target, string regexExpression)?
{?
??? if (target == null)?
??? {?
??????? target = string.Empty;?
??? }?
??? Regex regex = new Regex(regexExpression, RegexOptions.ExplicitCapture);?
??? Match match = regex.Match(target);?
??? if (!match.Success)?
??? {?
??????? return false;?
??? }?
??? string[] groupNames = regex.GetGroupNames();?
??? if (groupNames.Length > 0)?
??? {?
??????? if (this._groups == null)
{?
??????????? this._groups = new Hashtable();?
??????? }?
??????? for (int i = 0; i < groupNames.Length; i++)?
??????? {?
??????????? this._groups[groupNames[i]] = match.Groups[i].Value;?
??????? }?
??? }?
??? return true;?
}
?
最終的代碼終于被我們找到了,我們可以肯定地說Crawler屬性是通過匹配正則表達(dá)式來實現(xiàn),并且目前能判斷出的搜索引擎只有UserAgent字符串中包含crawler|Crawler|Googlebot|msnbot的。比如說百度就不能被判斷出來,因為百度的是Baiduspider。
轉(zhuǎn)載于:https://www.cnblogs.com/SUPERAI/archive/2011/11/30/2269074.html
總結(jié)
以上是生活随笔為你收集整理的Request.Browser.Crawler 属性的工作原理分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 收集几个移动平台浏览器的User-Age
- 下一篇: 在ASP.NET页面中动态添加控件