ai驱动数据安全治理_AI驱动的Web数据收集解决方案的新起点
ai驅(qū)動(dòng)數(shù)據(jù)安全治理
Data gathering consists of many time-consuming and complex activities. These include proxy management, data parsing, infrastructure management, overcoming fingerprinting anti-measures, rendering JavaScript-heavy websites at scale, and much more. Is there a way to automate these processes? Absolutely.
數(shù)據(jù)收集包括許多耗時(shí)且復(fù)雜的活動(dòng)。 這些措施包括代理管理,數(shù)據(jù)解析,基礎(chǔ)結(jié)構(gòu)管理,克服指紋防措施,大規(guī)模渲染JavaScript繁重的網(wǎng)站等。 有沒(méi)有辦法使這些過(guò)程自動(dòng)化? 絕對(duì)。
Finding a more manageable solution for a large-scale data gathering has been on the minds of many in the web scraping community. Specialists saw a lot of potential in applying AI (Artificial Intelligence) and ML (Machine Learning) to web scraping. However, only recently, actions toward data gathering automation using AI applications have been taken. This is no wonder, as AI and ML algorithms became more robust at large-scale only in recent years together with advancement in computing solutions.
網(wǎng)絡(luò)抓取社區(qū)中的許多人一直在尋找為大規(guī)模數(shù)據(jù)收集提供更易管理的解決方案。 專(zhuān)家們看到了將AI(人工智能)和ML(機(jī)器學(xué)習(xí))應(yīng)用于網(wǎng)頁(yè)抓取的巨大潛力。 但是,直到最近,才采取行動(dòng)使用AI應(yīng)用程序進(jìn)行數(shù)據(jù)收集自動(dòng)化。 這也就不足為奇了,因?yàn)锳I和ML算法直到最近幾年才隨著計(jì)算解決方案的進(jìn)步而變得更加強(qiáng)大。
By applying AI-powered solutions in data gathering, we can help automate tedious manual work and ensure a much better quality of the collected data. To better grasp the struggles of web scraping, let’s look into the process of data gathering, its biggest challenges, and possible future solutions that might ease and potentially solve mentioned challenges.
通過(guò)在數(shù)據(jù)收集中應(yīng)用基于AI的解決方案,我們可以幫助完成繁瑣的手工工作,并確保所收集數(shù)據(jù)的質(zhì)量更高。 為了更好地掌握Web抓取的工作,讓我們研究數(shù)據(jù)收集的過(guò)程,最大的挑戰(zhàn)以及將來(lái)可能緩解和潛在解決上述挑戰(zhàn)的解決方案。
數(shù)據(jù)收集:逐步 (Data collection: step by step)
To better understand the web scraping process, it’s best to visualize it in a value chain:
為了更好地了解網(wǎng)絡(luò)抓取過(guò)程,最好在價(jià)值鏈中對(duì)其進(jìn)行可視化處理:
As you can see, web scraping takes up four distinct actions:
如您所見(jiàn),Web抓取采取了四個(gè)不同的操作:
Anything that goes beyond those terms is considered to be data engineering or part of data analysis.
超出這些術(shù)語(yǔ)的任何內(nèi)容都被視為數(shù)據(jù)工程或數(shù)據(jù)分析的一部分。
By pinpointing which actions belong to the web scraping category, it becomes easier to find the most common data gathering challenges. It also allows us to see which parts can be automated and improved with the help of AI and ML powered solutions.
通過(guò)查明哪些動(dòng)作屬于Web抓取類(lèi)別,可以更輕松地找到最常見(jiàn)的數(shù)據(jù)收集難題。 它還使我們能夠看到哪些零件可以借助AI和ML支持的解決方案進(jìn)行自動(dòng)化和改進(jìn)。
大規(guī)模刮刮挑戰(zhàn) (Large-scale scraping challenges)
Traditional data gathering from the web requires a lot of governance and quality assurance. Of course, the difficulties that come with data gathering increase together with the scale of the scraping project. Let’s dig a little deeper into the said challenges by going through our value chain’s actions and analyzing potential issues.
從網(wǎng)絡(luò)收集傳統(tǒng)數(shù)據(jù)需要大量的管理和質(zhì)量保證。 當(dāng)然,數(shù)據(jù)收集帶來(lái)的困難隨著抓取項(xiàng)目的規(guī)模而增加。 讓我們通過(guò)價(jià)值鏈的行動(dòng)并分析潛在問(wèn)題,對(duì)上述挑戰(zhàn)進(jìn)行更深入的研究。
建立搜尋路徑并收集URL (Building a crawling path and collecting URLs)
Building a crawling path is the first and essential part of data gathering. To put it simply, a crawling path is a library of URLs from which data will be extracted. The biggest challenge here is not the collection of the website URLs that you want to scrape, but obtaining all the necessary URLs of the initial targets. That could mean dozens, if not hundreds of URLs that will need to be scraped, parsed, and identified as important URLs for your case.
建立爬網(wǎng)路徑是數(shù)據(jù)收集的首要且必不可少的部分。 簡(jiǎn)單來(lái)說(shuō),爬網(wǎng)路徑是一個(gè)URL庫(kù),將從中提取數(shù)據(jù)。 這里最大的挑戰(zhàn)不是您要抓取的網(wǎng)站URL的集合,而是獲得初始目標(biāo)的所有必需URL。 這可能意味著需要抓取,解析和標(biāo)識(shí)數(shù)十個(gè)(如果不是數(shù)百個(gè))URL,這對(duì)于您的案例而言是重要的URL。
刮板的開(kāi)發(fā)及其維護(hù) (Scraper development and its maintenance)
Building a scraper comes with a whole new set of issues. There are a lot of factors to look out for when doing so:
構(gòu)建刮板會(huì)帶來(lái)一系列全新問(wèn)題。 這樣做時(shí)要注意很多因素:
- Choosing the language, APIs, frameworks, etc. 選擇語(yǔ)言,API,框架等。
- Testing out what you’ve built. 測(cè)試您的構(gòu)建。
- Infrastructure management and maintenance. 基礎(chǔ)架構(gòu)管理和維護(hù)。
- Overcoming fingerprinting anti-measures. 克服指紋防措施。
- Rendering JavaScript-heavy websites at scale. 大規(guī)模渲染JavaScript繁重的網(wǎng)站。
These are just the tip of the iceberg that you will encounter when building a web scraper. There are plenty more smaller and time consuming things that will accumulate into larger issues.
這些只是構(gòu)建網(wǎng)絡(luò)刮板時(shí)遇到的冰山一角。 還有很多小而費(fèi)時(shí)的事情會(huì)累積成更大的問(wèn)題。
代理收購(gòu)與管理 (Proxy acquisition and management)
Proxy management will be a challenge, especially to those new to scraping. There are so many little mistakes one can make to block batches of proxies until successfully scraping a site. Proxy rotation is a good practice, but it doesn’t illuminate all the issues and requires constant management and upkeep of the infrastructure. So if you are relying on a proxy vendor, a good and frequent communication will be necessary.
代理管理將是一個(gè)挑戰(zhàn),特別是對(duì)于那些剛開(kāi)始使用的人。 在成功刮取站點(diǎn)之前,阻止批次代理存在很多小錯(cuò)誤。 代理輪換是一種很好的做法,但是它不能說(shuō)明所有問(wèn)題,并且需要對(duì)基礎(chǔ)架構(gòu)進(jìn)行持續(xù)的管理和維護(hù)。 因此,如果您依賴(lài)代理供應(yīng)商,則需要進(jìn)行良好且頻繁的溝通。
數(shù)據(jù)獲取和解析 (Data fetching and parsing)
Data parsing is the process of making the acquired data understandable and usable. While creating a parser might sound easy, its further maintenance will cause big problems. Adapting to different page formats and website changes will be a constant struggle and will require your developers teams’ attention more often than you can expect.
數(shù)據(jù)解析是使獲取的數(shù)據(jù)易于理解和使用的過(guò)程。 盡管創(chuàng)建解析器聽(tīng)起來(lái)很容易,但對(duì)其進(jìn)行進(jìn)一步的維護(hù)將導(dǎo)致大問(wèn)題。 適應(yīng)不同的頁(yè)面格式和網(wǎng)站更改將一直是一個(gè)難題,并且將需要您的開(kāi)發(fā)團(tuán)隊(duì)更多的注意力。
As you can see, traditional web scraping comes with many challenges, requires a lot of manual labour, time, and resources. However, the brightside with computing is that almost all things can be automated. And as the development of AI and ML powered web scraping is emerging, creating a future-proof large-scale data gathering becomes a more realistic solution.
如您所見(jiàn),傳統(tǒng)的Web抓取面臨許多挑戰(zhàn),需要大量的人工,時(shí)間和資源。 但是,計(jì)算的亮點(diǎn)是幾乎所有事物都可以自動(dòng)化。 隨著AI和ML支持的Web抓取技術(shù)的發(fā)展不斷涌現(xiàn),創(chuàng)建面向未來(lái)的大規(guī)模數(shù)據(jù)收集已成為一種更為現(xiàn)實(shí)的解決方案。
使網(wǎng)頁(yè)抓取永不過(guò)時(shí) (Making web scraping future-proof)
In what way AI and ML can innovate and improve web scraping? According to Oxylabs Next-Gen Residential Proxy AI & ML advisory board member Jonas Kubilius, an AI researcher, Marie Sklodowska-Curie Alumnus, and Co-Founder of Three Thirds:
AI和ML以什么方式可以創(chuàng)新和改善網(wǎng)頁(yè)抓取? 根據(jù)Oxylabs下一代住宅代理AI和ML顧問(wèn)委員會(huì)成員Jonas Kubilius的說(shuō)法,他是AI研究人員Marie Sklodowska-Curie Alumnus和“三分之三”的聯(lián)合創(chuàng)始人:
“There are recurring patterns in web content that are typically scraped, such as how prices are encoded and displayed, so in principle, ML should be able to learn to spot these patterns and extract the relevant information. The research challenge here is to learn models that generalize well across various websites or that can learn from a few human-provided examples. The engineering challenge is to scale up these solutions to realistic web scraping loads and pipelines.”
“網(wǎng)絡(luò)內(nèi)容中經(jīng)常會(huì)出現(xiàn)重復(fù)出現(xiàn)的模式,例如價(jià)格的編碼和顯示方式,因此,原則上,機(jī)器學(xué)習(xí)應(yīng)該能夠發(fā)現(xiàn)這些模式并提取相關(guān)信息。 這里的研究挑戰(zhàn)是學(xué)習(xí)在各種網(wǎng)站上都能很好地概括的模型,或者可以從一些人類(lèi)提供的示例中學(xué)習(xí)模型。 工程上的挑戰(zhàn)是將這些解決方案擴(kuò)展到實(shí)際的Web抓取負(fù)載和管道。 ”
Instead of manually developing and managing the scrapers code for each new website and URL, creating an AI and ML-powered solution will simplify the data gathering pipeline. This will take care of proxy pool management, data parsing maintenance, and other tedious work.
創(chuàng)建一個(gè)由AI和ML支持的解決方案將簡(jiǎn)化數(shù)據(jù)收集流程,而不是為每個(gè)新網(wǎng)站和URL手動(dòng)開(kāi)發(fā)和管理刮板代碼。 這將負(fù)責(zé)代理池管理,數(shù)據(jù)解析維護(hù)以及其他繁瑣的工作。
Not only does AI and ML-powered solutions enable developers to build highly scalable data extraction tools, but it also enables data science teams to prototype rapidly. It also stands as a backup to your existing custom-built code if it was ever to break.
由AI和ML支持的解決方案不僅使開(kāi)發(fā)人員能夠構(gòu)建高度可擴(kuò)展的數(shù)據(jù)提取工具,而且還使數(shù)據(jù)科學(xué)團(tuán)隊(duì)能夠快速進(jìn)行原型制作。 如果曾經(jīng)破解過(guò),它也可以作為現(xiàn)有定制代碼的備份。
網(wǎng)頁(yè)抓取的未來(lái)前景如何 (What the future holds for web scraping)
As we already established, creating fast data processing pipelines along with cutting edge ML techniques can offer an unparalleled competitive advantage in the web scraping community. And looking at today’s market, the implementation of AI and ML in data gathering has already started.
正如我們已經(jīng)確定的那樣,創(chuàng)建快速的數(shù)據(jù)處理管道以及最先進(jìn)的ML技術(shù)可以在Web抓取社區(qū)中提供無(wú)與倫比的競(jìng)爭(zhēng)優(yōu)勢(shì)。 縱觀當(dāng)今市場(chǎng),已經(jīng)開(kāi)始在數(shù)據(jù)收集中實(shí)施AI和ML。
For this reason, Oxylabs is introducing Next-Gen Residential Proxies which are powered by the latest AI applications.
因此,Oxylabs推出了由最新的AI應(yīng)用程序提供支持的下一代住宅代理 。
Next-Gen Residential Proxies were built with heavy-duty data retrieval operations in mind. They enable web data extraction without delays or errors. The product is as customizable as a regular proxy, but at the same time, it guarantees a much higher success rate and requires less maintenance. Custom headers and IP stickiness are both supported, alongside reusable cookies and POST requests. Its main benefits are:
下一代住宅代理的構(gòu)建考慮了重型數(shù)據(jù)檢索操作。 它們使Web數(shù)據(jù)提取沒(méi)有延遲或錯(cuò)誤。 該產(chǎn)品可以像常規(guī)代理一樣進(jìn)行自定義,但是同時(shí),它可以確保更高的成功率并需要更少的維護(hù)。 支持自定義標(biāo)頭和IP粘性,以及可重用的cookie和POST請(qǐng)求。 它的主要優(yōu)點(diǎn)是:
- 100% success rate 成功率100%
- AI-Powered Dynamic Fingerprinting (CAPTCHA, block, and website change handling) AI驅(qū)動(dòng)的動(dòng)態(tài)指紋識(shí)別(CAPTCHA,阻止和網(wǎng)站更改處理)
- Machine Learning based HTML parsing 基于機(jī)器學(xué)習(xí)HTML解析
- Easy integration (like any other proxy) 易于集成(像其他代理一樣)
- Auto-Retry system 自動(dòng)重試系統(tǒng)
- JavaScript rendering JavaScript渲染
- Patented proxy rotation system 專(zhuān)利代理旋轉(zhuǎn)系統(tǒng)
Going back to our previous web scraping value chain, you can see which parts of web scraping can be automated and improved with AI and ML-powered Next-Gen Residential Proxies.
回到我們以前的網(wǎng)絡(luò)抓取價(jià)值鏈,您可以看到可以使用AI和ML支持的下一代住宅代理來(lái)自動(dòng)化和改進(jìn)網(wǎng)絡(luò)抓取的哪些部分。
Source: Oxylabs’ design team資料來(lái)源:Oxylabs的設(shè)計(jì)團(tuán)隊(duì)The Next-Gen Residential Proxy solution automates almost the whole scraping process, making it a truly strong competitor for future-proof web scraping.
下一代住宅代理解決方案幾乎可以自動(dòng)化整個(gè)刮削過(guò)程,使其成為永不過(guò)時(shí)的網(wǎng)絡(luò)刮削的真正強(qiáng)大競(jìng)爭(zhēng)對(duì)手。
This project will be continuously developed and improved by Oxylabs in-house ML engineering team and a board of advisors, Jonas Kubilius, Adi Andrei, Pujaa Rajan, and Ali Chaudhry, specializing in the fields of Artificial Intelligence and ML engineering.
Oxylabs內(nèi)部的ML工程團(tuán)隊(duì)和顧問(wèn)委員會(huì)Jonas Kubilius , Adi Andrei , Pujaa Rajan和Ali Chaudhry將繼續(xù)開(kāi)發(fā)和改進(jìn)此項(xiàng)目,該委員會(huì)專(zhuān)門(mén)研究人工智能和ML工程領(lǐng)域。
結(jié)語(yǔ) (Wrapping up)
As the scale of web scraping projects increase, automating data gathering becomes a high priority for businesses that want to stay ahead of the competition. With the improvement of AI algorithms in recent years, along with the increase in compute power and the growth of the talent pool has made AI implementations possible in a number of industries, web scraping included.
隨著網(wǎng)絡(luò)抓取項(xiàng)目規(guī)模的擴(kuò)大,對(duì)于希望保持競(jìng)爭(zhēng)優(yōu)勢(shì)的企業(yè)而言,自動(dòng)化數(shù)據(jù)收集已成為當(dāng)務(wù)之急。 近年來(lái),隨著AI算法的改進(jìn),以及計(jì)算能力的提高和人才庫(kù)的增長(zhǎng),使得許多行業(yè)都可以實(shí)施AI,其中包括Web抓取。
Establishing AI and ML-powered data gathering techniques offers a great competitive advantage in the industry, as well as save copious amounts of time and resources. It is the new future of large-scale web scraping, and a good head start of the development of future-proof solutions.
建立由AI和ML支持的數(shù)據(jù)收集技術(shù)在行業(yè)中提供了巨大的競(jìng)爭(zhēng)優(yōu)勢(shì),并且節(jié)省了大量的時(shí)間和資源。 這是大規(guī)模刮網(wǎng)的新未來(lái),也是開(kāi)發(fā)面向未來(lái)的解決方案的良好開(kāi)端。
翻譯自: https://towardsdatascience.com/the-new-beginnings-of-ai-powered-web-data-gathering-solutions-a8e95f5e1d3f
ai驅(qū)動(dòng)數(shù)據(jù)安全治理
總結(jié)
以上是生活随笔為你收集整理的ai驱动数据安全治理_AI驱动的Web数据收集解决方案的新起点的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到自己被蛇追着咬是什么意思
- 下一篇: 梦到进医院什么预兆