python文本结构化处理_在Python中标记非结构化文本数据
python文本結(jié)構(gòu)化處理
Labelled data has been a crucial demand for supervised machine learning leading to a new industry altogether. This is an expensive and time-consuming activity with an unstructured text data which requires custom made techniques/rules to assign appropriate labels.
標(biāo)記數(shù)據(jù)一直是對(duì)有監(jiān)督的機(jī)器學(xué)習(xí)的一項(xiàng)至關(guān)重要的需求,從而帶動(dòng)了整個(gè)新興產(chǎn)業(yè)的發(fā)展。 對(duì)于非結(jié)構(gòu)化文本數(shù)據(jù),這是一項(xiàng)昂貴且費(fèi)時(shí)的活動(dòng),需要定制技術(shù)/規(guī)則來(lái)分配適當(dāng)?shù)臉?biāo)簽。
With the advent of state-of-the-art ML models and framework pipelines like Tensorflow and Pytorch, the dependency of data science practitioners has increased upon them for multiple problems. But these can only be consumed if provided with well-labelled training datasets and the cost and quality of this activity are positively correlated with Subject matter experts (SME). These constraints have directed the minds of practitioners towards Weak Supervision — an alternative of labelling training data utilizing high — level supervision from SMEs and some abstraction from noisier inputs using task-specific heuristics and regular expression patterns. These techniques have been employed in some opensource labelling models like Snorkel using Labelling functions and paid proprietaries Groudtruth, Dataturks etc.
隨著最先進(jìn)的ML模型和Tensorflow和Pytorch之類的框架管道的出現(xiàn),數(shù)據(jù)科學(xué)從業(yè)者對(duì)多種問(wèn)題的依賴日益增加。 但是,只有在提供了標(biāo)簽明確的培訓(xùn)數(shù)據(jù)集之后,這些內(nèi)容才能被消耗,并且這項(xiàng)活動(dòng)的成本和質(zhì)量與主題專家(SME)呈正相關(guān)。 這些限制使從業(yè)者的思想轉(zhuǎn)向“弱監(jiān)督”(Weak Supervision),這是一種利用中小型企業(yè)的高層監(jiān)督來(lái)標(biāo)記培訓(xùn)數(shù)據(jù)的方法,并且使用特定于任務(wù)的試探法和正則表達(dá)式模式從嘈雜的輸入中進(jìn)行抽象。 這些技術(shù)已被某些開(kāi)源標(biāo)簽?zāi)P筒捎?#xff0c;例如使用標(biāo)簽功能的Snorkel以及付費(fèi)所有者Groudtruth,Dataturks等。
The solution proposed is for a Multinational Enterprise Information Technology client that develops a wide variety of hardware components as well as software-related services for consumers & businesses. They deploy a robust Service team that supports customers through after-sales services. The client recognized the need for an in-depth, automated, and near-real-time analysis of customer communication logs. This has several benefits such as enabling proactive identification of product shortcomings and pinpointing improvements in future product releases.
提出的解決方案是針對(duì)跨國(guó)企業(yè)信息技術(shù)客戶的,該客戶為消費(fèi)者和企業(yè)開(kāi)發(fā)了各種各樣的硬件組件以及與軟件相關(guān)的服務(wù)。 他們部署了一支強(qiáng)大的服務(wù)團(tuán)隊(duì),通過(guò)售后服務(wù)為客戶提供支持。 客戶認(rèn)識(shí)到需要對(duì)客戶通信日志進(jìn)行深入,自動(dòng)化和近乎實(shí)時(shí)的分析。 這具有許多好處,例如能夠主動(dòng)發(fā)現(xiàn)產(chǎn)品缺陷并在將來(lái)的產(chǎn)品版本中指出改進(jìn)之處。
We developed a two-phase solution strategy to address the problem at hand.
我們制定了兩階段解決方案策略來(lái)解決當(dāng)前的問(wèn)題。
The first task was that of a binary classification to segregate customer calls into Operating System (OS) and Non-Operating System (Non-OS) calls. Since labelled data was not available in this case, we resorted to using Regular Expressions for this classification exercise. Using Regex also has the added utility of labelling the data in their respective categories. In the second phase, we targeted the ‘Non-OS’ category to tag other features.
第一項(xiàng)任務(wù)是二進(jìn)制分類的任務(wù),目的是將客戶呼叫分為操作系統(tǒng)(OS)和非操作系統(tǒng)(Non-OS)呼叫。 由于在這種情況下無(wú)法使用標(biāo)簽數(shù)據(jù),因此我們使用正則表達(dá)式進(jìn)行分類。 使用Regex還具有在其各自類別中標(biāo)記數(shù)據(jù)的附加實(shí)用程序。 在第二階段,我們以“非操作系統(tǒng)”類別為目標(biāo)來(lái)標(biāo)記其他功能。
A Stepwise Solution Approach is thus:
因此,逐步解決方案是:
Preprocessing:
預(yù)處理:
1. Create a corpus of frequently used OS phrases and abbreviations (ex: windows install, windows activation, Deployment issue, windows, VMware)
1.創(chuàng)建一個(gè)常用操作系統(tǒng)短語(yǔ)和縮寫的語(yǔ)料庫(kù)(例如:Windows安裝,Windows激活,部署問(wèn)題,Windows,VMware)
2. Similarly, form a corpus of phrases and words that may occur simultaneously with the OS phrases and may indicate to non-OS calls.
2.類似地,形成短語(yǔ)和單詞的語(yǔ)料庫(kù),這些短語(yǔ)和單詞可能與OS短語(yǔ)同時(shí)出現(xiàn),并可能指示非OS調(diào)用。
Core steps:
核心步驟:
1. Standard text cleaning procedures such as:
1.標(biāo)準(zhǔn)的文字清潔程序,例如:
a) Convert text to all lower cases
a)將文本轉(zhuǎn)換為所有小寫
b) Remove multiple spaces
b)刪除多個(gè)空格
c) Remove punctuations and special characters
c)刪除標(biāo)點(diǎn)符號(hào)和特殊字符
d) Remove non-ASCII characters
d)刪除非ASCII字符
2. In the first search pass, identify OS related words and phrases to tag the relevant calls as OS calls
2.在第一遍搜索中,識(shí)別與操作系統(tǒng)相關(guān)的詞和短語(yǔ),以將相關(guān)呼叫標(biāo)記為操作系統(tǒng)呼叫
3. In the second search pass, identify non-OS related words and phrases to tag calls related to features other than operating systems. This is needed as most call logs will keep a record of the configuration of the system which may lead to false tagging of the calls as OS
3.在第二遍搜索中,標(biāo)識(shí)與操作系統(tǒng)無(wú)關(guān)的單詞和短語(yǔ),以標(biāo)記與操作系統(tǒng)以外的功能相關(guān)的調(diào)用。 這是必需的,因?yàn)榇蠖鄶?shù)呼叫日志將保留系統(tǒng)配置的記錄,這可能導(dǎo)致錯(cuò)誤地將呼叫標(biāo)記為OS
Details for the phrase and word search:
短語(yǔ)和單詞搜索的詳細(xì)信息:
a) Create a dictionary of all text with the text of each row split into words and save the list of words as an element of the dictionary against the text or unique id.
a)創(chuàng)建所有文本的字典,將每一行的文本分成單詞,并將單詞列表作為文本或唯一ID的字典元素保存。
b) Now split each phrase of the corpus in words and search for each word of the phrase in each element of the dictionary. If all the words of the phrase are available in a given element of the dictionary, then tag the respective text or unique id accordingly.
b)現(xiàn)在將語(yǔ)料庫(kù)的每個(gè)短語(yǔ)拆分為單詞,然后在字典的每個(gè)元素中搜索短語(yǔ)的每個(gè)單詞。 如果該短語(yǔ)的所有單詞在字典的給定元素中均可用,則相應(yīng)地標(biāo)記相應(yīng)的文本或唯一ID。
c) Similarly, search for the words in the corpus in all the text and tag the successful search calls accordingly.
c)同樣,在所有文本中搜索語(yǔ)料庫(kù)中的單詞,并相應(yīng)地標(biāo)記成功的搜索調(diào)用。
Code Snippets
代碼段
Text cleaning:
文字清理:
Phrase search:
詞組搜索:
Limitations
局限性
1. Currently, the text is being only searched for the phrases of a single product and having it tagged accordingly. As an improvement, we can expect to include phrases of multiple products and tag the calls in a similar fashion.
1.目前,僅在文本中搜索單個(gè)產(chǎn)品的短語(yǔ)并對(duì)其進(jìn)行相應(yīng)標(biāo)記。 作為改進(jìn),我們可以期望包含多個(gè)產(chǎn)品的短語(yǔ),并以類似的方式標(biāo)記通話。
2. We can also include the language translations for foreign languages and check for spelling mistakes.
2.我們還可以包括外語(yǔ)的語(yǔ)言翻譯,并檢查拼寫錯(cuò)誤。
3. Domain experts can help in creating an exclusive set of words and phrases for each product which can make the product more customizable for different industry segments.
3.領(lǐng)域?qū)<铱梢詭椭鸀槊總€(gè)產(chǎn)品創(chuàng)建一組專有的單詞和短語(yǔ),這可以使產(chǎn)品針對(duì)不同的行業(yè)細(xì)分而更加可定制。
Sample search results
樣本搜索結(jié)果
1. OS Terms: RHEL, RedHat, OS install, no boot, subscription
1. 操作系統(tǒng)條款: RHEL,RedHat,操作系統(tǒng)安裝,不啟動(dòng),訂閱
2. Non-OS Terms: HW (Hardware), Disk Error
2. 非操作系統(tǒng)術(shù)語(yǔ):硬件(硬件),磁盤錯(cuò)誤
Proposed Future Enhancements
擬議的未來(lái)增強(qiáng)功能
1. The labelled training data can be consumed into training an NLP based Binary classification model which can classify the call logs into OS and Non-OS classes.
1.標(biāo)記的訓(xùn)練數(shù)據(jù)可以用于訓(xùn)練基于NLP的二進(jìn)制分類模型,該模型可以將呼叫日志分類為OS和Non-OS類。
2. Textual data needs to be converted into vectorized form, which can be achieved by using word embeddings for each token in the sentence. We can use pre-trained open-source embeddings like FastText, BERT, GloVe, etc.
2.文本數(shù)據(jù)需要轉(zhuǎn)換為矢量化形式,這可以通過(guò)對(duì)句子中的每個(gè)標(biāo)記使用單詞嵌入來(lái)實(shí)現(xiàn)。 我們可以使用經(jīng)過(guò)預(yù)先訓(xùn)練的開(kāi)源嵌入,例如FastText,BERT,GloVe等。
3. Some of the state-of-the-art models, like Neural Nets, can be used for the classification task, with RNN/GRU/LSTM layers to learn representations for text sequences.
3.一些最新的模型,例如神經(jīng)網(wǎng)絡(luò),可以用于分類任務(wù),通過(guò)RNN / GRU / LSTM層可以學(xué)習(xí)文本序列的表示形式。
翻譯自: https://towardsdatascience.com/labelling-unstructured-text-data-in-python-974e809b98d9
python文本結(jié)構(gòu)化處理
總結(jié)
以上是生活随笔為你收集整理的python文本结构化处理_在Python中标记非结构化文本数据的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 10分拉满!IGN发布《流浪地球2》影评
- 下一篇: 现代汽车去年营收同比增长 21.2%,今