精心整理 25 个 Python 文本处理案例,收藏!
Python 處理文本是一項非常常見的功能,本文整理了多種文本提取及NLP相關(guān)的案例,還是非常用心的。文章很長,要忍一下,如果忍不了,那就收藏吧,總會用到的!
提取 PDF 內(nèi)容
提取 Word 內(nèi)容
提取 Web 網(wǎng)頁內(nèi)容
讀取 Json 數(shù)據(jù)
讀取 CSV 數(shù)據(jù)
刪除字符串中的標點符號
使用 NLTK 刪除停用詞
使用 TextBlob 更正拼寫
使用 NLTK 和 TextBlob 的詞標記化
使用 NLTK 提取句子單詞或短語的詞干列表
使用 NLTK 進行句子或短語詞形還原
使用 NLTK 從文本文件中查找每個單詞的頻率
從語料庫中創(chuàng)建詞云
NLTK 詞法散布圖
使用 countvectorizer 將文本轉(zhuǎn)換為數(shù)字
使用 TF-IDF 創(chuàng)建文檔術(shù)語矩陣
為給定句子生成 N-gram
使用帶有二元組的 sklearn CountVectorize 詞匯規(guī)范
使用 TextBlob 提取名詞短語
如何計算詞-詞共現(xiàn)矩陣
使用 TextBlob 進行情感分析
使用 Goslate 進行語言翻譯
使用 TextBlob 進行語言檢測和翻譯
使用 TextBlob 獲取定義和同義詞
使用 TextBlob 獲取反義詞列表
1提取 PDF 內(nèi)容
#?pip?install?PyPDF2??安裝?PyPDF2 import?PyPDF2 from?PyPDF2?import?PdfFileReader#?Creating?a?pdf?file?object. pdf?=?open("test.pdf",?"rb")#?Creating?pdf?reader?object. pdf_reader?=?PyPDF2.PdfFileReader(pdf)#?Checking?total?number?of?pages?in?a?pdf?file. print("Total?number?of?Pages:",?pdf_reader.numPages)#?Creating?a?page?object. page?=?pdf_reader.getPage(200)#?Extract?data?from?a?specific?page?number. print(page.extractText())#?Closing?the?object. pdf.close()2提取 Word 內(nèi)容
#?pip?install?python-docx??安裝?python-docximport?docxdef?main():try:doc?=?docx.Document('test.docx')??#?Creating?word?reader?object.data?=?""fullText?=?[]for?para?in?doc.paragraphs:fullText.append(para.text)data?=?'\n'.join(fullText)print(data)except?IOError:print('There?was?an?error?opening?the?file!')returnif?__name__?==?'__main__':main()3提取 Web 網(wǎng)頁內(nèi)容
#?pip?install?bs4??安裝?bs4from?urllib.request?import?Request,?urlopen from?bs4?import?BeautifulSoupreq?=?Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',headers={'User-Agent':?'Mozilla/5.0'})webpage?=?urlopen(req).read()#?Parsing soup?=?BeautifulSoup(webpage,?'html.parser')#?Formating?the?parsed?html?file strhtm?=?soup.prettify()#?Print?first?500?lines print(strhtm[:500])#?Extract?meta?tag?value print(soup.title.string) print(soup.find('meta',?attrs={'property':'og:description'}))#?Extract?anchor?tag?value for?x?in?soup.find_all('a'):print(x.string)#?Extract?Paragraph?tag?value???? for?x?in?soup.find_all('p'):print(x.text)4讀取 Json 數(shù)據(jù)
import?requests import?jsonr?=?requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json") res?=?r.json()#?Extract?specific?node?content. print(res['quiz']['sport'])#?Dump?data?as?string data?=?json.dumps(res) print(data)5讀取 CSV 數(shù)據(jù)
import?csvwith?open('test.csv','r')?as?csv_file:reader?=csv.reader(csv_file)next(reader)?#?Skip?first?rowfor?row?in?reader:print(row)6刪除字符串中的標點符號
import?re import?stringdata?=?"Stuning?even?for?the?non-gamer:?This?sound?track?was?beautiful!\ It?paints?the?senery?in?your?mind?so?well?I?would?recomend\ it?even?to?people?who?hate?vid.?game?music!?I?have?played?the?game?Chrono?\ Cross?but?out?of?all?of?the?games?I?have?ever?played?it?has?the?best?music!?\ It?backs?away?from?crude?keyboarding?and?takes?a?fresher?step?with?grate\ guitars?and?soulful?orchestras.\ It?would?impress?anyone?who?cares?to?listen!"#?Methood?1?:?Regex #?Remove?the?special?charaters?from?the?read?string. no_specials_string?=?re.sub('[!#?,.:";]',?'',?data) print(no_specials_string)#?Methood?2?:?translate() #?Rake?translator?object translator?=?str.maketrans('',?'',?string.punctuation) data?=?data.translate(translator) print(data)7使用 NLTK 刪除停用詞
from?nltk.corpus?import?stopwordsdata?=?['Stuning?even?for?the?non-gamer:?This?sound?track?was?beautiful!\ It?paints?the?senery?in?your?mind?so?well?I?would?recomend\ it?even?to?people?who?hate?vid.?game?music!?I?have?played?the?game?Chrono?\ Cross?but?out?of?all?of?the?games?I?have?ever?played?it?has?the?best?music!?\ It?backs?away?from?crude?keyboarding?and?takes?a?fresher?step?with?grate\ guitars?and?soulful?orchestras.\ It?would?impress?anyone?who?cares?to?listen!']#?Remove?stop?words stopwords?=?set(stopwords.words('english'))output?=?[] for?sentence?in?data:temp_list?=?[]for?word?in?sentence.split():if?word.lower()?not?in?stopwords:temp_list.append(word)output.append('?'.join(temp_list))print(output)8使用 TextBlob 更正拼寫
from?textblob?import?TextBlobdata?=?"Natural?language?is?a?cantral?part?of?our?day?to?day?life,?and?it's?so?antresting?to?work?on?any?problem?related?to?langages."output?=?TextBlob(data).correct() print(output)9使用 NLTK 和 TextBlob 的詞標記化
import?nltk from?textblob?import?TextBlobdata?=?"Natural?language?is?a?central?part?of?our?day?to?day?life,?and?it's?so?interesting?to?work?on?any?problem?related?to?languages."nltk_output?=?nltk.word_tokenize(data) textblob_output?=?TextBlob(data).wordsprint(nltk_output) print(textblob_output)Output:
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.'] ['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']10使用 NLTK 提取句子單詞或短語的詞干列表
from?nltk.stem?import?PorterStemmerst?=?PorterStemmer() text?=?['Where?did?he?learn?to?dance?like?that?','His?eyes?were?dancing?with?humor.','She?shook?her?head?and?danced?away','Alex?was?an?excellent?dancer.']output?=?[] for?sentence?in?text:output.append("?".join([st.stem(i)?for?i?in?sentence.split()]))for?item?in?output:print(item)print("-"?*?50) print(st.stem('jumping'),?st.stem('jumps'),?st.stem('jumped'))Output:
where did he learn to danc like that? hi eye were danc with humor. she shook her head and danc away alex wa an excel dancer. -------------------------------------------------- jump jump jump11使用 NLTK 進行句子或短語詞形還原
from?nltk.stem?import?WordNetLemmatizerwnl?=?WordNetLemmatizer() text?=?['She?gripped?the?armrest?as?he?passed?two?cars?at?a?time.','Her?car?was?in?full?view.','A?number?of?cars?carried?out?of?state?license?plates.']output?=?[] for?sentence?in?text:output.append("?".join([wnl.lemmatize(i)?for?i?in?sentence.split()]))for?item?in?output:print(item)print("*"?*?10) print(wnl.lemmatize('jumps',?'n')) print(wnl.lemmatize('jumping',?'v')) print(wnl.lemmatize('jumped',?'v'))print("*"?*?10) print(wnl.lemmatize('saddest',?'a')) print(wnl.lemmatize('happiest',?'a')) print(wnl.lemmatize('easiest',?'a'))Output:
She gripped the armrest a he passed two car at a time. Her car wa in full view. A number of car carried out of state license plates. ********** jump jump jump ********** sad happy easy12使用 NLTK 從文本文件中查找每個單詞的頻率
import?nltk from?nltk.corpus?import?webtext from?nltk.probability?import?FreqDistnltk.download('webtext') wt_words?=?webtext.words('testing.txt') data_analysis?=?nltk.FreqDist(wt_words)#?Let's?take?the?specific?words?only?if?their?frequency?is?greater?than?3. filter_words?=?dict([(m,?n)?for?m,?n?in?data_analysis.items()?if?len(m)?>?3])for?key?in?sorted(filter_words):print("%s:?%s"?%?(key,?filter_words[key]))data_analysis?=?nltk.FreqDist(filter_words)data_analysis.plot(25,?cumulative=False)Output:
[nltk_data] Downloading package webtext to [nltk_data] C:\Users\amit\AppData\Roaming\nltk_data... [nltk_data] Unzipping corpora\webtext.zip. 1989: 1 Accessing: 1 Analysis: 1 Anyone: 1 Chapter: 1 Coding: 1 Data: 1 ...13從語料庫中創(chuàng)建詞云
import?nltk from?nltk.corpus?import?webtext from?nltk.probability?import?FreqDist from?wordcloud?import?WordCloud import?matplotlib.pyplot?as?pltnltk.download('webtext') wt_words?=?webtext.words('testing.txt')??#?Sample?data data_analysis?=?nltk.FreqDist(wt_words)filter_words?=?dict([(m,?n)?for?m,?n?in?data_analysis.items()?if?len(m)?>?3])wcloud?=?WordCloud().generate_from_frequencies(filter_words)#?Plotting?the?wordcloud plt.imshow(wcloud,?interpolation="bilinear")plt.axis("off") (-0.5,?399.5,?199.5,?-0.5) plt.show()14NLTK 詞法散布圖
import?nltk from?nltk.corpus?import?webtext from?nltk.probability?import?FreqDist from?wordcloud?import?WordCloud import?matplotlib.pyplot?as?pltwords?=?['data',?'science',?'dataset']nltk.download('webtext') wt_words?=?webtext.words('testing.txt')??#?Sample?datapoints?=?[(x,?y)?for?x?in?range(len(wt_words))for?y?in?range(len(words))?if?wt_words[x]?==?words[y]]if?points:x,?y?=?zip(*points) else:x?=?y?=?()plt.plot(x,?y,?"rx",?scalex=.1) plt.yticks(range(len(words)),?words,?color="b") plt.ylim(-1,?len(words)) plt.title("Lexical?Dispersion?Plot") plt.xlabel("Word?Offset") plt.show()15使用 countvectorizer 將文本轉(zhuǎn)換為數(shù)字
import?pandas?as?pd from?sklearn.feature_extraction.text?import?CountVectorizer#?Sample?data?for?analysis data1?=?"Java?is?a?language?for?programming?that?develops?a?software?for?several?platforms.?A?compiled?code?or?bytecode?on?Java?application?can?run?on?most?of?the?operating?systems?including?Linux,?Mac?operating?system,?and?Linux.?Most?of?the?syntax?of?Java?is?derived?from?the?C++?and?C?languages." data2?=?"Python?supports?multiple?programming?paradigms?and?comes?up?with?a?large?standard?library,?paradigms?included?are?object-oriented,?imperative,?functional?and?procedural." data3?=?"Go?is?typed?statically?compiled?language.?It?was?created?by?Robert?Griesemer,?Ken?Thompson,?and?Rob?Pike?in?2009.?This?language?offers?garbage?collection,?concurrency?of?CSP-style,?memory?safety,?and?structural?typing."df1?=?pd.DataFrame({'Java':?[data1],?'Python':?[data2],?'Go':?[data2]})#?Initialize vectorizer?=?CountVectorizer() doc_vec?=?vectorizer.fit_transform(df1.iloc[0])#?Create?dataFrame df2?=?pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#?Change?column?headers df2.columns?=?df1.columns print(df2)Output:
Go Java Python and 2 2 2 application 0 1 0 are 1 0 1 bytecode 0 1 0 can 0 1 0 code 0 1 0 comes 1 0 1 compiled 0 1 0 derived 0 1 0 develops 0 1 0 for 0 2 0 from 0 1 0 functional 1 0 1 imperative 1 0 1 ...16使用 TF-IDF 創(chuàng)建文檔術(shù)語矩陣
import?pandas?as?pd from?sklearn.feature_extraction.text?import?TfidfVectorizer#?Sample?data?for?analysis data1?=?"Java?is?a?language?for?programming?that?develops?a?software?for?several?platforms.?A?compiled?code?or?bytecode?on?Java?application?can?run?on?most?of?the?operating?systems?including?Linux,?Mac?operating?system,?and?Linux.?Most?of?the?syntax?of?Java?is?derived?from?the?C++?and?C?languages." data2?=?"Python?supports?multiple?programming?paradigms?and?comes?up?with?a?large?standard?library,?paradigms?included?are?object-oriented,?imperative,?functional?and?procedural." data3?=?"Go?is?typed?statically?compiled?language.?It?was?created?by?Robert?Griesemer,?Ken?Thompson,?and?Rob?Pike?in?2009.?This?language?offers?garbage?collection,?concurrency?of?CSP-style,?memory?safety,?and?structural?typing."df1?=?pd.DataFrame({'Java':?[data1],?'Python':?[data2],?'Go':?[data2]})#?Initialize vectorizer?=?TfidfVectorizer() doc_vec?=?vectorizer.fit_transform(df1.iloc[0])#?Create?dataFrame df2?=?pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#?Change?column?headers df2.columns?=?df1.columns print(df2)Output:
Go Java Python and 0.323751 0.137553 0.323751 application 0.000000 0.116449 0.000000 are 0.208444 0.000000 0.208444 bytecode 0.000000 0.116449 0.000000 can 0.000000 0.116449 0.000000 code 0.000000 0.116449 0.000000 comes 0.208444 0.000000 0.208444 compiled 0.000000 0.116449 0.000000 derived 0.000000 0.116449 0.000000 develops 0.000000 0.116449 0.000000 for 0.000000 0.232898 0.000000 ...17為給定句子生成 N-gram
NLTK
import?nltk from?nltk.util?import?ngrams#?Function?to?generate?n-grams?from?sentences. def?extract_ngrams(data,?num):n_grams?=?ngrams(nltk.word_tokenize(data),?num)return?[?'?'.join(grams)?for?grams?in?n_grams]data?=?'A?class?is?a?blueprint?for?the?object.'print("1-gram:?",?extract_ngrams(data,?1)) print("2-gram:?",?extract_ngrams(data,?2)) print("3-gram:?",?extract_ngrams(data,?3)) print("4-gram:?",?extract_ngrams(data,?4))TextBlob
from?textblob?import?TextBlob#?Function?to?generate?n-grams?from?sentences. def?extract_ngrams(data,?num):n_grams?=?TextBlob(data).ngrams(num)return?[?'?'.join(grams)?for?grams?in?n_grams]data?=?'A?class?is?a?blueprint?for?the?object.'print("1-gram:?",?extract_ngrams(data,?1)) print("2-gram:?",?extract_ngrams(data,?2)) print("3-gram:?",?extract_ngrams(data,?3)) print("4-gram:?",?extract_ngrams(data,?4))Output:
1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object'] 2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object'] 3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object'] 4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']18使用帶有二元組的 sklearn CountVectorize 詞匯規(guī)范
import?pandas?as?pd from?sklearn.feature_extraction.text?import?CountVectorizer#?Sample?data?for?analysis data1?=?"Machine?language?is?a?low-level?programming?language.?It?is?easily?understood?by?computers?but?difficult?to?read?by?people.?This?is?why?people?use?higher?level?programming?languages.?Programs?written?in?high-level?languages?are?also?either?compiled?and/or?interpreted?into?machine?language?so?that?computers?can?execute?them." data2?=?"Assembly?language?is?a?representation?of?machine?language.?In?other?words,?each?assembly?language?instruction?translates?to?a?machine?language?instruction.?Though?assembly?language?statements?are?readable,?the?statements?are?still?low-level.?A?disadvantage?of?assembly?language?is?that?it?is?not?portable,?because?each?platform?comes?with?a?particular?Assembly?Language"df1?=?pd.DataFrame({'Machine':?[data1],?'Assembly':?[data2]})#?Initialize vectorizer?=?CountVectorizer(ngram_range=(2,?2)) doc_vec?=?vectorizer.fit_transform(df1.iloc[0])#?Create?dataFrame df2?=?pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#?Change?column?headers df2.columns?=?df1.columns print(df2)Output:
Assembly Machine also either 0 1 and or 0 1 are also 0 1 are readable 1 0 are still 1 0 assembly language 5 0 because each 1 0 but difficult 0 1 by computers 0 1 by people 0 1 can execute 0 1 ...19使用 TextBlob 提取名詞短語
from?textblob?import?TextBlob#Extract?noun blob?=?TextBlob("Canada?is?a?country?in?the?northern?part?of?North?America.")for?nouns?in?blob.noun_phrases:print(nouns)Output:
canada northern part america20如何計算詞-詞共現(xiàn)矩陣
import?numpy?as?np import?nltk from?nltk?import?bigrams import?itertools import?pandas?as?pddef?generate_co_occurrence_matrix(corpus):vocab?=?set(corpus)vocab?=?list(vocab)vocab_index?=?{word:?i?for?i,?word?in?enumerate(vocab)}#?Create?bigrams?from?all?words?in?corpusbi_grams?=?list(bigrams(corpus))#?Frequency?distribution?of?bigrams?((word1,?word2),?num_occurrences)bigram_freq?=?nltk.FreqDist(bi_grams).most_common(len(bi_grams))#?Initialise?co-occurrence?matrix#?co_occurrence_matrix[current][previous]co_occurrence_matrix?=?np.zeros((len(vocab),?len(vocab)))#?Loop?through?the?bigrams?taking?the?current?and?previous?word,#?and?the?number?of?occurrences?of?the?bigram.for?bigram?in?bigram_freq:current?=?bigram[0][1]previous?=?bigram[0][0]count?=?bigram[1]pos_current?=?vocab_index[current]pos_previous?=?vocab_index[previous]co_occurrence_matrix[pos_current][pos_previous]?=?countco_occurrence_matrix?=?np.matrix(co_occurrence_matrix)#?return?the?matrix?and?the?indexreturn?co_occurrence_matrix,?vocab_indextext_data?=?[['Where',?'Python',?'is',?'used'],['What',?'is',?'Python'?'used',?'in'],['Why',?'Python',?'is',?'best'],['What',?'companies',?'use',?'Python']]#?Create?one?list?using?many?lists data?=?list(itertools.chain.from_iterable(text_data)) matrix,?vocab_index?=?generate_co_occurrence_matrix(data)data_matrix?=?pd.DataFrame(matrix,?index=vocab_index,columns=vocab_index) print(data_matrix)Output:
best use What Where ... in is Python used best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0[11 rows x 11 columns]21使用 TextBlob 進行情感分析
from?textblob?import?TextBlobdef?sentiment(polarity):if?blob.sentiment.polarity?<?0:print("Negative")elif?blob.sentiment.polarity?>?0:print("Positive")else:print("Neutral")blob?=?TextBlob("The?movie?was?excellent!") print(blob.sentiment) sentiment(blob.sentiment.polarity)blob?=?TextBlob("The?movie?was?not?bad.") print(blob.sentiment) sentiment(blob.sentiment.polarity)blob?=?TextBlob("The?movie?was?ridiculous.") print(blob.sentiment) sentiment(blob.sentiment.polarity)Output:
Sentiment(polarity=1.0, subjectivity=1.0) Positive Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666) Positive Sentiment(polarity=-0.3333333333333333, subjectivity=1.0) Negative22使用 Goslate 進行語言翻譯
import?goslatetext?=?"Comment?vas-tu?"gs?=?goslate.Goslate()translatedText?=?gs.translate(text,?'en') print(translatedText)translatedText?=?gs.translate(text,?'zh') print(translatedText)translatedText?=?gs.translate(text,?'de') print(translatedText)23使用 TextBlob 進行語言檢測和翻譯
from?textblob?import?TextBlobblob?=?TextBlob("Comment?vas-tu?")print(blob.detect_language())print(blob.translate(to='es')) print(blob.translate(to='en')) print(blob.translate(to='zh'))Output:
fr ?Como estas tu? How are you? 你好嗎?24使用 TextBlob 獲取定義和同義詞
from?textblob?import?TextBlob from?textblob?import?Wordtext_word?=?Word('safe')print(text_word.definitions)synonyms?=?set() for?synset?in?text_word.synsets:for?lemma?in?synset.lemmas():synonyms.add(lemma.name())print(synonyms)Output:
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound'] {'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}25使用 TextBlob 獲取反義詞列表
from?textblob?import?TextBlob from?textblob?import?Wordtext_word?=?Word('safe')antonyms?=?set() for?synset?in?text_word.synsets:for?lemma?in?synset.lemmas():????????if?lemma.antonyms():antonyms.add(lemma.antonyms()[0].name())????????print(antonyms)Output:
{'dangerous', 'out'}-?END -
對比Excel系列圖書累積銷量達15w冊,讓你輕松掌握數(shù)據(jù)分析技能,可以在全網(wǎng)搜索書名進行了解: 創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯,堅持創(chuàng)作打卡瓜分現(xiàn)金大獎總結(jié)
以上是生活随笔為你收集整理的精心整理 25 个 Python 文本处理案例,收藏!的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 上蔡惠民村镇银行属于什么银行
- 下一篇: 广汽埃安与滴滴成立合资公司,2025 年