當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

re、词云

發(fā)布時間：2024/3/13 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 re、词云小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

正則： ? re.S使點也能匹配到\n；re.I不區(qū)分規(guī)則中的大小寫；re.X忽略空格及#后的注釋；re.M把^和$由文首文末變?yōu)楦餍械氖孜病?/span> ? Egの刪除各行行尾的alex，alex不區(qū)分大小寫： import?re s='''ja654alEx runAlex87 90helloaLeX''' m=re.sub('alex$','',s,count=0,flags=re.M+re.I) print(m) ******************分割線******************* pattern中有\(zhòng)正則字母，或者就是表示普通標(biāo)點的\正則標(biāo)點，開頭可不必加r。只有表示\自身的\\，以及pattern中第num個()的\num如\2，才必須加r。多單詞匹配用…(單詞a|單詞b|單詞c)…，即|外再套個()。 Egの提取姓名： import re pattern=re.compile('((Bob|Jerry|Tom) Lee)') s='Jerry Lee666；Tom Lee+-*/、Bob Lee HELLO' r0=[item[0] for item in pattern.findall(s)] ? ?#提取所有姓名 r1=pattern.findall(s)[0][0] ? ?#這3行都是只提取第一個姓名，findall此處用了兩次下標(biāo) r2=pattern.match(s).groups()[0] r3=pattern.match(s).group(1) print(r0,r1,r2,r3,sep='\n') ******************分割線******************* 預(yù)搜索：目標(biāo)之后(?=666)為是而(?!666)為否，目標(biāo)之前(?<=666)為是而(?<!666)為否。<和+*?等元字符沖突，故盡量避免使用前置預(yù)搜索。預(yù)搜索的()不會占用\num的名額。 \b為單詞的首尾。Egの后面沒緊跟￥的單詞res：\bres\b(?!￥) Egの給json的每個key，添加個前綴ids_： import?re s='{"name" :"jerry", "age":32}' s1=re.sub('(\w+?"\s*:)',r'ids_\1',s) ? ?#re.sub禁\0，故套個()用\1來指代pattern全文 s2=re.sub('(?<=")(\w+?)(?="\s*:)',r'ids_\1',s) ******************分割線*******************
()：查找欄或規(guī)則中第3個()的內(nèi)容，在替換欄對應(yīng)的是：PyCharm為…$3…，VBA代碼以及正則軟件的替換表達(dá)式也是用$；而EditPlus卻是…\3…，python代碼里的正則也是用\，如re.sub(r'…','*\num*',text)。 ()的序列始于1，順序從外至內(nèi)，從左至右。而PyCharm的$0或EditPlus的\0卻是整個查找值或規(guī)則，與哪個()都無關(guān)。
re.sub(*)的pattern所匹配到的全文，在repl對應(yīng)為lambda x:x[0]，或源自re.search(*).group(0)的完整寫法lambda x:x.group(0)，或最簡寫法r'\0'(只不過\0會被識別為\x00而顯示為□而不能用，好在可對pattern套個()從而使用\1來指代查找欄的全文)。repl在匹配出查找項后若想調(diào)用它，如作個替換或用作某個{}的key，\num就無能為力了。 Egの替換html的text里的所有a為ǎ，標(biāo)簽里的不動： import re html='<div><a href="url">1a</a>1b 1c 1d 1a 1b 1d a1</div>' r1=re.sub('a(?=[^<>]*?<)','ǎ',html) r2=re.sub('>.*?<',lambda ptn:ptn[0].replace('a','ǎ'),html) Egの查找欄()匹配到的各項作為{}的key，對應(yīng)的value，作為查找欄全文的替換值： html='&w1;&as2;&d3f;&zx4y;' d=dict(w1=2,as2='0',d3f='1',zx4y=8) pattern=re.compile('&([a-z0-9]+?);') html=pattern.sub(lambda ptn:str(d.get(ptn[1])),html) ? ?#d[r'\1']無效 ******************分割線*******************
(?:xy)+：使()內(nèi)的xy只作為一整體而不蘊含\num意，re.findall的返回≥此()內(nèi)匹配到的xy。查找欄也有用\num+的時候，如匹配第num個()內(nèi)的重復(fù)單詞。 Egの連續(xù)重復(fù)的單詞去重： #在PyCharm的查找替換欄的應(yīng)用：(word)\1+替換為$1 import re s='wwwwert啦啦+-*/666嘿PythonPython' ?#下文匹配重復(fù)單詞的內(nèi)()算是第2個()，故用\2+ 提取連續(xù)出現(xiàn)的詞=[x[0] for x in re.findall(r'((.+?)\2+)',s)] 提取去重之后的詞=re.findall(r'(.+?)\1+',s) ****************************************分割線**************************************** 縱向打印古詩： import re,itertools poetry='鵝鵝鵝，曲項向天歌。。白毛浮綠水，，，紅掌撥清波~~！！' t=re.split('[^一-龥]+',re.sub('[^一-龥]+$','',poetry))? #去末尾的非漢字，再以非漢字分割 t.reverse() print(t)? ? #zip自其首參序列的各子元素內(nèi)，取同級孫元素組成新序列，不足位的則置空 [print(' ??'.join(x)) for x in itertools. zip_longest(*t,fillvalue='')] #[print(y,end=' ??') if y!=x[-1] else print(y) for x in itertools.zip_longest(*t,fillvalue='') for y in x] ******************分割線******************* ①zip函數(shù)以父序列的各元素作函參序列：*父序列名； ②雙層for循環(huán)：外層for在前，內(nèi)層for在后 ③列表解析，if寫在for前得補(bǔ)個else 0之類的，在for后則不必；執(zhí)行甲 if True else 執(zhí)行乙：True and 執(zhí)行甲 or 執(zhí)行乙 ******************分割線*******************
乘法口訣： ? 1*1=1 1*2=2?2*2=4 ……………… [print(f'{x}*{y}={x*y}',end=('?'?if?x<y?else?'\n'))?for?y?in?range(1,10)?for?x?in?range(1,10)?if?x<=y] #[print('%s*%s=%s' %(x,y,x*y),end=('?'?if?x<y?else?'\n'))?for?y?in?range(1,10)?for?x?in?range(1,10)?if?x<=y] ****************************************分割線**************************************** 多個詞的替換或提取——flashtext：若2+個中文關(guān)鍵詞，或中文與單詞在正文攜手出現(xiàn)，無空格等分隔，后者會被無視。 ? from flashtext import KeywordProcessor kp=KeywordProcessor() ? #參數(shù)大小寫敏感，默認(rèn)為False def Egの大雜燴(): ? ? kp.remove_keywords_from_list(list(kp.get_all_keywords().keys())) ? ? kp.add_non_word_boundary('、') ? #與左右的漢字或單詞3者，組合為1個新詞 ? ? olds='abcd eFg higk lmn opQ rst'.split() ? ?#news應(yīng)答''或None或無對應(yīng)，換為old自身 ? ? news=f'子丑寅卯辰巳午未 {""} 申酉戌亥'.split(' ') ? ? [kp.add_keyword(old,new) for old,new in zip(olds,news)] ? ?#,多換多 ? ? kp.add_keywords_from_dict(dict(秦=['甲','乙','丙丁'],唐宋=['6'])) ? ?#多換1 ? ? replace=kp.replace_keywords('乙甲乙hello,EFG OPQ rSt 6') #.extract_keywords ? ? print(replace) Egの大雜燴() *******分割線******* def Egの去除文件中的若干關(guān)鍵詞(path): ? ? kp.remove_keywords_from_list(list(kp.get_all_keywords().keys())) ? ? kp.add_keywords_from_dict({'乄乄':['的','了','是','有','在','不','子','個','世界']}) ? ? with open(path) as f: ? ? ? ? result=kp.replace_keywords(f.read()).replace('乄乄','') ? ? with open(path,'w') as f: ? ? ? ? f.write(result) Egの去除文件中的若干關(guān)鍵詞('E:/龍符.txt') ****************************************分割線**************************************** 漢字→拼音： from xpinyin import Pinyin py=Pinyin() s='朝辭白帝彩云間，千里江陵一日還。兩岸猿聲啼不住，輕舟已過萬重山。' 拼音=py.get_pinyin(s,' ',True,'upper') ? ?#2~4參的默認(rèn)：分隔符-，不注音，小寫首字母=py.get_initials('你好啊','|').lower() ?#默認(rèn)分隔符-，大寫 ******************分割線******************* 分析歸類の詩詞的作者是李白還是杜甫： python -m pip install textblob -i https://pypi.douban.com/simple/ python -m textblob.download_corpora -i https://pypi.douban.com/simple/ import jieba,os from textblob.classifiers import NaiveBayesClassifier def handleJieba(string): ? ?#結(jié)巴分詞并去除常用標(biāo)點 ? ? result=list(jieba.cut(string,True)) ? ? for noise in ['，','。','\n','']: ? ? ? ? while noise in result: ? ? ? ? ? ? result.remove(noise) ? ? return result def?materialTrain(comparedFiles=[]): ? #把李、杜等人的詩集用作訓(xùn)練素材 ? ? files={};train=[] ? ? for txt in comparedFiles: ? ? ? ? name=os.path.basename(txt).split('.')[0] ? ? ? ? files.update({name:0}) ? ? ? ? with open(txt,encoding='utf8') as f: ? ? ? ? ? ? result=handleJieba(f.read()) ? ? ? ? [train.append((word,name)) for word in result] ? ? classifier=NaiveBayesClassifier(train) ? ? makeDecisions(files,classifier) def makeDecisions(files,classifier): ? ?#最終的分析決策 ? ? words=handleJieba(input('請輸入一句詩詞：')) ? ? for word in words: ? ? ? ? classifyResult=classifier.classify(word) ? ? ? ? if classifyResult in files: ? ? ? ? ? ? files[classifyResult]+=1 ? ? for name in files: ? ? ? ? print(f'{name}的概率：%0.2f%%' %(files[name]/len(words)*100)) comparedFiles=['E:/李白.txt','E:/杜甫.txt'] materialTrain(comparedFiles) ******************分割線*******************
短文本分類工具：目前只支持Python2 from?tgrocery?import?Grocery gc=Grocery('短文本分類工具') train=[('education',?'名師指導(dǎo)托福語法技巧：名詞的復(fù)數(shù)形式'), ('education',?'中國高考成績海外認(rèn)可?是狼來了嗎？'), ('sports',?'圖文：法網(wǎng)孟菲爾斯苦戰(zhàn)進(jìn)16強(qiáng)?孟菲爾斯怒吼'), ('sports',?'四川成都舉行全國長距登山挑戰(zhàn)賽?近萬人參與'),] gc.train(train)?#list：各子是類別標(biāo)簽+語料文本構(gòu)成的tuple；2參delimiter默認(rèn)tab，用于文件路徑 #gc.train('E:/train.txt')????#文件路徑：1行(類別標(biāo)簽+tab空格+語料文本)為1個訓(xùn)練樣本 gc.save()???#本地自動創(chuàng)建個文件夾，名字為類初始化時的首參 gc=Grocery('短文本分類工具')???#再次加載模型，開始做問答或判斷題 gc.load() 問答=gc.predict('考生必讀：新托福寫作考試評分標(biāo)準(zhǔn)') test=[('education',?'福建春招考試報名18日截止?2月6日考試'), ('sports',?'意甲首輪補(bǔ)賽交戰(zhàn)記錄:國米10年連勝'),] 判斷=gc.test(test) ****************************************分割線****************************************
結(jié)巴分詞&詞云：結(jié)巴分詞的倆雞肋方法.cut(s)、.tokenize(s)： import jieba s='qw ert qwe rt' 無意義的原始分詞結(jié)果 = list(jieba.cut(s))[:50] 各單詞起閉止開的索引=list(jieba.tokenize(s)) for word in 各單詞起閉止開的索引: if 'er' in word[0]: ? ?#某個單詞首次出現(xiàn)的位置 print(word);break ******************分割線******************* jieba.analyse： .extract_tags(*)：jieba.cut(s)后剔除無意義詞并匯總，再順序取topN。首參str，在set_stop_words是追加的自定義踢詞的文件路徑，在extract_tags是待分析的正文。 Egの為1本小說制作詞云： from?jieba.analyse?import?set_stop_words,extract_tags from?wordcloud?import?WordCloud,STOPWORDS?as?sw import?numpy?as?np from?PIL?import?Image #①結(jié)巴分詞提取高頻中文詞： stopWords='D:/中文停用詞表.txt'???#1個過濾詞占1行 txtFile='F:/New?Download/example.txt' with?open(txtFile)?as?f: sentence=f.read() set_stop_words(stopWords)???#結(jié)巴分詞の過濾：自定義中文 #words=extract_tags(sentence,50) #text='?'.join(words) #詞云網(wǎng)站多數(shù)有個先后足矣，wc.generate_from_frequencies({*})等還需提供詞頻 words=extract_tags(sentence,topK=50,withWeight=True) frequencies={word[0]:int(word[1]*1000)?for?word?in?words} #②{詞:詞頻,}數(shù)據(jù)導(dǎo)入詞云： backImg='F:/New?Download/background.png' bg=np.array(Image.open(backImg))????#bg=scipy.misc.imread(backImg) wc=WordCloud('simhei.ttf',mask=bg,max_font_size=81)?#背景是ndarray對象 #wc.stopwords=sw|set(open(stopWords).readlines())???#詞云の過濾：內(nèi)置英文及自定義中文 #wc.generate_from_frequencies({詞1:詞頻1,})，wc.generate('空格分隔的各詞') wc.generate_from_frequencies(frequencies)???#wc.generate(text) #③展示圖：法1のImage庫用圖片路徑str，法2のplt庫用WordCloud對象 saveImg='F:/New?Download/result.jpg' wc.to_file(saveImg) Image.open(saveImg).show() #import?matplotlib.pyplot?as?plt #plt.imshow(wc) #plt.axis('off') #plt.savefig(saveImg,dpi=240,bbox_inches='tight') #plt.show() ******************分割線******************* 詞云網(wǎng)站https://wor刪dart.com/create(加載完要等幾秒)的用法：左側(cè)的：WORDSのImport→保持順序貼入各詞(勾上倆Remove，若有詞頻且以\t分割則勾上CSV)→SHPAGES選個圖→FONTSのAdd?font(如選個本機(jī)的雅黑字體，網(wǎng)站提供的那些都不支持中文)→LAYOUT設(shè)字體傾斜→STYLEのCustom設(shè)字體五顏六色 →右上的Visualize →右頂?shù)腄OWNLOAD(chrome設(shè)為內(nèi)置下載)。 ****************************************分割線****************************************
把兩張圖(尺寸和模式要相同)，合成為一張新圖： ? from PIL import Image ? backGround=Image.open('F:/666.png').convert('RGBA') img=Image.open('F:/1.jpg').resize(backGround.size).convert('RGBA') # backGround.paste(img,(0,40)) ? ?#按各自的尺寸合成 # backGround.show() ?#.save('F:/result.png') # result=Image.blend(img,backGround,0.2) ?#按透明度合成 result=Image.alpha_composite(img,backGround) ? ?#背景為png透明素材 result.show() ******************分割線******************* Egの把許多圖均勻整合到一張正方形內(nèi)： import glob,random,math from PIL import Image def mixImages(totalSize=640): images=glob.glob(imagesFolderPath+'\*.jpg') totalNum=len(images) vnum=math.ceil(math.sqrt(totalNum)) ?#縱向圖片數(shù)：總數(shù)的方根的天花板 hnum1=math.ceil(totalNum/vnum) ? #除末排的橫向圖片數(shù)，如5､6為2，7~12為3 frontNum=hnum1*(vnum-1) vsize=int(totalSize/vnum) hsize1=int(totalSize/hnum1);hsize2=int(totalSize/(totalNum-frontNum)) sizes=[(hsize1,vsize) if n<frontNum else (hsize2,vsize) for n in range(totalNum)] #4通道RGBA的png圖，和3通道RGB的bmp和jpg圖，都能粘貼進(jìn)畫布 toImage=Image.new('RGBA',(totalSize,totalSize)) x=0;y=0 #畫布游標(biāo) random.shuffle(images) for index,name in enumerate(images): img=Image.open(name).resize(sizes[index]) toImage.paste(img,(sizes[index][0]*x,vsize*y)) x+=1 if x==hnum1: x=0;y+=1 toImage.show() r,g,b=toImage.split()[:3] toImage=Image.merge('RGB',(r,g,b)) toImage.save('D:\合成圖.jpg') if __name__ == '__main__': imagesFolderPath='D:\待整合的各圖' mixImages(totalSize=720) ******************分割線******************* Egの分別把各圖填充為正方形并均勻切為9塊： from PIL import Image import os,pathlib def fill(originalImage): #將原圖居中貼在方形畫布上 width,height=originalImage.size newLength=max(originalImage.size) newImage=Image.new(originalImage.mode,(newLength,newLength),color='white') leftup=(int((newLength-width)/2),0) if width<height else (0,int((newLength-height)/2)) newImage.paste(originalImage,leftup) newImage.save(newFilePath) return newImage def cut(newImage): ?#把方形的新圖均勻切為9塊 width,height=newImage.size pieceLenth=int(width/3) pieces=[] for y in range(0,3): for x in range(0,3): piece=(x*pieceLenth,y*pieceLenth,(x+1)*pieceLenth,(y+1)*pieceLenth) pieces.append(newImage.crop(piece)) return pieces def save(pieces): ? #保存切自方形新圖的9塊小切圖 for index,piece in enumerate(pieces): piece.save(newFilePath.replace('_new','_'+str(index),1)) def walk(folderPath): ? #遍歷待切的各原圖 global newFilePath filesPath=[str(path) for path in pathlib.Path(folderPath).rglob('*.jp*')] for filePath in filesPath: originalImage=Image.open(filePath) newFolder=os.path.splitext(filePath)[0] newFilePath=os.path.split(newFolder)[-1]+'_new'+os.path.splitext(filePath)[-1] if not os.path.isdir(newFolder): os.makedirs(newFolder) os.chdir(newFolder) newImage=fill(originalImage) pieces=cut(newImage) save(pieces) if __name__=='__main__': folderPath='D:\圖片' walk(folderPath) ******************分割線******************* 制作驗證碼： import random,string from PIL import Image,ImageDraw,ImageFont,ImageFilter #每個驗證碼字符內(nèi)又有幾個字符 def rand_font(couple): s="" for j in range(couple): n=random.randint(1,3) ? #2為數(shù)字，1&3為大小寫字母 if n==2: s+=str(random.randint(0,9)) else: s+=random.choice(string.ascii_letters) return s #驗證碼各字符的顏色 def rand_fontColor(): return (random.randint(64,255),random.randint(64,255),random.randint(64,255)) #背景各像素的顏色 def rand_drawPixelColor(): return (random.randint(32,127),random.randint(32,127),random.randint(32,127)) #設(shè)置背景圖片的寬高 width=60*4 height=60 img=Image.new('RGB',(width,height),(0,0,0)) #創(chuàng)建背景圖片：模式、尺寸、顏色 draw=ImageDraw.Draw(img) ? ?#創(chuàng)建繪圖對象 font=ImageFont.truetype('C:/Windows/Fonts/Arial.ttf',36) #創(chuàng)建字體對象 #填充背景圖片每個像素點的顏色 for i in range(width): for j in range(height): draw.point((i,j),rand_drawPixelColor()) #寫入4個驗證碼字符，每個字符內(nèi)又含2個字符 for i in range(4): draw.text((60*i+10,10),text=rand_font(2),fill=rand_fontColor(),font=font) #圖片加噪，增加識別難度 img=img.filter(ImageFilter.BLUR) img.show() ******************分割線******************* 給圖片配文字： from PIL import Image,ImageDraw,ImageFont customFont=ImageFont.truetype('F:/msyh.ttc',50) image=Image.open('F:/原圖.jpg') width,height=image.size draw=ImageDraw.Draw(image) ?#創(chuàng)建繪圖對象 draw.text((width*1/3,height/2),'陳獨秀你坐下！！','#ff0000',customFont) ? #圖上加字 image.save('F:/新圖.jpg','jpeg') ******************分割線******************* 識別圖片中的文字： ? tesseract-ocr.exe安裝到默認(rèn)路徑，勾選Additional language下的Chinese(simplified) pytesseract.py中改tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe' ? 爬到的圖片b節(jié)碼不存圖而直接打開：Image.open(io.BytesIO(response.content)).show() from PIL import Image from pytesseract import image_to_string config="--tessdata-dir 'C:/Program Files (x86)/Tesseract-OCR/tessdata'" with Image.open('F:/New Download/1.jpg') as img: ? ? text=image_to_string(img,'chi_sim',config=config).replace(' ','') print(text) ******************分割線******************* 素描： from PIL import Image import numpy as np a=np.asarray(Image.open('D:\原圖.jpg').convert('L')).astype('float') depth=10. ?# (0-100) grad=np.gradient(a) ?# 取圖像灰度的梯度值 grad_x,grad_y=grad ?# 分別取橫縱圖像梯度值 grad_x=grad_x*depth / 100. grad_y=grad_y*depth / 100. A=np.sqrt(grad_x **2+grad_y **2+1.) uni_x=grad_x / A uni_y=grad_y / A uni_z=1. / A vec_el=np.pi / 2.2 ?# 光源的俯視角度，弧度值 vec_az=np.pi / 4. ?# 光源的方位角度，弧度值 dx=np.cos(vec_el)*np.cos(vec_az) ?# 光源對x 軸的影響 dy=np.cos(vec_el)*np.sin(vec_az) ?# 光源對y 軸的影響 dz=np.sin(vec_el) ?# 光源對z 軸的影響 b=255*(dx*uni_x+dy*uni_y+dz*uni_z) ?# 光源歸一化 b=b.clip(0,255) im=Image.fromarray(b.astype('uint8')) ?# 重構(gòu)圖像 im.save('D:\素描.jpg') ******************分割線******************* 雪花飄飄： ? import pygame,random ? pygame.init() ? #初始化 size=(1364,569) #屏幕長寬同背景圖 screen=pygame.display.set_mode(size) bg=pygame.image.load('F:/New Download/snow.jpg') pygame.display.set_caption('Snow Animation') ? snows=[] for i in range(200): ? ?#初始化雪花：[x坐標(biāo),y坐標(biāo),x軸速度,y軸速度] ? ? x=random.randrange(0,size[0]) ? ? y=random.randrange(0,size[1]) ? ? sx=random.randint(-2,2) #.randint(3,6)=.choice(range(3,7,1))=.randrange(3,7,1) ? ? sy=random.randint(4,7) ? ? snows.append([x,y,sx,sy]) ? clock=pygame.time.Clock() num=0 done=False while not done: ? ? screen.blit(bg,(0,0)) ? #圖片背景；黑背景screen.fill((0,0,0)) ? ? for snow in snows: # 雪花列表循環(huán) ? ? ? ? pygame.draw.circle(screen,(255,255,255),snow[:2],snow[3]-3) #畫雪花：顏色,位置,大小 ? ? ? ? snow[0] +=snow[2] ? # 移動雪花位置（下一次循環(huán)起效） ? ? ? ? snow[1] +=snow[3] ? ? ? ? if snow[1] > size[1]: ? # 如果雪花落出屏幕，重設(shè)位置 ? ? ? ? ? ? snow[1]=random.randrange(-50,-10) ? ? ? ? ? ? snow[0]=random.randrange(0,size[0]) ? ? pygame.display.flip() ? # 刷新屏幕 ? ? clock.tick(20) ? ? num+=1 ? ? if num<5: ? ? ? ? pygame.image.save(screen,f'F:/New Download/snow-{num}.jpg') ? ? for event in pygame.event.get(): ? ? ? ? if event.type==pygame.QUIT: ? ? ? ? ? ? done=True ? pygame.quit()

轉(zhuǎn)載于:https://www.cnblogs.com/scrooge/p/7693541.html

總結(jié)

以上是生活随笔為你收集整理的re、词云的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

词云

上一篇： Ant UI 的表单校验
下一篇：手机续航能力测试软件,五小时极限测试告诉