python在线爬取数据导入Neo4j创建知识图谱
生活随笔
收集整理的這篇文章主要介紹了
python在线爬取数据导入Neo4j创建知识图谱
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
近期學習Neo4j,以豆瓣top250數據為研究對象,實現python在線爬取數據寫入Neo4j創建知識圖譜,下文詳細介紹步驟。
1、知識圖譜設計
通過分析網頁,爬取網頁可以得到movie、country、type、time、director、actor、score等信息,此處我將movie、country、type、time、director、actor作為節點,而score作為movie的屬性,網上有很多地方講到只將movie、director、actor作為節點,其余均作為movie的屬性,這個我之前也做過,但最后的效果并不是我想要的,至于什么效果,后文會提到。節點和關系設計如下圖。
2、爬取數據并寫入Neo4j
此處就直接上代碼了:
from bs4 import BeautifulSoup from urllib.request import urlopen,urlparse,urlsplit,Request import urllib.request import re import codecs import random import py2neo from py2neo import Graph # ua_list = ["Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome"Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox"Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE"Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera ]if __name__ == "__main__":# connect to graphgraph = Graph ("http://localhost:11010/",username="admin",password="password")for i in range(0,9):ua = random.choice( ua_list )url = 'https://movie.douban.com/top250?start='+str(i*25)+'&filter='req = urllib.request.Request( url, headers={'User-agent' : ua} )html=urlopen(req).read()soup = BeautifulSoup ( html, 'lxml' )page=soup.find_all('div', {'class' : 'item'})punc = ':· - ...:-'list_item=[]for item in page:content = {}try :text0=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' )[0]text1=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]#get filmfilm=item.find( 'span', {'class' : 'title'} ).text.strip ( )film=re.sub ( r"[%s]+" % punc, "", film.strip ( ) )# get scorescore=item.find ( 'span', {'class' : 'rating_num'} ).text.strip ( )graph.run ("CREATE (movie:Movie {name:'" + film + "', score:'" + score +"'})" )#get directordirectors=text0.strip().split('???')[0].strip().split(':')[1]directors = re.sub ( r"[%s]+" % punc, "", directors.strip ( ) )#存在特殊字符需要先去除# director=directors.split ( '/' )if len ( directors.split ( '/' ))>1:print(film+'has more than one director')#創建director節點if directors not in list_item:graph.run ("CREATE (director:Person {name:'" + directors + "'})" )list_item.append ( directors )#創建director-movie關系graph.run ("match (p:Person{name:'" + directors + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:directed]->(b)" )#get actoractors = text0.strip ( ).split ( '???' ) [1].strip ( ).split ( ':' ) [1]actors = re.sub ( r"[%s]+" % punc, "", actors.strip ( ) )#存在特殊字符需要先去除if len ( actors.split ( '/' ) ) == 1 :actor = actorsif actor not in list_item:graph.run ("CREATE (actor:Person {name:'" + actor + "'})" )list_item.append ( actor )graph.run ("match (p:Person{name:'" + actor + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )else :actor = actors.split ( '/' )if '...' in actor:actor.remove ( '...' )for i in range(len(actor)-1):if actor[i] not in list_item :graph.run ("CREATE (actor:Person {name:'" + actor [i] + "'})" )list_item.append ( actor [i] )graph.run ("match (p:Person{name:'" + actor[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )#get timetime=text1.strip ( ).split ( '/' ) [0].strip()if time not in list_item:graph.run ("CREATE (time:Time {year:'" + time + "'})" )list_item.append ( time )graph.run ("match (p:Time{year:'" + time + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:created_in]->(p)" )#get country#maybe more than onecountry=text1.strip ( ).split ( '/' ) [1].strip().split(' ')[0]if country not in list_item:graph.run ("CREATE (country:Country {name:'" + country + "'})" )list_item.append ( country )graph.run ("match (p:Country {name:'" + country + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:produced_by]->(p)" )#get typetypes=text1.strip ( ).split ( '/' ) [2].strip().split(' ')if len(types)==1:type = typesif type not in list_item:graph.run ("CREATE (type:Type {name:'" + type + "'})" )list_item.append ( type )graph.run ("match (p:Type{name:'" + type + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )else:for i in range(0,len(types)):if types[i] not in list_item:graph.run ("CREATE (type:Type {name:'" + types[i] + "'})" )list_item.append ( types[i] )type_relation="match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)"graph.run ("match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )except:continue代碼比較粗糙,后續再完善。
3、知識圖譜show
整體效果如上圖,即可以通過country、type、time信息顯性化的檢索相關信息,如果只將movie、director、actor作為node,則需要點擊具體節點才能看到其屬性country、type、time等信息。
如此,一個簡易的豆瓣top250知識圖譜就構建好了,但是,此處仍存在一個問題-數據重復,做完后發現不僅僅是節點有重復,關系竟然也有重復的,這個問題還在探究中。
?
總結
以上是生活随笔為你收集整理的python在线爬取数据导入Neo4j创建知识图谱的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: flask html 模板继承,Flas
- 下一篇: MindManager2022高效好用办