當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task1地理数据分析常用工具

發布時間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task1地理数据分析常用工具小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

智慧海洋建設-Task1 地理數據分析常用工具

在地理空間數據分析中，常會用到許多地理分析的工具，在本模塊中主要是針對常用的shapely、geopandas、folium、kepler.gl、geohash等工具進行簡單介紹。其中shapely和geopandas是做地理空間數據的分析很好的工具，而folium和kepler.gl是進行地理數據可視化的工具，geohash是將經緯度坐標進行數據編碼的方式。通過了解不同的方法將有助于我們去思考如何在現有的工具下去做數據的分析和特征的提取功能

學習目標

1.了解和學習shapely和geopandas的基本功能，掌握用python中的這兩個庫實現幾何對象之間的空間操作方法。
2.掌握folium和kepler.gl的數據可視化工具的使用。
3.學習與掌握geohash編碼方法。

內容介紹

shapely

空間數據模型
幾何對象的一些功能特性
Point
LineStrings
LineRings
Polygon
幾何對象之間的關系

geopandas

Folium

Kepler.gl

GeoHash

注意事項

shapely

參考資料
官方資料

Shapely是python中開源的空間幾何對象庫，支持Point、Curve和Surface等基本幾何對象類型以及相關空間操作。另外，幾何對象類型的特征分別有interior、boundary和exterior。

空間數據模型

1.point類型對應的方法在Point類中。curve類型對應的方法在LineString和LinearRing類中。surface類型對應的方法在Polygon類中。
2.point集合對應的方法在MultiPoint類中，curves集合對應的反方在MultiLineString類中，surface集合對應的方法在MultiPolygon類中。

幾何對象的一些功能特性

Point、LineString和LinearRing有一些功能非常有用。

幾何對象可以和numpy.array互相轉換。
可以求線的長度(length)，面的面積（area)，對象之間的距離(distance),最小最大距離(hausdorff_distance),對象的bounds數組(minx, miny, maxx, maxy)
可以求幾何對象之間的關系：相交(intersect)，包含(contain)，求相交區域(intersection)等。
可以對幾何對象求幾何中心(centroid),緩沖區(buffer),最小旋轉外接矩形(minimum_rotated_rectangle)等。
可以求線的插值點(interpolate),可以求點投影到線的距離(project),可以求幾何對象之間對應的最近點(nearestPoint)
可以對幾何對象進行旋轉(rotate)和縮放(scale)

from shapely import geometry as geo from shapely import wkt from shapely import ops import numpy as np

Point

class Point(coordinates)

# point有三種賦值方式，具體如下 point = geo.Point(0.5,0.5) point_2 = geo.Point((0,0)) point_3 = geo.Point(point) # 其坐標可以通過coords或x，y，z得到 print(list(point_3.coords)) print(point_3.x) print(point_3.y) #批量進行可視化 geo.GeometryCollection([point,point_2]) print(np.array(point))#可以和np.array進行互相轉換 [(0.5, 0.5)] 0.5 0.5 [0.5 0.5]

LineStrings

class LineString(coordinates)
LineStrings構造函數傳入參數是2個或多個點元組

#代碼示例 arr=np.array([(0,0), (1,1), (1,0)]) line = geo.LineString(arr) #等同于 line = geo.LineString([(0,0), (1,1), (1,0)]) print ('兩個幾何對象之間的距離:'+str(geo.Point(2,2).distance(line)))#該方法即可求線線距離也可以求線點距離 print ('兩個幾何對象之間的hausdorff_distance距離:'+str(geo.Point(2,2).hausdorff_distance(line)))#該方法求得是點與線的最長距離 print('該幾何對象的面積:'+str(line.area)) print('該幾何對象的坐標范圍:'+str(line.bounds)) print('該幾何對象的長度:'+str(line.length)) print('該幾何對象的幾何類型:'+str(line.geom_type)) print('該幾何對象的坐標系:'+str(list(line.coords))) center = line.centroid #幾何中心 geo.GeometryCollection([line,center]) 兩個幾何對象之間的距離:1.4142135623730951 兩個幾何對象之間的hausdorff_distance距離:2.8284271247461903 該幾何對象的面積:0.0 該幾何對象的坐標范圍:(0.0, 0.0, 1.0, 1.0) 該幾何對象的長度:2.414213562373095 該幾何對象的幾何類型:LineString 該幾何對象的坐標系:[(0.0, 0.0), (1.0, 1.0), (1.0, 0.0)] bbox = line.envelope #envelope可以求幾何對象的最小外接矩形 geo.GeometryCollection([line,bbox]) rect = line.minimum_rotated_rectangle #最小旋轉外接矩形 geo.GeometryCollection([line,rect]) pt_half = line.interpolate(0.5,normalized=True) #插值 geo.GeometryCollection([line,pt_half]) ratio = line.project(pt_half,normalized=True) # project()方法是和interpolate方法互逆的 print(ratio)

下面這個是DouglasPucker算法的應用，在軌跡分析中經常會用得到

line1 = geo.LineString([(0,0),(1,-0.2),(2,0.3),(3,-0.5),(5,0.2),(7,0)]) line1_simplify = line1.simplify(0.4, preserve_topology=False) #Douglas-Pucker算法 print(line1) print(line1_simplify) line1_simplify LINESTRING (0 0, 1 -0.2, 2 0.3, 3 -0.5, 5 0.2, 7 0) LINESTRING (0 0, 2 0.3, 3 -0.5, 5 0.2, 7 0) buffer_with_circle = line1.buffer(0.2) #端點按照半圓擴展 geo.GeometryCollection([line1,buffer_with_circle])

LinearRings

class LinearRing(coordinates)
LineStrings構造函數傳入參數是2個或多個點元組

元組序列可以通過在第一個和最后一個索引中傳遞相同的值來顯式關閉。否則，將第一個元組復制到最后一個索引，從而隱式關閉序列。
與LineString一樣，元組序列中的重復點是允許的，但可能會導致性能上的損失，應該避免在序列中設置重復點。

# from shapely.geometry.polygon import LinearRing ring = geo.polygon.LinearRing([(0, 0), (1, 1), (1, 0)]) print(ring.length)#相比于剛才的LineString的代碼示例，其長度現在是3.41，是因為其序列是閉合的 print(ring.area) geo.GeometryCollection([ring]) 3.414213562373095 0.0

Polygon

class Polygon(shell[, holes=None])
Polygon接受兩個位置參數，第一個位置參數是和LinearRing一樣，是一個有序的point元組。第二個位置參數是可選的序列，其用來指定內部的邊界

from shapely.geometry import Polygon polygon1 = Polygon([(0, 0), (1, 1), (1, 0)]) ext = [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)] int = [(1, 0), (0.5, 0.5), (1, 1), (1.5, 0.5), (1, 0)] polygon2 = Polygon(ext, [int]) print(polygon1.area) print(polygon1.length) print(polygon2.area)#其面積是ext的面積減去int的面積 print(polygon2.length)#其長度是ext的長度加上int的長度 print(np.array(polygon2.exterior)) #外圍坐標點 geo.GeometryCollection([polygon2]) 0.5 3.414213562373095 3.5 10.82842712474619 [[0. 0.][0. 2.][2. 2.][2. 0.][0. 0.]]

與之前介紹的類似，MultiPoint、MultiLineString、MultiPolygon分別指的是多個點、多個linestring和多個polygon形成的集合。

幾何對象關系

一個幾何對象特征分別有interior、boundary和exterior。下面的敘述直接用內部、邊界和外部等名詞概述

1.object.contains(other)
如果object的外部沒有其他點，或者至少有一個點在該object的內部，則返回True
a.contains(b)與 b.within(a)的表達是等價的

coords = [(0, 0), (1, 1)] print(LineString(coords).contains(Point(0.5, 0.5)))#線與點的關系 print(LineString(coords).contains(Point(1.0, 1.0)))#因為line的邊界不是屬于在該對象的內部，所以返回是False polygon1 = Polygon( [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)]) print(polygon1.contains(Point(1.0, 1.0)))#面與點的關系 #同理這個contains方法也可以擴展到面與線的關系以及面與面的關系 geo.GeometryCollection([polygon1,Point(1.0, 1.0)]) True False True

2.object.crosses(other)
如果一個object與另一個object是內部相交的關系而不是包含的關系，則返回True
3.object.disjoint(other)
如果該對象與另一個對象的內部和邊界都不相交則返回True
4. object.intersects(other)
如果該幾何對象與另一個幾何對象只要相交則返回True。
5. object.convex_hull
返回包含對象中所有點的最小凸多邊形（凸包）

print( LineString(coords).crosses(LineString([(0, 1), (1, 0)]))) print(Point(0, 0).disjoint(Point(1, 1))) print( LineString(coords).intersects(LineString([(0, 1), (1, 0)]))) True True True # 在下圖中即為在給定6個point之后求其凸包，并繪制出來的凸包圖形 points1 = geo.MultiPoint([(0, 0), (1, 1), (0, 2), (2, 2), (3, 1), (1, 0)]) hull1 = points1.convex_hull geo.GeometryCollection([hull1,points1]) # object.intersection 返回對象與對象之間的交集 polygon1 = Polygon( [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)]) hull1.intersection(polygon1) #返回對象與對象之間的并集 hull1.union(polygon1) hull1.difference(polygon1) #面面補集

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-YiF3EGrn-1644894634135)(Task1_files/Task1_36_0.svg)]

6.與numpy和python數組之間的關系
point、LineRing和LineString提供numpy數組接口，可以進行轉換numpy數組

from shapely.geometry import asPoint,asLineString,asMultiPoint,asPolygon import numpy as np pa = asPoint(np.array([0.0, 0.0]))#將numpy數組轉換成point格式 la = asLineString(np.array([[1.0, 2.0], [3.0, 4.0]]))#將numpy數組轉換成LineString格式 ma = asMultiPoint(np.array([[1.1, 2.2], [3.3, 4.4], [5.5, 6.6]]))#將numpy數組轉換成multipoint集合 pg = asPolygon(np.array([[1.1, 2.2], [3.3, 4.4], [5.5, 6.6]]))#將numpy數組轉換成polygon print(np.array(pa))#將Point轉換成numpy格式 [0. 0.]

另外還有一些非常有用但是不屬于某個類方法的函數，如有需要可以在官網查閱

ops.nearest_points 求最近點
ops.split 分割線
ops.substring 求子串
affinity.rotate 旋轉幾何體
affinity.scale 縮放幾何體
affinity.translate 平移幾何體

geopandas

GeoPandas提供了地理空間數據的高級接口，它讓使用python處理地理空間數據變得更容易。GeoPandas擴展了pandas使用的數據類型，允許對幾何類型進行空間操作。幾何運算由shapely執行。Geopandas進一步依賴fiona進行文件訪問，依賴matplotlib進行繪圖。

geopandas和pandas一樣，一共有兩種數據類型：

GeoSeries
GeoDataFrame
它們繼承了pandas數據結構的大部分方法。這兩個數據結構可以當做地理空間數據的存儲器，shapefile文件的pandas呈現。

Shapefile文件用于描述幾何體對象：點，折線與多邊形。例如，Shapefile文件可以存儲井、河流、湖泊等空間對象的幾何位置。除了幾何位置，shp文件也可以存儲這些空間對象的屬性，例如一條河流的名字，一個城市的溫度等等。

例如，當安裝geopandas庫后，便可通過matplotlib直接畫出當安裝geopandas數據集中的世界地圖

import pandas as pd import geopandas import matplotlib.pyplot as plt world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))#read_file方法可以讀取shape文件，轉化為GeoSeries和GeoDataFrame數據類型。 world.plot()#將GeoDataFrame變成圖形展示出來，得到世界地圖 plt.show() world.head() pop_estcontinentnameiso_a3gdp_md_estgeometry01234

920938	Oceania	Fiji	FJI	8374.0	MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
53950935	Africa	Tanzania	TZA	150600.0	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
603253	Africa	W. Sahara	ESH	906.5	POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
35623680	North America	Canada	CAN	1674000.0	MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
326625791	North America	United States of America	USA	18560000.0	MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

#根據每一個polygon的pop_est不同，便可以用python繪制圖表顯示不同國家的人數 fig, ax = plt.subplots(figsize=(9,6),dpi = 100) world.plot('pop_est',ax = ax,legend = True) plt.show()

由以上geodataframe的實例world可知，其最后一列是geometry。其幾何對象包括了MULTIPOLYGON 、POLYGON，那么便同樣可以用剛才介紹的shapely庫進行分析。

具體的geopadandas常用的方法可以參考這篇文章

geopdandas的相關中文案例和分析可參考這個集錦，了解一下具體使用情況

Folium

官方文檔

folium可以滿足我們平時常用的熱力圖、填充地圖、路徑圖、散點標記等高頻可視化場景.folium也可以通過flask讓地圖和我們的數據在網頁上顯示，極其便利。

import folium import os #首先，創建一張指定中心坐標的地圖，這里將其中心坐標設置為北京。zoom_start表示初始地圖的縮放尺寸，數值越大放大程度越大 m=folium.Map(location=[39.9,116.4],zoom_start=10) m Make this Notebook Trusted to load map: File -> Trust Notebook

以下是Folium map的參數

用Folium繪制熱力圖示例

import folium import numpy as np from folium.plugins import HeatMap #先手動生成data數據，該數據格式由[緯度，經度，數值]構成 data=(np.random.normal(size=(100,3))*np.array([[1,1,1]])+np.array([[48,5,1]])).tolist() # data m=folium.Map([48,5],tiles='stamentoner',zoom_start=6) HeatMap(data).add_to(m) m Make this Notebook Trusted to load map: File -> Trust Notebook

folium的其他使用可以參考知乎的這篇文章，較為全面。
https://www.zhihu.com/question/33783546

Kepler.gl

kepler.gl基礎教程

Kepler.gl與folium類似，也是是一個圖形化的數據可視化工具，基于Uber的大數據可視化開源項目deck.gl創建的demo app。目前支持3種數據格式：CSV、JSON、GeoJSON。

Kepler.gl官網提供了可視化圖形案例，分別是Arc（弧）、Line（線）、Hexagon（六邊形）、Point（點）、Heatmap（等高線圖）、GeoJSON、Buildings（建筑）。

下面用本賽題的數據進行簡單的數據處理和基本的kepler.gl的使用

import pandas as pd import geopandas as gpd from pyproj import Proj from keplergl import KeplerGl from tqdm import tqdm import os import matplotlib.pyplot as plt import shapely import numpy as np from datetime import datetime import warnings warnings.filterwarnings('ignore') plt.rcParams['font.sans-serif'] = ['SimSun'] # 指定默認字體為新宋體。 plt.rcParams['axes.unicode_minus'] = False # 解決保存圖像時負號'-' 顯示為方塊和報錯的問題。 #獲取文件夾中的數據 def get_data(file_path,model):assert model in ['train', 'test'], '{} Not Support this type of file'.format(model)paths = os.listdir(file_path) # print(len(paths))tmp = []for t in tqdm(range(len(paths))):p = paths[t]with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:next(f)for line in f.readlines():tmp.append(line.strip().split(','))tmp_df = pd.DataFrame(tmp)if model == 'train':tmp_df.columns = ['ID', 'lat', 'lon', 'speed', 'direction', 'time', 'type']else:tmp_df['type'] = 'unknown'tmp_df.columns = ['ID', 'lat', 'lon', 'speed', 'direction', 'time', 'type']tmp_df['lat'] = tmp_df['lat'].astype(float)tmp_df['lon'] = tmp_df['lon'].astype(float)tmp_df['speed'] = tmp_df['speed'].astype(float)tmp_df['direction'] = tmp_df['direction'].astype(int)#如果該行代碼運行失敗，請嘗試更新pandas的版本return tmp_df # 平面坐標轉經緯度，供初賽數據使用 # 選擇標準為NAD83 / California zone 6 (ftUS) (EPSG:2230)，查詢鏈接：https://mygeodata.cloud/cs2cs/ def transform_xy2lonlat(df):x = df['lat'].valuesy = df['lon'].valuesp=Proj('+proj=lcc +lat_1=33.88333333333333 +lat_2=32.78333333333333 +lat_0=32.16666666666666 +lon_0=-116.25 +x_0=2000000.0001016 +y_0=500000.0001016001 +datum=NAD83 +units=us-ft +no_defs ')df['lon'], df['lat'] = p(y, x, inverse=True)return df #修改數據的時間格式 def reformat_strtime(time_str=None, START_YEAR="2019"):"""Reformat the strtime with the form '08 14' to 'START_YEAR-08-14' """time_str_split = time_str.split(" ")time_str_reformat = START_YEAR + "-" + time_str_split[0][:2] + "-" + time_str_split[0][2:4]time_str_reformat = time_str_reformat + " " + time_str_split[1] # time_reformat=datetime.strptime(time_str_reformat,'%Y-%m-%d %H:%M:%S')return time_str_reformat #計算兩個點的距離 def haversine_np(lon1, lat1, lon2, lat2):lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])dlon = lon2 - lon1dlat = lat2 - lat1a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2c = 2 * np.arcsin(np.sqrt(a))km = 6367 * creturn km * 1000def compute_traj_diff_time_distance(traj=None):"""Compute the sampling time and the coordinate distance."""# 計算時間的差值time_diff_array = (traj["time"].iloc[1:].reset_index(drop=True) - traj["time"].iloc[:-1].reset_index(drop=True)).dt.total_seconds() / 60# 計算坐標之間的距離dist_diff_array = haversine_np(traj["lon"].values[1:], # lon_0traj["lat"].values[1:], # lat_0traj["lon"].values[:-1], # lon_1traj["lat"].values[:-1] # lat_1)# 填充第一個值time_diff_array = [time_diff_array.mean()] + time_diff_array.tolist()dist_diff_array = [dist_diff_array.mean()] + dist_diff_array.tolist()traj.loc[list(traj.index),'time_array'] = time_diff_arraytraj.loc[list(traj.index),'dist_array'] = dist_diff_arrayreturn traj #對軌跡進行異常點的剔除 def assign_traj_anomaly_points_nan(traj=None, speed_maximum=23,time_interval_maximum=200,coord_speed_maximum=700):"""Assign the anomaly points in traj to np.nan."""def thigma_data(data_y,n): data_x =[i for i in range(len(data_y))]ymean = np.mean(data_y)ystd = np.std(data_y)threshold1 = ymean - n * ystdthreshold2 = ymean + n * ystdjudge=[]for data in data_y:if (data < threshold1)|(data> threshold2):judge.append(True)else:judge.append(False)return judge# Step 1: The speed anomaly repairingis_speed_anomaly = (traj["speed"] > speed_maximum) | (traj["speed"] < 0)traj["speed"][is_speed_anomaly] = np.nan# Step 2: 根據距離和時間計算速度is_anomaly = np.array([False] * len(traj))traj["coord_speed"] = traj["dist_array"] / traj["time_array"]# Condition 1: 根據3-sigma算法剔除coord speed以及較大時間間隔的點is_anomaly_tmp = pd.Series(thigma_data(traj["time_array"],3)) | pd.Series(thigma_data(traj["coord_speed"],3))is_anomaly = is_anomaly | is_anomaly_tmpis_anomaly.index=traj.index# Condition 2: 軌跡點的3-sigma異常處理traj = traj[~is_anomaly].reset_index(drop=True)is_anomaly = np.array([False] * len(traj))if len(traj) != 0:lon_std, lon_mean = traj["lon"].std(), traj["lon"].mean()lat_std, lat_mean = traj["lat"].std(), traj["lat"].mean()lon_low, lon_high = lon_mean - 3 * lon_std, lon_mean + 3 * lon_stdlat_low, lat_high = lat_mean - 3 * lat_std, lat_mean + 3 * lat_stdis_anomaly = is_anomaly | (traj["lon"] > lon_high) | ((traj["lon"] < lon_low))is_anomaly = is_anomaly | (traj["lat"] > lat_high) | ((traj["lat"] < lat_low))traj = traj[~is_anomaly].reset_index(drop=True)return traj, [len(is_speed_anomaly) - len(traj)] df=get_data(r'hy_round1_train_20200102','train') 100%|█████████████████████████████████████████████████████████████████████████████| 7000/7000 [00:15<00:00, 448.76it/s] df=transform_xy2lonlat(df) df['time']=df['time'].apply(reformat_strtime) df['time']=df['time'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) #這一個cell的代碼不用運行，DF.csv該數據已經放到了github上面給出的附件數據里面#對軌跡進行異常點剔除，對nan值進行線性插值 ID_list=list(pd.DataFrame(df['ID'].value_counts()).index) DF_NEW=[] Anomaly_count=[] for ID in tqdm(ID_list):df_id=compute_traj_diff_time_distance(df[df['ID']==ID])df_new,count=assign_traj_anomaly_points_nan(df_id)df_new["speed"] = df_new["speed"].interpolate(method="linear", axis=0)df_new = df_new.fillna(method="bfill")df_new = df_new.fillna(method="ffill")df_new["speed"] = df_new["speed"].clip(0, 23)Anomaly_count.append(count)#統計每個id異常點的數量有多少DF_NEW.append(df_new) DF=pd.concat(DF_NEW) #讀取github的數據 DF=pd.read_csv('DF.csv')

由于數據量過大，如果直接將軌跡異常點剔除的數據用kepler.gl展示則在程序運行時會出現卡頓，或者無法運行的情況，此時可嘗試利用geopandas對數據利用douglas-peucker算法進行簡化。有效簡化后的矢量數據可以在不損失太多視覺感知到的準確度的同時，帶來巨大的性能提升。

#douglas-peucker案例，由該案例可以看出針對相同ID的軌跡，可以先用geopandas將其進行簡化和數據壓縮 line= shapely.geometry.LineString(np.array(df[df['ID']=='11'][['lon','lat']])) ax=gpd.GeoSeries([line]).plot(color='red') ax = gpd.GeoSeries([line]).simplify(tolerance=0.000000001).plot(color='blue', ax=ax,linestyle='--') LegendElement = [plt.Line2D([], [], color='red', label='簡化前'),plt.Line2D([], [], color='blue', linestyle='--', label='簡化后')]# 將制作好的圖例映射對象列表導入legend()中，并配置相關參數 ax.legend(handles = LegendElement, loc='upper left', fontsize=10) # ax.set_ylim((-2.1, 1)) # ax.axis('off') print('化簡前數據長度：'+str(len(np.array(gpd.GeoSeries([line])[0])))) print('化簡后數據長度：'+str(len(np.array(gpd.GeoSeries([line]).simplify(tolerance=0.000000001)[0])))) 化簡前數據長度：377 化簡后數據長度：156

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Yj0pFlag-1644894634138)(Task1_files/Task1_67_1.png)]

#定義數據簡化函數。即通過shapely庫將經緯度轉換成LineString格式 #然后放入GeoSeries數據結構中并進行簡化，最后再將所有數據放入GeoDataFrame中 def simplify_dataframe(df):line_list=[]for i in tqdm(dict(list(df.groupby('ID')))):line_dict={}lat_lon=dict(list(df.groupby('ID')))[i][['lon','lat']]line=shapely.geometry.LineString(np.array(lat_lon))line_dict['ID']=dict(list(df.groupby('ID')))[i].iloc[0]['ID']line_dict['type']=dict(list(df.groupby('ID')))[i].iloc[0]['type'] line_dict['geometry']=gpd.GeoSeries([line]).simplify(tolerance=0.000000001)[0]line_list.append(line_dict)return gpd.GeoDataFrame(line_list) df_gpd_change=simplify_dataframe(DF) 100%|████████████████████████████████████████████████████████████████████████████| 7000/7000 [6:22:09<00:00, 3.28s/it]

df_gpd_change.pkl是將異常處理之后的數據進行douglas-peucker算法進行壓縮之后的數據。該數據已經放到了github上面給出的附件數據里面

df_gpd_change=pd.read_pickle('df_gpd_change.pkl') map1=KeplerGl(height=800)#zoom_start與這個height類似，表示地圖的縮放程度 map1.add_data(data=df_gpd_change,name='data') #當運行該代碼后，下面會有一個kepler.gl使用說明的鏈接，可以根據該鏈接進行學習參考 map1 User Guide: https://docs.kepler.gl/docs/keplergl-jupyterKeplerGl(data={'data': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21…

通過kepler.gl的數據可視化便可以看出不同類別的軌跡所在的位置有所不同，而且其不同船舶軌跡的形狀特征也可以顯示出來。

算完之后將數據重新化為dataframe格式，然后可以去計算geohash作為每一條船的數據特征

另外，kepler.gl最近新增「增量時間窗口」功能功能對時間序列數據的可視化提供了很好的幫助。當我們的數據集帶有時間類型字段時，在添加對應的Filters之后，顯示出的時間窗口如下圖所示

然后此時可以點擊播放按鈕，然后將默認的「Moving Time Window」模式切換到「Incremental Time Window」模式，此時就可以使用增量時間窗口模式看到畫面中的數據會從起點開始持續疊加：

如果你對dash有所了解，那么純Python快速開發出一個嵌入kepler.gl的交互式web應用將會變得非常容易。具體內容可參考該鏈接

GeoHash

參考文獻：https://blog.csdn.net/zhufenghao/article/details/85568340

在對于經緯度進行數據分析和特征提取時常用到的是GeoHash編碼，該編碼方式可以將地理經緯度坐標編碼為由字母和數字所構成的短字符串，它具有如下特性：

層級空間數據結構，將地理位置用矩形網格劃分，同一網格內地理編碼相同

只要編碼長度足夠長，可以表示任意精度的地理位置坐標

編碼前綴匹配的越長，地理位置越鄰近。

下圖對北京中關村軟件園附近進行6位的GeoHash編碼結果，9個網格相互鄰近且具有相同的前綴.

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-lKByM9UT-1644894634140)(attachment:image.png)]

那么GeoHash算法是怎么對經緯度坐標進行編碼的呢？總的來說，它采用的是二分法不斷縮小經度和緯度的區間來進行二進制編碼，最后將經緯度分別產生的編碼奇偶位交叉合并，再用字母數字表示。舉例來說，對于一個坐標116.29513,40.04920的經度執行算法：

將地球經度區間[-180,180]二分為[-180,0]和[0,180]，116.29513在右區間，記1；

將[0,180]二分為[0,90]和[90,180]，116.29513在右區間，記1；

將[90,180]二分為[90,135]和[135,180]，116.29513在左區間，記0；

遞歸上述過程（左區間記0，右區間記1）直到所需要的精度，得到一串二進制編碼11010 01010 11001。

同理將地球緯度區間[-90,90]根據緯度40.04920進行遞歸二分得到二進制編碼10111 00011 11010，接著生成新的二進制數，它的偶數位放經度，奇數位放緯度，得到11100 11101 00100 01101 11110 00110，最后使用32個數字和字母（字母去掉a、i、l、o這4個）進行32進制編碼，即先將二進制數每5位轉化為十進制28 29 4 13 30 6，然后對應著編碼表進行映射得到wy4ey6。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-IvDtdk82-1644894634141)(attachment:image.png)]

對這樣的GeoHash編碼大小排序后，是按照Z形曲線來填充空間的，后來又衍生出多種填充曲線且具有多種特性，由于沒有Z形曲線簡單通用，這里就不贅述了。

另外還有一些其他曲線可以填充空間，比如著名的希爾伯特曲線，感興趣的可以看bilibili這個視頻，了解一下，還是蠻有趣的~

https://www.bilibili.com/video/BV1Sf4y147J9?from=search&seid=12367619856156226126

# reference: https://github.com/vinsci/geohash def geohash_encode(latitude, longitude, precision=12):"""Encode a position given in float arguments latitude, longitude toa geohash which will have the character count precision."""lat_interval, lon_interval = (-90.0, 90.0), (-180.0, 180.0)base32 = '0123456789bcdefghjkmnpqrstuvwxyz'geohash = []bits = [16, 8, 4, 2, 1]bit = 0ch = 0even = Truewhile len(geohash) < precision:if even:mid = (lon_interval[0] + lon_interval[1]) / 2if longitude > mid:ch |= bits[bit]lon_interval = (mid, lon_interval[1])else:lon_interval = (lon_interval[0], mid)else:mid = (lat_interval[0] + lat_interval[1]) / 2if latitude > mid:ch |= bits[bit]lat_interval = (mid, lat_interval[1])else:lat_interval = (lat_interval[0], mid)even = not evenif bit < 4:bit += 1else:geohash += base32[ch]bit = 0ch = 0return ''.join(geohash) #調用Geohash函數 DF[DF['ID']==1].apply(lambda x: geohash_encode(x['lat'], x['lon'], 7), axis=1) 1873158 9rc76bv 1873159 9rc76cq 1873160 9rc76fw 1873161 9rc76gn 1873162 9rc76gy... 1873517 9rc7xnv 1873518 9rc7xnv 1873519 9rc7xnv 1873520 9rc7xnv 1873521 9rc7xnv Length: 364, dtype: object

注意事項

GeoHash的主要價值在于將二維的經緯度坐標信息編碼到了一維的字符串中，在做地理位置索引時只需要匹配字符串即可，便于緩存、信息壓縮。在使用大數據工具（例如Spark）進行數據挖掘聚類時，GeoHash顯得更加便捷和高效。

但是使用GeoHash還有一些注意事項：

由于GeoHash使用Z形曲線來順序填充空間的，而Z形曲線在拐角處會有突變，這表現在有些相鄰的網格的編碼前綴比其他網格相差較多，因此利用前綴匹配可以找到一部分鄰近的區域，但同時也會漏掉一些。

一個網格內部所有點會共用一個GeoHash值，在網格的邊緣點會匹配到可能較遠但是GeoHash值相同的點，而本來距離較近的點卻沒有匹配到。這種問題可以這樣解決：適當增加GeoHash編碼長度，并使用周圍的8個近鄰編碼來參與，因為往往只使用一個GeoHash編碼可能會有嚴重風險！

作業

基礎作業：
1.嘗試去使用kepler.gl可視化來分析不同類型船舶AIS數據的分布情況，并為接下來的特征工程的提取建立基礎
進階作業：
2.在這個模塊中，我們介紹了各種庫以及他們常用的方法。如果可以，請同學們嘗試在原有剔除異常點的數據（DF）中保留douglas-peucker算法所識別的關鍵點的數據，刪除douglas-peucker未保存的數據，并嘗試對這些坐標點進行geohash編碼

參考內容

https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.163c24d1HiGiFo&postId=110644

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task1地理数据分析常用工具的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：联想小新 Pro 16 2023 锐龙版
下一篇： Minisforum 公布新款 NUCG