python读取HDFS文件
生活随笔
收集整理的這篇文章主要介紹了
python读取HDFS文件
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
2019獨角獸企業重金招聘Python工程師標準>>>
###方法一:使用hdfs庫讀取HDFS文件 ###在讀取數據時,要加上 encoding='utf-8',否則字符串前面會有b'xxx' ###先寫入list,再轉為df,注意要對數據進行分列,最后要對指定字段轉換數據類型 from hdfs.client import Client client = Client("http://hadoop-1-1:50070")lines = [] with client.read("/user/spark/H2O/Wholesale_customers_data.csv", encoding='utf-8') as reader:for line in reader: lines.append(line.strip())column_str = lines[0] column_list = column_str.split(',')data = {"item_list":lines[1:]}import pandas as pd df = pd.DataFrame(data=data) df[column_list] = df["item_list"].apply(lambda x: pd.Series([i for i in x.split(",")])) ##重新指定列 df.drop("item_list", axis=1, inplace=True) ##刪除列df.dtypes """ Region object Fresh object Milk object Grocery object Frozen object Detergents_Paper object Delicassen object target object dtype: object"""df = df.astype('int') ##將object類型轉為int64 df.dtypes """ Region int64 Fresh int64 Milk int64 Grocery int64 Frozen int64 Detergents_Paper int64 Delicassen int64 target int64 dtype: object """ ###方法二:采用pydoop庫讀取HDFS文件 import pydoop.hdfs as hdfslines = [] with hdfs.open('/user/spark/security/iris.csv', 'rt') as f:for line in f:##print(line)lines.append(line.strip())column_list = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']data = {"item_list":lines[0:]}import pandas as pd df = pd.DataFrame(data=data) df[column_list] = df["item_list"].apply(lambda x: pd.Series([i for i in x.split(",")])) ##重新指定列 df.drop("item_list", axis=1, inplace=True) ##刪除列##調整數據類型 df[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']] = df[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']].astype('float64')df.dtypes """ Sepal_Length float64 Sepal_Width float64 Petal_Length float64 Petal_Width float64 Species object dtype: object """ ###直接運用pd.read_table進行數據讀取操作 import pydoop.hdfs as hdfs import pandas as pd###此份數據含有表頭 with hdfs.open('/user/spark/security/iris.csv', 'rt') as f:df = pd.read_table(f)column_list = df.columns[0].split(",") df[column_list] = df.iloc[:,0].apply(lambda x: pd.Series([i for i in x.split(",")])) ##此處注意要寫成df.iloc[:,0]df.head() """ Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species Sepal_Length Sepal_Width Petal_Length Petal_Width Species 0 5.1,3.5,1.4,0.2,setosa 5.1 3.5 1.4 0.2 setosa 1 4.9,3,1.4,0.2,setosa 4.9 3 1.4 0.2 setosa 2 4.7,3.2,1.3,0.2,setosa 4.7 3.2 1.3 0.2 setosa 3 4.6,3.1,1.5,0.2,setosa 4.6 3.1 1.5 0.2 setosa 4 5,3.6,1.4,0.2,setosa 5 3.6 1.4 0.2 setosa """df.drop(df.columns[0], axis=1, inplace=True) df.dtypes """ Sepal_Length object Sepal_Width object Petal_Length object Petal_Width object Species object dtype: object """#####將'Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'這四個字段轉換為float類型 df[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']] = df[['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']].astype('float')df.dtypes """ Sepal_Length float64 Sepal_Width float64 Petal_Length float64 Petal_Width float64 Species object dtype: object """轉載于:https://my.oschina.net/kyo4321/blog/3016864
總結
以上是生活随笔為你收集整理的python读取HDFS文件的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 基于 DataLakeAnalytics
- 下一篇: 0301 - 一个比价的小项目