當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

动手学PaddlePaddle（2）：房价预测

發(fā)布時間：2023/12/10 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了动手学PaddlePaddle（2）：房价预测小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

通過這個練習(xí)可以了解到：

機(jī)器學(xué)習(xí)的典型過程：

獲取數(shù)據(jù)
數(shù)據(jù)預(yù)處理

-訓(xùn)練模型

-應(yīng)用模型

fluid訓(xùn)練模型的基本步驟:

配置網(wǎng)絡(luò)結(jié)構(gòu)：
定義成本函數(shù)avg_cost
定義優(yōu)化器optimizer
獲取訓(xùn)練數(shù)據(jù)
定義運算場所(place)和執(zhí)行器(exe)
提供數(shù)據(jù)(feeder)
執(zhí)行訓(xùn)練(exe.run)
預(yù)測infer()并輸出擬合圖像

1 - 引用庫

2 - 數(shù)據(jù)預(yù)處理

3 - 定義reader

4 - 訓(xùn)練過程

5 - 預(yù)測

1 - 引用庫

首先載入需要用到的庫，它們分別是：

numpy：NumPy是Python語言的一個擴(kuò)展程序庫。支持高端大量的維度數(shù)組與矩陣運算，此外也針對數(shù)組運算提供大量的數(shù)學(xué)函數(shù)庫。NumPy的核心功能是"ndarray"(即n-dimensional array，多維數(shù)組)數(shù)據(jù)結(jié)構(gòu)。
matplotlib.pyplot：用于生成圖，在驗證模型準(zhǔn)確率和展示成本變化趨勢時會使用到
paddle.fluid：引入PaddlePaddle深度學(xué)習(xí)框架的fluid版本庫；
pandas:Pandas是python第三方庫，提供高性能易用數(shù)據(jù)類型和分析工具，Pandas基于Numpy實現(xiàn)，常與Numpy和Matplotlib一同使用

from __future__ import print_function import numpy as np import matplotlib.pyplot as plt import pandas as pd import paddle import paddle.fluid as fluidimport math import sys#%matplotlib inline

2 - 數(shù)據(jù)預(yù)處理

本次數(shù)據(jù)集使用的是2016年12月份某市某地區(qū)的房價分布。為了簡化模型，假設(shè)影響房價的因素只有房屋面積，因此數(shù)據(jù)集只有兩列，以txt的形式儲存。

當(dāng)真實數(shù)據(jù)被收集到后，它們往往不能直接使用。
例如本次數(shù)據(jù)集使用了某地區(qū)的房價分布，為了簡化模型數(shù)據(jù)只有兩維，并沒有標(biāo)出來每一列代表什么，其實分別是房屋面積與房屋價格。可以看到房價與房屋面積之間存在一種關(guān)系，這種關(guān)系究竟是什么，就是本次預(yù)測想要得到的結(jié)論。可以首先以表格的形式輸出數(shù)據(jù)的前五行看一下。

colnames = ['房屋面積']+['房價'] print_data = pd.read_csv('./datasets/data.txt',names = colnames) print_data.head()

一般拿到一組數(shù)據(jù)后，第一個要處理的是數(shù)據(jù)類型不同的問題。如果各維屬性中有離散值和連續(xù)值，就必須對離散值進(jìn)行處理。

離散值雖然也常使用類似0、1、2這樣的數(shù)字表示，但是其含義與連續(xù)值是不同的，因為這里的差值沒有實際意義。例如，用0、1、2來分別表示紅色、綠色和藍(lán)色的話，并不能因此說“藍(lán)色和紅色”比“綠色和紅色”的距離更遠(yuǎn)。通常對有d個可能取值的離散屬性，會將它們轉(zhuǎn)為d個取值為0或1的二值屬性或者將每個可能取值映射為一個多維向量。

不過就這里而言，數(shù)據(jù)中沒有離散值，就不用考慮這個問題了。

** 歸一化 **

觀察一下數(shù)據(jù)的分布特征，一般而言，如果樣本有多個屬性，那么各維屬性的取值范圍差異會很大，這就要用到一個常見的操作-歸一化（normalization）了。歸一化的目標(biāo)是把各維屬性的取值范圍放縮到差不多的區(qū)間，例如[-0.5, 0.5]。這里使用一種很常見的操作方法：減掉均值，然后除以原取值范圍。

# coding = utf-8 # global x_raw,train_data,test_data data = np.loadtxt('./datasets/data.txt',delimiter = ',') x_raw = data.T[0].copy() #axis=0,表示按列計算 #data.shape[0]表示data中一共有多少行 maximums, minimums, avgs = data.max(axis=0), data.min(axis=0), data.sum(axis=0)/data.shape[0] print(maximums) print(minimums) print(avgs) print("the raw area :",data[:,0].max(axis = 0))#歸一化，data[:,i]表示第i列的元素for i in range(data.shape[0]):data[i,0] = (data[i,0] - avgs[0]) / (maximums[0] - minimums[0])#data[i,0] = (data[i,0] - minimums[0]) / (maximums[0] - minimums[0])print('normalization:',data[:,0].max(axis = 0))

基本上所有的數(shù)據(jù)在拿到后都必須進(jìn)行歸一化，至少有以下3條原因：

1.過大或過小的數(shù)值范圍會導(dǎo)致計算時的浮點上溢或下溢。

2.不同的數(shù)值范圍會導(dǎo)致不同屬性對模型的重要性不同（至少在訓(xùn)練的初始階段如此），而這個隱含的假設(shè)常常是不合理的。這會對優(yōu)化的過程造成困難，使訓(xùn)練時間大大加長。

3.很多的機(jī)器學(xué)習(xí)技巧/模型（例如L1，L2正則項，向量空間模型-Vector Space Model）都基于這樣的假設(shè)：所有的屬性取值都差不多是以0為均值且取值范圍相近的。

** 數(shù)據(jù)集分割 **

將原始數(shù)據(jù)處理為可用數(shù)據(jù)后，為了評估模型的好壞，我們將數(shù)據(jù)分成兩份：訓(xùn)練集和測試集。

訓(xùn)練集數(shù)據(jù)用于調(diào)整模型的參數(shù)，即進(jìn)行模型的訓(xùn)練，模型在這份數(shù)據(jù)集上的誤差被稱為訓(xùn)練誤差；
測試集數(shù)據(jù)被用來測試，模型在這份數(shù)據(jù)集上的誤差被稱為測試誤差。

訓(xùn)練模型的目的是為了通過從訓(xùn)練數(shù)據(jù)中找到規(guī)律來預(yù)測未知的新數(shù)據(jù)，所以測試誤差是更能反映模型表現(xiàn)的指標(biāo)。分割數(shù)據(jù)的比例要考慮到兩個因素：更多的訓(xùn)練數(shù)據(jù)會降低參數(shù)估計的方差，從而得到更可信的模型；而更多的測試數(shù)據(jù)會降低測試誤差的方差，從而得到更可信的測試誤差。這個例子中設(shè)置的分割比例為8:2。

ratio = 0.8 offset = int(data.shape[0]*ratio)train_data = data[:offset] test_data = data[offset:]print(len(data)) print(len(train_data))

3 - 定義reader

構(gòu)造read_data()函數(shù)，來讀取訓(xùn)練數(shù)據(jù)集train_set或者測試數(shù)據(jù)集test_set。它的具體實現(xiàn)是在read_data()函數(shù)內(nèi)部構(gòu)造一個reader()，使用yield關(guān)鍵字來讓reader()成為一個Generator（生成器），注意，yield關(guān)鍵字的作用和使用方法類似return關(guān)鍵字，不同之處在于yield關(guān)鍵字可以構(gòu)造生成器（Generator）。雖然我們可以直接創(chuàng)建一個包含所有數(shù)據(jù)的列表，但是由于內(nèi)存限制，我們不可能創(chuàng)建一個無限大的或者巨大的列表，并且很多時候在創(chuàng)建了一個百萬數(shù)量級別的列表之后，卻只需要用到開頭的幾個或幾十個數(shù)據(jù)，這樣造成了極大的浪費，而生成器的工作方式是在每次循環(huán)時計算下一個值，不斷推算出后續(xù)的元素，不會創(chuàng)建完整的數(shù)據(jù)集列表，從而節(jié)約了內(nèi)存使用。

def read_data(data_set):"""一個readerArgs：data_set -- 要獲取的數(shù)據(jù)集Return：reader -- 用于獲取訓(xùn)練集及其標(biāo)簽的生成器generator"""def reader():"""一個readerArgs：Return：data[:-1],data[-1:] --使用yield返回生成器data[:-1]表示前n-1個元素，也就是訓(xùn)練數(shù)據(jù)，data[-1:]表示最后一個元素，也就是對應(yīng)的標(biāo)簽"""for data in data_set:yield data[:-1],data[-1:]return reader#測試readertest_array = ([10,100],[20,200]) print("test_array for read_data:") for value in read_data(test_array)():print(value)

接下來我們定義了用于訓(xùn)練的數(shù)據(jù)提供器。提供器每次讀入一個大小為BATCH_SIZE的數(shù)據(jù)批次。如果用戶希望加一些隨機(jī)性，它可以同時定義一個批次大小和一個緩存大小。這樣的話，每次數(shù)據(jù)提供器會從緩存中隨機(jī)讀取批次大小那么多的數(shù)據(jù)。我們都可以通過batch_size進(jìn)行設(shè)置，這個大小一般是2的N次方。

關(guān)于參數(shù)的解釋如下：

paddle.reader.shuffle(read_data(train_data), buf_size=500)表示從read_data(train_data)中讀取了buf_size=500大小的數(shù)據(jù)并打亂順序
paddle.batch(reader(), batch_size=BATCH_SIZE)表示從打亂的數(shù)據(jù)中再取出BATCH_SIZE=20大小的數(shù)據(jù)進(jìn)行一次迭代訓(xùn)練

如果buf_size設(shè)置的數(shù)值大于數(shù)據(jù)集本身，就直接把整個數(shù)據(jù)集打亂順序；如果buf_size設(shè)置的數(shù)值小于數(shù)據(jù)集本身，就按照buf_size的大小打亂順序。

BATCH_SIZE = 8# 設(shè)置訓(xùn)練reader train_reader = paddle.batch(paddle.reader.shuffle(read_data(train_data), buf_size=500),batch_size=BATCH_SIZE)#設(shè)置測試 reader test_reader = paddle.batch(paddle.reader.shuffle(read_data(test_data), buf_size=500),batch_size=BATCH_SIZE)

4 - 訓(xùn)練過程

完成了數(shù)據(jù)的預(yù)處理工作并構(gòu)造了read_data()來讀取數(shù)據(jù)，接下來將進(jìn)入模型的訓(xùn)練過程，使用PaddlePaddle來定義構(gòu)造可訓(xùn)練的線性回歸模型，關(guān)鍵步驟如下：

配置網(wǎng)絡(luò)結(jié)構(gòu)和設(shè)置參數(shù)
- 配置網(wǎng)絡(luò)結(jié)構(gòu)
- 定義損失函數(shù)cost
- 定義執(zhí)行器(參數(shù)隨機(jī)初始化)
- 定義優(yōu)化器optimizer
模型訓(xùn)練
預(yù)測
繪制擬合圖像

定義運算場所：

首先進(jìn)行最基本的運算場所定義，在 fluid 中使用 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace() 來進(jìn)行初始化：

place 表示fluid program的執(zhí)行設(shè)備，常見的有 fluid.CUDAPlace(0) 和 fluid.CPUPlace()
use_cuda = False 表示不使用 GPU 進(jìn)行加速訓(xùn)練

#使用CPU或者GPU訓(xùn)練 use_cuda = False place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

配置網(wǎng)絡(luò)結(jié)構(gòu)和設(shè)置參數(shù)：

# 輸入層，fluid.layers.data表示數(shù)據(jù)層,name=’x’：名稱為x,輸出類型為tensor # shape=[1]:數(shù)據(jù)為1維向量 # dtype='float32'：數(shù)據(jù)類型為float32x = fluid.layers.data(name='x', shape=[1], dtype='float32')# 標(biāo)簽數(shù)據(jù)，fluid.layers.data表示數(shù)據(jù)層,name=’y’：名稱為y,輸出類型為tensor # shape=[1]:數(shù)據(jù)為1維向量 y = fluid.layers.data(name='y', shape=[1], dtype = 'float32')# 輸出層，fluid.layers.fc表示全連接層，input=x: 該層輸入數(shù)據(jù)為x # size=1：神經(jīng)元個數(shù)，act=None：激活函數(shù)為線性函數(shù) y_predict = fluid.layers.fc(input=x, size=1, act=None)

定義損失函數(shù):

# 定義損失函數(shù)為均方差損失函數(shù),并且求平均損失，返回值名稱為avg_loss avg_loss = fluid.layers.square_error_cost(input = y_predict, label = y) avg_loss = fluid.layers.mean(avg_loss)

定義執(zhí)行器(參數(shù)隨機(jī)初始化):

exe = fluid.Executor(place)

配置訓(xùn)練程序:

main_program = fluid.default_main_program() # 獲取默認(rèn)/全局主函數(shù) startup_program = fluid.default_startup_program() # 獲取默認(rèn)/全局啟動程序#克隆main_program得到test_program #有些operator在訓(xùn)練和測試之間的操作是不同的，例如batch_norm，使用參數(shù)for_test來區(qū)分該程序是用來訓(xùn)練還是用來測試 #該api不會刪除任何操作符,請在backward和optimization之前使用 test_program = main_program.clone(for_test=True)

優(yōu)化方法:

# 創(chuàng)建optimizer，更多優(yōu)化算子可以參考 fluid.optimizer() learning_rate = 0.0005 sgd_optimizer = fluid.optimizer.SGD(learning_rate) sgd_optimizer.minimize(avg_loss) print("optimizer is ready")

創(chuàng)建訓(xùn)練過程:

訓(xùn)練需要有一個訓(xùn)練程序和一些必要參數(shù)，并構(gòu)建了一個獲取訓(xùn)練過程中測試誤差的函數(shù)。必要參數(shù)有executor,program,reader,feeder,fetch_list，executor表示之前創(chuàng)建的執(zhí)行器，program表示執(zhí)行器所執(zhí)行的program，是之前創(chuàng)建的program，如果該項參數(shù)沒有給定的話則默認(rèn)使用defalut_main_program，reader表示讀取到的數(shù)據(jù)，feeder表示前向輸入的變量，fetch_list表示用戶想得到的變量或者命名的結(jié)果。

# For training test cost def train_test(executor, program, reader, feeder, fetch_list):accumulated = 1 * [0]count = 0for data_test in reader():outs = executor.run(program=program, feed=feeder.feed(data_test), fetch_list=fetch_list)accumulated = [x_c[0] + x_c[1][0] for x_c in zip(accumulated, outs)] # 累加測試過程中的損失值count += 1 # 累加測試集中的樣本數(shù)量return [x_d / count for x_d in accumulated] # 計算平均損失#定義模型保存路徑： #params_dirname用于定義模型保存路徑。 params_dirname = "easy_fit_a_line.inference.model"

訓(xùn)練主循環(huán)：

#用于畫圖展示訓(xùn)練cost from paddle.utils.plot import Ploter train_prompt = "Train cost" test_prompt = "Test cost" plot_prompt = Ploter(train_prompt, test_prompt) step = 0# 訓(xùn)練主循環(huán) feeder = fluid.DataFeeder(place=place, feed_list=[x, y]) exe.run(startup_program)exe_test = fluid.Executor(place)#num_epochs=100表示迭代訓(xùn)練100次后停止訓(xùn)練。 num_epochs = 150for pass_id in range(num_epochs):for data_train in train_reader():avg_loss_value, = exe.run(main_program,feed=feeder.feed(data_train),fetch_list=[avg_loss])if step % 10 == 0: # 每10個批次記錄并輸出一下訓(xùn)練損失plot_prompt.append(train_prompt, step, avg_loss_value[0])plot_prompt.plot()#print("%s, Step %d, Cost %f" %(train_prompt, step, avg_loss_value[0]))if step % 100 == 0: # 每100批次記錄并輸出一下測試損失test_metics = train_test(executor=exe_test,program=test_program,reader=test_reader,fetch_list=[avg_loss.name],feeder=feeder)plot_prompt.append(test_prompt, step, test_metics[0])plot_prompt.plot()#print("%s, Step %d, Cost %f" %(test_prompt, step, test_metics[0]))if test_metics[0] < 10.0: # 如果準(zhǔn)確率達(dá)到要求，則停止訓(xùn)練breakstep += 1if math.isnan(float(avg_loss_value[0])):sys.exit("got NaN loss, training failed.")#保存訓(xùn)練參數(shù)到之前給定的路徑中 if params_dirname is not None:fluid.io.save_inference_model(params_dirname, ['x'], [y_predict], exe)

5 - 預(yù)測

通過fluid.io.load_inference_model，預(yù)測器會從params_dirname中讀取已經(jīng)訓(xùn)練好的模型，來對從未遇見過的數(shù)據(jù)進(jìn)行預(yù)測。

print(test_metics) infer_exe = fluid.Executor(place) inference_scope = fluid.core.Scope()

預(yù)測

預(yù)測器會從params_dirname中讀取已經(jīng)訓(xùn)練好的模型，來對從未遇見過的數(shù)據(jù)進(jìn)行預(yù)測。

tensor_x:生成batch_size個[0,1]區(qū)間的隨機(jī)數(shù)，以 tensor 的格式儲存
results：預(yù)測對應(yīng) tensor_x 面積的房價結(jié)果
raw_x:由于數(shù)據(jù)處理時我們做了歸一化操作，為了更直觀的判斷預(yù)測是否準(zhǔn)確，將數(shù)據(jù)進(jìn)行反歸一化，得到隨機(jī)數(shù)對應(yīng)的原始數(shù)據(jù)。

with fluid.scope_guard(inference_scope):[inference_program, feed_target_names, fetch_targets] = fluid.io.load_inference_model(params_dirname, infer_exe) # 載入預(yù)訓(xùn)練模型batch_size = 2tensor_x = np.random.uniform(0, 1, [batch_size, 1]).astype("float32")print("tensor_x is :" ,tensor_x )results = infer_exe.run(inference_program,feed={feed_target_names[0]: tensor_x},fetch_list=fetch_targets) # 進(jìn)行預(yù)測#raw_x = tensor_x*(maximums[i]-minimums[i])+avgs[i]raw_x = tensor_x * (maximums[0] - minimums[0]) + avgs[0]#raw_x = tensor_x * (maximums[0] - minimums[0]) + minimums[0]print("the area is:",raw_x)print("infer results: ", results[0])a = (results[0][0][0] - results[0][1][0]) / (raw_x[0][0]-raw_x[1][0]) b = (results[0][0][0] - a * raw_x[0][0])

**（5）繪制擬合圖像 **

通過訓(xùn)練，本次線性回歸模型輸出了一條擬合的直線，想要直觀的判斷模型好壞可將擬合直線與數(shù)據(jù)的圖像繪制出來。

import numpy as np import matplotlib.pyplot as pltdef plot_data(data):x = data[:,0]y = data[:,1]y_predict = x*a + bplt.scatter(x,y,marker='.',c='r',label='True')plt.title('House Price Distributions')plt.xlabel('House Area ')plt.ylabel('House Price ')plt.xlim(0,250)plt.ylim(0,2500)predict = plt.plot(x,y_predict,label='Predict')plt.legend(loc='upper left')plt.savefig('result1.png')plt.show()data = np.loadtxt('./datasets/data.txt',delimiter = ',') plot_data(data)

總結(jié)

以上是生活随笔為你收集整理的动手学PaddlePaddle（2）：房价预测的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： popupkiller.exe是什么进程
下一篇： 5, Data Augmentation