當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【小白学习keras教程】十一、Keras中文本处理Text preprocessing

發布時間：2024/10/8 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了【小白学习keras教程】十一、Keras中文本处理Text preprocessing 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

@Author：Runsen

文章目錄

Text preprocessing
- - Tokenization of a sentence
  - One-hot encoding
  - Padding sequences
  - Word Embeddings
  - Word vectors
  - Embedding layer

本次博客將介紹如何在Keras中，對文本進行處理Text preprocessing

Text preprocessing

Keras API
文檔: https://keras.io/preprocessing/text/

from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence, one_hot from tensorflow.keras.preprocessing.sequence import pad_sequences

Tokenization of a sentence

Tokenizatio：將字符序列轉換成符號序列的過程 (https://en.wikipedia.org/wiki/Lexical_analysis#Token)

sentences = ['Curiosity killed the cat.', 'But satisfaction brought it back'] tk = Tokenizer() # create Tokenizer instance tk.fit_on_texts(sentences) # tokenizer should be fit with text data in advance

文本建模的一種簡單方法是為每個句子創建整數序列
通過這樣做，可以保留有關單詞順序的信息

seq = tk.texts_to_sequences(sentences) print(seq)

[[1, 2, 3, 4], [5, 6, 7, 8, 9]]

One-hot encoding

有時，只需要檢查某個詞是否出現在句子中，這種文本處理的方式被稱為“一One-hot encoding ”。

如果單詞出現在句子中，它被編碼為“1”
如果不是，則編碼為“0”

mat = tk.sequences_to_matrix(seq) print(mat)

[[0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]

Padding sequences

通常，為了保持句子的維數，會執行zero padding
Padding 就像填充圖像格式，但適用于序列

# if set padding to 'pre', zeros are appended to start of sentences pad_seq = pad_sequences(seq, padding='pre') print(pad_seq)

[[0 1 2 3 4]
[5 6 7 8 9]]

、pad_seq = pad_sequences(seq, padding='post') print(pad_seq)

[[1 2 3 4 0]
[5 6 7 8 9]]

Word Embeddings

另一種單詞嵌入的方法是通過在word embedding中預先訓練單詞向量（例如word2vec、GloVe、fasttext等）

Word vectors

單詞嵌入是將每個單詞轉換成一個固定維度（單詞）向量的過程。嵌入空間（即向量空間）的維數是一個超參數；可以將維數設置為任何正整數

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import * from tensorflow.keras.datasets import reuters from tensorflow.keras.preprocessing import sequence from tensorflow.keras.utils import to_categorical

Embedding layer

因此，Embedding layer具有三維張量

Output 輸出形狀= (batch_size, input_length, output_dim)
input_dim 輸入尺寸：輸入空間的維數（感興趣的唯一標記的數量）
output_dim 輸出尺寸：嵌入空間的維數
input_length 輸入長度：輸入序列的長度（如果沒有，可以改變）

#輸入長度不變時 model = Sequential() model.add(Embedding(input_dim = 10, output_dim = 5, input_length = 3)) model.output_shape

(None, 3, 5)

# when input length varies model = Sequential() model.add(Embedding(input_dim = 10, output_dim = 5, input_length = None)) model.output_shape

(None, None, 5)

在model中，通常采用 embedding layer作為第一層對文本格式的數據進行建模

# parameters to import dataset num_words = 3000 maxlen = 50(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words = num_words, maxlen = maxlen) X_train = sequence.pad_sequences(X_train, maxlen = maxlen, padding = 'post') X_test = sequence.pad_sequences(X_test, maxlen = maxlen, padding = 'post') y_train = to_categorical(y_train, num_classes = 46) y_test = to_categorical(y_test, num_classes = 46) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape)

(1595, 50)
(399, 50)
(1595, 46)
(399, 46)

input_dim = num_words output_dim = 100 # we set dimensionality of embedding space as 100 input_length = maxlen def reuters_model():model = Sequential()model.add(Embedding(input_dim = input_dim, output_dim = output_dim, input_length = input_length))model.add(GRU(50, return_sequences = False))model.add(Dense(100))model.add(Activation('relu'))model.add(Dense(46, activation = 'softmax'))model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])return model model = reuters_model()model.summary()

model.fit(X_train, y_train, epochs = 10, batch_size = 256, verbose = 1) result = model.evaluate(X_test, y_test) print('Test Accuracy: ', result[1])

Test Accuracy: 0.7468671798706055

總結

以上是生活随笔為你收集整理的【小白学习keras教程】十一、Keras中文本处理Text preprocessing的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。