當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

05.序列模型 W3.序列模型和注意力机制（作业：机器翻译+触发词检测）

發布時間：2024/7/5 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 05.序列模型 W3.序列模型和注意力机制（作业：机器翻译+触发词检测）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

作業1：機器翻譯
- 1. 日期轉換
- - 1.1 數據集
- 2. 用注意力模型進行機器翻譯
- - 2.1 注意力機制
- 3. 可視化注意力
作業2：觸發詞檢測
- 1. 數據合成：創建語音數據集
- - 1.1 聽一下數據
  - 1.2 音頻轉頻譜
  - 1.3 生成一個訓練樣本
  - 1.4 全部訓練集
  - 1.5 開發集
- 2. 模型
- - 2.1 建模
  - 2.2 訓練
  - 2.3 測試模型
- 3. 預測
- - 3.3 在開發集上測試
- 4. 用自己的樣本測試

測試題：參考博文

筆記：W3.序列模型和注意力機制

作業1：機器翻譯

建立一個神經元機器翻譯（NMT）模型來將人類可讀日期（25th of June, 2009）翻譯成機器可讀日期（“2009—06—25”）

將使用注意力模型來實現這一點，這是最復雜的序列到序列模型之一

注意安裝包

pip install Faker==2.0.0 pip install babel

導入包

from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply from keras.layers import RepeatVector, Dense, Activation, Lambda from keras.optimizers import Adam from keras.utils import to_categorical from keras.models import load_model, Model import keras.backend as K import numpy as npfrom faker import Faker import random from tqdm import tqdm from babel.dates import format_date from nmt_utils import * import matplotlib.pyplot as plt %matplotlib inline

1. 日期轉換

模型將輸入以各種可能格式書寫的日期（例如"the 29th of August 1958", "03/30/1968", "24 JUNE 1987"），并將其轉換為標準化、機器可讀的日期（如 "1958-08-29", "1968-03-30", "1987-06-24"）。我們將讓模型學習以通用機器可讀格式YYYY-MM-DD輸出日期

1.1 數據集

1萬條數據

m = 10000 dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

打印看看

dataset[:10]

輸出：

[('9 may 1998', '1998-05-09'),('10.11.19', '2019-11-10'),('9/10/70', '1970-09-10'),('saturday april 28 1990', '1990-04-28'),('thursday january 26 1995', '1995-01-26'),('monday march 7 1983', '1983-03-07'),('sunday may 22 1988', '1988-05-22'),('08 jul 2008', '2008-07-08'),('8 sep 1999', '1999-09-08'),('thursday january 1 1981', '1981-01-01')]

上面加載了：

dataset
human_vocab: 字典， human readable dates ： an integer-valued index
machine_vocab: 字典， machine readable dates ： an integer-valued index
inv_machine_vocab: 字典，machine_vocab的反向映射，indices ： characters

Tx = 30 # 最大輸入長度，如果大了，就截斷 Ty = 10 # 輸出日期長度 YYYY-MM-DD X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)print("X.shape:", X.shape) print("Y.shape:", Y.shape) print("Xoh.shape:", Xoh.shape) print("Yoh.shape:", Yoh.shape)

輸出：

X.shape: (10000, 30) Y.shape: (10000, 10) Xoh.shape: (10000, 30, 37) # 37 是 len(human_vocab) Yoh.shape: (10000, 10, 11) # 11 是日期中的字符種類 0-9 和 ‘-’

看看數據（數據不夠長度的，會補充 pad，所有 x 都是 30 長度）

index = 52 print("Source date:", dataset[index][0]) print("Target date:", dataset[index][1]) print() print("Source after preprocessing (indices):", X[index]) print("Target after preprocessing (indices):", Y[index]) print() print("Source after preprocessing (one-hot):", Xoh[index]) print("Target after preprocessing (one-hot):", Yoh[index])

輸出：

Source date: saturday october 9 1976 Target date: 1976-10-09Source after preprocessing (indices): [29 13 30 31 28 16 13 34 0 26 15 30 26 14 17 28 0 12 0 4 12 10 9 3636 36 36 36 36 36] Target after preprocessing (indices): [ 2 10 8 7 0 2 1 0 1 10]Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.]...[0. 0. 0. ... 0. 0. 1.][0. 0. 0. ... 0. 0. 1.][0. 0. 0. ... 0. 0. 1.]] Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.][0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.][0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.][0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

2. 用注意力模型進行機器翻譯

2.1 注意力機制

$context<t>=∑t′=0Txα<t,t′>a<t′>context^{<t>} = \sum_{t' = 0}^{T_x} \alpha^{<t,t'>}a^{<t'>}$

將輸入重復幾次：https://keras.io/zh/layers/core/#repeatvector
輸入張量通過 axis 軸串聯起來 https://keras.io/zh/layers/merge/#concatenate_1
https://keras.io/zh/layers/wrappers/#bidirectional

# Defined shared layers as global variables repeator = RepeatVector(Tx) concatenator = Concatenate(axis=-1) densor1 = Dense(10, activation = "tanh") densor2 = Dense(1, activation = "relu") activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebookdotor = Dot(axes = 1)

注意力計算

# GRADED FUNCTION: one_step_attentiondef one_step_attention(a, s_prev):"""Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights"alphas" and the hidden states "a" of the Bi-LSTM.Arguments:a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)Returns:context -- context vector, input of the next (post-attetion) LSTM cell"""### START CODE HERE #### Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)s_prev = repeator(s_prev)# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)concat = concatenator(inputs=[a, s_prev])# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)e = densor1(concat)# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)energies = densor2(e)# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)alphas = activator(energies)# Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)context = dotor([alphas, a])### END CODE HERE ###return context n_a = 32 n_s = 64 post_activation_LSTM_cell = LSTM(n_s, return_state = True) output_layer = Dense(len(machine_vocab), activation=softmax) # GRADED FUNCTION: modeldef model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):"""Arguments:Tx -- length of the input sequenceTy -- length of the output sequencen_a -- hidden state size of the Bi-LSTMn_s -- hidden state size of the post-attention LSTMhuman_vocab_size -- size of the python dictionary "human_vocab"machine_vocab_size -- size of the python dictionary "machine_vocab"Returns:model -- Keras model instance"""# Define the inputs of your model with a shape (Tx,)# Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)X = Input(shape=(Tx, human_vocab_size))s0 = Input(shape=(n_s,), name='s0')c0 = Input(shape=(n_s,), name='c0')s = s0c = c0# Initialize empty list of outputsoutputs = []### START CODE HERE #### Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)a = Bidirectional(LSTM(n_a, return_sequences=True))(X)# Step 2: Iterate for Ty stepsfor t in range(Ty):# Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)context = one_step_attention(a, s)# Step 2.B: Apply the post-attention LSTM cell to the "context" vector.# Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)out = output_layer(s)# Step 2.D: Append "out" to the "outputs" list (≈ 1 line)outputs.append(out)# Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)model = Model(inputs=[X, s0, c0], outputs=outputs)### END CODE HERE ###return model

定義模型

model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))

定義優化器、配置模型

### START CODE HERE ### (≈2 lines) opt = Adam(learning_rate=0.005, beta_1=0.9, beta_2=0.999,decay=0.01) model.compile(loss='categorical_crossentropy',optimizer=opt, metrics=['accuracy']) ### END CODE HERE ###

訓練

s0 = np.zeros((m, n_s)) c0 = np.zeros((m, n_s)) outputs = list(Yoh.swapaxes(0,1)) model.fit([Xoh, s0, c0], outputs, epochs=1, batch_size=100)

為了節省時間，老師準備好了訓練好的權值

model.load_weights('models/model.h5') EXAMPLES = ['5th Otc 2019', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001'] for example in EXAMPLES:source = string_to_int(example, Tx, human_vocab)source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)source = source.transpose() #交換兩個軸source = np.expand_dims(source, axis=0) #增加一維軸prediction = model.predict([source, s0, c0])prediction = np.argmax(prediction, axis = -1)output = [inv_machine_vocab[int(i)] for i in prediction]print("source:", example)print("output:", ''.join(output))

輸出：

source: 5th Otc 2019 output: 2019-10-05 source: 5 April 09 output: 2009-04-05 source: 21th of August 2016 output: 2016-08-20 source: Tue 10 Jul 2007 output: 2007-07-10 source: Saturday May 9 2018 output: 2018-05-09 source: March 3 2001 output: 2001-03-03 source: March 3rd 2001 output: 2001-03-03 source: 1 March 2001 output: 2001-03-01

3. 可視化注意力

attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday 09 Oct 1993", num = 7, n_s = 64)

可以看出大部分的注意力用來預測年份

作業2：觸發詞檢測

導入包

import numpy as np from pydub import AudioSegment import random import sys import io import os import glob import IPython from td_utils import * %matplotlib inline

1. 數據合成：創建語音數據集

1.1 聽一下數據

有正向音頻 activates（觸發詞）、負向音頻（非觸發詞）、背景噪聲

IPython.display.Audio("./raw_data/backgrounds/1.wav")

1.2 音頻轉頻譜

音頻為 44100 Hz 的，時長 10秒

x = graph_spectrogram("audio_examples/example_train.wav")

本作業訓練樣本時長 10 秒，頻譜時間步為 5511，所以 $T_x = 5511$

_, data = wavfile.read("audio_examples/example_train.wav") print("Time steps in audio recording before spectrogram", data[:,0].shape) print("Time steps in input after spectrogram", x.shape)

輸出：

Time steps in audio recording before spectrogram (441000,) Time steps in input after spectrogram (101, 5511)

定義參數

Tx = 5511 # The number of time steps input to the model from the spectrogram n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram Ty = 1375 # The number of time steps in the output of our model

1.3 生成一個訓練樣本

隨機選擇10s 背景噪聲
隨機插入 0-4 段觸發詞音頻
隨機插入 0-2 段非觸發詞音頻

# Load audio segments using pydub activates, negatives, backgrounds = load_raw_audio()print("background len: " + str(len(backgrounds[0]))) # Should be 10,000, since it is a 10 sec clip print("activate[0] len: " + str(len(activates[0]))) # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot) print("activate[1] len: " + str(len(activates[1]))) # Different "activate" clips can have different lengths

輸出：

background len: 10000 activate[0] len: 721 activate[1] len: 731

獲取背景音頻中的隨機時間段

def get_random_time_segment(segment_ms):"""Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.Arguments:segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")Returns:segment_time -- a tuple of (segment_start, segment_end) in ms"""segment_start = np.random.randint(low=0, high=10000-segment_ms) # Make sure segment doesn't run past the 10sec background segment_end = segment_start + segment_ms - 1return (segment_start, segment_end)

檢測插入的音頻是否重疊

# GRADED FUNCTION: is_overlappingdef is_overlapping(segment_time, previous_segments):"""Checks if the time of a segment overlaps with the times of existing segments.Arguments:segment_time -- a tuple of (segment_start, segment_end) for the new segmentprevious_segments -- a list of tuples of (segment_start, segment_end) for the existing segmentsReturns:True if the time segment overlaps with any of the existing segments, False otherwise"""segment_start, segment_end = segment_time### START CODE HERE ### (≈ 4 line)# Step 1: Initialize overlap as a "False" flag. (≈ 1 line)overlap = False# Step 2: loop over the previous_segments start and end times.# Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)for previous_start, previous_end in previous_segments:if previous_end >= segment_start and previous_start <= segment_end:overlap = True### END CODE HERE ###return overlap

插入音頻

# GRADED FUNCTION: insert_audio_clipdef insert_audio_clip(background, audio_clip, previous_segments):"""Insert a new audio segment over the background noise at a random time step, ensuring that the audio segment does not overlap with existing segments.Arguments:background -- a 10 second background audio recording. audio_clip -- the audio clip to be inserted/overlaid. previous_segments -- times where audio segments have already been placedReturns:new_background -- the updated background audio"""# Get the duration of the audio clip in mssegment_ms = len(audio_clip)### START CODE HERE ### # Step 1: Use one of the helper functions to pick a random time segment onto which to insert # the new audio clip. (≈ 1 line)segment_time = get_random_time_segment(segment_ms)# Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)while is_overlapping(segment_time, previous_segments):segment_time = get_random_time_segment(segment_ms)# Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)previous_segments.append(segment_time)### END CODE HERE #### Step 4: Superpose audio segment and backgroundnew_background = background.overlay(audio_clip, position = segment_time[0])return new_background, segment_time

插入標簽 1

# GRADED FUNCTION: insert_onesdef insert_ones(y, segment_end_ms):"""Update the label vector y. The labels of the 50 output steps strictly after the end of the segment should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the50 followinf labels should be ones.Arguments:y -- numpy array of shape (1, Ty), the labels of the training examplesegment_end_ms -- the end time of the segment in msReturns:y -- updated labels"""# duration of the background (in terms of spectrogram time-steps)segment_end_y = int(segment_end_ms * Ty / 10000.0)# Add 1 to the correct index in the background label (y)### START CODE HERE ### (≈ 3 lines)for i in range(segment_end_y+1, segment_end_y+51):if i < Ty:y[0, i] = 1### END CODE HERE ###return y

合成訓練數據

# GRADED FUNCTION: create_training_exampledef create_training_example(background, activates, negatives):"""Creates a training example with a given background, activates, and negatives.Arguments:background -- a 10 second background audio recordingactivates -- a list of audio segments of the word "activate"negatives -- a list of audio segments of random words that are not "activate"Returns:x -- the spectrogram of the training exampley -- the label at each time step of the spectrogram"""# Set the random seednp.random.seed(18)# Make background quieterbackground = background - 20### START CODE HERE #### Step 1: Initialize y (label vector) of zeros (≈ 1 line)y = np.zeros((1, Ty))# Step 2: Initialize segment times as empty list (≈ 1 line)previous_segments = []### END CODE HERE #### Select 0-4 random "activate" audio clips from the entire list of "activates" recordingsnumber_of_activates = np.random.randint(0, 5)random_indices = np.random.randint(len(activates), size=number_of_activates)random_activates = [activates[i] for i in random_indices]### START CODE HERE ### (≈ 3 lines)# Step 3: Loop over randomly selected "activate" clips and insert in backgroundfor random_activate in random_activates:# Insert the audio clip on the backgroundbackground, segment_time = insert_audio_clip(background, random_activate, previous_segments)# Retrieve segment_start and segment_end from segment_timesegment_start, segment_end = segment_time# Insert labels in "y"y = insert_ones(y, segment_end)### END CODE HERE #### Select 0-2 random negatives audio recordings from the entire list of "negatives" recordingsnumber_of_negatives = np.random.randint(0, 3)random_indices = np.random.randint(len(negatives), size=number_of_negatives)random_negatives = [negatives[i] for i in random_indices]### START CODE HERE ### (≈ 2 lines)# Step 4: Loop over randomly selected negative clips and insert in backgroundfor random_negative in random_negatives:# Insert the audio clip on the background background, _ = insert_audio_clip(background, random_negative, previous_segments)### END CODE HERE #### Standardize the volume of the audio clip background = match_target_amplitude(background, -20.0)# Export new training example file_handle = background.export("train" + ".wav", format="wav")print("File (train.wav) was saved in your directory.")# Get and plot spectrogram of the new recording (background with superposition of positive and negatives)x = graph_spectrogram("train.wav")return x, y x, y = create_training_example(backgrounds[0], activates, negatives)

plt.plot(y[0])

1.4 全部訓練集

老師已經處理完了所有數據

# Load preprocessed training examples X = np.load("./XY_train/X.npy") Y = np.load("./XY_train/Y.npy")

1.5 開發集

使用真人錄制的音頻

# Load preprocessed dev set examples X_dev = np.load("./XY_dev/X_dev.npy") Y_dev = np.load("./XY_dev/Y_dev.npy")

2. 模型

導入包

from keras.callbacks import ModelCheckpoint from keras.models import Model, load_model, Sequential from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape from keras.optimizers import Adam

2.1 建模

模型先由一個 1維的卷積來抽取一些特征，還可以加速GRU計算只需要處理 1375 個時間步，而不是5511個

注意：不要使用雙向RNN，我們需要檢測到觸發詞后馬上輸出動作，如果使用雙向RNN，我們需要等待 10s 音頻被記錄下來，再判斷

一些 Keras 參考

conv1d https://keras.io/zh/layers/convolutional/#conv1d

BN https://keras.io/zh/layers/normalization/#batchnormalization

GRU https://keras.io/zh/layers/recurrent/#gru

timedistributed https://keras.io/zh/layers/wrappers/#timedistributed

# GRADED FUNCTION: modeldef model(input_shape):"""Function creating the model's graph in Keras.Argument:input_shape -- shape of the model's input data (using Keras conventions)Returns:model -- Keras model instance"""X_input = Input(shape = input_shape)### START CODE HERE #### Step 1: CONV layer (≈4 lines)X = Conv1D(filters=196,kernel_size=15,strides=4)(X_input) # CONV1DX = BatchNormalization()(X) # Batch normalizationX = Activation('relu')(X) # ReLu activationX = Dropout(rate=0.8)(X) # dropout (use 0.8)# Step 2: First GRU Layer (≈4 lines)X = GRU(128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)X = Dropout(rate=0.8)(X) # dropout (use 0.8)X = BatchNormalization()(X) # Batch normalization# Step 3: Second GRU Layer (≈4 lines)X = GRU(128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)X = Dropout(rate=0.8)(X) # dropout (use 0.8)X = BatchNormalization()(X) # Batch normalizationX = Dropout(rate=0.8)(X) # dropout (use 0.8)# Step 4: Time-distributed dense layer (≈1 line)X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed (sigmoid)### END CODE HERE ###model = Model(inputs = X_input, outputs = X)return model model = model(input_shape = (Tx, n_freq)) model.summary()

輸出：

Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 5511, 101) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 1375, 196) 297136 _________________________________________________________________ batch_normalization_2 (Batch (None, 1375, 196) 784 _________________________________________________________________ activation_2 (Activation) (None, 1375, 196) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 1375, 196) 0 _________________________________________________________________ gru_2 (GRU) (None, 1375, 128) 124800 _________________________________________________________________ dropout_3 (Dropout) (None, 1375, 128) 0 _________________________________________________________________ batch_normalization_3 (Batch (None, 1375, 128) 512 _________________________________________________________________ gru_3 (GRU) (None, 1375, 128) 98688 _________________________________________________________________ dropout_4 (Dropout) (None, 1375, 128) 0 _________________________________________________________________ batch_normalization_4 (Batch (None, 1375, 128) 512 _________________________________________________________________ dropout_5 (Dropout) (None, 1375, 128) 0 _________________________________________________________________ time_distributed_1 (TimeDist (None, 1375, 1) 129 ================================================================= Total params: 522,561 Trainable params: 521,657 Non-trainable params: 904

2.2 訓練

訓練很費時，在4000個樣本上，老師已經訓練好了該模型

model = load_model('./models/tr_model.h5')

再用我們的數據集，訓練1代

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"]) model.fit(X, Y, batch_size = 5, epochs=1)

2.3 測試模型

loss, acc = model.evaluate(X_dev, Y_dev) print("Dev set accuracy = ", acc)

輸出：

25/25 [==============================] - 1s 46ms/step Dev set accuracy = 0.9427199959754944

但是準確率在這里不是一個好的衡量標準，因為大部分標簽都是0，都預測為0，準確率也會很高，應該用 F1值等

3. 預測

def detect_triggerword(filename):plt.subplot(2, 1, 1)x = graph_spectrogram(filename)# the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the modelx = x.swapaxes(0,1)x = np.expand_dims(x, axis=0)predictions = model.predict(x)plt.subplot(2, 1, 2)plt.plot(predictions[0,:,0])plt.ylabel('probability')plt.show()return predictions

一旦估計了在每個輸出步驟檢測到單詞“activate”的概率，當概率高于某個閾值時，您可以觸發“chiming”聲音播放。此外，在說“activate”之后，有很多個 y 值可能接近1，但我們只想發出一次蜂鳴音。所以最多每75個輸出步驟插入一個蜂鳴音。這將有助于防止我們為“activate”的單個實例插入兩個蜂鳴音。（這與計算機視覺的非最大值抑制類似）

chime_file = "audio_examples/chime.wav" def chime_on_activate(filename, predictions, threshold):audio_clip = AudioSegment.from_wav(filename)chime = AudioSegment.from_wav(chime_file)Ty = predictions.shape[1]# Step 1: Initialize the number of consecutive output steps to 0consecutive_timesteps = 0# Step 2: Loop over the output steps in the yfor i in range(Ty):# Step 3: Increment consecutive output stepsconsecutive_timesteps += 1# Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passedif predictions[0,i,0] > threshold and consecutive_timesteps > 75:# Step 5: Superpose audio and background using pydubaudio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)# Step 6: Reset consecutive output steps to 0consecutive_timesteps = 0audio_clip.export("chime_output.wav", format='wav')

3.3 在開發集上測試

第一段語音，有1個觸發

filename = "./raw_data/dev/1.wav" prediction = detect_triggerword(filename) chime_on_activate(filename, prediction, 0.5) IPython.display.Audio("./chime_output.wav")

第二段語音，有2個觸發

4. 用自己的樣本測試

# Preprocess the audio to the correct format def preprocess_audio(filename):# Trim or pad audio segment to 10000mspadding = AudioSegment.silent(duration=10000)segment = AudioSegment.from_wav(filename)[:10000]segment = padding.overlay(segment)# Set frame rate to 44100segment = segment.set_frame_rate(44100)# Export as wavsegment.export(filename, format='wav') your_filename = "audio_examples/my_audio.wav" preprocess_audio(your_filename) IPython.display.Audio(your_filename) # listen to the audio you uploaded chime_threshold = 0.5 prediction = detect_triggerword(your_filename) chime_on_activate(your_filename, prediction, chime_threshold) IPython.display.Audio("./chime_output.wav")

本文地址：https://michael.blog.csdn.net/article/details/108933798

我的CSDN博客地址 https://michael.blog.csdn.net/

長按或掃碼關注我的公眾號（Michael阿明），一起加油、一起學習進步！

總結

以上是生活随笔為你收集整理的05.序列模型 W3.序列模型和注意力机制（作业：机器翻译+触发词检测）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 1723. 完成所有工
下一篇： LeetCode 1686. 石子游戏