當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结

發布時間：2024/1/1 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

給我一瓶酒，再給我一支煙，說code就code, 我沒的是時間
各位看官老爺，歡迎就坐觀看。
博主Github鏈接：https://github.com/wencoast

原理流程

摘要

就是去掉REID supervision后直接訓練SAN，然后把訓到的SAN(此時因為是用PIT訓練的所以叫SAN-PG)，從另外一個程序里導入，然后配合上網絡結構，從而生成pseudo groundtruth texture image for the reID datasets.

a decoder (SA-Dec) for reconstructing or regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation（因為這個生成的合成紋理是對齊的. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. 不僅僅是重構loss還有一個三元組REID約束。這個重構loss沒在pipeline圖里面體現。

然后，這是1_004_1_01.png_ep_1.jpg，難道這就是回歸出來的最后的結果么，這要去查看代碼來驗證。

意思在于，讓語義對齊的載體來作為監督，然后讓學成的東西滿足語義對齊。

從這開始看:

一句話，同樣是表達學習，但是他這里能滿足在Re-ID任務中語義對齊。

In this paper, we propose a framework that drives the re-ID network to
learn semantics-aligned feature representation through delicate
supervision designs

提出了一個框架，一個驅動reID 網絡學習語義對齊的feature representation through 精細的監督設計。

build a Semantics Aligning Network (SAN)(怎么實現的，讓這個網絡就是語義對齊的網絡呢？) which consists of a base network as encoder (SA-Enc 語義對齊Enc) for re-ID[For re-ID的語義對齊編碼器]， and a decoder (SA-Dec) for reconstructing or regressing the densely semantics aligned full texture image [語義對齊解碼器, 來重構和回歸密集語義對齊的全紋理圖像full texture image]

說明這個總的語義對齊網絡是由語義對齊的編碼器和語義對齊的解碼器組成的。編碼器和解碼器各有分工，怎么分工的呢？

經過幾個月的理解, Encoder負責提特征, decoder負責回歸出texture image然后進一步來改善reID feature.

以解決person re-identification的目地作為監督信號和對齊的texture generation作為監督信號，也就是under the supervision of person re-identification and aligned texture generation.

在decoder中，除了重構 reconstruction loss, 我們添加Triplet ReID constraints over the feature maps as the perceptual losses (作為感知loss). 在inference階段，decoder被discard, 從而實現了計算上更有效。消融研究確定了他們設計的有效性。

主要挑戰在于

large variation in

human pose
capturing view points
incompleteness of the bodies(due to occlusion)

而這些問題都會result in semantics misalignment across 2D images

什么是全紋理圖像？

什么是紋理圖像呢？

A texture image on the UV coordinate system represents the aligned full texture of the 3D surface of the person. 意思，在uv空間內獲得的紋理圖像表達人的3D surface的對齊的全紋理。 (因為人有通用的3D model) . 此外，Besides, a texture image contains all the texture of the full 3D surface pf a person.

不管來自哪個人，UV空間里紋理圖像的紋理信息是對齊的。

Note that the texture images across different persons are densely semantically aligned. Dense Pose就是用來從person images獲得dense semantics的. 合著本文用到的紋理圖像是用DensePose獲得的？

值得注意的是：用aligned texture image 來合成 person image of another pose or view不是MicrosoftAsia的創新點，這個工作是由FaceBook AI Research和Wang et al. 2019年時候做的。

對于不同的人的input images, the corresponding texture images are well semantics aligned.

在不同的texture image的相同空間位置上，語義是一樣的。

The person images with different visible semantics/regions, their texture images are semantics consistent/aligned since each one contains the ful texture/information of the 3D person surface

在本文中，學到的特征表達，在本質上就是語義對齊的。

為什么要用紋理圖像？

As the person identity is mainly characterized by textures.
因為person identity 主要用texture來特定化，因為3D human model的話，對于人而言，是有通用模型的。而另外的話，對于人的動作和姿態，大家都會做出那些動作和姿態。最大的區別就是在于外觀上的紋理了，所以我覺得texture應該隸屬于appearance.
Texture images for different persons/viewpoints/poses are densely semantically aligned, as illustrated in the following Figure.
對于不同person的輸入圖像，the corresponding texture image卻是well semantics aligned.
- 首先，對于在不同texture image的相同空間位置，the semantics are the same. 該代表胳膊的地方都代表胳膊，該代表腿的地方都代表腿。
- 其次， for person images with different visible semantics/regions, 對于具有不同區域或者不同語義的行人圖像，比如有的含完整上半身，有另外一張卻只含上半身不含腦袋，就算是這，他們的texture image也是語義對齊的，since each one contains the full texure/information of the 3D person surface.

這不代表我那個就是包含full texture information of the 3D person surface的吧，因為他們這個紋理圖像確實很全面的，感覺是個360度。

他們也是把原圖作為輸入，但是他們用的模型SAN是在合成數據集上面訓練過的，然后這個模型被用于來生成pseudo ground-truth texture image.

那么問題就來了，如何在我目前的基礎上，來得到full texture information of the 3D person surface?

我自己生成的是64×64，然后作者開源的是256×256的。

首先，紋理圖像是產生于3D human surface, 而3D human surface又要依托于專門基于表面的坐標系，也就是UV space.

3D human surface上的each position (u,v) 會在texture image上有unique semantic identity具有唯一語義標識，例如在texture image右下角的
像素對應的是some semantics of a hand.

此外，一個texture image 包含 all the texture of the full 3D surface of a person. 然而一個普通的2D person image只有一部分the surface texture.

意思，texture是個360度，而普通2D person image只是某個視角，是這個意思么？

the full 3D surface of a person, 這塊這個full具體什么意思？可以問問cena

如何做的Pseudo Groundtruth Texture Images Generation?

最奇怪的是：明明只是由single image獲得的texture image，作者們怎么把這個稱為Pseudo Groundtruth Texture Images呢？

For any given input person image, we use a simplified SAN (i.e., SAN-PG) which consists of the SA-Enc and SA-Dec, but with only the reconstruction loss. 這個reconstruction loss是不是只有encoder-decoder里才有的呢？

是用的別的作者發布的本來3D scanned的紋理數據集 (SURREAL)，再自己放上原input image, 合成一個a Paired Image Texture dataset (PIT)

什么是語義對不齊(Semantic Misalignment)？

Spatial semantics misalignment 這個意思，雖然視角差不多，但是不同圖像相同位置卻對應著人體不同的語義（其實就是本質上是什么什么玩意？相對人而言，就是腿，肚子，胳膊什么的）。
比如一個是腿，另一個卻是腹部。
Inconsistency of visible body regions/semantics 可以見到的語義都不一樣，比如：一個看到的是front side的腿，而另一個卻是后面的腿。雖然都是腿，但是本質上語義壓根不一樣，一前一后的。從英文上看的話，意思一前一后of a person，這樣的話語義就是不一致的。

Alignment

Explicitly exploit human pose/landmark information (body part alignment) 但是body part alignment is coarse.
而且在部分內within parts 仍然是對不齊的。
There is still spatial misalignment within the parts.
Based on estimated dense semantics (什么意思？能估計到具體的對應人體什么屬性？)
語義對齊的好處：
To achieve fine-granularity spatial alignment (實現精細粒度的空間對齊)
語義對齊的最早工作是來自Guler, Neverova的2018那篇么？

Densely Semantically Aligned Person Re-Identification(CVPR2019)這篇的話

思想是把原來語義上對不齊的圖像，wrap到規范的UV坐標系，然后這樣就獲得了語義對齊的圖像，意思是先獲得語義對齊的圖像，然后把這些densely semantics aligned images作為輸入，再開展進一步的ReID任務？

但CVPR2019這篇還有問題，問題是：

the invisible body regions result in many holes in the warped images and thus the inconsistency of
visible body regions across images，尚且還存在dense semantics misalignment的問題。

Our work

引入了一個對齊的紋理生成子任務，aligned texture generation subtask, 然后在此基礎上，with densely semantics aligned texture image 用的是不同于CVPR2019的，這個多在texture上，這里是densely semantics aligned texture image.

Encoder

SA-Enc can be any baseline network used for person reID.
用于獲得feature map of size $h×w×ch\times w \times c$ 然后的話，應該會再拉成一維的。
等下，應該是在拉成1D以前，先池化，在feature map上做average pool會得到the reID feature.
然后應該是在獲得這個reID feature后后面跟著reID losses.

為鼓勵SA-Enc來learn semantically aligned features, 本文引入SA-Dec并對SA-Dec做些設置.
要求用SA-Dec在pseudo ground-truth supervision下來regress/generate the densely semantically aligned full texture image(為了簡化，有時候叫texture image).
可見，這些semantics aligned texture image是由SA-Dec生成的. 然后的話，用的是合成的數據集來進行texture image generation的.

怎么就引入和設置后就能實現語義對齊呢？

因為的是： empowering the encoded feature map with aligned full texture generation capability。感覺是先通過編碼器獲得reID feature, 然后通過Decoder在解碼的時候賦予上它對齊的紋理生成。

語義對齊約束被引入是因為賦予編碼后的特征圖以對齊的完整紋理生成，感覺是因為紋理生成的這個對齊性才對齊的呢

看來如何獲得這個紋理生成應該很重要。也就是看SA-Dec怎么工作。

Decoder For generating densely semantically aligned full texture image with supervision.

At the SA-Dec, besides the reconstruction loss, Triplet ReID constraints over the feature maps as the perceptual metric.
之前那是reID loss這塊這是reconstruction loss和Triplet ReID constraints.
ReID 數據集本身沒groundtruth aligned texture image, Generating pseudo groundtruth texture images by leveraging synthesized data with person image and aligned texture image pairs(這塊這個對齊的紋理圖像對哪來的呢？).
之所以能這么干的原因，都是因為：Figure4, 即一個Texture image和一個3D mesh(person image)再加上background, 再利用上合適的rendering參數，就生成synthesized person image 此時沒涉及解碼器，所以，應該生成的這個帶紋理的person image應該還不是語義對齊的。

Related Work

Semantics Aligned Human Texture
A human body could be represented by a 3D mesh(例如SMPL)和a texture image as illustrated in the following figure. 就像下面這個圖顯示的一樣，給定一個texture image，然后再加一個3D mesh，就能通過rendering獲得那個人的person image.

注意到: 沒說，2D圖像上的每一個點都有semantic identity, 而是說，3D mesh上的每個點都有唯一的semantic identity(這種唯一的標識使用UV空間里的(uv)坐標來表示的。)

3. The Semantic Alignment Network

在這個網絡里把，in which densely semantically aligned full texture images are taken as supervision to drive the learning of semantics aligned features.

怎么做到的，怎么把另外一種信息用進來，并且作為監督的？

怎么用進來？

單獨地先生成texture image的文件夾，然后把它里面的紋理圖像通過下面的代碼讀入進來。

img = read_image(img_path) img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))

然后讀進來以后，怎么再給網絡用呢？用下面的代碼：

def __getitem__(self, index):return img, pid, camid, img_path, img_texture

到這一步，已經進來了。接下來看看到底怎么來作為監督信號被使用的？

下面這個圖就是框架圖，由一個為ReID編碼的編碼器,編碼器說白了就是一個network(encoder for ReID), 然后還有一個decoder sub-network, 有了這個SA-Dec才generating densely semantically aligned full texture with supervision. 啥意思？真正把texture image作為監督是通過SA-Dec實現的，對么？

model = models.init_model(name=args.arch, num_classes=dm.num_train_pids, loss={'xent', 'htri'})

這是把在ImageNet上面預訓練的Resnet50(且FC512)作為architecture.

注意看到這里的loss

loss={set:2}{'htri','xent'} num_classes={int}767 # 這是和xent結合使用的。

Encoder和Decoder怎么工作？

解碼器a decoder SA-Dec which enforces constraints over the encoder, 解碼器居然是給編碼器施加約束，by requiring the encoded feature (編碼的特征，我還以為是解碼的特征呢) to be able to predict/regress the semantically aligned full texture image. 怎么解釋呢？怎么做到的？

在Decoder部分，channel數量在逐漸減少，從2048的input_channel到final_channel的16，然后2D內的size在不斷地增大。

從這個圖也可以看出來，REID的特征向量f和網絡的FC不是一個玩意。FC才接ID loss, 然后之前的f直接接上Triplet loss，為啥要接這個Triplet loss？這里面這個Triplet loss在這怎么工作？

Encoder怎么工作？

好了，這個input image輸入進來后，通過Encoder for ReID----編碼器for ReID其實就是得到REID的feature vector能夠在pooling后(更具體的是：對encoder的最后一層的feature map進行average pooling)得到這個ReID用的feature vector，疑問在于：那么的話，REID這個feature vector和FC是一個東西么？應該不是吧？然后的話，這個監督網絡參數是用的ReID loss, 說白了就是cross entropy.

回答上面自己的疑問，感覺應該不是一個東西，因為：

保存在self.global_avgpool和保存在self.fc里的顯然是兩個不一樣的東西

那這塊這個Triplet Loss 的作用呢？就是the ranking loss of triplet loss with batch hard miniing.

Decoder怎么工作？

注意看的Loss就是： $L_{Rec}$

A decoder 形成以密集語義對齊的全紋理圖像進行的監督。

然后就是解碼器，解碼器緊接著被添加(接著the last layer of the SA-Enc)，就是為了在偽groundtruth texture image的約束下，讓SA-Dec來重構或者回歸出densely semantically aligned full texture image(這么看的話，好像再回歸出來的長成另外一個樣子，確實應該不是一個東西。然后的話，回頭可以打印出來顯示下). 這相當于是用cuhk03的偽groundtruth texture image來做監督學習，比著樣子學出來的感覺。

確實有還比著生成的texture image, 而且是通過最小化它和pseudo texture image的L1距離而得到的。

我們可以看出，作者專門為這個decoder工作部分寫了個Class

輸入咋就是2048么？不是說緊緊地接著the last layer of the SA-Enc么？而last layer不是應該為512嘛？

而且是先有一個UNet structure:(難道說的意思是：decoder的架構用U-Net而不是ResNet?)

緊接著還有如下別的類似的描述網絡結構的東西：

我們可以看到，

這塊還有個Triplet ，之前那個叫Triplet Loss，然而這個叫做Triplet ReID Constraints ( $L_{TR}$ ).

In the SA-Dec, Triplet REID constraints are further incorporated at different layers/blocks as the high level perceptual metric to encourage identity preserving reconstruction

這個不是僅僅接在最后的，而是在每個layers和blocks都用，還沒在代碼上找到對應，因為這部分的代碼在train那個函數里，而不是放在對Loss函數定義的py文件里。

因為它是高級的perceptual metric，得以確保更加保持identity的重構。一樣的盡可能近，不一樣的盡可能遠。

可以認為這是Encoder和decoder里的那個重構loss。會進一步影響到重構出來的東西的好壞。

作為鼓勵保留身份重建的高級感知指標

====

這塊Triplet ReID Constraints的作用是讓每個identity的，也就是自己和自己的更近，自己和別人的更遠，從而達到自己的真是自己，也就是說是保持identity的reconstruction. 保identity的reconstruction. 然后這塊這個Reconstruction loss也就是其實就是minimize L1 differences between the generated texture image(應該是帶人的，而不是那個惡心的texture image) and its corresponding(pseudo groundtruth texture images)

然后的話，這塊在解碼器這還有個loss，是為了讓編碼器繼承讓不同的identity更可分

是什么意思？用這個loss來最小化同類特征的L2 difference然后最大化不同類的特征的差異。

生成的紋理過程。

工程實現

dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args)) # dm是數據管理器。 dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args))

image_dataset_kwargs是為ImageDataManager服務的一個函數，而ImageDataManager是data_manager.py里面定義的一個類。這就得看這個類以什么作為輸入，并且以什么作為輸出了。

class ImageDataManager(BaseDataManager): """ Image-ReID data manager """

更加具體的：

class ImageDataManager(BaseDataManager):"""Image-ReID data manager"""def __init__(self,use_gpu,source_names,target_names,root,split_id=0,height=256,width=128,train_batch_size=32,test_batch_size=100,workers=4,train_sampler='',num_instances=4, # number of instances per identity (for RandomIdentitySampler)cuhk03_labeled=False, # use cuhk03's labeled or detected imagescuhk03_classic_split=False # use cuhk03's classic split or 767/700 split):

在深入ImageDataManager之前，先康康image_dataset_kwargs函數。

def image_dataset_kwargs(parsed_args):"""Build kwargs for ImageDataManager in data_manager.py fromthe parsed command-line arguments."""return {'source_names': parsed_args.source_names, # {list:1}['cuhk03'] 意思只處理cuhk03一個數據集'target_names': parsed_args.target_names, # {list:1}['cuhk03'] 意思處理哪個就將其對應保存出來。所以還是cuhk03. 'root': parsed_args.root, # {str}'/project/snow_datasets/Re_ID_datasets/data' 這是存放cuhk03及其他數據集的上一級目錄。'split_id': parsed_args.split_id, # 0 split index (note: 0-based) 從0開始的split index 具體在哪里其作用呢？'height': parsed_args.height, # 256 這是什么的尺寸？圖像的默認高度'width': parsed_args.width, # 128 圖像的默認寬度，但是re-id數據都不是這些尺寸啊'train_batch_size': parsed_args.train_batch_size, # 4'test_batch_size': parsed_args.test_batch_size, # 4 'workers': parsed_args.workers, # 4 'train_sampler': parsed_args.train_sampler, # 'RandomIdentitySampler' 好像是往出選identity而不是identity確定后隨機選樣本'num_instances': parsed_args.num_instances, # 4 number of instances per identity (for RandomIdentitySampler)'cuhk03_labeled': parsed_args.cuhk03_labeled, # True'cuhk03_classic_split': parsed_args.cuhk03_classic_split # True 但是Lan他們的項目里用的是new split protocal(767/700)}# 這個函數的輸入是解析出的args. 實參就是main.py里的args.# 這個函數的輸出是：將解析出的args某些key和value返回出來。==后來我把--cuhk03_classsic_split給刪除掉了，然后再次傳給**kwargs的時候就相當于里面的cuhk03_classsic_split=False. image_dataset_kwargs這個函數里面的return里面的項決定了kwargs的實際的可變長度。==模型上的每個點，哪個點是可見的，并且對應模型到這一步，只要能通過模型得到densepose.

先用CUHK03（labeled）

數據集統計：

分割方式:767/700
涉及identity數目：843+440+77+58+49 第一個到第五個攝像機組的所有數據都用上

query={list:1400} {要查找的}list里面的每個元素都是一個image, 然后格式如下：['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_003_1_01.png', 3, 0]
文件名的命名規則：
第一個數字：代表拍攝的攝像機組的編號，這意思是第一組
第二個由三個數組成的數據：代表identity的編號，因為每個攝像機組獲得的identity都不會超過843，所以三位數就夠了。
第三個數字：代表攝像機組里的1號相機或者2號相機
第四個數字：代表這個人的第多少張圖像，最多10張(從1到10).
————————————————
版權聲明：本文為CSDN博主「貝勒的杭蓋VanDebiao」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/HeavenerWen/article/details/106248257
剩下的3和0的意義：
3應該代表的是那個攝像機組下更具體的Identity的編號。剛好和1_003_1_01.png里面的3是一個玩意。
0應該代表0方向還是1方向，因為每個組里有2個相機。0可以認為是拍側向的那個相機，1可以認為是拍背面那個相機。

gallery={list:5332} {所有的}list里面的每個元素都是一個image,

Gallery集的例子

query集的例子

從這個Query集合的例子可以看出的是：在這里的Query集里，每個identity共有2個圖像，分別來自0號和1號相機，也就是一個側向和一個背向。而且怎么感覺都是第二張和第八張？是為了保障兩個方向的樣本都能被取到而作為query樣本么？

num_gallery_cams={int}2 num_query_cams={int}2 num_train_cams={int}2

這塊有一堆json，這些json是根據程序和用的數據集自動生成的，如果換成別的數據庫的話，不知道還能不能正常生成。

看看訓練圖像的這個格式，我們知道訓練的identities是767個，這767個身份類別在測試時候都是沒見過的。

1. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_001_1_01.png', 0, 0] 2. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_001_2_06.png', 0, 1] 3. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_002_1_01.png', 1, 0] 4. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_002_2_06.png', 1, 1] 5. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_004_1_01.png', 2, 0] 6. ['/project/snow_datasets/Re_ID_datasets/data/cuhk03/images_labeled/1_004_2_06.png', 2, 1] # 這次的話，第二個item代表的應該是：767個ids的從0到766的label. **沒錯的，我檢查過了，確實是0到766** # 然后的話，第三個item代表的應該是： 0側向攝像頭還是1背向攝像頭。

訓練集合中的樣本示例

可以看出來，在他們這種測試協議下，5個攝像頭組的數據都用到了。

pid += self._num_train_pids pid = pid + self._num_train_pids = 0 + 0 = 1 + 0= 2 + 0 # 查看self發現num_train_pids等于0.

然后我們看到了，在運行到self._num_train_cams += dataset.num_train_cams這句話的時候吧，我們知道，最后的self.train就變成下面這個樣子了。

我們可以看出來最后是相當于img_path, pid, camid組合在一起的。然后，實際的訓練到現在還沒開始，不但實際訓練沒開始，連訓練數據的導入還沒開始，真正的把訓練數據導入進去是從下面開始的

if self.train_sampler == 'RandomIdentitySampler':self.trainloader = DataLoader(ImageDataset(self.train, transform=transform_train), # ImageDataset 來自 from .dataset_loader import ImageDatasetsampler=RandomIdentitySampler(self.train, self.train_batch_size, self.num_instances),batch_size=self.train_batch_size, shuffle=False, num_workers=self.workers,pin_memory=self.pin_memory, drop_last=True)

這里面的最重要的函數就是DataLoader，是在開頭導入的，from torch.utils.data import DataLoader.

這個是pytorch的類，

pytorch document里的DataLoader

結合pytorch的類的官方API，我們發現dataset = ImageDataset(self.train, transform=transform_train). 然而，這里的又出來個ImageDataset. 這和ImageDataManager感覺很像啊，有點傻傻分不清的感覺。

from .dataset_loader import ImageDataset # ImageDataset有是一個類。 # 更準確地將應該是ReID訓練集專用的類 class ImageDataset(Dataset):"""Image Person ReID Dataset"""def __init__(self, dataset, transform=None):self.dataset = datasetself.transform = transformself.totensor = ToTensor()self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])def __len__(self):return len(self.dataset)def __getitem__(self, index):img_path, pid, camid = self.dataset[index]img = read_image(img_path)# Add by Xin Jin, for getting texture:img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))if self.transform is not None:img = self.transform(img)img_texture = self.normalize(self.totensor(img_texture))return img, pid, camid, img_path, img_texture

可以把這個類看作如下：

class ImageDataset():"""Image Person ReID Dataset"""def __init__(self, dataset, transform=None):self.dataset = datasetself.transform = transformself.totensor = ToTensor()self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])def __len__(self):return len(self.dataset)def __getitem__(self, index):img_path, pid, camid = self.dataset[index]img = read_image(img_path)

然后再結合：dataset = ImageDataset(self.train, transform=transform_train)，我們可以看出

self.dataset = dataset = self.train self.transform = transform_train # 同時為這個self(也就是屬于ImageDataset類的實例，更準確地說應該是ReID訓練集專用的類)生成兩屬性 # 也就是： self.totensor = ToTensor() self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])

然后就該到哪一步了，該到利用__getitem__得到對應的單個image的sample. 同時在這步驟中加入合成紋理。我應該看看論文，他們怎么描述的對這個紋理的獲取和應用。

if self.transform is not None:img = self.transform(img) # 對原圖進行transform操作img_texture = self.normalize(self.totensor(img_texture)) # 對紋理圖像進行normalize操作，在normalize操作之前，先轉化成tensor, 我們保留這個normalize操作不變。 return img, pid, camid, img_path, img_texture

然后，就到了DataLoader的第二個參數 sampler，

sampler=RandomIdentitySampler(self.train, self.train_batch_size, self.num_instances)

這里的RandomIdentitySampler是個從下面導入的類

from .samplers import RandomIdentitySampler

這個類的具體信息：

class RandomIdentitySampler(Sampler):"""Randomly sample N identities, then for each identity,也是要選N個pit的identitiesrandomly sample K instances, therefore batch size is N*K. # 然后，每個identity選取k個 instanceArgs:- data_source (list): list of (img_path, pid, camid).- num_instances (int): number of instances per identity in a batch.- batch_size (int): number of examples in a batch."""

然后這個類的實際功能是決定一個batch具體怎么得來。

def __init__(self, data_source, batch_size, num_instances):self.data_source = data_sourceself.batch_size = batch_size # 訓練和測試的batch_size都是4。對PIT這是64.self.num_instances = num_instances # 如下面定義，是每個identity選取的實例的數目。對PIT, num_instance還是4。self.num_pids_per_batch = self.batch_size // self.num_instances # 每個batch里面的identity的數量 # 然后就是64/4=16.# 因為batch里面總sample數目= 每個identity取多少個instance*多少個identity.# 那么，這樣的話，4/4=1. # 那如果把batch_size改成64的話呢，那么就是64//4 = 16. 也就是一個batch里處理涉及16個identity. 但不能說明# 不能說明就會涉及這16個pid的每個identity的9/10張圖像self.index_dic = defaultdict(list)for index, (_, pid, _) in enumerate(self.data_source):self.index_dic[pid].append(index)self.pids = list(self.index_dic.keys()) parser.add_argument('--num-instances', type=int, default=4,help="number of instances per identity")

接下來就到了一個很關鍵的地方defaultdict：

from collections import defaultdict # 這是在用python的官方庫

我們再看看另外一個，另外一個在SANPG里面的RandomIdentitySampler

python3.7官方API解釋

什么是containner datatype呢？

先不關注這個細節，先看看這個defaultdict什么作用呢？為什么涉及defaultdict呢，因為It gets more interesting when the values in a dictionary are collections (lists, dicts, etc.) 當字典中的值是集合(列表，字典等)時，它會變得更加有趣。

defaultdict: dict subclass that calls a factory function to supply missing values dict子類，調用工廠函數以提供缺失值

對factory函數的解釋

Quora對python factory function的解釋

具體怎么使用以及defaultdict的工作原理的解釋

解釋defaultdict的博客

剛剛通過defaultdict(<class 'list'>, {})那句代碼self.index_dic = defaultdict(list)得到的index_dic

index_dic={defaultdic:0}defaultdict(<class 'list'>, {})

In this case, the value (an empty list or dict) must be initialized the first time a given key is used. While this is relatively easy to do manually, the defaultdict type automates and simplifies these kinds of operations.

當字典中的值是集合（列表，字典等）時，它會變得更加有趣。在這種情況下，必須在首次使用給定鍵時初始化該值（一個空列表或字典）。盡管這相對容易手動完成，但是defaultdict類型可以自動執行并簡化這些類型的操作。

defaultdict(<class 'list'>, {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 1: [10, 11, 12, 13, 14]}) # 這意思，這個字典吧，字典中的值value是集合，所以會用到defaultdict.

字典類型如下：

defaultdict(<class ‘list’>,）# 指明哪類集合

字典如下：

{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 1: [10, 11, 12, 13, 14]}

key如下：

0 1

value如下：

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [10, 11, 12, 13, 14]

理解到現在這個程度的話，對接下來理解程序已經足夠了?；仡^再細細研究defaultdict更具體的東西。

self.pids = list(self.index_dic.keys()) # 這句代碼得到的是： pids={list:767}[0, 1, 2, 3, ..., 766] # 說明是在所有的id里面采樣

我們看到__iter__函數了. 先不深入研究這部分代碼，如果沒對圖像進行特別處理的話，先忽略這部分。

def __iter__(self):batch_idxs_dict = defaultdict(list) # 同樣的方法，討論一個batch時候的情形for pid in self.pids:idxs = copy.deepcopy(self.index_dic[pid]) # python里面的賦值語句that do not copy objects, 而是在target和object之間創建綁定。不改變原來這個self.index_dic[pid].if len(idxs) < self.num_instances:idxs = np.random.choice(idxs, size=self.num_instances, replace=True)random.shuffle(idxs)batch_idxs = []for idx in idxs:batch_idxs.append(idx)if len(batch_idxs) == self.num_instances:batch_idxs_dict[pid].append(batch_idxs)batch_idxs = []

For collections that are mutable or contain mutable items, a copy is sometimes needed so one can change one copy without changing the other.
對于可變或包含可變項的集合，有時需要一個副本，因此一個副本可以更改一個副本而不更改另一個副本。

copy.deep_copy的操作

copy.deepcopy(x[, memo]) Return a deep copy of x.

A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original. 深層副本將構造一個新的復合對象，然后遞歸地將原始對象中發現的對象的副本插入其中。

Two problems often exist with deep copy operations that don’t exist with shallow copy operations:

Recursive objects (compound objects that, directly or indirectly, contain a reference to themselves) may cause a recursive loop.
Because deep copy copies everything it may copy too much, such as
data which is intended to be shared between copies.

然后的話，根據不同的train_sampler，我們會有兩種不同的self.trainloader.

這代碼寫得還是很不錯的，這相當于對train數據組織完了，就到測試部分了。他們把對data(including train, test[query, gallery])都寫到一個py文件data_manager.py里.

因為我現在是在訓練，所以涉及測試的部分會有如下顯示：

當train phase時候，測試數據并不參與工作。

但是，在又讀了一次cuhk03數據集后，

for name in self.target_names:dataset = init_imgreid_dataset(root=self.root, name=name, split_id=self.split_id, cuhk03_labeled=self.cuhk03_labeled,cuhk03_classic_split=self.cuhk03_classic_split) #

我發現testloader_dict內部的內容發生變化了。

當train phase時候，又讀了一次數據后，query和gallery里面開始有東西。

其中，query的情況：

其中，gallery的情況：

經過如下代碼：

self.testdataset_dict[name]['query'] = dataset.query self.testdataset_dict[name]['gallery'] = dataset.gallery

testdataset_dict也發生了變化.

到這，才看到導數據(含train和test)完畢. 而且才是剛得到管理數據的對象dm，還沒真正執行導入。

dm = ImageDataManager(use_gpu, **image_dataset_kwargs(args)) trainloader, testloader_dict = dm.return_dataloaders() # 為什么返回的是：testloader_dict而不是testdataset_dict

trainloader

有幾個數值相當重要：
如：1492，7365，有空看看他們具體代表什么意義？

這個trainloader相當的重要，因為在訓練時候就是用這個來組織的數據：

訓練時候用到它的代碼顯示如下：

for batch_idx, (imgs, pids, _, img_paths, imgs_texture) in enumerate(trainloader): # 我以為的batch_idx應該從0開始確實是從0開始 # 我以為的imgs 就是原圖不，實際為原圖的tensor才對 # 我以為的pids 就是涉及的那些identity的編號我以為是一個數，但是實際為[556, 556, 556, 556]一個列表這種。 # 我以為的img_paths就是涉及到的樣本圖像的路徑，是多個絕對路徑字符串 # 我認為的圖像紋理，也就是同樣讀入進來的紋理圖像，不，實際為紋理圖像的tensor

接下來的圖像不是我以為的，而是實際的(在batch_size=4的情況下)。

然而imgs不代表是原圖，而應該是原圖的tensor, 而且維度是torch.Size([4, 3, 256, 128]). 這應該就是為啥他們提供的texture image的尺寸是256*256!

從上圖來看的話，確實只是1個identity的4張圖像。

同樣地，我們還有imgs_texture如下，尺寸為torch.Size([4, 3, 256, 256])：

然后有一個_代表的意思就是前三個來自冊向相機，最后一個來自背向相機.

還有幾個中間的tensor很重要，

outputs, features, feat_texture, x_down1, x_down2, x_down3 = model(imgs)

decoder部分同樣有幾個tensor很重要

recon_texture, x_sim1, x_sim2, x_sim3, x_sim4 = model_decoder(feat_texture, x_down1, x_down2, x_down3)

x_sim1, x_sim2, x_sim3, x_sim4是干啥的，后面有用沒？

這邊最重要的肯定當然要屬recon_texure, 需要把outputs, features, feat_texture，recon_texture從tensor轉換為numpy的數組然后都保存出來，看看是什么樣子。尤其是feat_texture, recon_texture的樣子。因為目前他們的維度尚且是{size:4}的，所以還需要進一步的分解和操作。

我們看看recon_texture, 它的維度是{size:4}torch.Size([4, 3, 256, 256]):

然后，重構loss真正有作用的地方是：

loss_rec = nn.L1Loss() loss_tri = nn.MSELoss() loss_recon = loss_rec(recon_texture, imgs_texture) # 用L1來約束recon_texture和imgs_texture之間的差距.

但是，這也沒啥啊，也沒體現出來用到triplet啊，只體現出為了重構的更好，用了L1

# L1 loss to push same id's feat more similar: loss_triplet_id_sim1 = 0.0 loss_triplet_id_sim2 = 0.0 loss_triplet_id_sim3 = 0.0 loss_triplet_id_sim4 = 0.0

這塊才是真正的再次用到triplet呢，這是重構部分用triplet的例子。

for i in range(0, ((args.train_batch_size//args.num_instances)-1)*args.num_instances, args.num_instances):loss_triplet_id_sim1 += max(loss_tri(x_sim1[i], x_sim1[i+1])-loss_tri(x_sim1[i], x_sim1[i+4])+0.3, 0.0)loss_triplet_id_sim2 += max(loss_tri(x_sim2[i+1], x_sim2[i+2])-loss_tri(x_sim2[i+1], x_sim2[i+5])+0.3, 0.0)loss_triplet_id_sim3 += max(loss_tri(x_sim3[i+2], x_sim3[i+3])-loss_tri(x_sim3[i+2], x_sim3[i+6])+0.3, 0.0)loss_triplet_id_sim4 += max(loss_tri(x_sim4[i], x_sim4[i+3])-loss_tri(x_sim4[i+3], x_sim4[i+4])+0.3, 0.0) loss_same_id = loss_triplet_id_sim1 + loss_triplet_id_sim2 + loss_triplet_id_sim3 + loss_triplet_id_sim4

具體怎么用，以及為啥用，暫時還沒時間寫，回頭有空會繼續跟進。

首先outputs, 維度是torch.Size([4, 767])，這個767是啥的維度，是FC么，為啥是767:
因為767是類別的總數。從這里可以看出來，我們知道, outputs在這是個Tensor{Tensor:4}.說明這不在list或者tuple里面，所以這個isinstance返回個False.

然后是features, 維度是torch.Size([4, 512]). features同樣也是個Tensor而不是元組或者列表.

然后是feat_texture, 維度是torch.Size([4, 2048, 16, 8])

然后是x_down1, x_down2, x_down3, 維度分別是torch.Size([4, 256, 64, 32])，torch.Size([4, 512, 32, 16])， torch.Size([4, 1024, 16, 8]) 這應該是對應于pipeline圖里的 $f_{e1}$ , $f_{e2}$ , $f_{e3}$

然后最值得區分的就是outputs, features, feature_texture, 或者看看這個outputs到底在后面有用么，怎么用的？

有用，cross entropy時候用的是outputs, 然后triplet用的features

而且這塊這個return_dataloaders()這個函數吧，還是在data_manager.py里面用@property修飾過的函數。

行人重識別中的warm up設置

# warm_up settings:optimizer_warmup = torch.optim.Adam(model.parameters(), lr=8e-06, weight_decay=args.weight_decay, betas=(0.9, 0.999)) # 一個是優化器warmupscheduler_warmup = lr_scheduler.ExponentialLR(optimizer_warmup, gamma=1.259)# 這個是scheduler——warmup, 是不是進行學習率warmup, 一定要借助優化器warmup

對于baseline的學習率一開始是一個很大的常量，而經過其他論文提出，Warm up的策略對于行人重識別的模型更加有效，具體是一開始從一個小的學習率經過幾個epoch后慢慢上升，如下圖紅色曲線部分，而不是和藍色線一樣上來就很大的學習率：

上面的圖像引用自那些板磚的日子的知乎

這里的encoder和decoder是分開優化的

因為本來我以為只有optimizer_encoder,

optimizer_encoder = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.5, 0.999))

但其實不是的，還有單獨的decoder的優化部分：

optimizer_encoder = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.5, 0.999))

有關打印的效果

每10個batch打印一下，時間，Data, Loss，以及Loss_recon的具體數值.

Epoch: [1][10/1492] Time 0.281 (424.272) Data 0.0095 (4.4704) Loss 6.7790 (6.6711) Loss_recon 0.4306 (0.5146) Epoch: [1][20/1492] Time 0.282 (212.276) Data 0.0096 (2.2401) Loss 6.6996 (6.6891) Loss_recon 0.5033 (0.4998) Epoch: [1][30/1492] Time 0.277 (141.610) Data 0.0096 (1.4967) Loss 6.5957 (6.6648) Loss_recon 0.5269 (0.4921) Epoch: [1][40/1492] Time 0.277 (106.276) Data 0.0097 (1.1249) Loss 6.7289 (6.6543) Loss_recon 0.4795 (0.4902) Epoch: [1][50/1492] Time 0.287 (85.077) Data 0.0106 (0.9019) Loss 6.6434 (6.6531) Loss_recon 0.3934 (0.4831)

為啥只是打印到這啊？

Epoch: [1][480/1492] Time 0.285 (9.119) Data 0.0097 (0.1031) Loss 6.3775 (6.6790) Loss_recon 0.2461 (0.3105) Epoch: [1][490/1492] Time 0.280 (8.939) Data 0.0099 (0.1013) Loss 6.1843 (6.6781) Loss_recon 0.2337 (0.3099) Epoch: [1][500/1492] Time 0.284 (8.766) Data 0.0098 (0.0995) Loss 6.9592 (6.6812) Loss_recon 0.3021 (0.3084) Epoch: [1][510/1492] Time 0.279 (8.600) Data 0.0098 (0.0977) Loss 6.8927 (6.6811) Loss_recon 0.2927 (0.3065) Epoch: [1][520/1492] Time 0.286 (8.440) Data 0.0106 (0.0960) Loss 6.8176 (6.6848) Loss_recon 0.3276 (0.3054) Epoch: [1][530/1492] Time 0.287 (8.286) Data 0.0102 (0.0944) Loss 6.5900 (6.6861) Loss_recon 0.2101 (0.3042)

關于訓練的細節，我們利用ResNet-50來構建我們的SA-Enc. 然后把用ResNet-50不帶decoder(帶ID loss和triplet loss)作為Baseline:

這足夠實現RE-ID了，因為這就可以提取到RE-ID特征，然后后面加上cross-entropy從而形成分類器. 此外，還有Triplet loss的約束下，會讓模型把不同類拉得更開，回頭再用模型提取到特征。

先訓SAN-PG

為了提供the image pairs的目地，也就是the person image and its texture image, 我們合成了一個Paired-Image-Texture(成對的圖像紋理)數據集，不是說我們獲得了in total 9,290 different synthesized (person image, texture image) pairs. 不太對勁啊，我發現只有9290個person image啊，沒有9290個texture image啊。而是929種，每一種可以對應在images文件夾里的10個圖。

images: 9290 person images with different poses, viewpoints, and texture
GT_texture: 929 kinds of texture, each texture corresponds to 10 person images that stored in images folder
image-texture-label: save the correspondiing relationship of (person image, texture), which is used for person texture prediction/synthesis training.

為了盡可能逼真real-world, 渲染用的background images是隨機從COCO數據集采樣的。

紋理圖像不是512×512么，也不是256×256啊

我們訓練the SAN-PG with 我們的synthesized PIT dataset. 用合成訓練SAN-PG(which consists of the SA-Enc and SA-Dec, but with only reconstruction loss).

怎么導入呢？怎么導入PIT數據集？
另外，怎么確保用的是SAN-PG的結構。
因為用U-Net那個直接用的是預訓練模型，用的也是U-net的網絡，那直接看看沒用預訓練權重的unet架構啥樣。

SAN-PG

SAN-PG encoder

用的什么網絡？

用的ResNet(是不是50暫時還不確定)，自己定義的包括 $3 ? 3$ 的卷積conv3x3，還有自己定義的 $3 ? 3$ 的去卷積deconv3x3, 去卷積里面用到的是Upsample, 感覺像是上采樣一樣將channel變小，將size變大。然后還定義了basic resnet block

def deconv3x3(in_planes, out_planes, stride=1):return nn.Sequential(nn.Upsample(scale_factor=stride, mode='bilinear'),nn.ReflectionPad2d(1),nn.Conv2d(in_planes, out_planes,kernel_size=3, stride=1, padding=0)) # Basic resnet block: # x ---------------- shortcut ---------------x # \___conv___norm____relu____conv____norm____/ 這個是shortcut的組成, 指的是下圖這個結構。

那BasicResBlock和ConvResBlock啥區別呢？

ConvResBlock是那種最經典的帶2個conv layer的殘差塊. 然后，它的結構可以描述如下：

#ResBlock: A classic ResBlock with 2 conv layers and a up/downsample conv layer. (2+1)
#x ---- BasicConvBlock ---- ReLU ---- conv/upconv ----

感覺意思是網絡由BasicResBlock, ConvResBlock共同組成。

然后就是encoder怎么初始化，這里面涉及對loss的更改。

model = models.init_model(name=args.arch, num_classes=dm.num_train_pids, loss={'xent', 'htri'}) # 如何修改這塊的xent和htri? # 應該是得把REID的監督移除掉！

網絡的話和原來的SAN一樣，只是得把REID監督移除掉。with the reID supervisions removed.

Given an input person image, the SAN-PG outputs predicted texture image as the pseudo groundtruth.

這句話，讓我懷疑的建立的一一對應作為監督對么？ SAN-PG的本意不是去利用合成數據然后生成偽紋理圖像么，怎么還拿偽紋理圖像作為監督呢？

怎么初始化？

SAN-PG decoder

用的什么網絡？

用到unet吧？是的，用的U-net做的decoder的architecture.

# For UNet structure:self.embed_layer3 = nn.Sequential(nn.Conv2d(in_channels=1024, out_channels=512,kernel_size=1, stride=1, padding=0),nn.BatchNorm2d(512),nn.ReLU(inplace=True))self.embed_layer2 = nn.Sequential(nn.Conv2d(in_channels=512, out_channels=256,kernel_size=1, stride=1, padding=0),nn.BatchNorm2d(256),nn.ReLU(inplace=True))self.embed_layer1 = nn.Sequential(nn.Conv2d(in_channels=256, out_channels=64,kernel_size=1, stride=1, padding=0),nn.BatchNorm2d(64),nn.ReLU(inplace=True))self.reduce_dim = nn.Sequential(nn.Conv2d(input_channel, input_channel//4, kernel_size=1, stride=1, padding=0), # 2048, 512, 1, 1, 0nn.BatchNorm2d(512),nn.ReLU(inplace=True)) # torch.Size([64, 512, 16, 8]) # 2048//4 = 512

先是定義了 ConvResDecoder，這個 ConvResDecoder被當作用于upsampling的convres block卷積殘差塊.
為啥叫做上采樣呢，感覺應該是從一個向量feature再變成一個圖像，所以叫做上采樣吧。難道這是經過上述u-net后的最后所得么？寬和高是分別是64，512，16是channel，8是啥？torch.Size([64, 512, 16, 8])

# 其中up代表要得到更大的圖像 self.up1 = ConvResBlock(512, 256, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True)) # torch.Size([64, 256, 32, 16]) self.up2 = ConvResBlock(256, 64, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True)) # torch.Size([64, 64, 64, 32]) self.up3 = ConvResBlock(64, 32, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True)) # torch.Size([64, 32, 128, 64]) self.up4 = ConvResBlock(32, 16, direction='up', stride=2, norm_layer=nn.BatchNorm2d, activation_layer=nn.ReLU(inplace=True)) # torch.Size([64, 16, 256, 128])

上面代碼段中的torch.size分別代表什么意思？ torch.Size([64, 16, 256, 128])各個維度分別代表什么意思？

怎么初始化？

PIT數據集的處理

如何把這個數據集的images文件夾和GT_texture建立聯系？

然后在程序上怎么實現出這種對應呢？

我們現在有的是00000到00001到00002，…, 到00009. 然后每十個對應前面的10行

可以先整體10行讀出一個子集，然后子集的[0][1][2]對應給0到9。

應該是這樣的，從這個txt文件里逐行讀取，然后后面對應的圖像在路徑上加1.

args里的height和width怎么在程序工作的？在哪塊工作的？

我想的是，仍然按照之前REID數據的組織和導入形式往進導入PIT，然后寫個文件的操作把重構的紋理約束關聯起來

先把PIT的images導入進來再說，還是同樣得有隨機性，

經過查看，經過查看原來作者的代碼，我們知道，我們不關心怎么從mat把數據存儲到label或者detect文件夾，而是關注把弄好的數據怎么輸入到模型.

def _extract_img(name):print("Processing {} images (extract and save) ...".format(name))meta_data = []imgs_dir = self.imgs_detected_dir if name == 'detected' else self.imgs_labeled_dirfor campid, camp_ref in enumerate(mat[name][0]):camp = _deref(camp_ref)num_pids = camp.shape[0]for pid in range(num_pids):img_paths = _process_images(camp[pid,:], campid, pid, imgs_dir)assert len(img_paths) > 0, "campid{}-pid{} has no images".format(campid, pid)meta_data.append((campid+1, pid+1, img_paths))print("- done camera pair {} with {} identities".format(campid+1, num_pids))return meta_data

_extract_img和_process_images都是服務于把圖像從mat里提取出來并且命名的，這不是我重點考慮的部分。

這塊這個_extract_img里的這個self.imgs_detected_dir到這句代碼時候還是空白的目錄，里面還沒有文件呢.然后正打算逐個攝像機組的往出extract和save圖像呢.

for pid in range(num_pids):img_paths = _process_images(camp[pid,:], campid, pid, imgs_dir)

這句話得到的這個img_paths是啥？是一個還是多個？

main的pid好像是從1開始

我的pid好像是從0開始

所有和query和gallery有關的沒用的代碼都刪除掉，這樣的話，得到一個干凈版本的，模型和SAN一樣的SAN-PG.

下圖是SAN-PG情形.

下圖為main情形.

不同的重構 Different Reconstruction Guidance

reconstruct時候是肯定重構的，那么REID SAN時候也往出重構么？

往出重構，因為the addition of a reconstruction subtask helps improve the reID performance which encourages the encoded feature to preserve more original information是為了讓encoded feature保持更多原始信息。所以原來的SAN里就是帶reconstruction的.

從Table2可以得知：得知到什么呢，得知到：CUHK03時候也是reconstruct而且也試過reconstruct input和pose(pose這個是在生成PIT時候也生成這么個pose aligned person image數據庫)，以及PN-GAN(也是用來生成pose aligned person image的方法)，但是，都沒有重構texture好。但是他試驗的都是Loss reconstruction, loss REID, Triplet loss組合下的。

研究the effect of using different reconstruction guidance并且在Table2中顯示不同的結果，意思就是the same encoder-decoder網絡(但是這個網絡是SAN-basic,這里的basic指的啥，指的是用，Loss reconstruction, loss REID, Triplet loss)，但是三個重構目標：

reconstructing the input image
reconstructing the pose aligned person image
reconstructing the texture image

注意：

To have pose aligned person image(首先得有這個pose aligned person image) as supervision, during synthesizing the PIT dataset, for each projected person image, 說白了就是在合成PIT數據集的時候，我們也synthesized a person image of a given fixed pose (frontal pose here). 合成時候是在PIT合成的時候同時做的，合成方法是根據下圖。

這種pose aligned person image是語義對齊的(不光只有紋理圖像才是對齊的)，那得到這種pose aligned person image的程序也同樣沒開源啊。

這種重構同樣也是只有部分紋理被保留，而且他們也指出這確實存在information loss。

這種情況下，只有partial texture(front body regions) of the full 3D surface texture is retained(保留保持) with the information loss. 這是說在得到這種pose aligned person image時候，怎么搭配loss從而保留你想要的信息。也就是partial texture(front body regions). 然后不只用了他們合成的pose aligned person image. 還用了別人PN-GAN生成的pose aligned person image.

4. Experiment

4.1 Datasets and Evaluation Metrics

4.2 Implementation Details

4.3 Ablation Study

SAN-basic到底是啥樣的網絡啊？表格里整理過了. ResNet50+ConvResDecoder

一共涉及多少個網絡類型？六個

是不是每個網絡都是有encoder和decoder然后encoder都是用的resnet50? 不是每個都有decoder, 有沒有Decoder就是看有沒有decoder部分的loss.

Baseline(ResNet-50) 這個的意思難道是沒有decoder? 就是沒有decoder.

SA-Enc(ResNet50), REID loss, triplet loss 對的，沒有decoder(因為沒triplet約束和reconstruction). 有沒有decoder可以理解有有沒有Loss,或者loss是不是0.

我訓練時候, 是用那兩個loss訓的么？不是的，我帶了triplet約束和reconstruction. 跑出來的結果和baseline一樣，但是我好像不是真的baseline，而是4個loss都帶。

我確定，我訓練的時候后邊那兩個loss是帶著的！！！！！！！！！！！！！？？？？？？

the last spatial down-sample operation in the last Conv layer is removed
說的是Resnet50里面的最后一個Conv層最后的空間下采樣操作被removed.不是說的decoder的事。

說的是這個decoder(SA-Dec)就是通過4個殘差上采樣block形成的，這個SA-Dec部分的參數只是SA-Enc(Resnet50)的三分之一。

這沒說baseline那個算法涉及SA-Dec.

來自論文的原文描述，We also take it as our baseline(Baseline) with both ID loss and triplet loss.

最起碼的就是，你得知道你每次訓的都使用了哪些個loss吧，這個實驗最重要的不就是取決于loss的設置么？

SAN-basic
trained with the supervision of the pseudo groundtruth texture images with(因為REID數據是沒有掃描得到的真實的紋理的，所以這個叫做pseudo groundtruth texture images)

Loss reconstruction, loss REID, Triplet loss

SAN-PG

Loss reconstruction, 0 loss REID, 0 Triplet loss,

SAN

w/L_{TR}

Loss reconstruction, loss REID, Triplet loss, Triplet ReID Constraints

SAN

w / s y n . d a t a

用PIT數據集，但是Loss組成如下：

Loss reconstruction, loss REID, Triplet loss

SAN

用PIT數據集，也用 Triplet ReID Constraints，所以Loss組成如下：

Loss reconstruction, loss REID, Triplet loss, Triplet ReID Constraints

Loss的設置

對于main.py：

我們知道：

對于main.py的程序，

對于SAN-PG.py:

在現在不確定的情況下，先試試這個. 然后另外一種可能就是，設置 $λ\lambda$ 的組合.

該去探究，探究下為什么會生成這些圖片，這些圖片真的很莫名其妙。

想知道怎么把pth.tar給用上. 這應該就會關乎到到底打開SAN的哪個Flags開關？

另外就是得確定我的SAN-PG的輸入和輸出是和論文說的SAN-PG一樣。

如何獲得SAN-PG

只需要把loss保留一個就是SAN-PG。然后，至于生不生成pseudo groundtruth, 那都是后話了。 然后把這個得到的checkpoint權重作為初始化SAN的權重，重新訓練SAN。

先研究怎么把pretrained weights of SAN-PG給SAN用上吧

論文SAN-PG輸入： REID行人圖像
論文SAN-PG輸出：讓人誤以為是紋理圖像。但是紋理圖像不是被用來作為監督了么？the SAN-PG outputs predicted texture image as the pseudo groundtruth. 應該就是為了得到權重，然后再用上generate_texture.py那個腳本，那網絡可能也得有個代碼。先不管，因為我們確實先有了SAN-PG的權重了。
我自己SAN-PG輸入： PIT數據
我自己SAN-PG輸出：SAN-PG權重和regress the semantically aligned full texture image.

我訓練300epoch regress出的結果例子：

那300epoch那個得訓練一天以上的感覺。

訓練日志

我查看過了，args.print_freq就是10。也就是每個10個batch打印一次。

[1][10/232] 1代表第多少epoch, 10代表第1個batch_size為10的batch. 232代表trainloader的長度.

232是len(trainloader)
batch_time = AverageMeter() 這個不太重要
data_time = AverageMeter() 這個也不太重要
Loss = losses 重要因為我把兩個lambda都設置為了0，所以這個的值確實應該都恒等于0
Loss_recon = losses_recon才是真正最值得關注的Loss

我們可以知道的是：

Loss的變化情形：

loss_rec = nn.L1Loss()

Creates a criterion that measures the mean absolute error (MAE) between each element in the input x and target y. 平均絕對誤差(MAE), 每個元素的平均絕對誤差，這個是怎么定義的.

SAN300EPOCH Loss的情形

EPOCH1到10 從36.3492到9.2652

以上是前11個EPOCH的數據?？梢钥吹竭@loss變動還是很大的，從最小的2到最大的36。

EPOCH11到20 從2.3651到

關于調參

我們知道，交叉熵損失的下確界是0，但是永遠都不可能達到0

因為要達到0，那么所有的預測向量分布就必須完全和真實標簽一致，退化為獨熱編碼. 但是，實際在神經網絡中，經過了softmax層之后，是不可能使得除了目標分量的其他所有分量為0的。因此，Cross entropy永遠不可能達到0的，正是因為如此，交叉熵損失可以一直優化，這也是它比MSE損失優的一個點之一。

不收斂

不收斂是個范圍很大的問題，有很多可能性：

學習率設置過大了
可能是因為我把label-smooth給去掉了
還有就是：我把-t的FLAG給去掉了
還有就是FrankZhu說的可能是因為沒對應上。
可能性5

學習率設置太大

太大的學習率將會導致嚴重的抖動，使得無法收斂，甚至在某些情況下可能使得loss變得越大直到無窮。

還是沒怎么降低，所以，明天看下reconstruction的結果

確定Loss的故障與否

這是wht程序乘以10后的loss。

這是wht程序去掉乘以10后的loss。

我現在要弄明白wht程序的cuhk03到底管用還是不管用？或者壓根就是-s,-t都是寫在那，但是卻沒用過的。

確定對應關系

在MicroAsia的原代碼里：

class ImageDataset(Dataset):"""Image Person ReID Dataset"""def __init__(self, dataset, transform=None):self.dataset = datasetself.transform = transformself.totensor = ToTensor()self.normalize = Normalize([.5, .5, .5], [.5, .5, .5])def __len__(self):return len(self.dataset)def __getitem__(self, index):img_path, pid, camid = self.dataset[index]img = read_image(img_path)# Add by Xin Jin, for getting texture:img_texture = read_image(img_path.replace('images_labeled', 'texture_cuhk03_labeled'))if self.transform is not None:img = self.transform(img)img_texture = self.normalize(self.totensor(img_texture))return img, pid, camid, img_path, img_texture 調試：1. 從第0個batch，因為Loss是按照Batch更新的?？匆幌?#xff0c;看重構loss是不是0.6, 但開始必須是0.6. 然后可以從0.6開始, 因為別的loss的引入可以從0.6升到13 因為REID loss會影響它。 2. baseline是這樣的，沒decoder. 你是這么訓的么，沒訓。 3. 4.

你有沒有加了別的loss以后，沒效果是不？沒試，主要在搞SAN-PG

我好像沒跑basic. 就是把另一個設為0，留著剩下3個。

cena:你做到哪一步？
我：做到第二個。
cena: 你之前沒預訓練，你直接把所有loss加上一點提升都沒有是吧？
我：你得知道你到底跑的哪個啊
cena: 之前你只跑baseline，后面那些Loss一個都沒加？
我：我沒試試加loss，減loss影響。
cena: 所有Loss都加上你跑了么？人家原文里加了，你為啥沒加？有constraint的代碼？
我：我說好像我沒加constraint(但是其實我加了).

我：

我跑的版本：交叉熵(確定), triplet(確定), constraint, recon。
我沒跑的版本：baseline只帶前兩個。
我得到的結果跟baseline差不多，但是我用了四個Loss,只是結果差不多。

cena: baseline你也跑了是吧？baseline你跟原文差不多，是吧？

cena

你得先跑下baseline, 你把baseline也跑下，然后看看相比baseline什么變化？
萬一加上有效果，只是baseline很低呢？好建議
加兩個如果沒提升，說明后面那兩個沒用。加兩個提升很大，說明后面兩個作用很大。

沒加第三個？后面那些Loss一個都沒加，

首先，確定之前訓練的用的哪些loss，做個表格
不加兩個loss的baseline和加上兩個loss的decoder. 說明沒用或者有問題。相當于把table 1實驗做一下。

程序解析

Optimizer:

分為

正常訓練時候的optimizer，正常訓練反向傳播算梯度的那個
然后warmup時候的optimizer

學習率規劃器

分為

r_scheduler.MultiStepLR的300個epoch里到了自動變化那種
lr_scheduler.ExponentialLR是warm_up階段如何進行學習率的變化。

close_specified_layers(model, ['fc','classifier'])

我們知道：在關閉前，含fc和classifier的部分內容如下：

這是數據的結構，但是看不到數據內容啊。不重要，不用看到，可以確定數值也沒再更新。

因為如下代碼：

給我的感覺是，那這兩個被固定的部分內容不再變化，但是沒固定的，那些層的權重參數就會發生變化。所以，確實會因為后面再加的兩個Loss影響encoder部分resnet50的參數。

看一看trainloader的length以及query啥的長度。

兩次打印和保存的epoch區別

每隔50epoch打印一次

if (epoch + 1) % 50 == 0:

里面的epoch是49, 99, 149, 199, 249, 299才測重構結果的表現.

Test

if (epoch + 1) > args.start_eval and args.eval_freq > 0 and (epoch + 1) % args.eval_freq == 0 or (epoch + 1) == args.max_epoch: # 這個代碼主要判斷的就是： (epoch + 1) % args.eval_freq == 0 也就是都哪個epoch數值加上1可以被80整除而余數為0. # 79, 159, 239, 319(超了)，別忘了299 ''' 注意到：還有(epoch+1)==300的情況也就是epoch=299 '''

里面的epoch有79，159，239，299

但是，就算有299，也不會重復的，因為

# 這是訓練時候檢查重構結果 print('finish: ', os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_trainloader_'+'.jpg')) cv2.imwrite(os.path.join(args.save_dir,img_paths[0].split('images_labeled/')[-1].split('.png')[0] + '_ep_' + str(epoch) +'_trainloader_'+ '.jpg'), out) print('finish: ', os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_queryloader_'+'.jpg')) cv2.imwrite(os.path.join(args.save_dir, img_paths[0].split('images_labeled/')[-1].split('.png')[0]+'_ep_'+str(epoch)+'_queryloader_'+'.jpg'), out)

注意到就算epoch都取299，我們，因為我重新增加了來自哪個loader的后綴，所以，我同樣不會覆蓋掉檢查重構結果時候相同epoch299的結果.

通過上圖可以看出來，299，239，159, 79，確實是通過測試輸出的，49是通過檢查紋理輸出的，那0為啥能輸出呢？

而且咋還train_loader和query_loader的圖像一模一樣呢？

再檢查一個：SAN-PG的結果

我們知道：對于SAN-PG的代碼，我把測試的部分已經都注釋掉了。所以，應該只按照檢查紋理時候的49, 99, 149, 199, 249, 299來往出輸出。

通過上圖可以看出，對于同一張圖像，colon和out保存出來的結果不一樣，out的結果更偏向于藍色。我們跑個300從PG移動過來好好看看.

把cython的文件夾刪了試一試
把環境變量添加了試一試

出現無法轉換為cuda變量的情況

在沒有進行Git repo的README里面的配置的時候，

目前的情況下，connect界面會卡住。接下來，我們試試配置一下，然后，看看還會不會卡在connect界面。

在用linux apt命令重新安裝pycharm后，發現之前train和evaluate時候的connected問題都沒有了。

但是，這步驟又取決于：

在環境激活的時候配置
在沒激活環境的時候配置. 不可以這么配置，這會提示Cython evaluation is unavailable.

必須在激活環境的情況下配置。

在沒激活環境的時候配置：

可以看出來，確實會多出來：eval_metrics_cy.so. 現在再試試還會不會卡在connected？

在激活環境的情況下，再make一下。會多出來個`x86_64_linux_gnu.so

應該是必須要在環境激活的時候配置，否則就會現實Cython evaluation is unavailable.

在消除connected的reclone版本里已經有了

可能存在的問題：

Pycharm
代碼本身train部分
代碼本身evaluation_only部分，實際參數有變動。
因素太多了，想配置vs code吧。

用Vscode跑的時候，能跑時候的文件內容

Original 程序

用上述文件的時候，train和evaluate_only都能debug. check核對過了.

Reclone 程序

用上述文件的時候，train和evaluate_only都能debug. check核對過了.

而且都在pytorch conda環境下能這么順利執行，所以，環境應該也沒問題。

在沒有執行out = (out / 2.0 + 0.5) * 255.時候的，數據范圍：

下面是執行完以后的，給我的感覺應該是：

原來都是-1左右，好奇怪，應該是0-1或者 -1到1啊。感覺out = (out / 2.0 + 0.5) * 255.這句話有問題啊。而應該是：batch = np.transpose(batch, [0, 2, 3, 1]) * 0.5 + 0.5這種。

這是把-1到1變到-0.5到0.5然后再變到0到1的慣用方法。

然后，經過轉化數據類型為np.uint8以后：

為啥在輸出整潔記錄前會有9個打印？

因為吧，因為滿足epoch是1的整數倍的條件啊。然后每個batch的每一張圖片都會被保存，所以，為了節約空間，batch數目少點，batchsize大點。

也就是在batch_idx為9，19，29，，也就是batch_idx+1是10的整數倍的時候，會打印Epoch:[1][XX/XXX]這種信息.

還得更改一些部分為None

triplet 約束怎么起作用

training_batch_size=16時候：

執行到還沒加上triplet約束。

下圖是加上triplet約束以后的：當training_batch_size=16時候，i會取到0，4，8。

加上約束以后的重構loss的值變為如下：

這OUT還是越界啊！這不一定越界,這還有e-05呢.

所以這次沒越界,因為如下圖所示: 尚且在(127.49, 127.49).

就算真越界的情況下，還是能轉換到整數的，但是不代表沒有問題。

這確實越界了。這超過1了。倒是沒超過-1。最保險的還是做下clip. 如果不clip, 我們可以看到：居然還有308

1.54變完以后就得到323同樣也是越界的。雖然變完以后，看到好像也是0-255，但是其實很多都被忽略了，從而導致保存出來的圖片有問題。

如果recon_texture越界,就一定會影響訓練. 因為下面這句代碼.

loss_recon = loss_rec(recon_texture, imgs_texture)

只要這個recon_texture是有問題的,那和正常的imgs_texture就會出現很多奇怪的差值. 從而影響loss, 進而又影響訓練整體.

但是recon_texture又來自于:

recon_texture, x_sim1, x_sim2, x_sim3, x_sim4 = model_decoder(feat_texture, x_down1, x_down2, x_down3)

那還往feat_texture,x_down1, x_down2, x_down3 上追根溯源么?

先不管. 先把feat_textureclip一下試試看.

關于數據尺寸

不管REID數據集中的行人圖像的大小為多少, 都會經過程序變成128和256.

從上面的圖可以出來無論多大的都會被變到256和128.

沒有clip以前的初始loss= 0.4975:

clip以后的初始loss= 0.4542:

說明: 我的clip至少對loss值沒大影響, 而且還是能得到不至于數值范圍變的奇怪的loss值.

在下一次打印的時候, loss變大了,感覺這也很正常. 不能這么單次地看.

query的話是: 79, 159, 239, 299. 79,159,239就是test函數保存出來的. 299兩者都有可能.

確實299會cover掉trainloader和queryloader.

然后看下trainloader的: 有49, 99, 149, 199, 249, 299 因為加了詞綴, 所以299 不因命名覆蓋.

把Github release跑通. 看看到底給個啥結果.另外用Github跟蹤具體變化. 完成
先用他的SAN-PG, (記住: 一定要把san_basic權重倒入部分涉及dict的部分改一下.) 正在弄還沒完成.
如果不行就把reconstruction loss乘以上0.0001那個. 從而突出最后我們的工作中心在REID而不是重構紋理.

可以知道, 確實checkpoint里面有state_dict和decoder_state_dict.

下圖是沒被SAN-PG更新之前的.

看看更新后,是不是由2.248e-01變成2.1573e-01. 是的 , 確實改了. 下面是改完以后的:

要去看看, 看看是不是用了SAN-PG的loss在初始上就開始變得很低.

而不是沒用時候那么大.

先把沒用時候的截個圖放在這.

然后再放一個用了SAN-PG以后的:

下面是不經過warmup, 而且batch_size很小時候的.

如果這么看的話, 確實已經比沒用SAN-PG時候要小了. 先跑個本地的1個epoch看看. 然后, 該檢查test部分,如果沒問題就上傳,然后訓練.

這是小batch在本地跑的結果. 好像也看不出來什么啊? 還是檢查完了,上傳完整的再說吧.

我現在要做的就是重新尋個SAN-PG然后得到SAN-PG權重。

重新訓個SAN-PG權重基于wht，然后再重新導入再看。
大改我自己的SAN-PG代碼先不干這個。太耗時

總結

以上是生活随笔為你收集整理的MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 015day(css控制单词换行的属性，
下一篇： uni-app 109生成个人二维码名片

编程问答

MicrosoftAsia-Semantics-Aligned Representation Learning for Person Re-identification---论文阅读笔记和工程实现总结

原理流程

摘要

什么是紋理圖像呢？

為什么要用紋理圖像？

如何做的Pseudo Groundtruth Texture Images Generation?

什么是語義對不齊(Semantic Misalignment)？

Alignment

Our work

Related Work

3. The Semantic Alignment Network

Encoder和Decoder怎么工作？

工程實現

trainloader

有關打印的效果

先訓SAN-PG

SAN-PG

SAN-PG encoder

SAN-PG decoder

PIT數據集的處理

不同的重構 Different Reconstruction Guidance

4. Experiment

4.1 Datasets and Evaluation Metrics

4.2 Implementation Details

4.3 Ablation Study

一共涉及多少個網絡類型？ 六個

Loss的設置

對于main.py：

對于SAN-PG.py:

訓練日志

Loss的變化情形：

SAN300EPOCH Loss的情形

關于調參

不收斂

確定Loss的故障與否

確定對應關系

程序解析

Optimizer:

學習率規劃器

兩次打印和保存的epoch區別

每隔50epoch打印一次

Test

出現無法轉換為cuda變量的情況

還得更改一些部分為None

triplet 約束怎么起作用

關于數據尺寸

總結

一共涉及多少個網絡類型？六個