人工智能 | ShowMeAI资讯日报 #2022.06.22
ShowMeAI日報系列全新升級!覆蓋AI人工智能 工具&框架 | 項目&代碼 | 博文&分享 | 數據&資源 | 研究&論文 等方向。點擊查看 歷史文章列表,在公眾號內訂閱話題 #ShowMeAI資訊日報,可接收每日最新推送。點擊 專題合輯&電子月刊 快速瀏覽各專題全集。點擊 這里 回復關鍵字 日報 免費獲取AI電子月刊與資料包。
1.工具&框架
工具:Unclutter - Immersive Reading Mode,排除干擾信息專注閱讀的瀏覽器插件
‘Unclutter - Immersive Reading Mode - A reader mode browser extension to remove distractions from web articles.’ by lindylearn
GitHub: https://github.com/lindylearn/unclutter
工具庫:scikit-opt - 一個純Python群體智能算法庫
包含很多算法(差分進化算法、遺傳算法、粒子群算法、模擬退火算法、蟻群算法、魚群算法、免疫優化算法),特點是輕量、易部署,支持GPU運算。
GitHub: https://github.com/guofei9987/scikit-opt
工具:Hayabusa - 基于sigma的Windows事件日志分析工具
它協助安全人員快速找到安全威脅。
GitHub: https://github.com/Yamato-Security/hayabusa
工具:Gifsicle - 一個在瀏覽器里進行gif編輯的工具。
Gifsicle可以對Gif圖片進行壓縮,旋轉,裁剪等操作
GitHub: https://github.com/renzhezhilu/gifsicle-wasm-browser
工具庫:AREkit - 文檔級屬性關系提取工具包
‘AREkit - Document level Attitude and Relation Extraction toolkit (AREkit) for mass-media news and analytical articles’ by Nicolay Rusnachenko
GitHub: https://github.com/nicolay-r/AREkit
2.博文&分享
課程:新加坡國立大學《3D計算機視覺》
《3D Computer Vision | National University of Singapore - YouTube》
Link: https://www.youtube.com/playlist?list=PLxg0CGqViygP47ERvqHw_v7FVnUovJeaz
博文:Vim 命令、操作、快捷鍵全集
Link: https://weibo.com/ttarticle/p/show?id=2309404335205144998402
3.數據&資源
資源列表:深度學習3D視覺最新論文列表
‘Trending-in-3D-Vision - An on-going paper list on new trends in 3D vision with deep learning’ by Xiaolong
GitHub: https://github.com/dragonlong/Trending-in-3D-Vision
書籍:《Python Data Science Handbook》Python數據科學
介紹數據科學和應用的書籍。內容覆蓋:① 數據科學家需要的計算環境:IPython和Jupyter ② NumPy工具庫與科學計算 ③ Pandas與數據處理 ④ Matplotlib與數據可視化 ⑤ Scikit-Learn與機器學習。
英文原版地址: https://jakevdp.github.io/PythonDataScienceHandbook/
非官方中文翻譯地址: https://github.com/wangyingsm/Python-Data-Science-Handbook
4.研究&論文
可以點擊 這里 回復關鍵字 日報,免費獲取整理好的6月論文合輯。
論文:Automatic Prosody Annotation with Pre-Trained Text-Speech Model
論文標題:Automatic Prosody Annotation with Pre-Trained Text-Speech Model
論文時間:16 Jun 2022
所屬領域:語音
對應任務:Speech Synthesis,Text-To-Speech Synthesis,語音合成,文本到語音合成
論文地址:https://arxiv.org/abs/2206.07956
代碼實現:https://github.com/daisyqk/automatic-prosody-annotation
論文作者:Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li, Deng Cai, Dong Yu
論文簡介:Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability./就自然性和可讀性而言,韻律邊界在文本到語音合成 (TTS) 中起著重要作用。
論文摘要:Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.
就自然性和可讀性而言,韻律邊界在文本到語音合成 (TTS) 中起著重要作用。然而,韻律邊界標簽的獲取依賴于人工標注,成本高且耗時。在本文中,我們建議通過帶有預訓練音頻編碼器的神經文本語音模型從文本音頻數據中自動提取韻律邊界標簽。該模型分別在文本和語音數據上進行預訓練,并在三元組格式的 TTS 數據上聯合微調:{語音、文本、韻律}。自動評估和人工評估的實驗結果表明:1)所提出的文本語音韻律注釋框架顯著優于純文本基線; 2)自動韻律邊界標注的質量與人工標注相當; 3) 使用模型標注邊界訓練的 TTS 系統比使用手動邊界的系統稍好。
論文:Level 2 Autonomous Driving on a Single Device: Diving into the Devils of Openpilot
論文標題:Level 2 Autonomous Driving on a Single Device: Diving into the Devils of Openpilot
論文時間:16 Jun 2022
所屬領域:計算機視覺
對應任務:無人駕駛,自動駕駛
論文地址:https://arxiv.org/abs/2206.08176
代碼實現:https://github.com/openperceptionx/openpilot-deepdive
論文作者:Li Chen, Tutian Tang, Zhitian Cai, Yang Li, Penghao Wu, Hongyang Li, Jianping Shi, Junchi Yan, Yu Qiao
論文簡介:Equipped with a wide span of sensors, predominant autonomous driving solutions are becoming more modular-oriented for safe system design./配備廣泛的傳感器,主要的自動駕駛解決方案正變得更加模塊化,以實現安全系統設計。
論文摘要:Equipped with a wide span of sensors, predominant autonomous driving solutions are becoming more modular-oriented for safe system design. Though these sensors have laid a solid foundation, most massive-production solutions up to date still fall into L2 phase. Among these, Comma.ai comes to our sight, claiming one $999 aftermarket device mounted with a single camera and board inside owns the ability to handle L2 scenarios. Together with open-sourced software of the entire system released by Comma.ai, the project is named Openpilot. Is it possible? If so, how is it made possible? With curiosity in mind, we deep-dive into Openpilot and conclude that its key to success is the end-to-end system design instead of a conventional modular framework. The model is briefed as Supercombo, and it can predict the ego vehicle’s future trajectory and other road semantics on the fly from monocular input. Unfortunately, the training process and massive amount of data to make all these work are not publicly available. To achieve an intensive investigation, we try to reimplement the training details and test the pipeline on public benchmarks. The refactored network proposed in this work is referred to as OP-Deepdive. For a fair comparison of our version to the original Supercombo, we introduce a dual-model deployment scheme to test the driving performance in the real world. Experimental results on nuScenes, Comma2k19, CARLA, and in-house realistic scenarios verify that a low-cost device can indeed achieve most L2 functionalities and be on par with the original Supercombo model. In this report, we would like to share our latest findings, shed some light on the new perspective of end-to-end autonomous driving from an industrial product-level side, and potentially inspire the community to continue improving the performance. Our code, benchmarks are at https://github.com/OpenPerceptionX/Openpilot-Deepdive
主要的自動駕駛解決方案配備了廣泛的傳感器,在安全系統設計方面正變得更加模塊化。盡管這些傳感器已經奠定了堅實的基礎,但迄今為止大多數量產解決方案仍處于 L2 階段。其中,Comma.ai 出現在我們的視線中,聲稱一款售價 999 美元的售后設備安裝了單個攝像頭和板卡,具有處理 L2 場景的能力。加上 Comma.ai 發布的整個系統的開源軟件,該項目被命名為 Openpilot。可能嗎?如果是這樣,它是如何實現的?帶著好奇心,我們深入研究了 Openpilot,并得出結論,它成功的關鍵是端到端的系統設計,而不是傳統的模塊化框架。該模型簡稱為 Supercombo,它可以從單目輸入動態預測自我車輛的未來軌跡和其他道路語義。不幸的是,所有這些工作的訓練過程和大量數據都沒有公開。為了進行深入調查,我們嘗試重新實現訓練細節并在公共基準上測試管道。在這項工作中提出的重構網絡被稱為 OP-Deepdive。為了將我們的版本與原始 Supercombo 進行公平比較,我們引入了雙模型部署方案來測試現實世界中的駕駛性能。 nuScenes、Comma2k19、CARLA 和內部真實場景的實驗結果驗證了低成本設備確實可以實現大多數 L2 功能,并且與原始 Supercombo 模型相當。在本報告中,我們想分享我們的最新發現,從工業產品層面闡明端到端自動駕駛的新視角,并可能激勵社區繼續提高性能。我們的代碼和基準位于 https://github.com/OpenPerceptionX/Openpilot-Deepdive
論文:Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation
論文標題:Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation
論文時間:15 Jun 2022
所屬領域:計算機視覺
對應任務:Contrastive Learning,Denoising,Image Generation,Music Generation,對比學習,去噪,圖像生成,音樂生成
論文地址:https://arxiv.org/abs/2206.07771
代碼實現:https://github.com/l-yezhu/cdcd
論文作者:Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan
論文簡介:To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process./為此,我們引入了條件離散對比擴散 (CDCD) 損失,并設計了兩種對比擴散機制,以有效地將其納入去噪過程。
論文摘要:Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route – we enhance input-output connections by maximizing their mutual information using contrastive learning. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process. We formulate CDCD by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with three diverse, multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, and class-conditioned image synthesis. On each, we achieve state-of-the-art or higher synthesis quality and improve the input-output correspondence. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.
擴散概率模型 (DPMs) 已成為一種流行的條件生成方法,因為它們具有可喜的結果和對跨模態合成的支持。條件合成中的一個關鍵要求是在條件輸入和生成的輸出之間實現高度對應。大多數現有方法通過將先驗結合到變分下限中來隱式地學習這種關系。在這項工作中,我們采取了不同的路線 - 我們通過使用對比學習最大化它們的互信息來增強輸入-輸出連接。為此,我們引入了條件離散對比擴散(CDCD)損失,并設計了兩種對比擴散機制,以有效地將其納入去噪過程。我們通過將 CDCD 與傳統的變分目標聯系起來來制定 CDCD。我們展示了我們的方法在評估三種不同的多模態條件合成任務中的有效性:舞蹈到音樂生成、文本到圖像合成和類條件圖像合成。在每一個方面,我們都實現了最先進或更高的合成質量,并改善了輸入-輸出的對應關系。此外,所提出的方法提高了擴散模型的收斂性,在兩個基準上將所需的擴散步驟數量減少了 35% 以上,顯著提高了推理速度。
論文:GLIPv2: Unifying Localization and Vision-Language Understanding
論文標題:GLIPv2: Unifying Localization and Vision-Language Understanding
論文時間:12 Jun 2022
所屬領域:計算機視覺,自然語言處理
對應任務:Contrastive Learning,Image Captioning,Instance Segmentation,Language Modelling,Masked Language Modeling,object-detection,Object Detection,Phrase Grounding,Referring Expression Segmentation,Semantic Segmentation,Visual Question Answering,VQA,對比學習,圖像字幕,實例分割,語言建模,蒙面語言建模,物體檢測,物體檢測,短語接地,參考表達分割,語義分割,視覺問答
論文地址:https://arxiv.org/abs/2206.05836
代碼實現:https://github.com/microsoft/GLIP
論文作者:Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao
論文簡介:We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning)./我們提出了 GLIPv2,一種基于 VL 的理解模型,它同時服務于本地化任務(例如,對象檢測、實例分割)和視覺語言 (VL) 理解任務(例如,VQA、圖像字幕)。
論文摘要:We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP
我們提出了 GLIPv2,一個基于 VL 的理解模型,它服務于本地化任務(例如,目標檢測、實例分割)和視覺語言(VL)理解任務(例如,VQA、圖像字幕/看圖說話)。 GLIPv2 優雅地將定位預訓練和視覺語言預訓練 (VLP) 與三個預訓練任務相結合:短語接地作為檢測任務的 VL 重構,區域-詞對比學習作為新的區域-詞級對比學習任務,以及掩碼語言建模。這種統一不僅簡化了之前的多階段 VLP 程序,而且實現了定位和理解任務之間的互相促進。實驗結果表明,單個 GLIPv2 模型(所有模型權重共享)在各種定位和理解任務上實現了接近 SoTA 的性能。該模型還展示了(1)在開放詞匯目標檢測任務上的強大的零樣本和少樣本適應性能和(2)在 VL 理解任務上的出色接地能力。代碼將在 https://github.com/microsoft/GLIP 發布。
論文:Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging
論文標題:Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging
論文時間:20 May 2022
所屬領域:計算機視覺
對應任務:Compressive Sensing,Image Reconstruction,Image Restoration,壓縮感知,圖像重建,圖像恢復
論文地址:https://arxiv.org/abs/2205.10102
代碼實現:https://github.com/caiyuanhao1998/MST
論文作者:Yuanhao Cai, Jing Lin, Haoqian Wang, Xin Yuan, Henghui Ding, Yulun Zhang, Radu Timofte, Luc van Gool
論文簡介:In coded aperture snapshot spectral compressive imaging (CASSI) systems, hyperspectral image (HSI) reconstruction methods are employed to recover the spatial-spectral signal from a compressed measurement./在編碼孔徑快照光譜壓縮成像 (CASSI) 系統中,采用高光譜圖像 (HSI) 重建方法從壓縮測量中恢復空間光譜信號。
論文摘要:In coded aperture snapshot spectral compressive imaging (CASSI) systems, hyperspectral image (HSI) reconstruction methods are employed to recover the spatial-spectral signal from a compressed measurement. Among these algorithms, deep unfolding methods demonstrate promising performance but suffer from two issues. Firstly, they do not estimate the degradation patterns and ill-posedness degree from the highly related CASSI to guide the iterative learning. Secondly, they are mainly CNN-based, showing limitations in capturing long-range dependencies. In this paper, we propose a principled Degradation-Aware Unfolding Framework (DAUF) that estimates parameters from the compressed image and physical mask, and then uses these parameters to control each iteration. Moreover, we customize a novel Half-Shuffle Transformer (HST) that simultaneously captures local contents and non-local dependencies. By plugging HST into DAUF, we establish the first Transformer-based deep unfolding method, Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST), for HSI reconstruction. Experiments show that DAUHST significantly surpasses state-of-the-art methods while requiring cheaper computational and memory costs. Code and models will be released at https://github.com/caiyuanhao1998/MST
在編碼孔徑快照光譜壓縮成像 (CASSI) 系統中,采用高光譜圖像 (HSI) 重建方法從壓縮測量中恢復空間光譜信號。在這些算法中,深度展開方法表現出良好的性能,但存在兩個問題。首先,它們沒有從高度相關的 CASSI 中估計退化模式和不適定度來指導迭代學習。其次,它們主要是基于 CNN 的,在捕獲遠程依賴方面表現出局限性。在本文中,我們提出了一個原則性的退化感知展開框架(DAUF),它從壓縮圖像和物理掩碼中估計參數,然后使用這些參數來控制每次迭代。此外,我們定制了一種新穎的 Half-Shuffle Transformer (HST),它同時捕獲本地內容和非本地依賴項。通過將 HST 插入 DAUF,我們建立了第一個基于 Transformer 的深度展開方法,即 Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST),用于 HSI 重建。實驗表明,DAUHST 顯著超越了最先進的方法,同時所需計算量和內存成本也降低了。代碼和模型將在 https://github.com/caiyuanhao1998/MST 發布
論文:HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video
論文標題:HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video
論文時間:CVPR 2022
所屬領域:計算機視覺
論文地址:https://arxiv.org/abs/2201.04127
代碼實現:https://github.com/chungyiweng/humannerf
論文作者:Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, Ira Kemelmacher-Shlizerman
論文簡介:Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps./我們的方法優化了人在標準 T 姿勢中的體積表示,與運動場相一致,該運動場通過向后扭曲將估計的標準表示映射到視頻的每一幀。
論文摘要:We introduce a free-viewpoint rendering method – HumanNeRF – that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challenging, as it requires synthesizing photorealistic details of the body, as seen from various camera angles that may not exist in the input video, as well as synthesizing fine details such as cloth folds and facial appearance. Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps. The motion field is decomposed into skeletal rigid and non-rigid motions, produced by deep networks. We show significant performance improvements over prior work, and compelling examples of free-viewpoint renderings from monocular video of moving humans in challenging uncontrolled capture scenarios.
我們介紹了一種自由視點渲染方法 - HumanNeRF - 它適用于人類執行復雜身體運動的給定單目視頻,例如:來自 YouTube 的視頻。我們的方法可以在任何幀暫停視頻,并從任意新的攝像機視點甚至是該特定幀和身體姿勢的完整 360 度攝像機路徑渲染主體。這項任務特別具有挑戰性,因為它需要合成身體的逼真細節,從輸入視頻中可能不存在的各種攝像機角度看,以及合成精細的細節,如布料褶皺和面部外觀。我們的方法優化了典型 T 姿勢中人的體積表示,與運動場相一致,該運動場通過向后扭曲將估計的典型表示映射到視頻的每一幀。運動場被分解為由深度網絡產生的骨骼剛性和非剛性運動。我們展示了相對于先前工作的顯著性能改進,以及在具有挑戰性的不受控制的捕獲場景中移動人類的單目視頻的自由視點渲染示例。
論文:SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
論文標題:SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
論文時間:CVPR 2022
所屬領域:計算機視覺
對應任務:Disentanglement,Facial Editing,Image Generation,Transfer Learning,解纏結,人臉編輯,圖像生成,遷移學習
論文地址:https://arxiv.org/abs/2112.02236
代碼實現:https://github.com/seasonSH/SemanticStyleGAN
論文作者:Yichun Shi, Xiao Yang, Yangyue Wan, Xiaohui Shen
論文簡介:When combined with editing methods designed for StyleGANs, it can achieve a more fine-grained control to edit synthesized or real images./當與為 StyleGAN 設計的編輯方法結合使用時,它可以實現更細粒度的控制來編輯合成或真實圖像。
論文摘要:Recent studies have shown that StyleGANs provide promising prior models for downstream tasks on image synthesis and editing. However, since the latent codes of StyleGANs are designed to control global styles, it is hard to achieve a fine-grained control over synthesized images. We present SemanticStyleGAN, where a generator is trained to model local semantic parts separately and synthesizes images in a compositional way. The structure and texture of different local parts are controlled by corresponding latent codes. Experimental results demonstrate that our model provides a strong disentanglement between different spatial areas. When combined with editing methods designed for StyleGANs, it can achieve a more fine-grained control to edit synthesized or real images. The model can also be extended to other domains via transfer learning. Thus, as a generic prior model with built-in disentanglement, it could facilitate the development of GAN-based applications and enable more potential downstream tasks.
最近的研究表明,StyleGAN 為圖像合成和編輯的下游任務提供了有前途的先驗模型。然而,由于 StyleGAN 的潛在代碼旨在控制全局樣式,因此很難實現對合成圖像的細粒度控制。我們提出了 SemanticStyleGAN,其中一個生成器被訓練來分別對局部語義部分進行建模,并以組合的方式合成圖像。不同局部部分的結構和紋理由相應的潛在代碼控制。實驗結果表明,我們的模型在不同的空間區域之間提供了強大的解耦。當與為 StyleGAN 設計的編輯方法相結合時,它可以實現更細粒度的控制來編輯合成或真實圖像。該模型還可以通過遷移學習擴展到其他領域。因此,作為具有內置解纏結的通用先驗模型,它可以促進基于 GAN 的應用程序的開發和支撐更多潛在的下游任務。
論文:3D-aware Image Synthesis via Learning Structural and Textural Representations
論文標題:3D-aware Image Synthesis via Learning Structural and Textural Representations
論文時間:CVPR 2022
所屬領域:計算機視覺
對應任務:3D-Aware Image Synthesis,Image Generation,3D感知圖像合成,圖像生成
論文地址:https://arxiv.org/abs/2112.10759
代碼實現:https://github.com/genforce/volumegan
論文作者:Yinghao Xu, Sida Peng, Ceyuan Yang, Yujun Shen, Bolei Zhou
論文簡介:The feature field is further accumulated into a 2D feature map as the textural representation, followed by a neural renderer for appearance synthesis./特征場進一步積累成二維特征圖作為紋理表示,然后是神經渲染器進行外觀合成。
論文摘要:Making generative models 3D-aware bridges the 2D image space and the 3D physical world yet remains challenging. Recent attempts equip a Generative Adversarial Network (GAN) with a Neural Radiance Field (NeRF), which maps 3D coordinates to pixel values, as a 3D prior. However, the implicit function in NeRF has a very local receptive field, making the generator hard to become aware of the global structure. Meanwhile, NeRF is built on volume rendering which can be too costly to produce high-resolution results, increasing the optimization difficulty. To alleviate these two problems, we propose a novel framework, termed as VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation. We first learn a feature volume to represent the underlying structure, which is then converted to a feature field using a NeRF-like model. The feature field is further accumulated into a 2D feature map as the textural representation, followed by a neural renderer for appearance synthesis. Such a design enables independent control of the shape and the appearance. Extensive experiments on a wide range of datasets show that our approach achieves sufficiently higher image quality and better 3D control than the previous methods.
使生成模型具有 3D 感知能力在 2D 圖像空間和 3D 物理世界之間架起一座橋梁,但仍然具有挑戰性。最近的嘗試為生成對抗網絡 (GAN) 配備了神經輻射場 (NeRF),它將 3D 坐標映射到像素值,作為 3D 先驗。然而,NeRF 中的隱函數具有非常局部的感受野,使得生成器很難意識到全局結構。同時,NeRF 建立在體繪制之上,其成本太高而無法產生高分辨率結果,從而增加了優化難度。為了緩解這兩個問題,我們提出了一種稱為 VolumeGAN 的新穎框架,用于通過顯式學習結構表示和紋理表示來進行高保真 3D 感知圖像合成。我們首先學習一個特征量來表示底層結構,然后使用類似 NeRF 的模型將其轉換為特征場。特征場進一步累積成 2D 特征圖作為紋理表示,然后是用于外觀合成的神經渲染器。這樣的設計能夠獨立控制形狀和外觀。在廣泛的數據集上進行的大量實驗表明,我們的方法比以前的方法實現了更高的圖像質量和更好的 3D 控制。
我們是 ShowMeAI,致力于傳播AI優質內容,分享行業解決方案,用知識加速每一次技術成長!點擊查看 歷史文章列表,在公眾號內訂閱話題 #ShowMeAI資訊日報,可接收每日最新推送。點擊 專題合輯&電子月刊 快速瀏覽各專題全集。點擊 這里 回復關鍵字 日報 免費獲取AI電子月刊與資料包。
- 作者:韓信子@ShowMeAI
- 歷史文章列表
- 專題合輯&電子月刊
- 聲明:版權所有,轉載請聯系平臺與作者并注明出處
- 歡迎回復,拜托點贊,留言推薦中有價值的文章、工具或建議,我們都會盡快回復噠~
總結
以上是生活随笔為你收集整理的人工智能 | ShowMeAI资讯日报 #2022.06.22的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: VLOOKUP函数具体操作及注意事项
- 下一篇: 自定义滚动条(css)