當前位置：首頁 > 人工智能 > ChatGpt >内容正文

ChatGpt

ai伪造论文实验数据_5篇有关AI培训数据的基本论文

發布時間：2023/12/15 ChatGpt 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 ai伪造论文实验数据_5篇有关AI培训数据的基本论文小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

ai偽造論文實驗數據

Many data scientists claim that around 80% of their time is spent on data preprocessing, and for good reason; collecting, annotating, and formatting data are crucial tasks in machine learning. This article will help you understand the importance of these tasks, as well as learn methods and tips from other researchers.

許多數據科學家聲稱，大約有80％的時間用于數據預處理，這是有充分理由的。收集，注釋和格式化數據是機器學習中的關鍵任務。本文將幫助您了解這些任務的重要性，并向其他研究人員學習方法和技巧。

Below, we will highlight academic papers from reputable universities and research teams on various training data topics. The topics include the importance of high-quality human annotators, how to create large datasets in a relatively short time, ways to securely handle training data that may include private information, and more.

下面，我們將重點介紹來自知名大學和研究團隊的有關各種培訓數據主題的學術論文。主題包括高質量人工注釋者的重要性，如何在相對較短的時間內創建大型數據集，安全處理可能包含私人信息的培訓數據的方式等。

1.人工注釋者有多重要？ (1. How Important are Human Annotators?)

This paper presents a firsthand account of how annotator quality can greatly affect your training data, and in turn, the accuracy of your model. In this sentiment classification project, researchers from the Jo?ef Stefan Institute analyze a large dataset of sentiment-annotated tweets in multiple languages. Interestingly, the findings of the project state that there was no statistically major difference between the performance of the top classification models. Instead, the quality of the human annotators was the larger factor that determined the accuracy of the model.

本文提供了有關注釋器質量如何極大地影響您的訓練數據以及模型準確性的第一手資料。在這個情感分類項目中，Jo?efStefan Institute的研究人員分析了多種帶有多種語言的帶有情感注釋的推文的數據集。有趣的是，該項目的發現表明，頂級分類模型的性能在統計上沒有重大差異。相反，人工注釋者的質量是決定模型準確性的更大因素。

To evaluate their annotators, the team used both inter-annotator agreement processes and self- agreement processes. In their research, they found that while self-agreement is a good measure to weed out poor-performing annotators, inter-annotator agreement can be used to measure the objective difficulty of the task.

為了評估其注釋者，團隊使用了注釋者間協議過程和自我協議過程。在他們的研究中，他們發現，雖然自我約定是清除表現不佳的注釋者的好方法，但注釋者之間的共識可以用來衡量任務的客觀難度。

Research Paper: Multilingual Twitter Sentiment Classification: The Role of Human Annotators

研究論文 ：多語言Twitter情感分類：人類注釋者的作用

Authors / Contributors: Igor Mozetic, Miha Grcar, Jasmina Smailovic (all authors from the Jozef Stefan Institute)

作者/撰稿人： Igor Mozetic，Miha Grcar，Jasmina Smailovic(所有作者均來自Jozef Stefan Institute)

Date Published / Last Updated: May 5, 2016

發布日期/最近更新日期： 2016年5月5日

2.機器學習數據收集調查 (2. A Survey On Data Collection for Machine Learning)

From a research team at the Korean Advanced Institute of Science and Technology, this paper is perfect for beginners looking to get a better understanding of the data collection, management, and annotation landscape. Furthermore, the paper introduces and explains the processes of data acquisition, data augmentation, and data generation.

本文由韓國高級科學技術研究院的研究團隊提供，非常適合希望對數據收集，管理和注釋領域有更好了解的初學者。此外，本文還介紹并解釋了數據獲取，數據擴充和數據生成的過程。

For those new to machine learning, this paper is a great resource to help you learn about many of the common techniques to create high-quality datasets used in the field today.

對于那些不熟悉機器學習的人來說，本文是一個非常有用的資源，可以幫助您了解創建當今該領域中使用的創建高質量數據集的許多常用技術。

Research Paper: A Survey on Data Collection for Machine Learning

研究論文 ：機器學習數據收集調查

Authors / Contributors: Yuji Roh, Geon Heo, Steven Euijong Whang (all authors from KAIST)

作者/撰稿人 ：余宇治，姜，，史蒂文歐鐘(所有來自KAIST的作者)

Date Published / Last Updated: August 12th, 2019

發布日期/最后更新日期： 2019年8月12日

3.使用弱監督來標記大量數據 (3. Using Weak Supervision to Label Large Volumes of Data)

For many machine learning projects, sourcing and annotating large datasets takes up substantial amounts of time. In this paper, researchers from Stanford University propose a system for the automatic creation of datasets through a process called “data programming”.

對于許多機器學習項目，大型數據集的獲取和注釋會占用大量時間。在本文中，斯坦福大學的研究人員提出了一種通過稱為“數據編程”的過程自動創建數據集的系統。

The above table was taken directly from the paper and shows precision, recall, and F1 scores using data programming (DP) in comparison to the distant supervision ITR approach.

上表直接取自本文，與遠程監管ITR方法相比，該表顯示了使用數據編程(DP)的準確性，召回率和F1得分。

The proposed system employs weak supervision strategies to label subsets of the data. The resulting labels and data will likely have a certain level of noise. However, the team then removes noise from the data by representing the training process as a generative model, and presents ways to modify a loss function to ensure it is “noise-aware”.

所提出的系統采用弱監督策略來標記數據的子集。產生的標簽和數據可能會具有一定程度的噪音。但是，該團隊然后通過將訓練過程表示為一個生成模型來從數據中去除噪聲，并提出修改損失函數以確保其“感知噪聲”的方法。

Research Paper: Data Programming: Creating Large Training Sets, Quickly

研究論文 ：數據編程：快速創建大型訓練集

Authors / Contributors: Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré (all authors from Stanford University)

作者/撰稿人：亞歷山大·拉特納(Alexander Ratner)，克里斯托弗·德薩(Christopher De Sa)，吳森(Sen Wu)，丹尼爾·塞爾薩姆(Daniel Selsam)，克里斯托弗·雷(ChristopherRé)(斯坦福大學的所有作者)

Date Published / Last Updated: January 8, 2017

發布日期/最后更新日期： 2017年1月8日

4.如何使用半監督知識轉移來處理個人身份信息(PII) (4. How to Use Semi-supervised Knowledge Transfer to Handle Personally Identifiable Information (PII))

From researchers at Google and Pennsylvania State University, this paper introduces an approach to dealing with sensitive data such as medical histories and private user information. This approach, known as Private Aggregation of Teacher Ensembles (PATE), can be applied to any model and was able to achieve state-of-the-art privacy/utility trade-offs on the MNIST and SVHN datasets.

谷歌和賓夕法尼亞州立大學的研究人員介紹了一種處理敏感數據(例如病史和私人用戶信息)的方法。這種方法被稱為教師合奏私人聚集(PATE)，可以應用于任何模型，并且能夠在MNIST和SVHN數據集上實現最新的隱私/實用性權衡。

However, as Data Scientist Alejandro Aristizabal states in his article, one major issue with PATE is that the framework requires the student model to share its data with the teacher models. In this process, privacy is not guaranteed. Therefore, Aristizabal proposes an additional step that adds encryption to the student model’s dataset. You can read about this process in his article, Making PATE Bidirectionally Private, but please make sure you read the original research paper first.

但是，正如數據科學家Alejandro Aristizabal在他的文章中指出的那樣，PATE的一個主要問題是該框架要求學生模型與教師模型共享其數據。在此過程中，不能保證隱私。因此，Aristizabal提出了一個額外的步驟，該步驟將加密添加到學生模型的數據集中。您可以在他的文章PATE Bidirectionally Private中閱讀有關此過程的信息，但是請確保您首先閱讀了原始研究論文。

Research Paper: Semi-Supervised Knowledge Transfer for Deep Learning From Private Training Data

研究論文 ：從私人培訓數據進行深度學習的半監督知識轉移

Authors / Contributors: Nicolas Papernot (Pennsylvania State University), Martin Abadi (Google Brain), Ulfar Erlingsson (Google), Ian Goodfellow (Google Brain), Kunal Talwar (Google Brain)

作者/貢獻者： Nicolas Papernot(賓夕法尼亞州立大學)，Martin Abadi(谷歌大腦)，Ulfar Erlingsson(谷歌)，Ian Goodfellow(谷歌大腦)，Kunal Talwar(谷歌大腦)

Date Published / Last Updated: March 3, 2017

發布日期/最后更新日期： 2017年3月3日

5.用于半監督學習和轉移學習的高級數據增強 (5. Advanced Data Augmentation for Semi-supervised Learning and Transfer Learning)

One of the largest problems facing data scientists today is getting access to training data. It can be argued that one of the biggest problems of deep learning is that most models require large amounts of labeled data in order to function with a high degree of accuracy. To help combat these issues, researchers from Google and Carnegie Mellon University have come up with a framework for training models on substantially lower amounts of data.

當今數據科學家面臨的最大問題之一是獲得培訓數據。可以說，深度學習的最大問題之一是，大多數模型需要大量的標記數據才能以較高的精度運行。為了幫助解決這些問題，谷歌和卡內基梅隆大學的研究人員提出了一個框架，該框架可用于對大量數據進行模型訓練。

The team proposes using advanced data augmentation methods to efficiently add noise to unlabeled data samples used in semi-supervised learning models. Amazingly, this framework was able to achieve incredible results. The team states that on the IMDB text classification dataset, their method was able to outperform state-of-the-art models by training on only 20 labeled samples. Furthermore, on the CIFAR-10 benchmark, their method outperformed all previous approaches.

該團隊建議使用高級數據增強方法來將噪聲有效地添加到半監督學習模型中使用的未標記數據樣本中。令人驚訝的是，該框架能夠實現令人難以置信的結果。該團隊指出，在IMDB文本分類數據集上，他們的方法僅對20個帶有標簽的樣本進行了訓練，就能夠勝過最新模型。此外，在CIFAR-10基準測試中，他們的方法優于所有以前的方法。

Research Paper: Unsupervised Data Augmentation for Consistency Training

研究論文 ：用于一致性訓練的無監督數據增強

Authors / Contributors: Qizhe Xie1,2 , Zihang Dai1,2 , Eduard Hovy2 , Minh-Thang Luong1 , Quoc V. Le1 (1Google Research, Brain Team, 2Carnegie Mellon University)

作者/撰稿人 ：謝啟哲1,2，戴子行1,2，愛德華·霍維2，Minh-Thang Luong1，Quoc V. Le1(1 Google Research，Brain Team，2卡內基梅隆大學)

Date Published / Last Updated: September 30th, 2019

發布日期/最后更新日期： 2019年9月30日

Hopefully these machine learning papers focusing on training data and data processing tasks helped you learn something new that you can apply to your own projects. For more machine learning articles please view our top stories below, and please be sure to follow me on Medium.

希望這些專注于訓練數據和數據處理任務的機器學習論文可以幫助您學習可以應用于自己的項目的新知識。有關更多機器學習的文章，請在下面查看我們的熱門文章，請確保在Medium上關注我。

Original article reposted with permission.

原始文章經許可重新發布。

翻譯自: https://medium.com/datadriveninvestor/5-essential-papers-on-ai-training-data-aba8ea359f79

ai偽造論文實驗數據

總結

以上是生活随笔為你收集整理的ai伪造论文实验数据_5篇有关AI培训数据的基本论文的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：有变！油价下周五将迎今年第二涨预计上调
下一篇：智能电视困于“套路”