[NN] Synthetic Training Data
主要是材料整理。
2018
The Ultimate Guide to Synthetic Data: Uses, Benefits & Tools
带来的好处
However, synthetic data has several benefits over real data:
- Overcoming real data usage restrictions: Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all important statistical properties of real data without exposing real data, thereby eliminating the issue.
- Creating data to simulate not yet encountered conditions: Where real data does not exist, synthetic data is the only solution.
- Immunity to some common statistical problems: These can include item nonresponse, skip patterns, and other logical constraints.
- Focuses on relationships: Synthetic data aims to preserve the multivariate relationships between variables instead of specific statistics alone.
Synthetic data tools
The tools related to synthetic data are often developed to meet one of the following needs:
- Test data for software development and similar purposes
- Training data for machine learning models
We prepared a regularly updated, comprehensive sortable/filterable list of leading vendors in synthetic data generation software. Some common vendors that are working in this space include:
NAME | FOUNDED | STATUS | NUMBER OF EMPLOYEES | 评价 |
---|---|---|---|---|
BizDataX | 2005 | Private | 51-200 | |
CA Technologies Datamaker | 1976 | Public | 10,001+ | |
CVEDIA | 2016 | Private | 11-50 | |
Deep Vision Data by Kinetic Vision | 1985 | Private | 51-200 | |
Delphix Test Data Management | 2008 | Private | 501-1000 | |
Genrocket | 2012 | Private | 11-50 | |
Hazy | 2017 | Private | 11-50 | |
Informatica Test Data Management Tool | 1993 | Private | 5,001-10,000 | |
Mostly AI | 2017 | Private | 11-50 | |
Neuromation | 2016 | Private | 11-50 | |
Solix EDMS | 2002 | Private | 201-500 | |
Supervisely | 2017 | Private | 2-10 | 仅快速标注 |
TwentyBN | 2015 | Private | 11-50 | 3d模拟 |
These tools are just a small representation of a growing market of tools and platforms related to the creation and usage of synthetic data. For the full list, please refer to our comprehensive list.
Synthetic data is a way to enable the processing of sensitive data or to create data for machine learning projects. To learn more about related topics on data, be sure to see our research on data.
从整体市场来看,数据标注行业国内起步较晚,行业代表公司有市值超28亿美元的Appen、Amazon旗下的AMT、估值10亿美金的Scale AI、以及近期完成2500万美元B轮融资的Labelbox 等。
Ref: https://labelbox.com/blog/labelbox-ceo-discusses-breakthroughs-in-ai-training-data
2019
Ref: Synthetic Data for Deep Learning
结尾的引用列表,是个好东东!
类似,但不完善,未了解random bg的妙处。
End.