Jan 5 -- The Federal Chief Data Officers (CDO) Council publishes this Request for Information (RFI) for the public to provide input on key questions concerning synthetic data generation. Responses to this RFI will inform the CDO Council's work to establish best practices for synthetic data generation. This RFI is intended for Chief Data Officers, data scientists, technologists, data stewards and data- and evidence-building related subject matter experts from the public, private, and academic sectors. We will consider comments received by February 5, 2024.
Pursuant to the Foundations for Evidence-Based Policy Making Act of 2018, the CDO Council is charged with establishing best practices for the use, protection, dissemination, and generation of data in the Federal Government. In reviewing existing activities and literature from across the Federal Government, the CDO Council has determined that:
-- the Federal Government would benefit from developing consensus of a more formalized definition for synthetic data generation,
-- synthetic data generation has wide-ranging applications, and
-- there are challenges and limitations with synthetic data generation.
The CDO council is interested in consolidating feedback and inputs from qualified experts to gain additional insight and assist with establishing a best practice guide around synthetic data generation. The CDO Council has preliminarily drafted a working definition of synthetic data generation and several key questions to better inform its work.
Information and Key Questions
Section 1: Defining Synthetic Data Generation -- Synthetic data generation is an important part of modern data science work. In the broadest sense, synthetic data generation involves the creation of a new synthetic or artificial dataset using computational methods. Synthetic data generation can be contrasted with real-world data collection. Real-world data collection involves gathering data from a first-hand source, such as through surveys, observations, interviews, forms, and other methods. Synthetic data generation is a broad field that employs varied techniques and can be applied to many different kinds of problems. Data may be fully or partially synthetic. A fully synthetic dataset wholly consists of points created using computational methods, whereas a partially synthetic dataset may involve a mix of first-hand and computationally generated synthetic data. . . .
Section 2: Applying Synthetic Data Generation -- Synthetic data generation can enable the creation of larger and more diverse datasets, enhance model performance, and protect individual privacy. The CDO Council's review of potential applications of synthetic data generation found examples in: . . . .
Section 3: Challenges and Limitations in Synthetic Data Generation -- The CDO Council recognizes that synthetic data generation can be a valuable technique. However, it should be noted that there are some challenges and limitations with the technique. For example, there can be challenges generating data that realistically simulates the real world and the diversity of real data. Additionally, evaluating the quality of a synthetic dataset may also be extremely challenging.
Synthetic data generation is also subject to challenges commonly facing any statistical methods, such as overfitting and imbalances in the source data. These challenges reduce the utility of the generated synthetic data because they may not be properly representative, including failing to represent rare classes. . . .
Section 4: Ethics and Equity Considerations in Synthetic Data Generation -- Synthetic data generation techniques hold great promise, but also introduce questions of ethics and equity. Consistent with Federal privacy practices, any data generation technique involving individuals must respect their privacy rights and obtain informed consent before using real-world data to generate synthetic data. As noted in Section 3, synthetic data generation is also subject to challenges commonly facing any statistical methods and has the potential to introduce and encode errors or bias, potentially leading to discriminatory outcomes.
Uses of generated synthetic data must also be carefully considered. The context and quality of the generated synthetic data will impact its practical utility and impact. Assessing and understanding the fitness of a generated synthetic dataset is essential. For instance, a generated synthetic dataset may not sufficiently represent the diversity of the source dataset. In addition, a generated synthetic dataset may not contain sufficient variables to fully represent the system and the drivers of differences in the phenomenon it is meant to represent. . . . .
Section 5: Synthetic Data Generation and Evidence-Building -- Synthetic data generation can enable the production of evidence for use in policymaking. Applications such as simulation or modeling can help policymakers explore scenarios and their potential impacts. Likewise, policymakers can conduct controlled experiments of potential policy interventions to better understand their impacts. Data synthesis may help policymakers make more data publicly available to spur research and other foundational fact-finding activities that can inform policymaking. . . . .
FRN:
https://www.federalregister.gov/d/2024-00036