[author]DING Daoqin
[content]
*Author Ding Daoqin
Abstract: Data collection and processing in the training phase of generative artificial intelligence (AI) face numerous legal issues, which have led to various lawsuits at the global level. The data protection issues in the training phase are concentrated in the pre-training and model fine-tuning segments, involving issues such as the legality of data sources, data quality management, improperly captured and utilized public data, lack of protection of personal data rights, and unlawful bias and discrimination. As for the choice of data governance paths in the training phase, typical countries and regions such as the EU and the U.S. present different characteristics, with the EU adopting the concept of classification and hierarchy and sub-subjects and focusing on the transparency of training data, the U.S. taking a positive attitude towards the utilization of publicly available personal information and exploring the exemption of public data collection, the U.K. proposing a three-step test for the legitimate interest assessment criterion, and Singapore creating the business improvement and scientific research exception system for data processing, among others. system, etc. Generative AI is still evolving, and in order to solve the data law problems in the training stage, at the macro level, China needs to maintain a balance between AI industry development and safety regulation, promote the legalization of industry promotion policies, adhere to the legislative guidance of inclusive and prudent and classified and graded regulation, and establish an experimental regulatory system, such as the regulatory sandbox, which is suitable for the development stage of China's AI industry; in terms of the specific data rule construction, there is a need to distinguish between research and development training and commercialization. In terms of the construction of specific data rules, it is necessary to distinguish between the research and development training and commercial provision phases, establish a safe harbor system, introduce a data reasonable use system with exceptions for scientific research and business improvement, further refine the rules for public data utilization, strengthen data quality management, unify the data anonymization standards, create new rights and rules for handling data under machine learning scenarios, and reasonably construct a data governance system for China's generative AI training data.
The rise of generative artificial intelligence, represented by ChatGPT, is increasingly changing people's production and life style, and gradually becoming the information infrastructure of the digital era. The rapid development of generalized large-scale language modeling technology and industry cannot be separated from the comprehensive promotion of key factors such as arithmetic power, algorithms and data. From the perspective of technology development, the development of generative artificial intelligence (hereinafter referred to as “generative AI”) is limited by the improvement of arithmetic power in the medium and long term, and by high-quality data in the short term. In a sense, for the development of generative AI industry, high-quality data has a decisive impact, if the quality of the data is not high, even if the arithmetic power has been rapidly improved, it will directly affect the performance of generative AI systems. Therefore, high-quality data is of vital importance to generative AI.
The collection and processing of data in the training phase of generative AI faces many legal issues, such as data issues, copyright issues, and competition issues, and even triggers various lawsuits around the world.20 In 2023, more than a dozen judicial lawsuits for AI large model training occurred in the U.S., for example, sixteen anonymous individuals filed a class action lawsuit against OpenAI and Microsoft, filing 15 charges, including OpenAI violating the Electronic Communications Privacy Act, the Computer Computer Privacy Act, the Computer Privacy Act, and the Computer Privacy Act, and the Computer Privacy Act. Electronic Communications Privacy Act, the Computer Fraud and Abuse Act, the California Invasion of Privacy Act (CIPA), California's Unfair Competition Law, and the Business Occupations Code, the Biometric Information Privacy Act, the Illinois Consumer Fraud and Deceptive Business Practices Act, and the New York General Business Act, constituting gross negligence, invasion of privacy, intrusion into personal life, theft/receipt of stolen property, misappropriation, unjust enrichment, and failure to warn. Plaintiffs allege that Defendants stole personal information by crawling all the data across the Internet, all in secret and without notice or consent, to build AI products, and then profited by selling access to the products.
Data legal protection plays a pivotal role in the development of generative AI training data compliance. In view of this, Article 7 of the Interim Measures for the Administration of Generative Artificial Intelligence Services issued by the National Internet Information Office and other seven departments and the Basic Requirements for the Security of Generative Artificial Intelligence Services issued by the National Information Security Standardization Technical Committee explicitly set out the requirements for the security of the corpus source and the security of the corpus content, among others. When the EU General Data Protection Regulation (GDPR) and China's Personal Information Protection Law of the People's Republic of China (hereinafter referred to as the Personal Information Protection Law) were formulated, neither of them considered generalized big model scenarios such as generative AI, so what are the challenges that generative AI training data scenarios have posed to the GDPR and Personal Information Protection Law and its supporting regulations and standards, and what are the exact data law issues that need to be resolved, and how the rules should be improved to address these issues need to be studied urgently. In the following section, we are going to discuss the fundamentals of generative AI training data, focus on the data legal issues on the input side of the training phase, and then put forward suggestions for improvement.
1. Data legal issues in the training phase of generative artificial intelligence
At present, there is no uniform and standardized definition of Artificial Intelligence Generated Context (AIGC), and Article 22 of China's Interim Measures for the Administration of Generative Artificial Intelligence Services defines “generative artificial intelligence technology and services” in terms of the form of content generation and the way of provision. “It refers to “models and related technologies that have the ability to generate content such as text, pictures, audio, video, etc.”, including the provision of generative AI services through programmable interfaces and other means. That is, generative AI is a type of AI that can generate new content, such as text, images, and audio/video, by learning models from pre-existing data, including various technologies and techniques of artificial intelligence and machine learning. Overall, the generative AI industry chain can be divided into three layers of architecture: the arithmetic base layer, the algorithmic model layer, and the vertical application layer. From the industrial chain perspective, the generative AI training data process mainly includes three links, including pre-training, model (instruction) fine-tuning, and capability access and application. Pre-training models were first born in the field of computer vision and have achieved good results in this field. Pre-training is a pre-trained a model or refers to the process of pre-training a model, which is providing data to the model to learn, often also referred to as a training dataset. Generative AI models, especially big language models, i.e., natural language processing models that have a large number of participants (currently in the hundreds of billions of size) and use large-scale corpora for self-supervised learning during pre-training, require a large amount of data feeding or training. Pre-training is unsupervised learning on large amounts of data to allow the network to learn generic feature representations; model fine-tuning is the process of training a previously trained model on new data or otherwise adapting an existing model, i.e., using a task-specific dataset to re-train a model that has already been pre-trained in order to improve its performance and performance on that task. The fine-tuning process is actually a second stage of training the model using a specific downstream task. The purpose of this step is to fine-tune the pre-trained model for better adaptation to a specific downstream task.
From the perspective of industry chain development, the generative AI industry can be roughly divided into the stages of research and development, deployment and application. Generative AI training data is an important R&D link, and ChatGPT's training data has an important impact on the performance and quality of the model. In general, using more and higher quality training data can improve the performance and accuracy of the model. Also, the diversity of training data has an important impact on the performance and generalization ability of the model. Article 3 of the EU Artificial Intelligence Act defines “training data”, “validation data”, “test data” and “input data” respectively. According to Article 3, “training data” means data used to train an AI system by fitting it with learnable parameters, and “input data” means data provided to or directly obtained by an AI system on the basis of which the system has been trained, and “input data” means data provided to or directly obtained by an AI system on the basis of which the system has been trained. “input data” means data provided to or directly obtained by the AI system on the basis of which the system produces output. Article 23 of our Provisions on the Administration of Deep Synthesis of Internet-Based Information Services states that “labeled or benchmark datasets that are used to train machine learning models “. Overall, it seems that generative AI technology has room to play in digital content-related fields in various industries, and its industrial chain involves data, algorithms, arithmetic, scenes and other elements. Taken together, the legal issues in the training phase of generative AI are mainly clustered around the fundamental conflict between the massive training data required by AI and the protection of personal information and data, and the protection of copyright exclusivity. In terms of data legal issues alone, there are mainly two levels of macro and micro problems, at the macro level, there is a mismatch between the technology and the law, it is difficult to balance the company's commercial interests and the goal of personal data protection (public interest), it is impossible to balance the technological innovation and the interests of consumers, it is difficult to balance the regulation and the development of the technology, etc., although the law distinguishes between the categories of data, the data capture, data training can not be identified and distinguish data categories, AI companies do not understand the specifics of model training; at the micro level, there are problems such as the lack of a legality basis for the collection of personal data by AI companies, or the legality basis is not clear. On this basis, big modeling companies will share data with third parties. If the subsequent use of the data is different from the purpose claimed at the time of initial collection, it may be suspected of constituting a violation of the law. This paper focuses on observing the legal issues of data collection and processing involved in each part of the big model training phase at the micro level, as machine learning is subdivided into eight steps/processes such as problem definition, data collection, data cleansing, summary statistics review, data partitioning, model selection, model training and model deployment. For legal research, this can be divided into processing the data (including the first seven steps) and running the model. The data protection issues in the training phase of generative AI are focused on the pre-training and model fine-tuning sessions involving data collection and training sets, including the possibility of collecting third-party data, or using own data or seeking copyright license cooperation to generate new content from autonomous learning.
1.1 Problems of data source legality
Artificial intelligence technologies face many legal issues throughout the development process, the most prominent of which is the legality of data sources. This is because a good dataset must satisfy four basic criteria: the dataset must be large enough to cover multiple iterations of the problem, the data must be clearly labeled and annotated, the data must be representative and unbiased, and the data must comply with privacy regulations. Legitimacy of data sources is the cornerstone of training data compliance, and most of the model capability comes from pre-training, which is closely related to large amounts of high-quality data. Therefore, Article 7 of China's Interim Measures for the Administration of Generative Artificial Intelligence Services explicitly requires that “generative AI service providers shall, in accordance with the law, carry out pre-training, optimization training, optimization training and other training data processing activities, and use data and base models with legal sources.”
From the perspective of industry practice, the data sources of generative AI mainly include own data, open source datasets, outsourced data, automated data collection and synthetic data. For example, processing personal information without authorized consent or beyond the scope of authorization, datasets acquired by illegal means, violating the license agreement for the use of open source datasets, taking more intrusive measures or increasing the burden on the server of the crawled party to illegally acquire computer information system data, illegal means of data collection such as violating the Robots agreement or destroying anti-climbing measures, and automated collection of data that contains copyright-protected content, among other data sources. All of these may lead to problems with the legality of the data source.
1.2 Data quality management issues
Training data quality requirements reflect the rationality of legal norms to intervene in technical activities, data quality and discrimination and bias are “two sides of the same coin”, if the training data lacks diversity, it is easy to lead to data discrimination and bias. In addition, if the accuracy of the training data is low, it is difficult to guarantee the quality of model training. For example, if the dataset contains inaccurate and unreliable information such as illegal and undesirable information related to pornography, politics and gambling, sensitive personal information, false propaganda, exaggerated propaganda, and absolute terms, the data quality cannot be guaranteed, which will easily lead to bias in model training. It should be said that data quality risks are the core issues of machine learning, they have a direct impact on supervised learning techniques, the objectivity, timeliness, and representativeness of the data play an important role in model prediction, and objectively incorrect training data can lead to incorrect model predictions. Companies relying on incorrect data may be required to compensate those harmed by the use of the data, even triggering punitive damages. At the same time, data quality is not limited to objective correctness, but must also include the timeliness and representativeness of the data. Therefore, it is often necessary to establish legally actionable quality standards for training data. With this in mind, the French Data Protection Supervisory Authority (CNIL) requires data controllers to assess whether the accuracy of the data has been verified, from the raw data to a quality training data set? If annotation methods were used, were they checked? Does the data used represent data observed in a real-world environment? Which methods were used to ensure this representativeness? Has a formal study of this representativeness been conducted? In the case of AI systems that use continuous learning, which mechanisms should be implemented to ensure the quality of the data used on an ongoing basis? Do regular mechanisms exist to assess the risks associated with loss of data quality or changes in data distribution?
1.3 Problem of improperly capturing and utilizing public data
The problem of improperly capturing and utilizing public data is a legal risk in the use of training data, because many training datasets come from public channels, which are filled with some improperly licensed data, and it is easy to cause disputes over the fair use of public data. For example, in general, the training data for the ChatGPT big model mainly comes from textual datasets on the Internet, and a very large portion of it comes from public domain content and open data, and according to media reports, OpenAI uses at least five different parts of the dataset for training: first, the Common Crawl database, which is based on a large-scale webpage crawling to form a dataset, and it is owned by a nonprofit organization with the same name. organization, which has been indexing and storing web pages for over 10 years at a rate of nearly 3 billion pages archived every month; two, WebTex2, OpenAI's dedicated AI corpus of personal data, for which every web page linked to by the social media site Reddit was crawled in order to build the corpus, and which was fed to train the big language model; three, Books1; four, Books2; and five, Wikipedia. Wikipedia.The above dataset is very large and needs to be collected and organized by crawlers and other means. When organizing the data, it needs to be cleaned and screened to ensure the quality and availability of the data.
Europe and the United States have very different stances on publicly available data. The European Union believes that private individuals still have rights after the data is made public, and the EU adopts a strict protection model for publicly available personal data, i.e., strict protection of personal data, respect for the rights of the individual, and that publicly available personal data must not be processed without the individual's knowledge or consent, and the EU requires that the duty of informing the public must be fulfilled with respect to personal data that was not obtained from the data subject; and the United Kingdom Requires a lawful basis for obtaining personal data from publicly accessible sources, with notification to the individual, and notification and assessment of data processing that exceeds the individual's expectations; and France requires that a third party grabbing publicly available personal data must obtain the user's consent. The United States, on the other hand, has adopted the exception model, whereby publicly available personal data is treated as an exception to the protection of personal data and can be processed without obtaining the consent of the individual. Draft legislation at the federal level and state legislation in the U.S. take a consistent position in excluding publicly available information from the definition of personal information. Some stipulate that as long as the relevant subject has a reasonable basis to believe that the publicly available personal information is lawfully provided to the public, it is publicly available information that is not protected; some set out the scope of what is and is not covered by “publicly available information” by way of positive and negative enumeration, with the positive scope of coverage being either broad or narrow, for example, the Consumer Privacy Act of the State of California does not prohibit data scraping, but does not prohibit data collection. For example, the California Consumer Privacy Act does not prohibit data scraping because: (1) the companies that scrape the data do not scrape it directly from the user, but from the public domain; (2) the user decides on his or her own to make the information public; and (3) there is no good technical solution to implement a notification program.
1.4 Lack of personal data rights protection concerns
In light of public concerns about the protection of personal data rights related to large model training data, on May 16, 2023, Sam Altman, CEO and co-founder of OpenAI, stated at a Senate Judiciary Subcommittee hearing in Washington, D.C., that OpenAI does not use any user data for the purpose of advertising, promoting OpenAI's services, or selling data to third parties for the purpose of create profiles of people; OpenAI may use ChatGPT conversations to help improve OpenAI's models, but OpenAI offers users several ways to control how their conversations are used. Any ChatGPT user can choose not to use their conversations to improve OpenAI's models. Users can delete their account, remove specific conversations from the history sidebar, and disable their chat history at any time; while some of the information OpenAI uses to train its models may include personal information on the public Internet, OpenAI endeavors to remove personal information from its training dataset where feasible.
However, while the exact source is not known, the ChatGPT model is trained by collecting data from a variety of sources on the Internet, and given the sheer volume, it is virtually impossible to identify and inform individuals of the relevant processing, or to make a statement regarding the processing of personal data. There are third-party rights to the training dataset; how can the processing and use of the data be licensed by the rights holders? Instead, it can be assumed that personal data found on the internet is processed through models. This therefore effectively excludes the data subject's right to information under Article 13 of the GDPR.There is a fundamental mismatch between data consumption models such as ChatGPT and the protection of individuals under data protection law, and this prevalence also means that the data subject's other rights, such as the right to rectification (Article 16 of the GDPR) or the right to erasure (Article 16 of the GDPR), remain only on paper and cannot be enforced. The collective harm caused by the almost unlimited grabbing of personal data from the Internet transcends the dimension of the individual. In the case of predictive models that utilize the collective database of millions of users, not only do users have no control over it, but they are also unable to utilize their own data.One question that exists for ChatGPT is whether or not it complies with the right to be forgotten in Article 17 of the GDPR, which completely removes personal data from the model when requested by an individual. The difficulty with generative AI implementing the right to be forgotten is that the data created by these systems is persistent, and natural language processing generates responses based on the data collected, making it nearly impossible to remove all traces of personal information. It is uncertain whether ChatGPT or other generative AI models will be able to comply with the right to be forgotten under Article 17 of the GDPR. In addition, there is a fundamental conflict between the right to privacy and an individual's right to demand an explanation when affected by automated decision-making. Machine learning is a data-driven model matching process based on large data sets. When data subjects demand accurate and truthful explanations of automated decisions, it means that the training data must be viewed (rather than anonymized or partial) thereby violating the privacy rights of the subjects from whom the training data originated.
1.5 Issues of Unlawful Bias and Discrimination
Training data is a major source of algorithmic discrimination. Real-world cases in fields such as facial recognition, AI recruitment, and personalized advertising illustrate this point. If the data quality for a specific protected group is generally adversely affected, the risk of discrimination partially correlates with data quality risk or may even be a consequence of it. In the operation of ChatGPT’s algorithmic model, the “machine learning + human annotation” technique serves as the core of the algorithm, fundamentally supporting the goal of generative AI. This combination aims to enhance ChatGPT’s intelligence and accuracy, but it simultaneously increases the legal risks associated with algorithmic bias. The joint use of machine learning and human annotation amplifies the impact of human will and preference compared to previous pure machine learning frameworks, as the influence of personal preferences from human annotation layers atop the inherent biases in machine learning algorithms, resulting in a compounded negative effect from algorithmic bias. Consequently, the sources of algorithmic bias become more varied and difficult to trace and prevent. The U.S. Consumer Financial Protection Bureau (CFPB), the U.S. Department of Justice (DOJ), the Equal Employment Opportunity Commission (EEOC), and the Federal Trade Commission (FTC) issued a joint statement on law enforcement efforts opposing discrimination and bias in automated systems, suggesting that such systems may involve unlawful discrimination in violation of federal law. Many automated systems analyze vast data to identify patterns or correlations, which are then applied to new data to perform tasks, make recommendations, or predict outcomes. While these tools are in operation, they may yield discriminatory outcomes that violate the law, often stemming from biases in data and datasets, models, design, or use. Insufficiently representative or imbalanced datasets that include historically discriminatory or otherwise erroneous data may contribute to discrimination.
2. Data Governance Pathways for the Training Phase of Generative AI
In terms of data governance pathways for the training phase of generative AI, typical countries and regions like the EU and the United States are exploring different approaches. The EU’s Artificial Intelligence Act adopts a classification-based, hierarchical approach that emphasizes transparency in training data. The United States, in contrast, favors a pragmatic attitude prioritizing industry development, with data governance pathways largely relying on industry and corporate self-regulation. The U.S. takes an active stance on the use of publicly available personal information and is also exploring exemptions for the collection of public data.
2.1 Classification-Based, Hierarchical, and Entity-Specific Approach
For generative AI, the EU has pioneered a classification-based, hierarchical approach tailored to specific entities, regulating high-risk AI systems primarily from perspectives of transparency, purpose, proportionality, and anti-discrimination.
First, the EU Artificial Intelligence Act mandates that providers of high-risk AI systems document the entire data processing lifecycle. Providers of foundational models are required to process only governed data, scrutinize data sources, and disclose the use of copyrighted content within generative AI. Article 10, “Data and Data Governance,” explicitly requires that data governance be applied to high-risk AI systems utilizing data for model training. These systems must be built on training, validation, and test datasets that meet a set of quality standards, which include but are not limited to the following:(1) Training data should be managed in alignment with the intended purpose of the AI system, with the data collection and processing processes governed by principles of transparency, purpose, and proportionality. This management should include: transparency regarding the initial purpose of data collection; data preparation and processing operations (such as annotation, labeling, cleaning, updating, augmentation, and aggregation); and an assessment of the dataset’s availability, volume, and suitability. (2) To prevent discrimination, training datasets should be relevant, sufficiently representative, and subject to appropriate error review, aiming for as much completeness as possible given the intended purpose. (3) Datasets should be tailored to the specific geographic, situational, behavioral, or functional context in which the high-risk AI system is expected to operate or might reasonably be misused, covering all necessary characteristics or factors aligned with the system’s intended purpose or potential misuse.
Second, Article 17, “Quality Management Systems,” requires high-risk AI providers to implement quality management systems documenting the entire data processing lifecycle and all pre- and post-market activities. This system includes comprehensive data management procedures for data acquisition, collection, analysis, tagging, storage, filtering, mining, aggregation, retention, and any other data-related operations performed before and during the deployment of the high-risk AI system.
Finally, Article 28(b) defines the obligations for foundational model providers. Before placing such models on the market or using them, providers must ensure compliance with the requirements of this article. This includes the use of datasets processed under robust foundational model governance measures, particularly by verifying the appropriateness of data sources, identifying potential biases, and implementing suitable mitigation measures. Generative AI providers must also publicly disclose a summary of the copyrighted data used for training.
2.2 Emphasis on Training Data Transparency
To address compliance issues concerning data sources, protection of personal data rights, data quality, and risks of unlawful discrimination and bias in the training phase of generative AI, the EU emphasizes transparency in training data to improve the visibility of data processing for data subjects during training. For instance, the European Commission’s Guidelines on Artificial Intelligence and Data Protection suggest that, while large datasets are essential for machine learning during training, it is crucial to adopt a design paradigm that rigorously evaluates the nature and volume of data, reduces redundant or marginal data, and gradually expands the training dataset size. Additionally, research has investigated the development of algorithms with automatic forgetting mechanisms, which gradually delete data, though this approach may impact the post-decision interpretability of AI. Using synthetic data, anonymized and based on subsets of personal data, in algorithm training also aligns with data minimization principles. At the national level, countries like France and Italy focus on the legal grounds for data processing, data accuracy, and transparency. Italy highlights information transparency, legal justification for data processing, data accuracy, and protections for minors. France’s data protection authority, CNIL, emphasizes data sources, legal grounds for processing, sensitive data, data minimization, anonymization, data accuracy and representativeness, and data quality, quantity, and bias. CNIL is developing concrete recommendations for the design of AI systems and the construction of machine learning databases, aiming to address issues such as the use of research systems in constructing and reusing training databases, the purpose principle in foundational models like general AI and large language models, and clarifying shared responsibility among database assemblers, model trainers, and users of the trained models. CNIL’s Guide to Rules on Sharing and Reuse of Public Data outlines unresolved issues, including the construction and use of training databases for research purposes, application of the purpose, accuracy, and minimization principles, responsibility allocation among various entities involved in the data processing chain, and the management of individual rights.
The UK also recommends enhancing transparency regarding training data. For instance, the UK Department for Science, Innovation, and Technology, in its Pro-Innovation AI Regulation (2023), suggests improving transparency in training data. It encourages regulators to set expectations for relevant entities throughout the AI lifecycle to proactively disclose information about the data they use, as well as details related to training data, in alignment with the principles of reasonable transparency and explainability.
Regarding training data transparency requirements, China’s Interim Measures for the Management of Generative Artificial Intelligence Services also sets corresponding standards for generative AI service providers. These providers must conduct pre-training and optimization training activities in compliance with the law, respect intellectual property rights, and uphold the legitimate rights of others. Based on the characteristics of the service type, effective measures should be taken to enhance the transparency of generative AI services, improving the accuracy and reliability of generated content. Additionally, generative AI providers are required to label relevant generated content in accordance with the Provisions on the Administration of Deep Synthesis of Internet Information Services.
2.3 Active Use of Publicly Available Personal Information and Exploration of Public Data Collection Exemptions, Establishing Business Improvement and Research Exceptions for Data Processing
In terms of handling publicly available personal data, the United States adopts a pragmatic, industry-first approach, favoring the active circulation and use of publicly accessible personal information. Federal legislative drafts and state laws hold a consistent stance, excluding publicly accessible information from the definition of personal information. Instead of defining “public personal information,” they use the broader concepts of “publicly obtainable” and “publicly available information.” Many U.S. state laws directly exclude public information when defining personal information. Currently, there is no comprehensive federal data privacy law in the United States, relying mainly on industry self-regulation and self-governance. Although Congress has passed specific laws that set data requirements for certain industries and data subcategories, these protections are not comprehensive. In the first half of 2023, several U.S. legislators introduced four AI-related proposals, each with different focuses but without proposing substantial regulatory frameworks. The Congressional Research Service continues to monitor data and copyright issues arising from generative AI. At this stage, the focus of AI regulation is on interpreting how existing laws apply to AI technologies rather than introducing new AI-specific laws. For instance, the Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems asserts that existing laws apply to automated and innovative systems in the same way as to other industry practices. The Federal Trade Commission concentrates on the legality of data collection and whether it leads to discriminatory outcomes, mandating companies to delete algorithms and outputs derived from improperly collected data. Under current U.S. law, generative AI may be subject to certain privacy laws depending on the context, developers, data type, and intended purpose of the model. For example, if a company offers chatbots in video games or other online services targeting children, it may be required to comply with the Children’s Online Privacy Protection Act (COPPA). In addition, privacy, biometric, and AI-related laws in certain states may impact the use of generative AI. In many cases, the collection of personal information is governed by state privacy laws that grant individuals the “right to know” what information companies collect about them, how it is used and shared, and the “right to access and delete” specific data or “opt-out” of data transfers and sales. However, some of these laws include exemptions for the collection of public data, potentially raising questions about how, or if, these exemptions apply to generative AI tools that gather information from the internet. Regarding public data openness and sharing, the U.S. federal level has established a unified open platform (data.gov) with standardized data formats and structures. This platform provides a high volume of diverse and frequently updated data, enhancing the accessibility of data for AI training purposes.
Regarding compliance in data processing for generative AI, the UK Information Commissioner’s Office (ICO) highlights that supervised machine learning primarily utilizes data during the training and inference phases. When a model is used to predict or classify individuals, both phases involve personal data. During the training phase, machine learning algorithms apply to datasets containing individual characteristics that generate predictions or classifications, though not all features in a dataset are necessarily relevant to the intended purpose. For instance, not all financial and demographic features are suitable for predicting credit risk. Therefore, it is essential to assess which features (and data) are relevant to the purpose and limit processing to these, minimizing the use of personal data. Additionally, privacy-enhancing techniques such as data perturbation or “noise” addition, synthetic data, and federated learning should be employed. In the inference phase, personal data minimization can be achieved by converting personal data into a less “human-readable” format, conducting inference locally, and using privacy-protecting query methods. For the lawful basis of using scraped data to train generative AI, the ICO’s consultation on Generative AI and Data Protection proposes a legitimate interest assessment framework, specifying that AI developers should conduct a three-step test: first, the purpose test to confirm the processing purpose is legitimate; second, the necessity test to verify the processing is essential for that purpose; and third, the balancing test to ensure individual rights do not override the interests pursued by the AI developer.
Regarding the reasonable use of personal data in AI, Singapore has clearly defined business improvement and research exceptions to promote industry development. For example, on March 1, 2024, the Personal Data Protection Commission (PDPC) of Singapore issued the Guidance on the Use of Personal Data in AI Recommendation and Decision Systems (hereinafter referred to as the “Guidance”) under the 2012 Personal Data Protection Act (PDPA). This Guidance establishes a business improvement exception, which allows enterprises to use personal data collected in compliance with PDPA’s data protection requirements without the need for consent or notification, provided that the data use falls within the scope of business improvement or research purposes. Article 5.2 of the Guidance stipulates that under Part 5 of the First Schedule and Chapter 2 of Part 2 of the Second Schedule of PDPA, organizations may use personal data collected under PDPA without individual consent if the data use fulfills the following business improvement purposes (the “business improvement exception”): (c) learning about or understanding the behaviors and preferences of individuals (including groups profiled by user characteristics); (d) identifying products and services that may be suitable for individuals (including groups profiled by user characteristics) or personalizing or customizing any such products or services. Article 5.4 provides illustrative examples, stating that AI system development may fall under the business improvement exception, such as (d) using AI systems or machine learning models to deliver new product features and functionalities that enhance product and service competitiveness.
3. Legislative Recommendations for Data Governance in Generative AI Training
As generative AI continues to evolve, China must balance the development and safety of artificial intelligence from a strategic perspective, prioritizing industry growth to enhance national competitiveness. Adopting an inclusive, cautious, and classification-based regulatory stance, China should build a governance framework for training data in generative AI. To address issues such as compliance of data sources during AI training, protection of personal data rights, data quality, unlawful discrimination and bias, and misuse of public data, policy upgrades are needed at both macro and specific levels. At the macro level, promoting AI development from a strategic perspective to enhance national competitiveness requires upgrading supportive AI policies and formalizing them into industry regulations, with a legislative orientation that emphasizes inclusivity, prudence, and tiered regulation. This should include establishing an AI regulatory sandbox to foster innovation under regulatory oversight. At the level of specific data rules, it is necessary to establish a rational data usage system, standards for data anonymization, rules for using publicly available personal data, and new rights for processing machine learning data.
3.1 Upgrading AI Development Promotion Policies and Legalizing Industry Promotion Policies to Enhance National Competitiveness
Artificial intelligence is a key driver of the new wave of technological revolution and industrial transformation. Accelerating the development of next-generation AI is strategically crucial for China to seize opportunities in this era of technological and industrial change. Therefore, to ensure China remains competitive and rides the wave of the new technological revolution, it is vital to achieve high-quality development of the general AI industry. It is recommended that, under the national strategy of building China into a strong technological power, the state updates and upgrades AI development promotion policies to drive a new round of AI industry upgrades and formalizes these policies within a future Artificial Intelligence Law. Historically, industrial policies have played a critical role in the growth of China’s information industry. For example, in the late 1990s, the “Four Incentive Policies” for the electronics and information industry, along with policies such as the Electronics Development Fund, initial telephone installation fees, and incentives for software and integrated circuits, greatly motivated enterprises to develop the information industry, positioning it as a key driver of national economic growth. For specific industry promotion policies and legislation, it is recommended to follow the example of the Notice of the State Council on Policies to Promote the High-Quality Development of the Integrated Circuit and Software Industries in the New Era (State Council No. 8 [2020]), which was an “upgraded iteration” of the Several Policies to Encourage the Development of the Software and Integrated Circuit Industries (State Council No. 18 [2000]). Similar to these past policies, dedicated promotion policies should be introduced for generative AI, covering areas such as taxation, investment, financing, research and development, imports and exports, intellectual property, and market applications. These industry promotion policies should be enshrined in a future Artificial Intelligence Law, explicitly defining relevant promotional provisions. Additionally, local governments and sectoral departments should be encouraged to pioneer these initiatives and issue tailored policy measures that account for regional and industry-specific needs.
3.2 Upholding an Inclusive, Prudent, and Tiered Regulatory Approach; Establishing a Regulatory Sandbox and Other Experimental Oversight Systems Suitable for the Development Stage of China’s AI Industry
The EU’s Artificial Intelligence Act classifies and regulates AI systems according to their risk levels. For “high-risk” AI applications, the Act imposes stricter requirements on data quality, transparency, and accuracy, with heightened mandatory provisions. China, following the principle of balanced development and regulation, specifies in Article 3 of the Interim Measures for the Management of Generative AI Services that the state “adheres to the principles of balancing development with security, promoting innovation alongside lawful governance, and encourages generative AI innovation through inclusive, prudent, and tiered regulatory measures.” Therefore, it is recommended to adopt an inclusive and tiered regulatory approach as the foundational legislative direction for China’s future AI legislation. A framework should be established to classify and regulate models and risks at various levels, each level corresponding to a distinct regulatory approach.
Overall, the regulatory sandbox mechanism embodies principles of proactive intervention, agile oversight, inclusiveness, and proportionality. An AI regulatory sandbox is a flexible and agile risk governance tool that allows regulators, generative AI service providers, and users to better observe and manage the risks brought by generative AI. The EU’s Artificial Intelligence Act mandates the AI regulatory sandbox for all member states, reducing regulatory burdens on businesses. Chapter Five, “Measures to Support Innovation,” specifies the purpose and function of the AI regulatory sandbox. Article 54, in particular, addresses the processing and usage of personal data for AI systems developed in the public interest, allowing further data processing under lawful circumstances. Spain and the UK are leading pilot AI regulatory sandbox initiatives, while more than ten other EU member states also plan to establish their own sandboxes.
Therefore, it is recommended that the future Artificial Intelligence Law establish a regulatory sandbox and other experimental regulatory systems suited to the current development stage of China’s AI industry. The design should include four main stages: entry standards, structural experimentation, sandbox evaluation, and system framework. In the accompanying regulations of the future Artificial Intelligence Law, a fair entry threshold should be established, where AI companies applying to join the regulatory sandbox must meet certain entry criteria regarding corporate governance, personnel allocation, and technological capability. Further refinement of operational rules within the sandbox is suggested, allowing for differentiated rules under a unified framework. For example, exemption methods and testing durations could first be trialed in different regional sandboxes, with adjustments made based on practical outcomes. A unified standard for sandbox data should be developed, alongside a platform for data transmission, integration, and sharing to facilitate interconnectivity and data sharing. This will enhance information disclosure and increase transparency in law enforcement throughout the sandbox testing process.
3.3 Distinguish Between the Stages of R&D Training and Commercial Provision; Establish a “Safe Harbor” System for Training Data and Introduce Reasonable Data Use Exceptions for Research and Business Improvement
During the training phase of generative AI, it is inevitable to use datasets containing copyrighted content, personal information, and publicly available data. To foster AI research, industrial innovation, and business improvement, it is crucial to establish a reasonable data use system. When the EU’s GDPR and China’s Personal Information Protection Law were first legislated, they did not account for machine learning or training data scenarios. However, both regions have extended existing data protection laws to the AI field, emphasizing compliance of data sources and transparency in data processing. While overly strict personal information protection rules may conflict with the development, deployment, and application of general-purpose models, it is still necessary to strengthen the compliance of data sources and processing in training datasets, setting reasonable degrees of protection for generative AI training data. Currently, the training use of vast data resources by generative AI faces legal obstacles. It is recommended that the future Artificial Intelligence Law distinguish between R&D training and commercial provision stages, drawing on the “Safe Harbor” concept applied during the early development of internet search engines. This “Safe Harbor” system for training data would allow users to leverage data for R&D or application even if they are unaware of the legality of its source. In the event of subsequent rights claims, users would be required to pay or compensate according to legal provisions. Additionally, personal information protection laws could be refined by drawing on the EU’s GDPR, the UK ICO’s legitimate interest assessment standards, and Singapore’s PDPA consultation guidelines on personal data use in AI recommendation and decision systems, which create exceptions for scientific and business improvements. For example, laws in member states may restrict certain personal information rights—such as access, rectification, processing limitations, and objection—for the purposes of public interest, scientific or historical research, or statistical purposes, and may provide a defense against deletion requests. It is also recommended to add provisions in the revised Regulations for the Implementation of the Copyright Law of the People’s Republic of China, clearly stipulating that text or data analysis, training, and mining are limitations or exceptions under copyright law, thereby clearing legal obstacles to high-quality data collection for AI model training.
3.4 Further Refine Open Data Utilization Rules, Strengthen Data Quality Management, and Standardize Data Anonymization
High-quality, large-scale open datasets are essential for AI model training, with a growing emphasis on data quality standards. To better utilize public data resources for accessible training datasets and to facilitate the legal and compliant flow of open data, further refinement of open data utilization rules is needed based on China’s Civil Code and Personal Information Protection Law. The Artificial Intelligence Law should include a dedicated chapter with specific provisions on the acquisition, use, circulation, and processing of open data, allowing selective access to certain public data for training and use, thereby expanding the openness of public data. This would promote standardized processes for data collection, cleansing, labeling, and storage, removing obstacles to public data access and utilization for large models. In addition, data quality management should be strengthened by establishing standardized goals, data formats, labeling methods, quality indicators, and data labeling rules. Necessary training for labeling personnel should also be implemented to form standardized operating procedures and develop quality control plans, ensuring high-quality labeling results.
From a technical standards perspective, a data anonymization system will be essential for data entering production and circulation. A reasonable anonymization standard should be comprehensive and applied throughout the entire data lifecycle, including collection, processing, use, and reuse. Therefore, a unified data anonymization standard should be established, along with detailed technical standards and guidelines for anonymization practices. This standard should follow the “reasonable anonymization” principle, which means that under current technological conditions, if a normal and rational person, using conventional methods, cannot trace back the anonymized data, the anonymization obligation should be considered fulfilled.
3.5 Establish New Rights and Rules for Data Processing in Machine Learning Scenarios
To further protect individual data rights and address privacy issues arising from anonymization failures, it is necessary to establish new rights and rules for data processing in machine learning contexts, including defining a system for synthetic data use. Future AI legislation will need to address data and privacy protection for training datasets, as data protection and privacy pose obstacles to sharing high-quality data. Training datasets often involve third-party rights, requiring permission from rights holders to process and use the data. Companies may also protect their investment in training AI models by keeping datasets and entire databases confidential through contractual and technical means. Additionally, fear of GDPR compliance has significantly hindered AI and data startups from quickly launching and scaling. Uncertainty around legal data ownership further complicates matters, as stakeholders often lack clarity on who legally owns the data and what permissible actions data holders may take. Further refinement of data privacy rules for generative AI is necessary, with explicit provisions allowing the use of synthetic data. To some extent, synthetic datasets can outperform traditional anonymization techniques by addressing anonymization failures. Privacy regulations mandate that personally identifiable information must not be disclosed. Synthetic data protects privacy by adding statistically similar information rather than merely removing unique identifiers. For instance, the UK’s ICO employs methods such as perturbation or “noise,” synthetic data, and federated learning to enhance privacy. Therefore, it is recommended that China’s future Artificial Intelligence Law establish new rights and rules for data processing in machine learning scenarios. This would involve creating legal frameworks that permit data access, sharing, and reuse, while also constructing fair methods for accessing, sharing, and reusing machine learning training, testing, and validation datasets. A new data processing right specifically for machine learning purposes should be introduced, allowing for data access, sharing, and reuse in AI and Internet of Things contexts.
Conclusion
Generative AI holds significant strategic value, representing a key area in future technological competition and serving as a crucial foundation for intelligent infrastructure, meriting attention from the perspective of national competitiveness. As generative AI continues to evolve, it enhances productivity and social welfare while introducing numerous challenges across political, economic, social, cultural, and legal-ethical dimensions. Countries are adjusting the stringency of AI development policies and regulatory frameworks based on their respective societal conditions and stages of industrial development. At this stage, China should enhance its AI development policies at a macro level, formalizing industry promotion policies through legal measures. By adhering to an inclusive, prudent, and tiered legislative approach, experimental regulatory systems like regulatory sandboxes should be established to suit the current stage of AI industry development. On the data-specific regulatory level, it is essential to distinguish between the R&D and commercial phases, establish a “safe harbor” system for training data, and introduce reasonable data use exceptions for research and business improvement. Additionally, refining public data utilization rules, strengthening data quality management, standardizing data anonymization practices, and creating new rights and rules for data processing in machine learning contexts will contribute to a robust data governance framework for generative AI training data in China.
The original article was published in the Administrative Law Review, Issue 6 2024, and is reposted from the WeChat official account “Editorial Department of Administrative Law Review”.