Location : Home > Resource > Paper > Theoretical Deduction
Resource
LI Mingxuan | On Fair Use of Training Data for largelanguage Models
2025-12-30 [author] LI Mingxuan preview:

[author]LI Mingxuan

[content]



On Fair Use of Training Data for largelanguage Models



LI Mingxuan

Lecturer of School of Interdisciplinary Studies, and Big Data and Responsible Artificial Intelligence for National Governance, Renmin University of China


Abstract: The primary sources of training data for large language models are publicly available data on the inter-net. Developers typically collect these data on a large scale through web crawling and aggregation of open-source datasets. However, as the protection of data property rights becomes increasingly reinforced, the legitimacy of this approach faces growing legal challenges. The large number of data rightsholders and the difficulty in tracing data usage significantly increase transaction costs. making it impractical for developers to obtain individual licenses through market mechanisms to ensure lawful use of training data. In this context of market failure, permitting the fair use of data for training large language models can increase social welfare and generally does not harm the market interests of data rightsholders. Alternatives such as collective management or statutory licensing offer limited benefits to rightsholders while imposing higher institutional costs and potentially hindering the development of large language models in China. Therefore, a fair use for training data should be established to provide legal certainty for technological innovation. In terms of rule design, fair use should be limited to publicly available data, be solely for the purpose of pretraining, include data processing methods involved in training, and allow data rightsholders to opt out through technical measures.

Key words: Large Language Models: Training Data; Data Property Right and Interest; Fair Use; Market Failure


Quotation

In recent years, artificial intelligence applications based on large language models (hereinafter referred to as "large models"), such as ChatGPT and DeepSeek, have developed rapidly, profoundly changing people's lives. A large model is "a language model constructed by a deep neural network containing hundreds of billions or more parameters." Unlike previous artificial intelligence models, large models demonstrate extremely powerful language capabilities and possess the general ability to handle various tasks, creating the possibility of achieving general artificial intelligence (i.e., strong artificial intelligence). Therefore, large models have become one of the most important technologies in the field of artificial intelligence.

Data is a key element in the development of large models. Large models have higher demands for both data scale and diversity. There is a "Scaling Law" for the performance of large models. As the amount of training data increases, the performance of the model will also improve. For instance, the training data volume of the GPT series models of the American company OpenAI has increased from 5GB for GPT-1 to 40GB for GPT-2, and then rapidly expanded to 45TB for GPT-3. In addition, the types of data required for training large models are becoming increasingly diverse, including different types of data such as web pages, books, and encyclopedias. The diversity of training data not only helps enhance the general capabilities of large models but also contributes to ensuring their fairness and inclusiveness. Therefore, large-scale and diverse training data has become an important foundation for the development of large models.

However, the acquisition and use of training data for large models are subject to legal restrictions. Data is not only an element of artificial intelligence but also the carrier of legal rights and interests. Data carries various rights and interests such as copyright, personal information rights, and data property rights. The acquisition and use of training data for large models will fall within the control of these rights, which means that developers of large models are not allowed to obtain and use the data without the permission of the rights holders. This will impose significant restrictions on the acquisition and utilization of training data, and may in turn affect the development of large model technology. Therefore, many viewpoints suggest that in the case of large model training, restrictions on copyright and personal information rights should be strengthened to eliminate possible legal obstacles.

However, existing research mainly focuses on the impact of copyright and personal information rights protection on the training of large models, but rarely discusses the problems brought about by the protection of data property rights. However, the impact of data property rights protection on the acquisition and use of training data cannot be ignored. Even if developers of large models can enjoy infringement exemptions in the fields of copyright or personal information rights, if data holders can assert their property rights in the data, it is still possible to restrict the acquisition and use of data by developers of large models. Therefore, from the perspective of the system, if the legal obstacles to the acquisition and use of training data for large models are to be eliminated, it is urgent to comprehensively examine the potential impact of various rights and interests in the data "rights bundle" and maintain coordination and consistency in the system design. Moreover, in practice, disputes related to data property rights and interests have also occurred. On June 4, 2025, Reddit filed a lawsuit against Anthropic, accusing it of using web crawlers without permission to obtain content data from its platform and using it in the training of large models, which constitutes a breach of contract, unjust enrichment and infringement (including unfair competition). In this case, Reddit, as the plaintiff, is neither the copyright owner nor the subject of personal information. Therefore, the focus of the dispute is not whether the developers of large models have infringed upon copyright or personal information rights, but whether they have infringed upon the property rights of data.

Therefore, this article will discuss the problems arising from the protection of data property rights in the scenario of large model training and propose reasonable solutions. Although there is still a lack of consensus in the institutional design of data property rights in our country, this does not affect the premise discussed in this article: the acquisition and use of training data for large models may fall within the control scope of data property rights, and with the strengthening of the protection of data property rights, the acquisition and use of training data will be subject to more restrictions. This paper holds that in the face of the obstacles caused by the protection of data property rights and interests to the training of large models, introducing a reasonable use system for large model training data is a feasible solution. In view of this, this paper intends to focus on the theme of the rational use of training data for large models, discuss the prerequisite background and theoretical basis for introducing this system, and propose specific plans for rule construction.

1. Introduction Background of the Rational Use of Training Data for Large Models

In this section, this paper aims to demonstrate that the protection of data property rights and interests may lead to the failure of the large model training data market. This is an important background for introducing the rational use of training data for large models. This article will first examine the main sources of training data for large models, then discuss the impact of data property rights protection on the acquisition and use of training data, and finally analyze the possible market failure of training data caused by this.

1.1 The main sources of training data for large models

Large models are essentially language models, and their core objective is to model the probability distribution of natural language. Take the GPT model as an example. Its task is to predict the next word based on the existing word sequence, that is, to construct a conditional probability distribution. Model training is the process of modeling from data. It mainly refers to the iterative optimization of model parameters by accepting input training data, thereby minimizing the error between the model's predicted output and the labels of the real data. Therefore, the training data for large models like GPT is essentially the pairing of word sequences (features) with the next word (label), and the vast amount of publicly available text data on the network can provide rich resources for the construction of training data.

In terms of sources, the training data for large models mainly comes from publicly available data on the Internet. Large model developers obtain publicly available data on the Internet in two ways on a large scale. First, crawl public web pages. Large model developers can utilize technologies such as web crawlers to obtain a large amount of data from public web pages and use it to build training datasets. For instance, Open AI Company once built a high-quality web page text dataset called WebText by crawling external link web pages on Reddit that received at least three likes, and used it in the training of the GPT series of models. At present, many large model developers have deployed dedicated web crawlers, such as GPTBot from Open AI Company, to collect data from public web pages on a large scale, providing rich raw materials for building training datasets. Second, collect open-source data. Open Source data refers to data resources released under an Open Source License. The publishers of these data generally allow others to freely and for free use of the data they have collected. Large model developers also frequently collect a large amount of open-source data and build training datasets based on it. For example, developers of large models usually use Common Crawl or its derived open-source datasets. Common Crawl is a non-profit organization that regularly crawls web pages across the entire Internet through web crawlers, accumulating data at the petabyte level, which is stored in a database for public access and download. Due to the extremely large volume of data, when large model developers build training datasets, they usually extract appropriate subsets from Common Crawl or use other open-source datasets built on Common Crawl, such as Colossal Clean Crawled Corpus(C4), etc.

So far, the vast majority of developers have been able to successfully build datasets suitable for large model training, thanks to the fact that the two data acquisition methods mentioned above are actually "free and free". First of all, large model developers can freely obtain a large amount of data through these two methods without having to pay high search and negotiation costs for obtaining licenses. Secondly, developers of large models have not paid compensation for these data either, thus saving a significant amount of costs. Thanks to this, many developers have been able to obtain and use a sufficient amount and wide variety of data to train large models, promoting continuous innovation and development in this field.

1.2 Protection of Data Property Rights and Interests and Its Impact

The data property rights and interests discussed in this article refer to the global property rights and interests that data holders enjoy over their data. This concept does not include copyright and personal information rights, nor does it cover the rights regarding data as stipulated in the contract. Under the current laws of our country, the rights and interests of data property mainly refer to the public property interests of data holders protected by the Tort Law and the Anti-Unfair Competition Law. In addition to the protection of trade secrets for non-public data, Chinese courts also mainly provide protection for public data based on the Anti-Unfair Competition Law. In practice, data holders often assert that the acts of others in obtaining and using data constitute unfair competition on the grounds of Article 2 of the Anti-Unfair Competition Law (i.e., the general provisions). In such cases, many courts have determined that the acts of others in obtaining and using data constitute unfair competition and infringe upon the legitimate rights and interests of data holders. Through the application of general provisions, Chinese courts have in fact created a "competitive property right" for public data, enabling data holders to prevent other entities in a competitive relationship from improperly obtaining and using their data. In June 2025, the newly revised Anti-Unfair Competition Law added a new paragraph to Article 13, confirming the judicial practice of protecting data property rights and interests, and specifically defining the circumstances of infringing upon data property rights and interests as the legal types of unfair competition behaviors.

There has always been a call in our country to strengthen the protection of data property rights and interests. Many viewpoints hold that the current system's protection of data property interests is still insufficient, and a rights protection model should be further adopted, that is, to confirm in advance through legislation that data holders enjoy property rights over data. Of course, at present, these plans are still at the stage of theoretical discussion and have not yet been transformed into specific legal systems. However, observing the policy trends in recent years, China is actively exploring the establishment of a data property rights system and may put it into practice in the future. In December 2022, the Central Committee of the Communist Party of China and The State Council issued the "Opinions on Building a Data Foundation System to Better Leverage the Role of Data Elements" (hereinafter referred to as the "Data Twenty Articles"), explicitly stating that it is necessary to "establish a data property rights system that safeguards rights and interests and ensures compliant use", sending out a signal of support for data rights confirmation at the national policy level and providing ideas and frameworks for future legislation. The above signs indicate that in the future, China may gradually establish a data property rights system and further strengthen the protection of data property rights and interests.

With the strengthening of the protection of data property rights and interests, the two main ways currently used by large model developers to obtain massive amounts of data will face severe legal challenges. First, the legal space for crawling public web pages is constantly tightening. The act of crawling public web pages is likely to fall within the control of data property rights and interests, and will be further restricted as the protection of data property rights and interests is strengthened. Under the current interest protection model, the criteria for determining that data crawling behavior constitutes unfair competition are very lenient. Especially in the determination of misconduct, if a data crawler violates the robot agreement or service agreement when crawling data, it is highly likely to be identified as having committed misconduct. In practice, it is increasingly common for websites to restrict large model developers from crawling data through robot protocols or service agreements. A study investigated three of the most commonly used open-source datasets for large model training: C4, RefinedWeb, and Dolma. It was found that in the year of the rise of large models, many websites have been continuously strengthening restrictions on data crawling through bot protocols and service agreements. For instance, on the most critical websites, approximately 20% to 33% of the content was restricted by bot protocols in April 2024, compared to less than 3% a year ago. The restrictions on crawling in the website service agreement have also increased by 26 to 53%. The relatively low threshold for recognition, coupled with the continuously expanding restrictions of agreements, makes it increasingly easy for large model developers to fall into the category of unfair competition when they crawl public web pages without permission. Moreover, if a rights protection model is adopted for data in the future, even if the website does not explicitly restrict the crawling behavior of large model developers through robot agreements or service agreements, large model developers still need to obtain permission from the website before crawling web page data; otherwise, it will constitute an infringement of data property rights. This will make it increasingly difficult for large model developers to obtain massive amounts of training data by crawling public web pages.

Second, there is uncertainty regarding the legality of using open-source data. The construction of large-scale training datasets by collecting open-source data is also greatly restricted. First of all, the construction of many open-source datasets itself relies on crawling public web pages. Due to the tightening of the legal space for crawling public web pages, the available resources for building open-source datasets have also further shrunk. If there are acts that infringe upon the rights and interests of data property during the construction of open-source datasets, then the legitimacy of the datasets themselves is questionable. At this point, developers of large models may face legal risks when using this dataset. Secondly, even if the open-source dataset itself does not have legitimacy issues, it does not mean that it is legal for large model developers to use it for large model training. Firstly, developers of large models may break the law when using them because the scope of use exceeds the permission of the publisher. For instance, open-source datasets are only permitted for scientific research purposes, while large model developers use them for training commercial models. Secondly, even if it does not exceed the license scope of the open-source dataset, it still cannot be guaranteed that it is legal for large model developers to train large models using this dataset. This is because the scope of the license stipulated by the publisher of open-source datasets may sometimes exceed the authority they themselves enjoy. For instance, some websites may allow non-profit organizations such as Common Crawl to crawl their web page data, but impose restrictions on web crawlers specifically used by large model developers to collect data. However, after Common Crawl crawls these web page data, it may be used by other developers to train large models, but the terms of use of Common Crawl do not explicitly restrict this behavior of developers. If the website has universal data property rights over its data, although the developer's use of its data to train large models does not violate the terms of use of Common Crawl, it may constitute an infringement of data property rights. Therefore, as the protection of data property rights and interests is strengthened, developers of large models also face more restrictions and uncertainties when using open-source data to build training datasets.

1.3 Failure of the training data market and Its causes

Against the backdrop of increasingly strengthened protection of data property rights, the main methods for collecting massive training data are facing legal obstacles, which limits the freedom to train large models using data. In many cases, developers of large models must obtain the permission of data property rights holders through market mechanisms to ensure the legality of training data. However, based on the current practice of large model training, it is quite difficult to obtain permission through market mechanisms, mainly due to the high transaction costs. Transaction costs typically include search and information costs, negotiation and decision-making costs, as well as regulatory and law enforcement costs. In the market of large model training data, these costs are extremely high for various reasons, hindering the occurrence of transactions.

First, the training data of large models involves a large number of data property rights holders, which leads to extremely high search and information costs, negotiation and decision-making costs that developers have to bear. In the past, typical data utilization scenarios often involved specific purposes, functions or fields, and the required data was frequently concentrated in the hands of a limited number of stakeholders or intermediaries. However, in the scenario of large model training, the scale of property rights holders involved in the training data may increase sharply. Especially the web page data, which accounts for the highest proportion, involves tens of thousands of websites. For instance, in the C4 dataset commonly used for large model training, there are approximately 15 million websites, with a token count as high as 156 billion. Nowadays, the scale of web page data used in the training of many large models has long far exceeded the scale of the C4 dataset, reaching the level of trillions of words, and the number of websites involved is very likely to have long exceeded the tens of millions in the C4 dataset. Theoretically, every website could potentially be an independent data property rights holder. Therefore, the number of rights subjects involved in large model training could be in the tens of millions or even hundreds of millions. For developers of large models, seeking out such a large number of scattered rights holders will incur extremely high search and information costs. It is also necessary to negotiate or pay for each of these websites one by one, which will incur extremely high negotiation and decision-making costs. In this situation, if the protection of data property rights and interests continues to be emphasized, it will impose a heavy burden on large model developers, who will almost never be able to obtain and use these data through market mechanisms.

Second, the usage behavior of training data for large models is in fact difficult to trace, which leads to a sharp increase in regulatory and law enforcement costs. In typical data disputes in the past, many acts that infringed upon the property rights and interests of data were public acts such as providing and displaying others' data, which were easy to trace. However, in the scenario of large model training, the behavior of data usage becomes more concealed and difficult to prove. Firstly, the training of large models, including the data usage involved, is generally not carried out in a public environment. Therefore, the outside world cannot directly observe and monitor the data usage behavior during the training of large models. Secondly, for commercial large model developers, information about training methods is of vital importance, including the source, content and processing methods of training data, etc. Developers will take confidentiality measures for this information to maintain a competitive edge. Furthermore, the results obtained through training are the parameter weights of the model and do not store any training data. Although developers ultimately need to provide large model services to the outside world, it is difficult for the outside world to infer the training data used from the model itself or the results it generates. Due to the high difficulty in tracing the usage behavior of training data for large models, to strictly enforce the data property rights protection system in such circumstances, a significant amount of regulatory and law enforcement costs must be incurred.

It is evident that the high transaction costs prevent large model developers from obtaining the permission of data property rights holders through market mechanisms. In reality, this phenomenon is accompanied by the "illegal rise" of large models. Due to the inability to legally obtain the permission of data rights holders through market mechanisms, the training of many large models actually acquires and uses data that may be protected by the rights of others without permission, posing a significant risk of illegal activities. For this reason, developers of large models are facing many lawsuits initiated by data holders. This "illegal rise" to some extent indicates that in the scenario of large model training, the market is difficult to play a role in facilitating transactions between artificial intelligence companies and data holders. In this situation, if the data property rights protection system is strictly enforced, it will not only incur extremely high regulatory and law enforcement costs, but also prevent large model developers from obtaining sufficient data for training. In other words, overly strong protection of data property rights may lead to market failure, that is, large model developers cannot reach transactions beneficial to society with data rights holders through market mechanisms, resulting in inefficient outcomes.

2. Institutional Justification for the Rational Use of Training Data for Large Models

The possible malfunctions in the large model training data market provide the necessary prerequisite background for the introduction of the fair use system. Rational use is regarded as one of the main means to solve market failure. However, market failure is only a necessary but not sufficient condition for introducing fair use. To justify the fair use system of large model training data, it is still necessary to conduct an argument in combination with other justification conditions. Furthermore, rational use is not the only means to solve market failure. Collective management, legal licensing and other systems are also possible solutions to market failure. To justify the system of rational use of training data for large models, it is necessary to compare these alternative solutions and prove that rational use is a better choice. Therefore, in this section, this paper will first review the theories that justify the fair use regime, analyze the legitimacy of introducing the fair use regime for large model training data, and then discuss alternative solutions to market failure, pointing out that it cannot completely replace fair use.

2.1 The theory of institutional justification

Fair use originated from copyright law and refers to a system where, under specific conditions, subjects other than the copyright owner can use a work free of charge without the permission of the copyright owner. In the field of copyright law, the theory of fair use has been discussed most thoroughly. Scholars of copyright law have long constructed highly explanatory theories for the fair use system around the concept of market failure. In her pioneering paper, Professor Wendy J. Gordon argues that fair use of copyright should be interpreted as an institutional solution to market failure. The copyright law grants authors exclusive rights over their works to encourage their literary and artistic creation activities. After this, the task of transferring works to the most useful users is mainly accomplished by the market. However, due to reasons such as transaction costs, the market does not always facilitate transfer behaviors that are beneficial to society, thus resulting in inefficient or inefficient market failure phenomena. In the event of market failure, it is necessary to solve this problem through channels outside the market. Fair use precisely resolves the issue of market failure and enhances the overall efficiency of society by allowing potential users to freely and without charge use works, enabling more people to enjoy the benefits brought by the works.

However, the mere existence of market failure does not fully justify the fair use system. Fair use is the broadest form of rights restriction, which not only grants users the freedom to engage in specific usage behaviors but also exempts them from the obligation to pay consideration fees. Therefore, the justification of fair use requires the application of more stringent conditions. Professor Gordon was also aware of this and attempted to construct a complete theory that justifies the fair use system. He pointed out that, apart from the existence of market failure, for a certain situation to constitute fair use, two conditions must also be met: First, allowing users to use the work can enhance social welfare. Second, the incentives for copyright holders will not suffer substantial damage by allowing users to continue using the works.

The market failure theory of fair use of copyright and its analytical framework can be extended to the field of data property rights to justify the fair use system of data property rights. This mainly stems from the similarity in principle between data property rights and copyright. Similar to copyright, data property rights not only provide a reasonable return for the data holder's investment but also serve as an effective incentive for their data production and supply behaviors. After the data property rights are established, the market will play a role most of the time, allowing the data to flow to the users who can best realize its value. However, when the market fails to effectively achieve this goal, there is a possibility of the intervention of legal means such as the rational use. By analogy with the conditions for proving fair use of copyright, to prove that a certain situation constitutes fair use of data property rights, three conditions must be met: First, there is market failure, that is, users cannot obtain permission to use data by paying appropriate fees through market mechanisms; Second, allowing users to utilize data can enhance social welfare. Thirdly, the incentives for data property rights holders will not be materially undermined by allowing users to continue using the data.

2.2 Evidence of fair use

In the first part, this paper has already demonstrated that the market for training data of large models may experience malfunctions. Next, to prove the legitimacy of the reasonable use of training data for large models, it is still necessary to verify the last two conditions, that is, allowing reasonable use can enhance social welfare without causing substantial damage to the incentives of data property rights holders.

Rational use enhances social welfare

The most direct evidence for assessing whether fair use enhances social welfare comes from the current reality. In fact, the reality of the "illegal rise" of large models and the social benefits they bring have largely demonstrated that their rational use can greatly enhance social welfare. In the case of the "illegal rise" of large models, the state in which developers obtain and use data for training large models is exactly the same as that under reasonable use - the developers of large models do not obtain the permission of the data rights holders and do not pay them any fees. However, in this situation, large model technology has developed rapidly and generated tremendous social value. At present, large models are mainly applied in generative artificial intelligence, which has greatly promoted the development of fields such as literature and art. Large models have enhanced the content generation capabilities of generative artificial intelligence, providing new tools for literary and artistic creation. On the one hand, this tool has lowered the threshold for creation, providing more ordinary people with the opportunity to participate in literary and artistic creation. On the other hand, it also provides new opportunities for professional artists, expanding the models and Spaces for literary and artistic creation. In addition to its value in the fields of literature and art, the universal feature demonstrated by large models makes them have the potential to generate value in other fields. General-purpose large models can replace specialized models in various fields and tasks, reducing the difficulty and cost of development, and enabling more people to leverage their powerful capabilities to develop artificial intelligence applications that solve various practical problems. This broad application prospect indicates that the development of large models will greatly enhance the efficiency of individuals and the overall welfare of society. In conclusion, introducing a system for the rational use of training data for large models can legalize the current "illegal rise" of large models without imposing any additional burden on developers, thereby ensuring the rapid development of large model technology and the continuous growth of social welfare.

(2) Reasonable use will not cause damage

Compared with the justification of fair use of copyright, the justification of fair use of data faces greater challenges in the latter condition. In the sense of copyright law, the use of work data by large model developers usually constitutes non-work use or non-expressive use, which means that their usage behavior basically does not cause substantial damage to the market interests enjoyed by the copyright owner. However, from the perspective of data property rights, the use of data by large model developers may indeed harm the market interests of data property rights holders, and thereby affect the incentives for data property rights holders. The first situation involves damage to the market interests of data services or products. If the data services or products provided by large model developers can substantially replace those provided by data property rights holders, it is very likely to cause damage to the interests of data property rights holders in the existing data service or product market. For instance, data property rights holders have made substantial investments to accumulate a considerable amount of legal data and, based on these data, provide public legal information content services to the public. However, developers obtain the above-mentioned public data without permission to train large models and provide the public with the same or similar legal information content services based on the large models. If such situations are also included in the scope of reasonable use of large model training data, it is very likely to cause significant damage to the interests of data property rights holders in the existing data service or product market. The second situation involves damage to the interests of the data licensing market. If a licensing market for large model training data has already been formed between large model developers and data property rights holders, or if there is a possibility of establishing a related data licensing market, then allowing developers to use the data reasonably would cause damage to the interests of data property rights holders in the existing or potential data licensing market. For instance, although the data on websites like Reddit is public, they have already started to establish a business model for public data licensing, requiring large model developers to obtain permission and pay corresponding fees before using the data. Against this backdrop, if the law allows developers of large models to obtain and use the data of the above-mentioned platforms for free, it will inevitably have an impact on the emerging data licensing market and harm the potential interests that data property rights holders may obtain.

Despite this, this paper holds that the above circumstances are not sufficient to completely deny the legitimacy of the reasonable use of training data for large models. First, in most cases, allowing developers to use data reasonably to train large models will not harm the market interests of data property rights holders. Firstly, in most cases, the services or products provided by large model developers do not have a direct competitive relationship with those provided by data property rights holders and do not constitute substantial substitution. Large model developers use data to train large models mainly to enable the large models to learn the knowledge in the data to enhance their language and general capabilities, rather than to provide the same or similar services and products as the data property rights holders. As long as there are significant differences between the services or products provided to the public based on large models and those provided by data rights holders, the introduction of fair use generally will not cause damage to the existing market interests of data property rights holders. Empirical research also shows that the main practical uses of large model-based products like ChatGPT are significantly different from the market segments in which their training data mainly comes from. For instance, over 30% of the conversations in ChatGPT are used for creative writing, but in fact, the proportion of creative writing data in ChatGPT's training data is not high. News-related data accounts for a relatively high proportion of ChatGPT's training data, but less than 1% of ChatGPT usage is related to news. Secondly, although there are indeed some data licensing transactions between large model developers and data property rights holders, compared with the scale of data used by large models, the scale of data involved in these transactions only accounts for a very small part, mainly occurring between a few dominant large model developers and individual platform-based data property rights holders. In other words, these transactions are likely just individual cases and cannot prove that the malfunction of the large model training data market can be cured. In fact, in most cases, the excessively high transaction costs in the large model training data market make many transactions simply impossible to occur, and it is also very difficult for data property rights holders to claim the loss of expected benefits. Therefore, allowing large model developers to use data reasonably for large model training will, in most cases, not harm the interests of data property rights holders in the data licensing market.

Second, although there may be a few situations that harm the interests of data property rights holders, by accurately defining the scope of application of fair use, these few situations can be excluded from fair use, thereby ensuring that the interests of data property rights holders will not be harmed. Specifically, by clearly defining the objects, purposes and methods of fair use, situations that may harm the interests of data property rights holders can be excluded from the scope of fair use. It is also possible to allow data property rights holders to opt out of fair use under specific conditions, thereby enabling them to reach licensing agreements with large model developers at a lower transaction cost and leaving the possibility for the formation of an effective training data licensing market.

In conclusion, the situations where fair use may harm the rights and interests of data property rights holders are only a few exceptional cases. As long as the scope of fair use is appropriately limited, in most cases, the introduction of fair use of large model training data will not cause damage to the market interests of data property rights holders.

2.3 Comparison of alternative solutions

Some people hold the view that China can draw on the experience in the field of copyright and reduce transaction costs by introducing collective management or statutory licensing of data property rights, thereby solving the problem of market failure. Collective management refers to a system in which rights holders, through collective management organizations, grant permission to the use of rights holders and collect corresponding remuneration. Statutory licensing refers to a system where, in accordance with the provisions of the law, the right holder can use the object of interest in a specific way without the permission of the right holder, but remuneration should be paid to them. Both of them have mature experience in the field of copyright. By concentrating the exercise of rights or directly stipulating licensing fees, they have solved the problem of excessively high transaction costs that may exist in the one-to-one authorization situation. Against this backdrop, there is a view suggesting that collective management or statutory licensing solutions can be extended to large model training scenarios to address the issue of extensive use of work data involved in large model training. Similarly, when the training data of large models involves a large amount of data property rights, a similar collective management or legal licensing system can also be adopted.

Whether it is a collective management or a statutory licensing scheme, the common feature compared to fair use is that it still requires large model developers to pay fees to the data property rights holders. Their supporters might argue that the advantage of these schemes over fair use lies in taking into account the interests of data property rights holders: by providing economic compensation to the rights holders, a better balance of interests can be achieved. However, this article holds that compared with fair use, the advantages of collective management or statutory licensing are not obvious, but the possible institutional costs and expenses they may bring cannot be ignored.

First, the benefits brought to data property rights holders by adopting collective management or statutory licensing schemes are very limited. Firstly, due to the extremely large scale of rights holders involved in the training data of large models, even if developers pay a considerable sum of money to the rights holders as a whole, the benefits that most rights holders can obtain are extremely meager. Suppose the training data of a certain large model may involve 10 million websites. Even if developers are willing to pay a licensing fee of 100 million yuan, the average revenue allocated to each website would only be 10 yuan. Moreover, considering that the data also carries other rights such as copyright, the data property rights holders may need to further distribute these benefits to other rights holders, and the actual benefits they can obtain may even be less. Secondly, the frequency of data usage in large model training is also more limited. Many typical situations that copyright collective management or statutory licensing deals with are the continuous and frequent utilization of works. Such continuous and frequent utilization enables copyright holders to accumulate continuously and eventually obtain relatively considerable profits. However, the training data of large models does not possess this feature. The cost of training large models is extremely high. The vast majority of large model developers do not frequently use data for training, which may result in rights holders being unable to obtain sufficient returns through the accumulation of usage times.

Second, adopting collective management or statutory licensing schemes will incur higher institutional costs. Since all these plans involve the establishment of collective management organizations and activities such as the determination, collection and distribution of licensing fees, the establishment and operation of their systems require more costs compared to fair use. Moreover, compared with the collective management or statutory licensing of copyright, the collective management or statutory licensing of large model training data may face higher institutional costs. Take the establishment and operation of collective management as an example. First of all, compared with collective management of copyright, establishing a collective management organization for data requires a higher cost. In the field of copyright, there are already mature collective management organizations, and these organizations have obtained the authorization of many works. If the collective management of copyright is to be extended to the scenario of large model training, there is a relatively solid foundation. However, in the field of data property rights, there is no similar foundation. If a collective management organization is to be established and authorized on a sufficient scale, more costs will be required. Secondly, the situation of large model training is quite different from the typical situation that copyright collective management deals with. On the one hand, the types of data involved in large model training are more diverse. Traditional copyright collective management organizations often focus on the copyright management of a single type of work, but the training of large models involves multiple types of data. The diversification of data types brings higher operating costs to collective management. The experience of copyright collective management organizations in managing a single type of work may be difficult to apply to the management of diverse training data for large models. On the other hand, the scale of objects that need to be managed in data collective management is even larger. In the scenario of large model training, the scale of the involved data and property rights holders may far exceed that managed by traditional copyright collective management organizations. The largest copyright collective management organization in China is the Music Copyright Society of China. As of the end of 2024, it has 14,064 members under its management and approximately 23 million musical works. As mentioned earlier, the scale of data property rights holders involved in the training data of a single large model is in the tens of millions, and the number of web pages is even in the hundreds of millions. This far exceeds the scale managed by traditional copyright collective management. Due to the larger scale of the managed objects, the operating costs of the data collective management system will also increase significantly.

Thirdly, adopting collective management or legal licensing schemes would have an adverse impact on the development of large models in our country. Firstly, adopting collective management or statutory licensing schemes will consolidate the competitive edge of large enterprises and create extremely high entry barriers for start-ups. The use of large-scale data incurs extremely high licensing fees. Judging from the cases of training data transactions reached in current practice, the licensing fees between many large model developers and individual platform-based data property rights holders have already reached tens of millions of dollars per year. If all data property rights holders are to be paid, the amount is likely to far exceed the level of tens of millions of dollars per year. Such a huge cost is rarely affordable for enterprises. This will greatly reduce competition in the field of large models and have an adverse impact on innovation in this field. Secondly, adopting collective management or statutory licensing schemes may affect our country's international competitiveness in this field. Some scholars have pointed out that if some countries choose not to require large model developers to pay, then developers and development activities may shift to these countries with relatively loose regulations, leading to the phenomenon of "innovation arbitrage". At a stage when large model technology is still evolving, if China takes the lead in requiring developers to bear this cost burden, it may affect their enthusiasm for researching and developing technologies and applying them in China, and weaken China's international competitiveness in the field of artificial intelligence.

3. Rule Construction for the Rational Use of Training Data for Large Models

After confirming the legitimacy of the reasonable use of training data for large models, it is necessary to further explore the construction of its specific rules. Given the difference from fair use of copyright, fair use of data is more likely to affect the interests of data property rights holders. Its scope of application must be strictly limited: on the one hand, it is necessary to solve the problem of market failure; on the other hand, it is also necessary to avoid adverse effects on the incentives of data property rights holders. In this section, this article will explore the construction of specific rules for the rational use of large model training data from aspects such as the objects, purposes, methods, and exits, and put forward suggestions for the improvement of legislation.

3.1 Objects of fair use: Public data

Data can be classified into public data and non-public data based on whether unspecified subjects can actually freely access and obtain the content. This article holds that the objects for the reasonable use of training data for large models should be limited to public data for the following reasons:

First, from the perspective of necessity, only applying reasonable use to public data is sufficient to solve the problem of market failure in the training scenarios of large models. First of all, the main source of training data for large models is public data. To ensure the acquisition and use of training data for large models, the primary step is to guarantee the acquisition and use of public data. Secondly, the problem of market failure in large model training data mainly occurs in the acquisition and use of public data. The problems of a large number of rights holders and difficulty in tracing usage are more prominently reflected in the acquisition and use of public data. These issues are the main reasons for the increase in transaction costs. However, the problem of high transaction costs in the acquisition and use of non-public data is not obvious. On the one hand, as long as the acquisition and use of public data are guaranteed, developers of large models can generally obtain a sufficient scale of training data. Although obtaining more non-public high-quality data is also helpful for the training of large models, its scale and the number of rights holders involved are relatively small, and adopting a one-to-one direct authorization model is also feasible. On the other hand, unauthorized acquisition of non-public data often leaves more traces, and many non-public data are usually the exclusive data of the rights holders. When proving the improper acquisition behavior of large model developers, the difficulty faced by the rights holders in providing evidence is relatively small. Therefore, there is little necessity to expand the objects of fair use to non-public data.

Second, from the perspective of rationality, the protection and restrictions on the rights and interests of public data and non-public data should be differentiated. Although both public and non-public data are protected by law, the degree of property rights protection for public data should be weaker than that for non-public data. Correspondingly, the rights restrictions on public data should also be stronger than those on non-public data. The reason why public data and non-public data should be treated differently is primarily because public data is regarded as having openness and public nature. A popular view holds that the Internet has an inherent openness: "It allows anyone in the world to post information that anyone else can access without the need for authentication." When a computer owner decides to host a web server so that files can be accessed over the network, the default setting is to allow public access to these files. In other words, any information or data publicly available on the Internet should be by default open and public, allowing others to access and obtain such information and data. Based on this view, public data with public nature should be subject to reasonable use to ensure the sharing and circulation of data. Secondly, the degree of damage to interests caused by obtaining and using public data and non-public data without permission is different. This is related to the expectations of data holders for the benefits of both public and non-public data. Overall, data holders have stronger expectations for the benefits of non-public data. If one wants to obtain non-public data, it is often necessary to undermine the confidentiality measures taken by the data holder, which causes more serious damage to their expected interests and disrupts social order. However, in comparison, public data is more easily obtained and used by other entities. In some cases, it can even be presumed that the data holder has implicitly consented to other entities' acquisition and use of the data.

3.2 Purpose of reasonable use: For pre-training

The training process of large models is generally divided into two stages: pre-training and fine-tuning. Pre-training refers to "the initial training of model parameters using large-scale data unrelated to downstream tasks". Fine-tuning refers to the additional training carried out on the basis of a pre-trained model for specific tasks or data, typically including instruction fine-tuning and alignment. The training objectives of the two stages are different. In layman's terms, pre-training is to enable large models to learn a wide range of knowledge, thereby endowing them with the ability to understand and generate general languages. Fine-tuning is to enable large models to conduct specialized learning in specific fields, so as to better complete specific tasks and align with human values. This paper holds that the purpose of the reasonable use of training data for large models should be limited to pre-training rather than fine-tuning, for the following reasons:

First, market failure mainly occurs in the acquisition and use of pre-training data. Due to different purposes, there are significant differences in the data used by large model developers during the pre-training and fine-tuning stages. In terms of scale, pre-training involves the use of large-scale data, while fine-tuning uses relatively smaller amounts of data. In terms of types, pre-training data is not limited to a certain domain, while fine-tuning data is more likely to be targeted at specific domains or tasks. The direct consequence of these differences is that the number and types of data property rights holders involved in pre-training data are much larger, and their acquisition and use may incur higher transaction costs and be more prone to market failure. However, fine-tuning data, due to its small scale and often targeting specific fields or tasks, involves a relatively limited number and types of data property rights holders, and generally does not result in particularly high transaction costs. Therefore, the rational use of large model training data should mainly be applicable to the pre-training stage where market failure is more likely to occur.

Second, the acquisition and use of fine-tuned data are more likely to cause damage if used reasonably. From the perspective of the purpose of data usage, the data usage in the pre-training stage is more transformative than that in the fine-tuning stage. The concept of transformative use originated from copyright law and originally referred to the use of a work in different ways or for different purposes. The transformative use of data refers to the use of data by subsequent data users in a manner or for a purpose different from that of the data property rights holders. In the case of transformative use, the way or purpose of use by the subsequent user is significantly different from that of the prior right holder. Therefore, the behavior of the subsequent user has a relatively small impact on the market interests of the prior right holder, and the possibility of causing damage is lower. The purpose of pre-training is to endow large models with universal language capabilities, which is significantly different from the purpose of the vast majority of data property rights holders in using data. It has remarkable transformability and generally does not directly affect the market interests of data rights holders. The main purpose of fine-tuning is to endow large models with knowledge in specific domains or the ability to handle specific tasks, so that they can be directly applied to specific services or products, and the data they use is often closely related to these services or products. For instance, to develop a model that provides legal consultation services, developers can fine-tune a general large model using legal question-and-answer data, thereby endowing the large model with richer legal knowledge and generation capabilities that better meet the requirements of legal consultation. These fine-tuned data often come from data property rights holders providing the same or similar services or products, including legal database providers, legal Q&A websites, etc. The services or products provided by the model obtained through fine-tuning are very likely to be very close to those offered by the data property rights holders, and may even constitute substantial substitution. It can be seen that the degree of conversion in the use of fine-tuned data is relatively low, which is very likely to seriously affect the market interests of data property rights holders and should not be widely included in the scope of fair use.

3.3 Reasonable usage methods: Data processing behaviors involved in training

The reasonable use of training data for large models should be limited to the data processing behaviors involved in the training. Data processing encompasses the collection, storage, use, processing, transmission, provision, and disclosure of data. The training of large models mainly involves data processing behaviors such as collection, storage, use, processing and transmission. For instance, before starting training, developers need to collect the data required for training, process the collected raw data into machine-readable formatted data, and store it in a certain medium. During the training process, the data needs to be transmitted to the training server and used to conduct model training. These processing actions may fall within the control of data property rights and interests and should be granted an exemption from infringement through fair use, thereby facilitating the legal training of large models.

As for whether the act of providing and disclosing is included in the reasonable use of training data for large models, it needs to be carefully considered and discussed. First of all, the training of large models generally does not involve the provision and disclosure of data. The scope covered by the reasonable use of training data for large models should be limited to the data processing behaviors required for the normal progress of training. It should be more cautious to include data processing behaviors beyond this scope in the reasonable use. Secondly, providing and disclosing data without permission may sometimes substantially harm the interests of data property rights holders. From relevant cases, it can be seen that many data property rights holders mainly attempt to prevent the provision and disclosure of data rather than other data usage behaviors. This is because in many cases, providing and disclosing data would constitute a substantial substitution for the products or services of the data holder, seriously harming the interests of the data holder. Similar situations may also occur in the context of large models. If large model developers train large models using public data from other online platforms and then directly provide users with the same or similar information content services through the large models, it may constitute a substantive substitution for the information content services of other online platforms. Obviously, such behavior should be explicitly excluded from the scope of fair use.

However, in some cases, allowing the provision and public disclosure of large model training data can enhance social welfare and generally will not substantially harm the interests of data property rights holders. For instance, when developers of large models disclose part of their training data based on the legal requirements for the transparency of artificial intelligence, it helps enhance the transparency of the models and increase public supervision and trust in large model technology. For instance, there are enterprises that specifically offer large model training data services, responsible for data collection, cleaning and annotation, and provide it to multiple large model developers for use. Allowing such enterprises to provide training data to developers can help reduce the cost of repetitive collection and processing and improve social efficiency. Therefore, when it comes to whether providing and disclosing training data for large models constitutes fair use, a one-size-fits-all approach should not be adopted. Instead, a scenario-based determination method should be used, and an assessment should be made by comprehensively considering factors such as the impact of this behavior on the interests of data property rights holders.

3.4 Exit through reasonable use: Exit through technical measures

As mentioned above, the market failure of large model training data is not a complete market failure. Under certain circumstances, developers of large models may enter into data licensing deals with some data property rights holders. If overly broad fair use rules are constructed, it may weaken the motivation of large model developers to reach transactions with some data property rights holders, thereby harming the potential interests of data property rights holders. It is a feasible solution to define the applicable conditions of fair use through legislation and explicitly exclude these circumstances from fair use. However, legislators are often constrained by information costs and cannot fully anticipate all possible scenarios. Often, the parties involved in a transaction have more information and can make decisions that are both in their own interests and most socially efficient based on specific circumstances. Therefore, granting data property rights holders the right to opt out of fair use can help facilitate transactions between data property rights holders and large model developers under conditions of lower transaction costs, thereby more effectively safeguarding the interests of data property rights holders.

However, allowing the option to opt out of fair use may also cause new problems. The view opposing the option to opt out of fair use holds that if the cost of opting out of fair use is too low, the right holder may abuse this option to undermine the fair use system. There has long been discussion in the academic circle about whether the mechanism for choosing to exit fair use is reasonable. The most typical example is the discussion on the legal effect of excluding the fair use clause in a contract. Most viewpoints hold that the legal effect of such contract terms should be denied. The main reason is that the fair use system is an important mechanism in copyright law to maintain the balance of interests between copyright holders and the public. If copyright holders can exclude the application of fair use through contracts, it is no different from unilaterally redefining the content and boundaries of copyright protection by copyright holders, constituting "private intellectual property rights", and thereby disrupting the balance of interests originally set by legislators. Especially in the context of the widespread application of click-through contracts, online platforms are very likely to disrupt this institutional balance through contractual means, thereby intensifying the erosion of public interests. In addition, the EU's introduction of an opt-out mechanism in the rules for the fair use of text and data mining has also sparked controversy. According to Article 4 of the Digital Single Market Copyright Directive, commercial entities conducting text and data mining can constitute fair use, but copyright holders also have the right to opt out of this fair use. Some people hold the view that the cost of the opt-out means set by this clause is extremely low. This not only leads to the failure to achieve the institutional goal of promoting the development of text and data mining, but may also generate significant social negative externalities.

This article holds that in order to prevent data property rights holders from abusing the opt-out mechanism, while introducing this mechanism, the threshold for rights holders to choose to exit and use it reasonably should be raised. As many viewpoints have worried, if the threshold for choosing to exit the mechanism is too low, in the case of high transaction costs, even if the data property rights holders have only a very small probability of obtaining the licensing fee, since they can circumvent the restrictions on fair use at a very low cost, then the data property rights holders may choose to exit fair use with a speculative mentality. To reserve the possibility of claiming data property rights from users. This will lead to an increasing number of situations being excluded from the rational use of training data for large models, and the institutional purpose of rational use will be dashed. For instance, if the provisions of the robot agreement or service agreement allow the right to opt out of fair use, and the data property rights holders can circumvent the restrictions of fair use almost without incurring any costs, then the vast majority of rights holders are very likely to choose to modify the robot agreement or service agreement to retain the opportunity to assert their rights against the developers of large models. As empirical research shows, the proportion of large model developers restricted from crawling data through robot protocols or service protocols in practice is constantly increasing. A relatively reasonable solution would be to allow data property rights holders to opt out of the reasonable use of large model training data through technical measures such as paywalls and software locks. Such measures usually require data property rights holders to bear relatively high costs. Therefore, when rights holders choose whether to withdraw from fair use, they will weigh the benefits and costs that can be obtained from the withdrawal. Under normal circumstances, only when the possibility of reaching a deal with the developers of large models is relatively high and the profits obtained are relatively high will the data property rights choose to withdraw from the reasonable use of the training data of large models. At this point, the behavioral strategies adopted by data property rights holders are generally in line with the most socially efficient decisions, which can not only better safeguard their own interests but also promote the improvement of overall social welfare.

3.5 Improvement of legislation on rational use

Under the current law, by reasonably interpreting and applying the relevant provisions of the Anti-Unfair Competition Law, it is possible to achieve an effect close to the introduction of fair use. Firstly, by interpreting the application requirements of general provisions, courts can exclude some reasonable data acquisition and usage behaviors from the scope of regulation. To protect the rights and interests of data property under general terms, multiple conditions must be met, including the existence of a competitive relationship, the act being unjust, and causing actual damage to the data rights holder, etc. In the context of large model training, courts can, through the interpretation of these elements, leave appropriate legal space for large model developers to obtain and use data. For instance, if the purpose of large model developers in obtaining and using data is mainly for scientific research, then in general, it can be determined that they do not constitute a competitive relationship with the data rights holders. Even if the developers of large models obtain and use data for commercial purposes, if it and the services or products provided by the data rights holders belong to two markets that are not very related, it can be determined that there is no competitive relationship between the two. Or, as long as the training behavior of large models does not cause substantial damage to the data rights holders, it is also entirely possible to exempt the developers of large models from liability for their acquisition and use of data through the interpretation of the elements of damage. Secondly, there is also room for interpretation in the newly added data provisions of the Anti-Unfair Competition Law. For instance, this clause stipulates that business operators must not obtain or use data lawfully held by other business operators through "improper means". In the future, courts can, through interpretations of "improper means", exclude the situation of fair use from the scope of control over data property rights and interests.

However, this judicial solution cannot replace the improvement of legislation. First, the interpretation and application of the above-mentioned provisions have considerable flexibility and cannot provide the certainty that legislation possesses. The general provisions themselves cannot provide specific guidance for the determination of fair use. When judges interpret and apply such provisions, they have a considerable amount of discretion, which will lead to a great deal of uncertainty in the determination of fair use. As an uncertain concept, "improper means" cannot be clearly defined in terms of its connotation and denotation in advance. This makes it difficult to provide stable legitimacy expectations for large model developers and fails to offer a definite role in achieving fair use legislation. Secondly, with the rightization of data property rights, from a legal perspective, fair use as a right restriction should, in principle, be clearly stipulated by legislation. Data property rights, like copyright and personality rights, are private rights. Starting from the private law principle of the sanctity and autonomy of will of private rights, the restrictions on private rights are exceptional circumstances. In principle, they should be limited by legislation through enumeration and should not be arbitrarily interpreted and created by the judiciary. For this reason, whether it is the fair use of copyright or the fair use of personality rights (including the right to personal information), China has adopted a legislative listing approach to stipulate them. The reasonable use of data property rights and interests should also be like this. Therefore, it is necessary to introduce reasonable usage rules for large model training data at the legislative level.

At present, in the fields of artificial intelligence and data, relevant legislative activities are being accelerated. In 2023 and 2024, The State Council has consecutively included the "Draft of the Artificial Intelligence Law" in the legal proposals to be submitted to the Standing Committee of the National People's Congress for deliberation. The Standing Committee of the National People's Congress has also included "legislative projects on the healthy development of artificial intelligence and other aspects" in the preparatory review items of the legislative work plan for 2024 and 2025. These legislative opportunities should be seized to seriously consider establishing a reasonable use system for large model training data, providing a clear legal basis for large model training and ensuring the development of artificial intelligence technology and industry. The options that can be considered include: First, establishing a separate provision for the fair use of training data for large models in the "Artificial Intelligence Law". Second, when legislating on data property rights, a broader range of fair use provisions for data should be established, and the fair use of training data for large models should be explicitly listed as one of the circumstances.

Conclusion

The strengthening of data property rights protection has brought legal challenges to the acquisition and use of training data for large models. Based on the analysis of the market failure theory, it can be known that in most cases, allowing developers to reasonably use data for large model training can enhance social welfare without harming the market interests of data property rights holders. Compared with alternative options such as collective management or statutory licensing, fair use is also a better choice. As long as the rules for the rational use of training data for large models are appropriately designed to ensure that their scope of application is necessary and reasonable, an effective balance can be achieved between technological development and rights protection. In the future, it is necessary for our country to seriously consider introducing this system in the legislation on artificial intelligence and data, providing better legal guarantees for the development of the data element market and the artificial intelligence industry.


The original text was published in the "Thematic Discussion One: Multi-dimensional Perspectives on Digital Law Research" column of the 5th issue of "Jurist" in 2025, and was reprinted from the wechat public account "Jurist Magazine".



Assistant Editor: Yang Ziyue

Editor: Zhao Zerui

Reviewed by Ji Weidong