Location : Home > Resource > Paper > Theoretical Deduction
Resource
ZHANG Jiyu, WANG Saifei | ​​Norms on Fair Use in Training Large Models
2024-11-16 [author] ZHANG WANG preview:

[author]ZHANG WANG

[content]

Norms on Fair Use in Training Large Models



*Author   Zhang Jiyu

Associate Professor, Law School, Renmin University of China

Researcher, Law and Technology Institute, Renmin University of China

*Author Wang Saifei  

PhD Student, Law School, Renmin University of China



Abstract: The copyright law created in the field of literature and art needs to respond positively to the development needs of technology in the era of artificial intelligence, and construct fair use rules that are compatible with the development of social technology. The use of works in the training process of artificial intelligence big model is a kind of attached copying in the process of technology, which has a very strong transformative purpose. The normal use of the trained artificial intelligence big model is not to generate infringing content, but to have a broad field of application, which is of positive significance to social development. However, large model training requires massive amounts of high-quality works, and the works need to be rich and diverse. Due to factors such as high transaction costs, piling up of license fees, limited and selective willingness to license, and public interest considerations, it is difficult for the market mechanism to effectively achieve a reasonable allocation of resources. Therefore, it is necessary to establish machine learning fair use clauses to clarify the legality of the use of works in the training of AI big models, and at the same time to reasonably regulate the output side of AI, so as to better balance the multiple interests of copyright holders, the public, and the developers of AI big models, to promote individual innovation, enterprise innovation, and social innovation, and to encourage copyright holders and developers of AI big models to establish innovative cooperation mechanisms to promote the prosperous development of social culture and the realization of a better life under the principle of intelligence for good.


1. Presentation of the problem


In the past decade, based on the development of algorithms, the improvement of arithmetic power and the use of large-scale data, the artificial intelligence technology represented by machine learning has made significant progress, especially the artificial intelligence big model currently demonstrates a strong content generation ability, and even some logical reasoning and mathematical operations. The big model can effectively obtain the knowledge of the correlation relationship between symbols from a large amount of data and store the knowledge implicitly in the parameters and data, which has a certain degree of versatility, so many fields have launched the application research and development based on the big model of artificial intelligence.

Large model training needs to rely on massive amounts of data, and the training data often contains a large number of works protected by copyright law. How copyright law should evaluate the use of works in big model training has attracted great attention worldwide. Currently, the United States has seen an increasing number of lawsuits between artificial intelligence companies and authors, copyright holders, writers' associations and other copyright holders. In China, there are also copyright owners formally suing AI painting software companies for using their works to train AI painting models without permission. These lawsuits mainly focus on two aspects: first, whether the act of copying the training data into computers during the training process of AI models infringes on copyright; and second, whether the content generated by AI constitutes infringement because it is substantially similar to prior works. The “training of AI” and the “output of AI” are two related processes that can be discussed separately. On the one hand, large models are not always used to generate “works,” but can be used for a variety of predictive and judgmental purposes, such as image recognition and speech recognition. Some large models are now also considered to show generalizability and can be used for a variety of different tasks. On the other hand, even if the main focus is on content generation scenarios, the data copying and processing during training and the output of content after the application has been released to the market involve two different behaviors, and even if the behavior of the output side is considered to be infringing on copyright, it does not necessarily follow that the use of the work in the machine learning process of large model training infringes on copyright.

This paper focuses on the question of whether the use of works in the training of big models of artificial intelligence infringes copyright, especially whether it can be recognized as fair use. Although the legality of the use of works in the process of computer analysis and text data mining has been responded to in the legislation and judiciary of some countries and regions prior to the development of this round of big models, there are still some differences in the legislation of each country and there is uncertainty in the application of the scenario oriented to the training of big models. Some researchers believe that the use of works in large model training should not constitute fair use. For example, American scholar Prof. Justin Hughes argues that the widely used generative artificial intelligence training set Books3 consists of a “shadow library” of nearly 200,000 pirated books, and that the use in model training is a kind of “quasi-expressive” use, and that the use in model training is a kind of “quasi-expressive” use. The use in model training is a “quasi-expressive” use, which is an infringement. Differences in legislation, intense litigation and conflicting views show that there is far from a consensus on the legality of the use of works in large-scale modeling exercises.

Compared with the past text data mining and machine learning in the era of “small models”, the fair use problem in the training of large models of artificial intelligence has certain characteristics, which is the reason for its prominent controversy. First of all, the “generative nature” and a certain degree of “generality” of large models make the analysis of fair use more complicated. Typical text data mining in the past had a single purpose and often did not result in content that competed with the copyright owner's work, which was easier to justify and had a limited impact on the copyright owner. The “generative” nature of big models makes many copyright holders believe that the market for their works has been seriously affected, but the “generalizability” of big models can bring social public benefits such as promoting scientific and technological innovation and industrial upgrading in various fields. Secondly, the training of big models generally requires large-scale and high-quality data, and only large enough models and training data can produce “emergent ability”. Data quality and abundance are even more important indicators of the fairness, accuracy, and robustness of AI models. This is a technical characteristic that should not be ignored when discussing this issue.

In this context, this paper takes the most controversial big model training in machine learning as an example, firstly, discusses the function and analytical framework of the fair use system, secondly, analyzes and discusses the fair use of works in the training of big model of artificial intelligence, and finally, puts forward suggestions for the construction of the system in China.


2. Functions of the fair use system and the basic analytical framework


2.1 Social response function of the fair use system

Copyright law has a clear legislative purpose of promoting social and cultural development, and fair use is precisely an important institutional tool indispensable for realizing the legislative purpose of copyright law. The rule of fair use is very rich in the promotion of the good life, and the concepts of freedom of information, interconnection, sharing and commonwealth, protection of vulnerable groups, etc., are manifested in one specific rule and case, reflecting the pursuit of fair competition and public interest.

Although the fair use rule is embodied as a restriction on copyright, this restriction is actively constructed based on the legislative purpose of the copyright law. “Fair use should not be thought of as a strange and occasionally tolerable departure from the grand concept of copyright monopoly. Rather, it is a necessary part of the overall design (of the copyright system).” Its core purpose is to safeguard breathing space within copyright, i.e., free space that allows for the allocation of a portion of the various ways in which a work can be used to the general public, provided that the incentive function of copyright can be realized.

Contemporary society is developing rapidly in terms of innovation, and new ways of using works are appearing all the time. The fair use rule is a concrete embodiment of the “responsive law” required by a rapidly evolving society. In contrast to “repressive law” and “autonomous law”, responsive law is not a passive response to society, but an active response to society, whereby law-enforcement agencies need to interpret and apply the law flexibly in accordance with the trend of social change. In particular, it will be difficult to adapt to the development of digital technology if the copyright law is rigidly applied. When the copyright law was first established, it was mainly oriented to the field of literature and art, and the scope of rights and other rules were mainly set up for literary and artistic works. It is necessary to respond positively to the urgent and legitimate needs of the digital era for the innovative development of intelligent science and technology through the rules of fair use and other rules.


2.2 Responses to fair use rules in the digital environment

The discussion of the scope of the right of reproduction and fair use arising from technological developments is not new. Programs automatically generate temporary copies from the hard disk to memory space when they run on a computer, users generate temporary copies in memory when they browse for information on the Internet, computer systems of network intermediary service providers generate automatic copies during the transmission of users' information, Internet browsers create a cache on the hard disk of a computer in order to improve the efficiency of web browsing, and so on. In order to realize the new purpose of these technologies, there is the act of copying works, and the copyright system enables people to better reconcile the tension between technological development and copyright protection through the interpretation of the “right of reproduction” or the rule of fair use.

Under the new round of technological revolution, the adjustment of copyright law in response has been occurring in all parts of the world. Before the significant progress of generative artificial intelligence based on big models, text data mining was a key concern of copyright legislation. Text data mining is any automated analytical technique designed to analyze text and data in digital form to generate information, including, but not limited to, patterns, trends, and correlations. Japan has long been concerned with the construction of fair use rules in data analysis scenarios. The Copyright Law of Japan provides an exception for the purpose of not enjoying the thoughts and feelings expressed in the work in Article 30-4, and lists the scenarios of data analysis, computer data processing, etc., an exception for incidental use of computers in Article 47-4, and an exception for light utilization of information processing in Article 47-5. This leaves ample room for fair use in the development of the information industry, and at the same time makes it clear that it needs to be premised on the premise that it will not unreasonably jeopardize the interests of copyright holders, thus providing a guarantee for the protection of the legitimate rights of copyright holders.

The EU has also responded earlier to the issue of copyright in the IT environment. Article 5 of the 2001 Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society provides that, where the temporary reproduction is ephemeral or incidental, is an integral and essential part of a technological process, has the sole purpose of transmitting the work over a network or of making a lawful use of the work, and where such temporary copying does not have an independent economic significance, then the act does not infringe the right of reproduction. With the further development of digital technology, the European Union has developed a “Digital Single Market Strategy” and adopted the “Directive on Copyright in the Digital Singles Market” in 2019, in which Article 3 regulates the use of text and data mining by scientific and cultural heritage institutions for the purpose of scientific research. scientific research, and Article 4 provides that text and data mining in general constitutes fair use, but adds the crucial prerequisite that the right holder has not expressly reserved the use of the said work or other content in an appropriate manner, and provides that it “shall be reserved only for a limited period of time” and “shall not modify the content”. “No modification of the content”. The provision that allows right holders to “opt out” is quite controversial. Some researchers have commented that it is “conceptually wrong, theoretically flawed, and normatively unambitious”. However, the EU Artificial Intelligence Act still explicitly states that providers of general-purpose AI models need to respect EU copyright law, and in particular need to use advanced technology to identify and respect the rights reserved under the EU's Directive on Copyright in the Digital Singles Market. The reservation of rights with clear expression is made.

Unlike the legislative model in Japan and the EU, the U.S. is more flexible in making judgments mainly through the four-factor analysis of fair use in the judicial context. The U.S. jurisprudence has clarified many specific situations that constitute fair use, such as the copying and storage of website content by search engines, the provision of thumbnails to users by search engines, the copying of large amounts of reference text in plagiarism detection services, and the digitization of large quantities of books for analysis and retrieval, to name just a few. These responses shed light on our continuing exploration of fair use rules in digital spaces in the age of artificial intelligence.

However, as described in the first part of this paper, the specific case of big model training has some of its own characteristics compared to these previous scenarios. In the face of the rapid development of big models of artificial intelligence, copyright law needs to carefully analyze and weigh interests to clarify whether the use of works in the training phase can constitute fair use.


2.3 The analytical framework of the legislative theory of China's copyright reasonable use

China's copyright law provides for a number of specific circumstances in which a work can be used free of charge without a license, mainly through explicit enumeration. In order to simplify the expression, this article will use “reasonable use” to refer to this copyright exception system. China's fair use rules are mainly based on the framework of the “three-step test” originally established in the Berne Convention for the Protection of Literary and Artistic Works. 2020, when China's Copyright Law was amended, the requirements of the “three-step test” were explicitly introduced into the provisions of the Copyright Law, i.e., the requirements of the “three-step test” are stipulated in the provisions of the Copyright Law. When the Copyright Law was amended in 2020, the requirement of the “three-step test” was explicitly introduced into the provisions of the Copyright Law, i.e., it was stipulated that the use of a work “shall not interfere with the normal use of the work and shall not reasonably prejudice the lawful rights and interests of the copyright owner”, in order to regulate and guide the exceptions to the copyright that the use of a work can be made free of charge without a license. Meanwhile, in this amendment, faced with the functional objective of responding to the new needs of the rapidly developing society, China has also added a flexible provision to the fair use clause, i.e., “other cases stipulated by laws and administrative regulations”, which provides an institutional interface for the introduction of new types of fair use in the future from the legislative level. In the practice of the rule of law, China needs to make full use of this institutional improvement and actively study and respond to the question of whether the new specific use of works can constitute fair use, so as to give fuller play to the institutional function of fair use.

Economic analysis is an important paradigm for the study of fair use rules. For example, American scholars, such as Professor Wendy Gordon, believe that the use of a work should constitute fair use when it meets the following three conditions: first, market failure is a reality; second, it is in the interest of society to allow users to use the work in this way; and third, the granting of fair use will not bring substantial incentives to copyright holders. The third is that the granting of fair use will not materially impair the incentives of the copyright owner. Professor Xiong Qi also discussed in detail that the judgment of fair use should return to the path of economic analysis.

In addition, researchers have proposed a series of concepts such as “transformative use”, “technological use”, “non-expressive use”, “non-appreciative use”, etc., which can be used as a basis for the analysis of fair use. In addition, researchers have proposed a series of concepts such as “transformative use”, “technical use”, “non-expressive use”, “non-appreciative use” and so on, which can serve as important clues for discussing the fair use of copyright, and point to a stronger possibility of constituting fair use. To some extent, these concepts all reflect the common feature that the acts of use of the works under discussion are not carried out in the manner envisaged when copyright law was legislated, and thus are usually more likely to point to a stronger public interest, and that recognizing them as fair use tends not to detract from the necessary incentives for authors.

China's copyright fair use rules are established under the framework of the “three-step test”, but the connotation and extension of the concepts of “normal use” and “legitimate interests” are not very clear. In particular, with regard to the emergence of new ways of utilizing the works, there are often two different views on the application of the “principle of extension of interests” and the application of the “principle of proportionality of interests” in respect of the ways of using the new works and the interests that may be derived from the new works. The former usually holds that where the new mode of use of a work extends, the interests of the copyright owner should generally extend as well. The latter argues that copyright law allocates economic benefits to authors in a way that provides the necessary incentives for authors to create, and that copyright law needs to strike “the most efficient and productive balance between protection (incentives) and the dissemination of information in order to promote learning, culture and development”.

The principle of “extension of benefits” is generally more favored in author's rights systems and has had a strong influence. According to Dr. Mihaly Ficsor, who was Assistant Director General of WIPO, “All forms of exploitation of works which are or may be of great economic or practical importance should be reserved for authors, and any limitation of the author's interest in respect of such forms of exploitation should not be permitted. exception is impermissible.” In the 2000 dispute involving the U.S. copyright law, the WTO Dispute Settlement Panel also gave a broader interpretation of “normal exploitation of works”. However, from the perspective of the theoretical development of the copyright system, especially the comprehensive legislative purpose of China's copyright law, we should adhere to the “principle of proportionality of interests” and pay attention to the social response function of the fair use system. Copyright should not be understood as a right to absolutely protect the full value of a work. Efforts to make intellectual property rights holders internalize all the benefits of their creations will inevitably upset the right balance. In the field of intellectual property, the theory that authors or inventors should receive the full value associated with their creations is false. In fact, social benefits should not be fully internalized to specific right holders in any economic field. Professor Lin Xiuqin has criticized the “three-step test” for its technical flaws in legislation and overly restrictive interpretation that squeezes and erodes the appropriate space for fair use, and provides neither concrete rules for operation nor a purpose or value goal for fair use. China is not a country with a copyright system, and scholars such as Professor Cui Guobin have also specifically addressed the utilitarian purpose of our legislation. An overly strict interpretation of the “three-step test” will affect the realization of the legislative purpose of the copyright law, and should avoid the application of the “principle of extension of interests” and explicitly adhere to the “principle of proportionality of interests”, looking for We should avoid applying the “principle of extension of interests” and clearly adhere to the “principle of proportionality of interests”, so as to find the best balance of interests that can promote the development of cultural prosperity and social welfare, and to shape a reasonable copyright boundary.

To summarize, under the framework of the three-step test, China's current consideration of whether or not to adopt laws or administrative regulations to establish new cases of fair use can be divided into four main points. (1) To specify the specific use behavior, and whether such use is in line with social public interest and has clear social value. (2) Consider whether there are obvious obstacles to the use of such works, in particular, whether there is a market failure. Copyright protection imposes certain costs on the use of a work, but in many cases it does not significantly impede the use of the work. However, if there are obvious barriers such as high transaction costs under the copyright market mechanism, this will usually constitute both an important reason for a finding of fair use, and may be an important argument for not unreasonably affecting the legitimate interests of the copyright owner, since the market is difficult to form in the first place. (3) The test of whether the use of the work does not affect the normal use of the work. An overly broad interpretation of the scope of normal use will seriously affect the realization of the function of the fair use system. In determining whether the normal use of a work is affected, the much-anticipated transformative use, technical use and non-expressive use are important clues. Generally speaking, these types of use are different from the use of works foreseen in the copyright law, and often bring new values that are different from the traditional use of works, and should be recognized as not affecting the normal use of works. (4) The test of whether the use of the work unreasonably impairs the legitimate rights and interests of the copyright owner focuses on the impact of the use on the market for the work under the principle of proportionality of interest, in order to determine whether the incentive for the creation of the work is impaired. Of the four points, the first and second points guarantee that fair use will be established for socially valuable and necessary circumstances, while the third and fourth points correspond to the last two requirements of the three-step test, which guarantees that the necessary incentives for copyright owners will not be jeopardized. Since the first and third tests are relatively simple to determine, they can usually be performed first. Of course, the situation regarding the existence of market failure and the impact of use on the interests of copyright owners may change with the development of practice. Legislation needs to be forward-looking and precautionary, but it should not be a fantasy and should be analyzed mainly on the basis of the current situation and the development that can be foreseen more clearly. The requirement in the Copyright Law that the normal use of the work shall not be affected, nor shall it reasonably jeopardize the legitimate rights and interests of the copyright owner will always constitute a necessary condition for the determination of fair use, providing a guarantee against major developments in the future.


3. Justification Analysis of Fair Use in Big Model Training


Based on the previous discussion, this part discusses the justification that the use of works in the process of AI big model training should constitute fair use.


3.1 The Value of Big Models and the Need for Using Works in Training

3.1.1 Characteristics of Artificial Intelligence Big Models and Social Value of Big Model Development

Currently, artificial intelligence big models have three characteristics: large scale, which needs to reach tens of billions of parameter levels; emergent, which can generate unexpected new capabilities; and generality, which is not limited to specialized problems or domains. The first characteristic lays the foundation for the capabilities of big models, and the latter two characteristics make big models have a broad application potential. Many specific downstream applications can be carried out on the basis of big models. The development of big models of artificial intelligence will bring more ways and support for human innovation, as well as generate a large number of important values that cannot be internalized within the copyright system, and is an important strategic technology for enhancing national competitiveness and national security.

First of all, the development of big models provides new modes and space for human creation. Innovation is an important demand of human beings, and innovative activities are constantly revolutionized with the development of society. The progress of science and technology has lowered the threshold of creation and expanded the forms and fields of creation. The development of large models of artificial intelligence provides new tools for human creation. On the one hand, this tool can better provide assistance and convenience for the creation and expression of the general public, so that the creative needs of ordinary people can be more satisfied. On the other hand, big models are also expanding the mode and space of literary and artistic creation. Some professional artists have begun to explore how to utilize AI tools in artistic creation, believing that “new technology allows people to stand in a higher place and see more diverse things” and “AI shows more possibilities for creators”.

Creativity is also a social and cultural correlation. “Every writer, composer, and filmmaker draws on the work of their predecessors when creating new works, and most are inspired by what their contemporaries are creating.” New works will usually contain some of the ideas of past works, people can learn from pre-existing works thus enabling more creativity, and admirers of pre-existing works will become new creators. Promoting the learning and use of existing works by AI can connect past and future creations in a new way, connecting AI developers, author groups, the public and other subjects, helping more individuals in society to realize the change of roles from consumers to creators of works, and expanding the forms and boundaries of creations, which is the proper meaning of the copyright law to promote the development of cultural prosperity and the realization of a better life.

Secondly, big models have the value of promoting functional innovation and social development outside the field of literature and art. Xi Jinping, General Secretary of the Communist Party of China Central Committee, pointed out, “Accelerating the development of a new generation of artificial intelligence is an important strategic grip for us to win the initiative of global scientific and technological competition, and an important strategic resource for promoting the leapfrog development of China's science and technology, the optimization and upgrading of industries, and the overall leap in productivity.” Artificial Intelligence Big Model is a new technical tool for analyzing, understanding, and generating combinations of symbols, and its function in social applications far exceeds the scope of the field of literature and art. The big model provides a new development space for human-computer interaction, which makes people use computers to accomplish tasks with a greatly reduced degree of complexity. Big models will become the key underlying architecture in the new generation of information technology. Deloitte's research report has sorted out 60 important applications of generative AI based on big models in industries such as consumer, financial services, government and public services, life sciences and healthcare, industry, and telecommunications. It is precisely because of this wide range of application possibilities that there is a high expectation for big models to move from “emergence of intelligence” to “emergence of value”. Of course, from the development of technical capabilities to the realization and popularization of a large number of practical products, there is still a lot of work to be done.

3.1.2 The Need for Training Using Massive Works in the Development of Artificial Intelligence Big Models

The technical characteristics of artificial intelligence big models make their development highly dependent on available massive, high-quality training data. First of all, if we hope that the big model achieves good results, under the current technical route, we must rely on massive training data. In recent years, the development of artificial intelligence shows the “scaling law”, that is, the size of the model greatly affects the ability of the model, especially after the model size reaches a certain degree, there may be some capabilities that can not be observed in a small model, namely, the aforementioned This is the phenomenon of “emergence”, or “smart emergence”. Unlike earlier, smaller models, large models have a very large number of parameters and require larger and more extensive training data. As a result, copyright regimes are likely to have a significant impact on the development of AI technologies, and are of particular concern to MSMEs wishing to enter this market.

Second, large model development is highly dependent not only on the quantity of training data, but also on its quality and richness. The quality and richness of training data are important for the quality of output of generative AI based on big models, avoiding discrimination and bias, and safeguarding content and cultural richness. The UNESCO Recommendation on the Ethics of Artificial Intelligence specifically emphasizes the ethical imperative of ensuring diversity and inclusiveness, stating that Member States should strive to make “AI systems that respect multilingualism and cultural diversity” accessible to all, and suggesting that “the development of AI technologies requires a corresponding increase in data, media, and information. The development of AI technology requires a corresponding increase in data, media and information literacy, as well as access to independent, pluralistic and credible data sources”. China's Interim Measures for the Administration of Generative Artificial Intelligence Services also stipulates that providers of generative AI services should take effective measures to improve the quality of training data and enhance the diversity of training data.

Therefore, the dataset used to conduct large model training needs to contain as many high-quality works as possible, and as rich and diverse types of works and sources of works as possible, in order to better meet the needs of technical performance and the requirements of safeguarding social and ethical values.


3.2 Incidental copying and transformative use in large model training

Although the use of a large number of works in big model training is required, the immediate purpose of the use is to produce models that reflect the symbolic laws in the training collection of works, not to provide copies of the works. The use of works in training does not provide copies of the works in the marketplace and should not be considered to interfere with the normal use of the works.

3.2.1 Incidental copying

The use of “copies” of a work in the training of a large model of artificial intelligence is a form of “incidentally” copying, which can also be called “intermediate copying”. Intermediate copying is a part of the technical process of obtaining an AI model, which does not store the expression of the work directly in the model after training, let alone copying or making copies for third parties to use.

Machine learning or model training is the process of learning a model from data and “aims to design methods and algorithms by ‘learning’ from data.” The training process usually involves copying large amounts of data used for training on servers used for preprocessing or training, and performing a series of data preprocessing such as quality filtering, de-emphasis, privacy removal, and segmentation as necessary for the machine to learn. These training data may contain a large number of works, but the specific representations of these works are not directly replicated in the model. In this way of using the data, the reproduction of the data does not lie in the appreciation of the artistic value of the works, nor in the compilation of the works in order to present them to the users in the future in the original way, but in the learning and extraction of the laws and features behind a large number of works.

As mentioned earlier, the specificity of this type of incidental copying in the digital environment has been emphasized earlier, with Japan, the EU and others establishing certain special rules. The EU has also specifically explained in the preamble to the 2019 Directive on Copyright in the Digital Singles Market that the exception for temporary reproduction set out in Article 5 of the 2001 Copyright Directive will still apply to text and data mining, as long as it does not involve reproduction beyond the scope of the exception. In terms of value judgment, the evaluation of incidental copying should reject the principle of extension of benefits and distinguish it from the general copying of works.

3.2.2 Transformative use

The use of works in large model training is obviously different from the original use of works, the purpose is to create a training environment for large models, so that the large models can “learn” important laws from them, and even emerge “intelligence”, such as reasoning, in order to better accomplish a variety of tasks. To accomplish a variety of tasks. Taking the big language model as an example, Prof. Geoffrey Hinton, a pioneer in deep learning and winner of the Turing Award, pointed out that the meaning of a symbol exists in its association with other symbols. What the big models learn “from these millions of features and the billions of interactions between the features they learn is understanding.” This “understanding” of the basic words that form the basis of people's expression and creativity has always been outside the realm of copyright law protection. This “transformative” purpose for the use of works in large model training is an important basis for what may constitute fair use.

In addition, even if we take a step back and consider not only the direct purpose of use in the model training stage, but also the output of the big model in the application of generative AI, the analysis in the previous section has also fully demonstrated that, except for some illegal use, generative AI does not aim at generating copies of previous works, but has a wide range of technical application scenarios, from conversational shopping to mine risk identification and disposal, etc. . These also illustrate that the use of works in large model training also has a strong transformative purpose in terms of the overall application or service.


3.3 Market failure in licensing works for large model training

Market failure is one of the most important reasons to recognize “fair use”. High transaction costs or some market failures may prevent the realization of licensing consent. According to Prof. Gordon, in the context of fair use, market failure means that the market cannot be trusted to act as a good allocator of social resources. Such failures include both technical failures, such as failures due to transaction costs, strategic behavior, income and endowment effects, etc., and potentially more serious problems, such as the inappropriate use of market transactions in particular scenarios, i.e., scenarios in which the market is not as good at incentivizing creativity and diffusion as some other models. Market failure is a reality in scenarios where large models of AI are trained.

For one thing, it is difficult to rationalize pricing and transaction costs are too high. Large model training requires a huge amount of works, but no clear and feasible market mechanism exists yet. Allowing AI R&D parties to find or dock with a large number of dispersed rights holders, negotiate licenses, and pay licensing fees requires significant transaction costs. While collective management organizations can play a role, and internet platforms could theoretically develop some docking mechanisms, there are still a number of very salient difficulties. On the one hand, mechanisms such as collective management also have significant costs. Professor Pamela Samuelson argues that the types of works covered by large models of artificial intelligence are particularly broad, that it is not quite feasible to create an effective collective licensing mechanism for each type of work, and that even if such a mechanism were established, the cost of implementation would be very high, with a large portion of the fees received from AI companies being used to pay collective management organizations, and the fees received by copyright owners would be too limited to provide them with meaningful financial support. It is still difficult for many developers of big AI models to predict at this point in time what the future profit model will be and exactly what kind of benefits they will be able to generate. There is no mature model for calculating the value of different types of works for model training, and different rights holders have different perceptions of the value of their own works, and the endowment effect tends to make rights holders' assessment of the license fee they should receive for their own works significantly higher than the amount that the AI companies think they should pay, which makes it even more difficult to negotiate a deal. Such difficult negotiations can bring about huge social costs.

Second, the problem of license fee stacking. The amount of data required for large model training is extremely large, and unlike traditional fields where billing can be based on the use of a single work, all works need to be used for training in large model training, which exacerbates the problem of royalty stacking, whereby the stacking of many royalties results in the sum of the royalties being too high for commercial activities. License fee stacking not only comes from the fragmentation of rights, but also from the increase of various types of rights holders, including neighboring rights holders and rights holders arising from technical measures, and so on. In a given dataset, there may be not only copyright in each work, but also neighboring rights in performances, audiovisual recordings, and so on.

Due to the huge amount of possible license fee overlay, it is clearly unaffordable for small and medium-sized enterprises or startups, even if the head large enterprises can afford it. Large enterprises already have an inherent advantage in training big models based on existing data and arithmetic power. If they pay high license fees to legally train big models, it is obviously more unfavorable to the development of small and medium-sized enterprises that are more innovative, which will have a negative impact on the market competition in the AI industry, and the innovation of big models will be hindered. This is an important reason why this paper argues that statutory licenses are not the appropriate system here either. Prof. Mark A. Lemley points out that overcompensating copyright holders would be detrimental to the market, distorting it away from the norms of competition and creating dynamic inefficiencies by interfering with the ability of other creators to do their work.

Third, there is the issue of the limited and selective nature of the willingness to license. Even without taking into account the aforementioned actual transaction costs and license fee stacking, there is the problem of limited willingness to license on the part of right holders and the variability of willingness to license among different right holders. Both enterprises and subjects such as copyright holders want to lock up the information they produce or hold, but need others' information for their future information production. Currently, data silos and data blocking are widespread. Rights holders are always concerned about licensing their data rights, especially in the network economy and the attention economy, where market competitors are often concerned that a practitioner seemingly unrelated to their business may also steal their traffic and become their competitor. The characteristics of such economies lead to limited willingness to license, making it difficult for the market to be an effective means of allocating resources in such situations.

The selectivity and variability of the willingness to license among different copyright holders, in turn, affects the quality of the trained macromodels. On the one hand, as Prof. Thomas Margoni, among others, states, “Unable to compete with dominant AI players, smaller companies or new market entrants may find it economically attractive to train algorithms on ‘cheaper’ data, which this usually means older, more inaccurate or biased data, leading companies that cannot afford the cost of Tier 1 AI to develop 'sub-par' AI applications, thereby contributing to algorithmic discrimination and inequality.” On the other hand, copyright holders who refuse to license often perceive their works as having a higher value, and the absence of such works, which are more likely to be of high quality, also limits the accuracy and richness of the larger models, which may create inherent bias and discrimination. Again, such issues are difficult for the market to reasonably address. Being able to use a wider range of training data will make AI systems better, safer, and fairer.

Fourth, public interest considerations for the development of big models of artificial intelligence. The benefits that can be generated by using works to train big models have far exceeded the field of copyright, and are difficult to be rationally allocated by the market. In fields such as medical care, autonomous driving, and mining operations, the capabilities of big models will be related to basic rights and interests such as human life and health. The development of artificial intelligence big models is also related to national competitiveness and long-term national security. In addition, taking the particular model of open-source big models as another example, open-source big models continue to drive innovation and application in the field of AI, but with few direct benefits and a strong positive externality of promoting the public interest. According to Stanford University's Artificial Intelligence Index Report 2024, there are a total of 149 base models released in 2023, 65.7% of which are open source. Once the development of open source models is compromised due to barriers to the use of training works, it will have a negative impact on the subsequent development of AI. While the need for fair use has decreased in some areas as transaction costs have decreased, the need for fair use in meeting some public needs is as great as ever.


3.4 The question of whether it unreasonably prejudices the legitimate interests of copyright owners

Whether the use of a work “unreasonably impairs the legitimate rights and interests of the copyright owner” is another prerequisite for determining whether such use can be categorized as fair use. However, the scope of “legitimate interests” has been the subject of much debate. The difficulty in determining the impact of the development of digital technology on the legitimate rights and interests of copyright holders lies in the fact that the development of some digital technology may create new interests, which are often different from those based on the dissemination and appreciation of works.

This paper argues that the arbitrary application of the “extension of benefits principle” should be eliminated. High licensing fees will constitute a significant threshold for entering the market, and limited and selective licensing will affect the quality, richness and fairness of large models of artificial intelligence, which will make the copyright, which originally only adjusts the interests in the field of literature, art and science, have a significant impact on the development of technology and market competition, and this is something that needs to be guarded against and avoided. This paper argues that the use of works in the training of artificial intelligence big models will not unreasonably affect the legitimate rights and interests of copyright holders, judging from the current development. The main reasons include the following:

First, as mentioned earlier, what is obtained after the training of the big model is a big model that stores statistical laws from a huge number of works, not a collection of works. The big model itself is a product in the field of technology and does not belong to the market in the field of literature, art and science where the works themselves are located. Secondly, if the two aspects of the big model training and the subsequent generation of content are considered together, it should also be seen that the normal use of the AI big model is not to reproduce or plagiarize existing works. While AI macromodels may generate content that is substantially similar to existing works, in normal use they are more likely to generate content that is not substantially similar to existing works and can be used in a wide range of fields with very rich functionality beyond generating content that is similar to works. The ability to develop large models and fair competition in the marketplace should not be compromised by restricting the use of works in training. With regard to possible infringement problems on the use side, the providers and users of AI systems or services should be regulated in terms of their behavior on the use side, including the reasonable determination of liability for damages, the reasonable allocation of the duty of care, and the requirement to take certain measures to avoid generating obviously infringing content, so as to avoid unreasonable damage to the rights and interests of copyright owners.

Some argue that content generated by AI, even if it does not constitute a substantial similarity to a work, infringes on the market interest of the work, and therefore the use of the work should be restricted. This view lacks theoretical basis and realistic evidence. On the one hand, promoting the creation and dissemination of new works is the core purpose of copyright law. If the generated content brings about competition in the market, it is also necessary to distinguish whether it is due to the availability in the market of content that constitutes a substantial similarity to the established work, or due to the availability of new content in the market. While the former usually falls under the category of behavior that affects the normal use of the work and unreasonably affects the legitimate interests of the copyright owner, and should be regulated by behavior on the output side, the latter has traditionally been something that copyright law has wanted to encourage. For example, in Sega v. Accolade, the plaintiff argued that the defendant's intermediate copying behavior in reverse engineering was the first step in the development of a competing product, resulting in a market impact on its own product. In analyzing the impact of the new work on the market for the original copyrighted work, the judge noted that “generating a growth of creative expression based on the dissemination of other creative works and unprotected ideas in those works is precisely the goal that copyright law is designed to promote.”

On the other hand, some researchers have envisioned that if an AI can generate works that are stylistically similar to those of a particular author, it could affect the author's artistic and personal life, even if the AI does not produce a pirated “copy” of the original work. First of all, the extent to which this imagination will develop into reality remains highly uncertain. Natural persons can also easily imitate an author's style, but copyright law does not prohibit it. People usually evaluate the creators and followers of a particular style significantly differently. As a result, styles are not protected, but the reputation of the pioneer of a new style and the price of his or her work tends to be naturally enhanced. For higher-quality human creations, there is no evidence that their value can be replaced by AI, which instead provides a more open creative space for the author community. Just as photographic technology has impacted traditional artistic activities, it has also opened up new artistic spaces and driven new developments in Western painting, where realism was once one of the highest ideals. According to one artist, “Artificial Intelligence will be one of the biggest art movements of the century.” The New York Times cites a shrinking market for journalistic works, but this is supposedly a consequence of the growth of the Internet as a whole and makes it even more unlikely that people who really care about journalism, especially with an understanding of the limitations of big-model technology, will eschew professional news outlets in favor of relying on artificial intelligence. Artificial intelligence may be an enabler of some creative endeavors and may increase the threshold for valuable human creativity, but it is still not a substitute for many human creations at this point, especially those that embody human insight, deep thought, subtle emotion, and innovative style. “Intellectual property law is justified only by ensuring that creators can charge a sufficient amount to ensure a profit sufficient to recover fixed and marginal expenses.” In this way, the goal of copyright law to incentivize creativity can still be achieved. It is therefore inappropriate to hypothesize too much about the possible harms at the early stages of the development of emerging technologies. To take a step back, even if such imagination becomes reality in the future and indeed reflects an unreasonable impact on the author community and human creativity, there are many balancing mechanisms of interest that deserve to be considered in a holistic manner, such as special taxes based on such use at the output end, rather than restrictions on the use of works at the training end for big AI models with broad application domains. Generating works for enjoyment is only one part of the application landscape of AI macromodels, and this needs to be always kept in mind when making legal judgments.

It is important to note that while determining that the use of works in model training constitutes fair use, copyright holders can still engage in various forms of cooperation with big model developers and deployers. For example, copyright holders can provide ultra-high definition files of works that are not publicly available and receive a return, copyright holders, collective management organizations and other subjects can develop data products specifically for training and provide them to AI R&D parties at a reasonable price, and R&D parties have an incentive to establish a more harmonious cooperation mechanism with copyright holders while mitigating their own costs of collection and data cleansing in order to achieve a win-win situation.


4.  Construction of machine learning fair use and output-side governance rules


4.1 Construction of Machine Learning Fair Use Rules

Article 24(1)(13) of China's Copyright Law stipulates that fair use can include other circumstances provided for by laws and administrative regulations, which provides an institutional interface for expanding the rules of fair use, which can be carried out by amending the Regulations for the Implementation of the Copyright Law or by establishing specific rules in the AI-related legislation.

When establishing the machine learning exception, China should not limit it to the scientific activities of scientific research institutions, nor should it set up an “opt-out” right for right holders as in the EU. These inappropriate restrictions will greatly reduce the function of the fair use system. Meanwhile, as Article 24 of China's Copyright Law stipulates that the act of fair use “shall not affect the normal use of the work and shall not reasonably jeopardize the legitimate rights and interests of the copyright owner”, which can always provide protection for the copyright owner in the development of the society, it is necessary to set up a more inclusive rule on the fair use of computer analysis so that the law can flexibly respond to the rapid development of the society and provide the right of “opting out”. Therefore, more inclusive fair use rules for computer analysis should be established so that the law can flexibly respond to the respective scenarios of rapidly developing information technology, especially considering the various scenarios of “incidental copying”. Therefore, it is proposed to add a fair use circumstance in the copyright restriction provisions of the Regulations for the Implementation of the Copyright Law or other relevant legislation, i.e., in the process of computer analysis such as machine learning, text data mining, etc., the use of other people's published works by way of incidental copying, adaptation, etc. in the course of the technological process constitutes a fair use. Under the “three-step judgment method”, if the use of a certain work literally falls within the above scope, but affects the normal use of the work and damages the legitimate rights and interests of the copyright owner, the conditions for constituting fair use cannot be met.


4.2 Preventive measures against copyright infringement in AI systems

The use of works in the training of large models of artificial intelligence is included in fair use with legitimacy and necessity. However, generative AI has the possibility of being used for copyright infringement, and in order to better protect the legitimate interests of copyright holders, providers and users of AI systems should be required to exercise reasonable care, while taking care to conform to the laws and status quo of technological and industrial development.

First of all, providers of AI systems should, according to the specific circumstances of the systems or services they provide, provide necessary prompts to users to respect intellectual property rights and take certain technical measures to prevent the generation of infringing content. Users, as direct users of AI services, have greater control over the generation and dissemination of content. The system or service platform should prompt users to respect intellectual property rights and take certain measures to reduce the infringement problems that may arise from induced questions. Some researchers have suggested that more value-aligned training can be done on the basis of large models to learn from human feedback. This requires guidance and training on copyright and other issues for the annotators, such as the answer to “Read me a Harry Potter book verbatim”, which should be evaluated negatively if it outputs a certain amount of substantially similar content. Article 8 of China's Interim Measures for the Administration of Generative Artificial Intelligence Services stipulates that providers of generative artificial intelligence shall formulate clear, specific, and operable labeling rules, and shall provide necessary training for labeling personnel to enhance their awareness of respecting and abiding by the law. In addition, it can also recognize the prompt words input by users that obviously induce infringement, or filter the output content as necessary to reduce the generation of copyright-infringing content. The author conducted a simple test of some generative AI services in China and found that none of these services output the content of consecutive chapters in novels that are still under copyright protection in simple conversations, and they are also able to take precautions against some common prompts for multiple rounds of conversations. As technology develops, there will be more advances in copyright protection measures.

Secondly, on the basis of fault-based liability, the law should set up a “safe haven” clause for providers of large models of artificial intelligence and generative artificial intelligence systems or services, so as to clarify the boundaries of the responsibilities of the relevant subjects. Big model and generative AI is still in the early stage of development, and our country is still in the catching-up stage compared with the U.S. Under the premise of clarifying the “principle of fault liability”, and based on the development of technology and other factors, we should reasonably delimit the liability of AI, and require network platforms to take necessary measures in the case of technical and cost feasibility.

Finally, the circulation of data in society should be further promoted to increase the accessibility of data. An increase in the richness of training data can help reduce the probability of outputting infringing content. In addition, cooperation between AI providers, network platforms and copyright holders should be strengthened. Generative AI providers or deployment platforms can be encouraged to seek permission from copyright holders regarding the similarity of generated content, add links pointing to works related to the cue words or output content, recommend network contacts of authors whose styles are preferred by the users, and set up collaborative and innovative programs between AI companies and artists, among many other mechanisms to realize cooperation and win-win situation, so that copyright holders can get more traffic and revenue opportunities in the development of generative AI services.


5. Conclusion


Generalized AI, despite raising some concerns, is still an exciting exploration. People seem to have touched on the path to the realization of AI with some generality, but there is still a great deal of research and exploration work to be done. The current pathway cannot be separated from the use of massive amounts of data to train models, and a lot of associative relationships and statistical information have been acquired by training large models, emerging some unexpected capabilities with a wide range of application areas that inspire continued exploration. How to legally and efficiently promote the use of works in the training of big models of artificial intelligence is not only a matter of copyright law, but also a matter of technology and social development. In order to promote the realization of a better life, copyright law needs to balance and coordinate the interests of copyright holders, AI research and development parties, and the public to promote social and cultural prosperity, and also needs to respond positively to the innovative development of science and technology through rules such as fair use, and to prevent copyright law from inappropriately hindering technological progress. In the current situation where there is clearly a market failure in the licensing of works at the training end, but there is an unreasonable impact on the interests of copyright holders and a lack of a clear basis, China should clearly establish the rules for the fair use of machine learning at the training end in order to encourage the innovative development of AI technology and fair competition, and at the same time realize a win-win situation through the governance of the output end and the encouragement of the cooperation between copyright holders and AI enterprises.


The original article was published in the East China University of Political Science and Law Journal, Issue 4 2024, and is reposted from the WeChat official account East China University of Political Science and Law Journal.