China Institute for Socio-Legal Studies, Shanghai Jiao Tong University

2023-07-28 [author] Wang Lusheng preview：

[author]Wang Lusheng

[content]

Wang Lusheng: ChatGPT technology: an innovator or a disruptor of legal artificial intelligence?

Wang Lusheng

Professor of Law School of Southeast University, researcher of Judicial Big data Research Base of People's Court of Southeast University

Abstract: Since the 21st century, legal artificial intelligence has shown a scene of prosperity and revival. However, behind the hot surface, the language complexity and knowledge richness of the legal field make legal artificial intelligence still face technological bottlenecks in natural semantic processing and knowledge generation. The large-scale language model and generative AI technology represented by ChatGPT is expected to break the bottleneck of Natural-language understanding of legal AI, greatly improve the interaction, generativity and embeddedness of technology, and promote the connection between legal AI and users in terms of rigidity, high frequency and high viscosity. However, the underlying logic of existing ChatGPT technology cannot fully respond to the domain needs of legal knowledge richness, rigor, and creativity. The mismatch between fluent language processing ability and relatively low knowledge generation ability leads to knowledge completion illusion, knowledge authority illusion, and knowledge generation illusion, which restricts the fundamental subversion of large-scale language models and generative artificial intelligence architecture on legal artificial intelligence. In the future, it is necessary to overcome the "knowledge illusion" to achieve further iteration of legal AI by strengthening the supply of high-quality multimodal legal data and constructing the command fine-tuning mechanism based on the legal instruction set and the Reinforcement learning mechanism based on the knowledge feedback of the legal person. At the same time, in the sense of technological sociology, we should adjust the tension between innovation diffusion and social justice, avoid the digital divide between accessibility and usability, and truly achieve a new round of empowerment around legal knowledge in the entire society.

Keywords: ChatGPT technology; Large scale language model; Generative artificial intelligence; Legal artificial intelligence

Although legal artificial intelligence is a recently developed technology, its ideological origins can be traced back to Leibniz's discussion of how mathematical formalism improves law in the 17th century and Weber's imagination of modern vending machine judges. The seasonal metaphorical narrative about the development of artificial intelligence also applies to legal artificial intelligence. The research and development of legal artificial intelligence, which originated in the 1950s, experienced the "early spring" development and ushered in the "midsummer" in 1980. A group of legal artificial intelligence represented by legal expert systems emerged as the times require. Due to the inability to effectively break through technological bottlenecks, legal artificial intelligence fell into a "cold winter" at the end of the last century. Subsequently, under the triple influence of judicial digitization, technological accumulation, and venture capital, legal artificial intelligence entered the "summer" again in the past decade. Despite the rapid development of legal artificial intelligence in recent years, large-scale practical applications still face varying degrees of obstacles, and the pessimistic sentiment of "winter is approaching" has begun to spread in the field of legal artificial intelligence. In November 2022, ChatGPT was born, relying on a large-scale language model and a pre training technology framework of generative artificial intelligence, achieving impressive leaps in language understanding, intention recognition, logical reasoning, content generation, and other aspects. Since its inception in the mid-20th century, legal artificial intelligence has carried the universal expectation of improving the quantity, quality, and accessibility of legal services through technology. The good performance of ChatGPT in subdivisions, including law, rekindled the ardent expectation of legal Futurism that this technical framework will bring revolutionary changes to legal AI, and then generate strong legal AI - an AI that can adapt to the real legal world, solve complex legal problems, and communicate and cooperate with legal people without obstacles. So, to what extent are the large-scale language models represented by ChatGPT and generative artificial intelligence technologies (hereinafter referred to as ChatGPT type technologies) expected to enhance current legal artificial intelligence? Is the above enhancement "improving" or "disruptive"? This article intends to briefly expand on the above issues.

1 The Era Vision of Legal Artificial Intelligence

From the perspective of application practice, legal artificial intelligence is the use of automation technology to complete work such as prosecution and trial, which is usually carried out by legal professionals with the help of professional knowledge and wisdom. It is a cutting-edge content in legal technology that provides substantive legal service solutions. Since the 21st century, Big data applications and intelligent algorithms have gradually moved from theory to practice, and connectionist intelligence with deep neural networks as its core has shown an explosive development trend. Legal artificial intelligence is thriving again in this era and producing certain research results in various micro business scenarios. However, limited by the technological bottlenecks of natural semantic processing and knowledge generation, various types of legal artificial intelligence have not yet achieved a scientific connection between the institutionalized logic of sustainable development and existing practical achievements, and there is still room for further development and improvement in indicators such as functional improvement, universality, and coverage. An objective depiction of the era's landscape helps to gain a deeper understanding of the transformative development of legal artificial intelligence through ChatGPT technology.

1.1 The Contemporary Revival of Legal Artificial Intelligenc

It is generally believed that specialized publications on legal artificial intelligence can be traced back to Leman Allen's work "Symbolic Logic: Sharp Tools for Drafting and Interpreting Legal Documents" in 1957. In the development of more than 70 years, legal AI has not only experienced the paradigm shift from the semiotic school to the connectionist school, but also experienced several ups and downs in the development period. Undoubtedly, in the past decade or so, legal artificial intelligence has generally been in a state of revival, and corresponding progress has been made in key areas such as legal knowledge retrieval, legal text generation, and legal result prediction. In China's legal community, especially within the judicial system, there is also a conscious focus on the development of emerging technologies such as artificial intelligence, and the construction goals of "smart courts" and "smart prosecution" have been proposed. The contemporary revival of legal artificial intelligence is closely related to the rapid development of data, algorithms, computing power, and the maturity and improvement of artificial intelligence infrastructure.

Firstly, the widespread digitization of justice worldwide has provided significant data reserves for the development of legal artificial intelligence in this round. In the wave of informatization, the traditional concept of "judicial openness" has been given a new era connotation: judicial organs should promote the electronic disclosure of judicial documents to ensure the true realization of citizens' right to a fair trial and the right to know. Promoting the electronic disclosure of judicial documents has become a global trend in the 21st century, and a highly convergent system of online disclosure of judicial documents has been formed to regulate and operate practices. Although most of these judicial disclosures do not prioritize the construction of legal artificial intelligence, objectively speaking, the online disclosure of judicial documents has greatly increased the richness and accessibility of electronic legal corpus, providing convenient and rich data resources for the development of legal artificial intelligence, especially for the current round of connected school legal artificial intelligence.

Secondly, the long-term efforts of experts and scholars in the intersection of law and artificial intelligence have provided technological accumulation for the development of legal artificial intelligence in this round. In 1987, the first International Conference on Artificial Intelligence and Law (ICAIL) marked the formal formation of the academic community of legal artificial intelligence. At the third ICAIL in 1991, the International Association for Legal Artificial Intelligence (IAAIL) was established, further promoting the research and development of legal artificial intelligence. In Europe, the Legal Knowledge Systems Foundation (JURIX), composed of legal and computer science researchers, has held an international conference on legal knowledge and information systems annually since 1988, which, along with ICAIL, has become a benchmark for international legal artificial intelligence research. Even in the cold winter of legal artificial intelligence, there is still a group of scholars working on it. This enables the legal AI community to quickly absorb the latest achievements mentioned above and build a new type of automation model with the times in the evolution of the paradigm from the semiotic school to the connectionist school, as well as in the explosion of connectionist technologies such as machine learning and deep neural networks.

Finally, venture capital provides investment and financing to various legal technology companies, providing solid infrastructure support for the development of legal artificial intelligence in this round. According to data from Crunchbase, 175 venture capital investments in the legal technology field in the United States in 2021 exceeded $1.6 billion, far exceeding the $522 million investment in 2020 and the $989 million investment in 2019, setting a new historical high. In China, according to incomplete statistics, the total investment received by legal technology companies in 2021 also exceeded 1.5 billion yuan. Overall, with the accumulation of long-term reserves of data and algorithms, the demand for data processing capabilities in legal artificial intelligence has significantly increased, and venture capital can make up for the infrastructure shortcomings of these start-up legal technology enterprises.

In summary, with the rapid development of data, algorithms, and computing power, as well as the maturity and improvement of artificial intelligence infrastructure, legal artificial intelligence has flourished in major countries around the world. This is reflected in the current legal artificial intelligence, which not only attempts to provide differentiated intelligent legal services for lawyers, the public, enterprises, judges, prosecutors and other entities, but also in different micro business scenarios such as legal knowledge Q&A (legal Q&A robots), case recommendation, legal document generation and review, judgment result prediction, litigation risk analysis, and different judgments and warnings for the same case, Produce research and development results with certain practical application capabilities.

1.2 The Technical Bottlenecks of Legal Artificial Intelligence

Despite the triple support of data, algorithms, and computing power, legal artificial intelligence has entered a revival period in recent years. However, compared with the imagination of legal Futurism, the Technological singularity of existing legal AI has not yet arrived. From the perspective of technical logic, the essence of legal AI application is to accurately identify users' legal demands through Natural language processing technology, and map them to the best available response set of legal knowledge in the system. The complexity of Legal writing and the richness of legal knowledge shape and solidify the technical bottleneck of legal AI, improve the difficulty of developing legal task AI systems, and thus make many systems fail to achieve the expected results.

1.2.1 Technical bottleneck of Natural language processing in the legal field

Natural language processing is one of the important research fields of artificial intelligence, which aims to create a computer system that can understand and respond to human voice or written text. The legal field is highly dependent on the use of language. Therefore, for legal AI, Natural language processing is a key common underlying technology. Its research and development degree determines the accuracy and professionalism of language understanding, intention recognition, logical reasoning and content generation in the legal field. However, Legal writing is highly complex, and even professional lawyers need to learn for many years to adapt to and use Legal writing. On the one hand, there is a "semantic gap" between French and everyday language, reflected in a large number of specialized concepts and terminology, as well as highly stylized and pedantic precise wording. For non legal professionals, legal language is often associated with "obscure", "difficult", and "difficult to understand". At the same time, as a code of conduct generally applicable to social members, Legal writing also has requirements for unity and standardization that are different from ordinary language, which are embodied in the use of a large number of complex phrases, long sentence structures, and the mixed use of sentence groups such as juxtaposition, supplement, progression, and transition. This also determines that the demand for data samples, model parameters and computing power in legal AI corpus training is significantly different from that of general Natural-language understanding technology. On the other hand, even seemingly precise legal language is full of subtle semantic differences. The same term may have different context specific meanings in Legal writing. Moreover, the meanings of many legal terms are not fixed by dictionaries, but will change with the interaction of social interests and changes in circumstances. The depth and breadth of such semantics often require a large number of domain specific expertise to be embedded in Natural language processing systems that interact with legal texts. The above characteristics of Legal writing make the natural semantic understanding technology in the legal field not only need a large number of high-quality professional corpus, but also usually have higher requirements in terms of parameter quantity and operation scale. Based on professional barriers of legal knowledge, most Data cleansing and labeling tasks often need law students and even professional lawyers to complete, which also greatly improves the difficulty and cost of building legal knowledge. Because of this, the traditional paradigm adopts a technical framework with Supervised learning and small language model, which is in fact not competent for the needs of Natural language processing technology in the legal field. As a result, legal artificial intelligence technology not only encounters substantive difficulties in identifying various types of natural language users in the legal field, but also has gaps in generating content compared to the usage habits in the legal field. It is not surprising that legal artificial intelligence has the phenomenon of "not understanding" and "not being able to speak" in human-computer interaction. However, as the underlying technology, Natural language processing and interaction of legal scenarios directly determine the user's first sense of using legal AI. This capability bottleneck means that the developed intelligent products are more or less trapped in a situation of "more manual labor and less intelligent", which will greatly affect users' in-depth operation and trust relationship establishment.

1.2.2 Technical bottlenecks in the construction and generation of knowledge in the legal field

If Natural language processing and smooth human-computer interaction are only external capabilities of AI, then knowledge building and generation are core capabilities that all AI must have. This is reflected in the legal field by fully responding to the requirements of richness, rigor, and creativity in legal knowledge, integrating multiple heterogeneous resources, and achieving the autonomous construction of legal knowledge and the automatic generation of adaptive legal knowledge. In fact, as early as the embryonic stage of artificial intelligence research, the indispensable position of knowledge in the construction of artificial intelligence was revealed. However, due to the diversity, variability, and fuzziness of knowledge, the integration of artificial intelligence and knowledge has always been extremely difficult. In the field of law, the richness of knowledge derives to a large extent from the diversity of Sources of law. In addition to statutory laws, precedents and customs, law also contains a large number of legal theories, moral norms, justice concepts, and even religious rules, township rules and regulations, philosophical concepts, etc. can play the role of "law" in specific scenarios. The differentiated understanding of law among different legal schools has further blurred and even conflicted the extension of legal knowledge. Not only that, the law contains a large amount of local knowledge, which may be relatively explicit knowledge such as "standards for determining the amount of criminal property infringement cases", or may only be tacit, practical, or even unable to be expressed in language and general propositions.

At the beginning of the development of artificial intelligence, knowledge was considered as information represented by language symbols and could be input into computer systems through the form of "if then" logic rules. Even professional knowledge in legal, financial, and other fields can be expressed and constructed through the deep cooperation of legal, financial experts, and technical personnel. The technological approach represented by this expert system is collectively referred to as the semiotics of artificial intelligence. Nowadays, the mainstream Connectionism method automatically completes knowledge construction and generation through effective aggregation, identification and analysis of massive data. The rise of this new paradigm, on the one hand, is due to the fact that the semiotics school is trapped in the dilemma of exhaustion in massive domain knowledge, legal rules and even common sense of life; on the other hand, it also benefits from the dual improvement of knowledge construction efficiency and performance brought about by the development of data, algorithms and computing power. However, the current Connectionism approach, which takes deep learning as the core, still cannot effectively solve the bottleneck of the ability to build and generate legal scene knowledge. This is because the connectionist school of legal artificial intelligence mainly relies on a large amount of annotated high-quality legal data as model training corpus. However, the reality is that the total amount of data in the legal field is very limited, far less than in the fields of finance, transportation, and healthcare. And in terms of format, these legal data are mostly presented in unlabeled and unstructured forms such as text, audio, images, videos, etc., making it difficult for computers to directly automate processing. Due to the limited and limited data corpus in the legal field, current legal artificial intelligence is still unable to fully respond to the rich characteristics of legal knowledge, and the requirements for creativity and rigor in legal knowledge are even more difficult to solve. This is reflected in the appearance that the current legal artificial intelligence cannot fully consider the large and scattered multi-party needs in legal scenarios, nor can it use knowledge calculus, knowledge reasoning, knowledge filtering and other methods to deeply construct and generate complex legal knowledge to achieve precise responses to public needs.

2 ChatGPT technology empowers the evolution of legal artificial intelligence

Although ChatGPT has been around for less than a year, it may have a potentially profound impact on legal artificial intelligence. Since its release at the end of 2022, the main areas of ChatGPT research in the English speaking world have included law. At the same time, according to the AI Occupational Exposure (AIOE) ranking, the legal industry ranks first among industries closely related to large-scale language models. In fact, the technical framework of large-scale language models and generative intelligence has a strong internal consistency with legal AI, which is expected to break the technical bottleneck of legal AI in Natural-language understanding, promote the technological evolution of legal AI, greatly improve the interactivity, generativity and embeddedness, and form a major empowerment for legal people and the public. Under the empowerment of ChatGPT technology, the connection between legal artificial intelligence and legal professionals and the public may evolve from non rigid, low-frequency, and low viscosity to rigid, high-frequency, and high viscosity.

2.1 Enhancing the interactivity of empowering legal artificial intelligence
With the support of ChatGPT technology, legal AI is expected to break through the technical bottleneck of Natural language processing in the legal field and significantly improve the user experience of legal human-computer interaction. As an interface for information dissemination and communication between natural individuals and computers, as well as a human-centered system design methodology, human-computer interaction plays a crucial role in the development of artificial intelligence. Different from the traditional paradigm, which adopts the technical architecture of Supervised learning and small language model, ChatGPT, which is iterated from GPT-3, has a huge amount of data corpus, parameters and operations. Its data corpus has more than 300 billion words, and the entire English Wikipedia only accounts for 0.6% of the entire training dataset, accounting for 3% of ChatGPT training corpus. The majority (60%) of the training corpus comes from the Common Crawl dataset from 2016 to 2019- a large public dataset that periodically (every few months) crawls web page data and extracts text from the internet, and the number of web pages crawled each time is often over 2 billion. Thanks to the astonishing training data, ChatGPT has formed a model with 175 billion parameters, which stores a large amount of knowledge content. These knowledge not only include various factual and common sense knowledge that we usually understand, but also professional knowledge in fields such as law, finance, and healthcare, as well as linguistic knowledge such as morphology, grammar, and syntax. These linguistic knowledge are generally considered to be the key to technological breakthroughs in Natural language processing. It is not hard to imagine that ChatGPT, supported by these massive data corpora and model parameters, is expected to break through the technical bottleneck of Natural language processing in the legal field in the small model era, and achieve smooth interaction and system transformation between natural language and Legal writing.

In addition, ChatGPT technologies are further optimized through instruction tuning, Reinforcement learning based on human feedback and other technologies, which further enhance the model's ability to identify intentions, follow instructions and multi round dialogue. This includes at least three stages: comprehensively testing various natural language instructions submitted by users, and encouraging the model to generate better answers by Reinforcement learning through the scoring and sorting of model answer results by professional annotation personnel. Surprisingly, when the number of instructions input to the model reaches a certain threshold, the ChatGPT generated by GPT-3 iteration has the ability to generalize and complete various new tasks - the model can also respond effectively on new instructions that have never been seen before. Especially the smooth multi round dialogue ability demonstrated by ChatGPT technology, which is significantly different from previous legal question answering robots that can only engage in closed, rigid, and short conversations in single or individual rounds. The open, coherent, and smooth interaction can greatly improve the user experience of legal artificial intelligence. An intuitive feeling is that ChatGPT technology not only "understands", but also "communicates coherently" and "understands people well".

In general, ChatGPT technology has shown better Legal writing semantic understanding, intention recognition and multi round dialogue ability than the original legal AI. With the further iteration and improvement of its technology, it is expected to greatly enhance the interactivity of legal artificial intelligence, thereby effectively meeting the cognitive and usage needs of various users for the human-machine interaction of legal artificial intelligence.

2.2 Enhancing the Generativity of Legal Artificial Intelligence

Unlike the artificial intelligence of the traditional small model era, another prominent feature of ChatGPT technology is its generative nature. The letter "G" in its name is actually the English abbreviation for generative. According to the definition of Gartner, a well-known consulting company, generative AI refers to the intelligent technology that automatically learns the characteristics of objects from the original corpus and can generate new, completely original and similar content to the original data. Before the rise of ChatGPT technology, analytical artificial intelligence based on machine learning and deep neural networks dominated research and development. It searched for hidden patterns of objects from a large amount of data and classified and predicted future content. Although ChatGPT technology and traditional connectionist artificial intelligence both form regular understanding by learning patterns from a large amount of data, there are significant differences in the output form of the model. A simple example is that traditional analytical artificial intelligence is trained to distinguish which animals in images belong to, while generative artificial intelligence represented by ChatGPT technology can generate a completely different animal image from the real world. It is precisely because of the huge capacity of generative AI in content production that it is considered to have broad prospects in industrial design, Drug development, material science and data synthesis, and even subvert the ecological environment of global Internet content production.

Content production is one of the core tasks of law, which not only covers various suggestions and responses in legal consultation and Q&A, but also includes legal documents such as pleadings, defenses, and judgments throughout the entire life cycle of litigation. However, similar to other fields, the revival of legal artificial intelligence in this round is also centered around analytical intelligence, which involves mining the legal laws contained in massive amounts of legal data through analysis, deduction, or calculation, and then using them for specific classification tasks such as litigation risk assessment and litigation result prediction. Generative intelligence has not yet made substantial progress in the legal field. On the contrary, using the structure of generative intelligence, ChatGPT technology can autonomously generate new legal professional content through deep learning of massive corpus, promoting the evolution of legal artificial intelligence from "analytical intelligence" to "generative intelligence". The evaluation shows that ChatGPT technology has the ability to automatically generate legal documents, assess legal risks, and make legal decisions based on input case information, party information, evidence materials, and other content. It can be foreseen that with the continuous iteration and development of ChatGPT technology, its technological empowerment role in generating legal scene content will be further enhanced.

2.3 Enhancing the Embedding of Legal Artificial Intelligence

ChatGPT technology has the potential to enhance the embeddedness of legal artificial intelligence and promote a closer connection between legal artificial intelligence and legal professionals. Although the revival of legal artificial intelligence in this round has resulted in application achievements such as knowledge retrieval Q&A, document review generation, data analysis and prediction, limited by the richness of legal knowledge, these achievements are often developed in a fragmented and modular manner, that is, using different model paradigms for specific tasks. This leads to insufficient compatibility among various application achievements, which cannot meet the intelligent requirements of full process coverage, full scenario integration, and intensive management, and also leads to duplicate construction and resource waste. On the contrary, ChatGPT technology changes the traditional approach of decentralized and fragmented artificial intelligence construction through a unified large-scale language model. Its universal technical architecture effectively improves the embedding ability of the model. In the field of artificial intelligence, there has always been a distinction between "general artificial intelligence" and "specialized artificial intelligence". The former refers to artificial intelligence systems that can perform various tasks as widely as humans, while the latter only exists as intelligent solutions for specific fields and problems. According to this classification, traditional legal knowledge retrieval Q&A, document review generation, data analysis and prediction, etc., all belong to the category of specialized artificial intelligence. Although the Dartmouth Conference in 1956 and the earliest artificial intelligence researchers were committed to developing general artificial intelligence like "thinking machines," unfortunately, currently mainstream artificial intelligence products do not fall within the scope of general artificial intelligence. ChatGPT, on the other hand, is completely different, as it does not have a specific intent recognition module or traditional legal AI task units such as document generation and summarization. In fact, in the world of ChatGPT technology, as a generative artificial intelligence, it deconstructs user instructions into a unified large-scale language model and generates character combinations that have been calibrated by humans based on it.

The universal artificial intelligence technology architecture of ChatGPT technology can effectively and flexibly embed into various existing daily applications, providing good empowerment for existing systems. For example, in February 2023, Microsoft released the "New Bing" search engine, which is a typical method of this path. It embeds OpenAI's large-scale language model into Bing search products, promoting higher quality and more accurate information retrieval and answer generation. Similarly, embedding technologies such as ChatGPT into common office software to provide document proofreading, grammar checking, and even intelligent generation of data tables and presentations is not difficult. In the legal field, ChatGPT technology can also be combined with traditional legal professional knowledge bases such as Westlaw, LexisNexis, and Peking University's magic weapon in China, providing verifiable legal provisions or judicial cases for legal question answering and document generation, and improving the verifiability of large-scale language models for generating results. It can be foreseen that the use of large-scale language model achievements such as ChatGPT technology, with open and generative interaction capabilities as the core, will be deeply embedded in various existing legal artificial intelligence systems such as intelligent legal consultation, automatic litigation guidance, intelligent document generation, and dispute focus summary, creating possibilities for the construction and implementation of integrated and intensive legal artificial intelligence. While greatly reducing user learning and interaction costs, the improvement of embeddedness will drive the evolution of the connection between legal artificial intelligence and legal professionals from low frequency and low viscosity to high frequency and high viscosity.

3 The Limits of ChatGPT Technology Empowering Legal Artificial Intelligence

Although ChatGPT technology is expected to break through the bottleneck of semantic understanding and content generation encountered in the development process of traditional legal artificial intelligence, and thus bring possibilities for improving the interactivity, generativity, and embeddedness of legal artificial intelligence. However, ChatGPT technology still has technical limitations in empowering legal artificial intelligence. The reason is that the existing technical framework of ChatGPT technology is still unable to effectively respond to the needs of legal knowledge in the field. The mismatch between excellent Natural language processing ability and relatively low knowledge generation ability has led to a series of "knowledge illusion" phenomena of ChatGPT technology. The term 'hallucination' has been widely used in the technical community to describe how artificial intelligence models generate text that is smooth, natural, grammatically correct, but meaningless in content or contains factual errors. To put it more simply, it means "talking nonsense in a serious way". The concept extension of "knowledge illusion" discussed in this article is more extensive, including not only error generation in the dimension of technological governance, but also unreasonable trust and dependence caused by the deep use of ChatGPT technology. These "knowledge illusions" will greatly constrain the fundamental subversion of existing legal artificial intelligence by large-scale language models and generative artificial intelligence.

3.1 The richness of legal knowledge and the "knowledge completion illusion" of ChatGPT technology

The diverse sources of knowledge, localized knowledge content, and open knowledge structure collectively shape the richness of legal knowledge. First, the richness of legal knowledge comes from the diversity of Sources of law. The changes, interweaving, and blending of various formal and informal "legal" norms have presented a complex and ever-changing face to legal knowledge. In addition, the highly regional characteristics of legal knowledge also enhance the richness of legal knowledge. The knowledge system centered on the principle of precedent in Anglo American law is vastly different from the rule-based reasoning model in continental law. Even within the same legal system, there may be a significant amount of local knowledge hidden. Finally, the openness of legal knowledge is also an important incentive for the richness of legal knowledge. For the openness of legal knowledge, Hart made a more classic expression through the concept of Open Texture. He believes that precedents or legislation are tools used to convey legal knowledge and standards of conduct, and no matter how smoothly they are applied to most cases, there will be application issues at a certain point, exhibiting uncertainty. Therefore, whether it is legal words, sentences, or legal rules, they all possess the core of "definite meaning" and also have the "half shadow of doubt". A sharper criticism comes from Frank, who believes that the certainty of law is a human fantasy, a "fundamental misconception," and that people's desire for legal certainty is "undesirable" and "impossible to achieve. In short, the richness of legal knowledge largely determines the technological bottleneck in the development of legal artificial intelligence.

Despite the adoption of large-scale language models and the technical architecture of generative artificial intelligence, ChatGPT technologies have made great strides in data, model parameters and computing power requirements, greatly improving their Natural language processing capabilities. However, ChatGPT technology is still unable to fully respond to the richness requirements of legal knowledge and output a complete and complete answer. This is because although ChatGPT technology utilizes massive internet data as training corpus, the scale of such large samples does not fully reflect richness. From the perspective of online data production, it is evident that the contribution of young people and users from developed countries is greater than that of the elderly and developing countries. At the same time, those who hold mainstream values and hegemonic views can easily produce a large amount of data, while other underrepresented groups are the opposite. This makes ChatGPT technology have the following drawbacks in terms of knowledge absorption and reproduction: firstly, the sample corpus presented in data-driven form cannot cover a large amount of implicit and tacit knowledge in the legal field. Secondly, sample corpora based on network data cannot equally reflect the true appearance of real society, and mainstream values and hegemonic views may lead to the decline of local legal knowledge in model parameters. Finally, although ChatGPT technology eliminates the self closed boundaries of traditional law through a generalized technical architecture, the open structure of law can be supplemented by knowledge from other fields. However, the model approach of ChatGPT technology is often far from traditional legal elements, and the black box style of knowledge generation will pose challenges to its accuracy and fairness. In fact, during most of the development of legal artificial intelligence, there has been a greater interest in inferential modeling of explanatory results (and providing reasons for other possible outcomes) than in predicting the results themselves. In summary, although ChatGPT technology can provide seemingly reliable answers to all questions raised by users, in the face of the richness of legal knowledge, such answers often have the illusion of "knowledge completeness" - does ChatGPT technology have all the background knowledge required to answer questions endlessly, and can the generated answers fully answer the questions?

3.2 The rigor of legal knowledge and the "knowledge authority illusion" of ChatGPT technology

The legal field highly emphasizes the rigor of knowledge. This is because, as an authoritative social norm, law plays an important role in clarifying behavior patterns, maintaining social order, and managing public affairs. Although different legal schools have differentiated views on the sources of legal authority, such as Habermas attributing authority to "communicative rationality", Austin attributing it to obedience to sovereignty, and Hart attributing it to "recognition rules". However, the authority of law is indeed unanimously recognized by all jurists, which is also reflected in the strict requirements of modern legal formulation, a complete system of legal implementation, and the backing guarantee as a coercive force. Under the control of legal authority, legal knowledge should naturally possess the important characteristics of solemnity, rigor, comprehensiveness, and prudence. This makes legal knowledge have different normative requirements from literature, art, and aesthetics, and there are always objective differences between "truth" and "fallacy" in specific time and space regions.

However, based on the architecture analysis of ChatGPT technology, it can be found that the acquisition and generation of knowledge cannot guarantee absolute accuracy, which requires the rigor of legal knowledge. The large-scale language model and generative artificial intelligence architecture adopted by ChatGPT technology are "knowledge acquisition" based on the existing massive corpus, and are essentially still the paradigm of connecting schools of thought. As mentioned earlier, ChatGPT technology mainly relies on the Common Crawl web dataset during the model training and knowledge construction stages. These data volumes are unpredictable and the content is uneven. In the absence of manually calibrated and supervised learning methods, large-scale language models often absorb and comprehend the erroneous knowledge and value in them. Based on the defects of the training corpus, ChatGPT technology cannot avoid factual errors, False statement and wrong data. Especially when facing complex problems in professional fields, due to the limited data corpus, large-scale language models such as ChatGPT technology cannot guarantee the generation of correct answers. Bard, a dialogue robot launched by Google based on the large-scale language model, made a factual error when answering questions related to the James Webb Space Telescope, causing widespread concern. Although the above problems can be calibrated through command fine-tuning and Reinforcement learning based on human feedback, the correctness of generated content cannot be guaranteed. Recently, when OpenAI released the latest generation of large-scale language model GPT-4, it still clearly pointed out that although GPT-4 has significantly reduced the problem of knowledge illusion compared to previous models, it is still not completely reliable and may produce incorrect answers. These factual errors caused by training corpus can also occur when large-scale language models are applied in the legal field. On the one hand, the quality of legal corpus in the internet space varies, and large-scale language models may have "learned" both correct and incorrect legal knowledge during training. Considering the professionalism of legal knowledge, for legal issues with certain complexity, there may be far more erroneous knowledge than correct knowledge in the corpus. On the other hand, the revision and repeal of laws will also bring significant updates and adjustments to legal knowledge. A common situation is that the answers to the same question may differ significantly before and after the revision of the law. However, large-scale language models can only generate knowledge based on existing data without sufficient new training data. Therefore, under the existing technological architecture, it is inevitable that ChatGPT technology may result in "misjudgment" or "fabrication" of legal issues. For example, the University Law of the China has been fabricated to evaluate the problem of college teachers' leaving without promotion, or it is believed that China's current criminal procedure law has provisions related to Big data investigation. Although ChatGPT technology has a smooth and natural language understanding and expression ability, it can greatly eliminate the information exchange gap between users and intelligent agents, and enhance mutual trust between humans and machines. However, the identity and obedience brought about by this high interaction ability are mismatched with less high generation accuracy. The illusion of knowledge authority - treating the generation of artificial intelligence as knowledge authority - has seriously affected the deep application of ChatGPT technology in the legal field.

3.3 The Creativity of Legal Knowledge and the "Knowledge Generation Illusion" of ChatGPT Technology
The legal field highly emphasizes the creativity of knowledge, which is particularly reflected in the process of judicial cases, especially difficult cases, and legal reasoning. The most classic interpretation comes from Dworkin's exploration of the concept of 'Constructive Interpretation'. He pointed out that the interpretation of the law is closer to the interpretation of literature and art, rather than a scientific interpretation. This is because the object of Law and literature and art interpretation is something created by people, rather than something that exists objectively. Moreover, the purpose of the interpreter plays a decisive role in the entire explanation, rather than a purely causal relationship. This makes Statutory interpretation creative, which aims at obtaining the results expected by the interpreter rather than understanding the original intention of the law. In fact, even laws that claim to be rigorous and deterministic still need to adapt to the realistic needs of rapid social change, and there are various potential changes hidden - although this creative knowledge generation behavior is rarely publicly recognized by judges in the continental legal system.
However, based on in-depth exploration of ChatGPT technology, it can be found that although it adopts a generative artificial intelligence technology architecture, this "creative" generation is to a greater extent an "illusion". There are similarities between ChatGPT technology and the traditional machine learning method of the linkage school of legal artificial intelligence - that is, they both form regular understanding by examining patterns in a large amount of data. They are essentially mathematical models that speculate about the future based on past data, with a common basic assumption that patterns will repeat. For machine learning legal artificial intelligence such as sentencing prediction, this understanding of regularity originates from the relationship between specific data such as case circumstances and sentencing; For ChatGPT technology, it is based on the potential correspondence between each input character and output character. Although these methods of knowledge construction and acquisition may yield results similar to human cognition, they are fundamentally different from human higher-order cognitive systems. A simple explanation is that ChatGPT technology, in the process of generating and outputting results, is based on the maximum probability distribution on the string. But in legal reasoning, especially in case adjudication, probabilistic inference is almost never used to generate conclusions. For probability, an 80% probability means that 20% of cases will be wrongly judged, which is by no means just. Another simple explanation is that the generation of ChatGPT technology results is based on reflection and review of past data, while the generation of legal knowledge centered on judicial adjudication is a constructive process oriented towards the future. In the context of the transformation of social shared values, when the "patterns" and "laws" are no longer repeated, the legal knowledge summarized from past data may not be able to respond to the real needs of current and future cases. In fact, the "work" done by ChatGPT technology can only be considered as "information processing", which is a hereditary "from being to being" generation, that is, one "having" generates another "having", rather than a breakthrough "making something out of nothing" generation. Consistent with sentencing prediction, ChatGPT technology uses existing information and content as the core of model generation, which means it will not be able to generate texts about new legal theories or overturn existing viewpoints. The excessive reliance on past data will fundamentally limit the creative evolution in judicial scenarios based on changes in social values or adjustments in legal concepts.

4 ChatGPT technology empowers the future of legal artificial intelligence

The remarkable advantages of the large-scale language model represented by ChatGPT in Natural language processing and human-computer interaction in the legal field make it only a matter of time before it is combined with legal AI. This means that discussing ChatGPT technology empowering legal artificial intelligence is not blindly following the trend and amplifying the "fake problem". The large-scale language model and generative artificial intelligence technology for iterative optimization of data in the legal field will become the core of future research and development of legal artificial intelligence, and "intensive cultivation" will be carried out on top of pre training to form more efficient, accurate, and professional artificial intelligence. However, unrealistic expectations should not be held for this path, as overemphasizing the functional value of technology can easily fall into the trap of technology supremacy and ultimately lead to a misguided path. The richness, rigor, and creative nature of legal knowledge make it difficult for current ChatGPT technology to fully meet the ultimate imagination of legal professionals regarding legal technology. Looking forward to the future, in addition to clarifying the intervention boundaries of legal artificial intelligence and ensuring the ethical requirements of "people in the loop" that have been extensively discussed, it is also necessary to strengthen the supply of legal data and verification of legal knowledge to minimize the occurrence of knowledge illusions. On this basis, ensure the accessibility and equality in the process of ChatGPT Technical communication, and realize the empowerment of the whole public around ChatGPT technology.

4.1 Strengthening the supply of legal data, fine-tuning legal directives, and verifying legal knowledge
At present, ChatGPT still faces difficulties in the accuracy, robustness, and verifiability of legal knowledge generation, which is first reflected in incorrect answers to simple legal questions. The mismatch between fluent language processing ability and relatively low knowledge generation ability makes ChatGPT technology face more complex "knowledge illusion" phenomena. From a technical perspective, this should be attributed to the insufficient participation of legal knowledge in the development of artificial intelligence. At present, the research and development of legal artificial intelligence still adopts a three-dimensional paradigm of "data+algorithms+computing power", which does not fully reflect the driving role of legal knowledge in artificial intelligence. Even the current ChatGPT technology is only a product of further deepening in these three dimensions, namely the mining and utilization of massive data, and the generation of excellent generalization and generation capabilities through complex model parameters. However, behind the intelligent generation of ChatGPT technology lies the relatively low data utilization efficiency - despite significant progress, large-scale language models can see much more text during training than anyone can see in their lifetime. The reason is that human decision-making and generation will use symbolic, Abstraction and theoretical knowledge for reasoning more often, while large-scale language models are still unable to accurately understand the real meaning behind language. Therefore, in the process of empowering legal artificial intelligence with ChatGPT technology in the future, it is necessary to take specific optimization and governance measures to fully leverage the fundamental role of legal data and domain knowledge in the construction of intelligent facilities. While forming a four-dimensional driving paradigm of "data+algorithms+computing power+knowledge", it is necessary to strengthen the supply of legal data through "full reinforcement" - Fine tuning of legal instructions and verification of legal knowledge - further improving the accuracy of knowledge construction and generation.

Firstly, at the source of the corpus, we will strengthen the supply of high-quality legal data and form a multimodal Chinese dataset in the legal field. Currently, the use of Chinese language materials for ChatGPT technology outside the region is very limited. According to statistics, in the training corpus of GPT-3, the number of Chinese documents, words, and characters accounts for only 1.1 ‰, 1.0 ‰, and 1.6 ‰ of the total corpus, respectively, ranking 15th, 17th, and 14th in all languages. Therefore, in order to effectively reduce the knowledge illusion of ChatGPT technology in the legal field, the primary task is to strengthen the supply of high-quality Chinese datasets in the legal field. At present, high-quality legal corpus on the Chinese internet is relatively limited and mainly exists in the form of judicial documents. Other high-quality Chinese legal corpora such as laws and regulations, legal papers, and legal consultations (legal Q&A) are still fragmented and distributed among state agencies or various enterprises, making it difficult to achieve effective data coordination and sharing. Next, under the coordination of the National Data Bureau, we can promote the collection and aggregation of various non classified legal documents and laws and regulations at the government data level. On this basis, the legal text data will gradually develop from a single mode to a multimodal legal data of graphics, audio and video, and finally form a high-quality, multimodal and open Chinese legal data set.

Secondly, in the training process, enhance the collection and integration of legal instructions, and fine-tune the generated content within the field of instructions. In the current ChatGPT technology, instruction fine-tuning based on pre trained models can significantly enhance the model's ability to recognize intentions, follow instructions, and generalize, playing a very important role. For legal artificial intelligence, the essence of "legal instruction set" is a collection of all potential intelligent needs in the legal field, such as requiring legal artificial intelligence to write pleadings, judgments, analyze litigation risks, predict litigation outcomes, etc. Authenticity, adaptability, and richness are the basic requirements of the legal instruction set, which are reflected in the following aspects: firstly, the legal instruction set should reflect the real business demands and real expression habits submitted by users, in order to achieve accurate simulation of user needs, improve the accuracy and efficiency of the model. Secondly, for different user groups (such as the general public, lawyers, judges, prosecutors, etc.) and different geographical environments, the legal instruction set should have differences to accurately adapt to different usage scenarios. Thirdly, the exponential accumulation and deep mining application of legal instruction sets will lead to the comprehensive expansion of ChatGPT technology's ability to generalize and transfer in the legal field, which will have the effect of drawing inferences from one example to another. However, limited by the traditional analytical intelligence technology that determines the task content in advance, the directed accumulation of legal scene instruction sets and corresponding demand instances is currently very limited, which will greatly restrict the targeted training and fine-tuning of ChatGPT technology in the legal field. Therefore, in the process of empowering legal artificial intelligence with ChatGPT technology, it is necessary to fully fine-tune the content generation of the model based on the real instructions of the legal scene during model training, further unleashing the powerful potential of ChatGPT technology, and achieving true matching and generalization processing of diversified requirements for legal scenes.

Finally, at the end of result generation, enhance the verification function of legal knowledge, and build a Reinforcement learning mechanism based on legal person feedback. Through the Reinforcement learning mechanism (RLHF) based on human feedback, ChatGPT can answer questions in a way consistent with human intentions, knowledge and values, which is also called "AI alignment" in the technical community. However, considering the particularity of legal scene values and the rigor of legal knowledge, solely based on basic RLHF technology cannot effectively align with legal knowledge, cognition, needs, and values. At this time, it is necessary to build Reinforcement learning technology based on legal person feedback, so as to realize the field optimization of ChatGPT technology. On the one hand, utilizing the professional knowledge of legal professionals to evaluate legal texts generated by ChatGPT technology, correcting the algorithm for generating incorrect answers, making the language model more in line with common knowledge and cognition in the legal field. On the other hand, in the field of law, ontological value is particularly important - any system that is worthy of being called "law" must actually focus on some fundamental values that go beyond the relativity of specific social and economic structures. At this time, it is even more necessary to use the Reinforcement learning technology based on the feedback of legal person, and use the knowledge in the legal field and the theoretical proposition and value orientation behind it to conduct ethical control and correction of intelligent technology, so that the generated content is more consistent with the value pursuit and ethical norms of legal person.

4.2 Ensure accessibility and equalization of technology diffusion

The interaction between the diffusion of new technologies and social justice has always been one of the most important issues in the field of technological sociology. Throughout the history of human society, technological diffusion often triggers the "digital divide", with the proliferation and application of mobile phones, the internet, or social media being no exception. Therefore, how to narrow the "digital divide" in the process of technology diffusion and address the social equity issues brought about by new technology diffusion is undoubtedly a major issue that must be paid attention to in the governance of risk societies. It is generally believed that the "Digital Divide" is an imbalanced phenomenon in the popularization and application of emerging technologies, which is comprehensively influenced by the level of economic development, knowledge development ability, openness to the outside world, and the level of communication technology introduction. This imbalance is not only reflected in different geographical regions, countries with different levels of human development, and countries with different levels of economic development, but also in different regions and populations within a country. Its essence is the polarization trend caused by the differences in the ownership and application of new technologies among different social groups. In a broad sense, the "digital divide" includes two levels: "Have or Not Have" and "Use or Not Use", respectively representing differences in access or usage after access. The former points to a country's public policies and infrastructure supply, while the latter points to the inequality caused by differences in technology application among users.

The monopolistic tendencies that arise during the development of ChatGPT technology, as well as the differences in technological literacy among different groups, will pose a huge "digital divide" risk to the innovative dissemination of this type of technology in the legal field. On the one hand, the "monopoly tendency" of ChatGPT technology will hinder innovation diffusion and create a "digital divide" in terms of accessibility. Innovation diffusion "refers to the process of innovation spreading among members of a certain social group through specific channels over a period of time. Innovation characteristics, communication channels, time and social systems are the four basic elements that affect innovation diffusion. According to the Diffusion of innovations, the innovative characteristics and comparative advantages of the interactivity, generativity and embeddedness of ChatGPT technologies are one of the conditions for ChatGPT to obtain a diffusion effect and speed significantly better than traditional legal AI technologies. In addition to innovative features, the diffusion of technology is also constrained by larger social system differences, such as commercialization and market competition. Commercialization, market competition, and government industrial policies will ultimately spread new technologies to those at a disadvantage in the initial diffusion. With the diffusion of secondary technology, the differentiation in terms of technological accessibility will gradually disappear. However, the huge demand for data, algorithms, and computing power in ChatGPT technology has led to a distinct monopoly tendency at the beginning of its emergence. Data shows that the total computing power consumption of ChatGPT is approximately 3640PF - days, requiring 7 to 8 data centers with an investment scale of 3 billion yuan and a computing power of 500P to support operation. Apart from the hundreds of millions of GPU equipment and other training investments in the early stages, the maintenance cost of ChatGPT is also staggering, with daily expenses reaching up to 700000 US dollars. The huge investment mentioned above may limit the competition for ChatGPT technology to only a few internet giants. Therefore, the market surrounding the application of this technology may not reach a level of sufficient competition for a long time, and it is difficult to expect the next level of technology diffusion through commercialization and market competition. Driven by monopolistic interests, future legal ChatGPT technology will inevitably have natural profit seeking considerations in price setting, regional differences, and even competition and monopoly strategies during its implementation and deployment. If the prices of related products are too high for the vast majority of the public to afford, or if there is a fundamental difference between the free version and the paid version, or if the application scope of the product is limited to specific subjects or regions for specific reasons, it will make legal artificial intelligence, which carries the expectation of equalization and accessibility, an important driving force for exacerbating the new round of "digital divide" and "knowledge poverty", The vast social group will lose the opportunity to participate in the development of emerging technologies and enjoy high-quality legal services. On the other hand, the difference in Technological literacy will also lead to a high degree of differentiation in the ability to benefit among groups accessing ChatGPT technologies. As a result, the technology dividend is highly heterogeneous distributed among different access entities, resulting in a time lag effect of technology diffusion, shaping the "digital divide" in terms of availability. As some scholars have said, we may be expecting a society that is more unequal than the existing society. This kind of inequality is an all-round inequality from the starting point to the result, which Rawls, Sandl and Sen can not deal with together.

Although ChatGPT technology has just emerged and has not yet been widely applied in the legal society, its enormous potential requires us to consider how to ensure the accessibility and equalization of future technological services, and truly return to the value pursuit of empowering emerging technologies. This also includes two aspects of accessibility and usability: from the perspective of accessibility, "if old regulatory tools cannot adapt to new uses, the progress of human society will be delayed and slow. Public legal services are an important component of the government's public functions, and promoting a modern public legal service system that covers urban and rural areas, is convenient, efficient, and equally inclusive. At the same time, the deep integration of AI technology and judicial Big data resources, promoting the equalization and accessibility of legal services and rule of law products, and bridging the gap between urban and rural areas, regions, and different groups of people in the enjoyment of legal services resources are important value pursuits of legal artificial Intelligent design research and development, especially in the landing application and deployment phase. From the perspective of "empowerment", legal artificial intelligence empowered by ChatGPT technology is provided to the public through government procurement of public services. Various governance tools such as industrial policies are used to promote the innovation and diffusion of ChatGPT technology in the legal field, which helps to avoid the digital divide in terms of accessibility. Furthermore, from the perspective of "empowerment", for a large number of individuals, due to inevitable differences in abilities and resources, the use of advanced legal artificial intelligence by individuals in the digital age will inevitably show an imbalance. In fact, it requires individuals to fully understand this imbalance, consciously cultivate Technological literacy, actively embrace scientific and technological development, and jointly help the legal science and technology field to enable a better vision of social governance, so as to effectively avoid the digital divide at the usability level.

Conclusion

ChatGPT technology is a product of the further leap of deep neural networks under the paradigm of the connectionist school, supported by data, algorithms, and computing power. From the perspective of effectiveness, ChatGPT technology seems to have opened up the possibility of leading to universal artificial intelligence. The modular solutions in traditional legal artificial intelligence research and development, such as knowledge retrieval Q&A, document review generation, and data analysis and prediction in the legal field, will gradually decline, and will be replaced by "intensive cultivation" based on large-scale language models. However, as scholars have pointed out, deep neural networks in the legal field can easily lead to tempting ideas of "solving problems without skill or effort". As a rule of law, the connection school and deep neural networks cannot be regarded as the only development path for legal artificial intelligence, and it is also difficult to accept high-performance AI without any legal professional knowledge. ChatGPT technologies have basically solved the technical bottleneck of Natural language processing and barrier free human-computer interaction, but there are still a series of "knowledge illusion" problems due to the richness, preciseness and creativity of legal knowledge. This means that in the era of ChatGPT technology, legal artificial intelligence is neither a "panacea" to solve all legal needs nor a "monster" that can completely unemployed legal professionals. Of course, behind the amazing capabilities presented by ChatGPT technology, it may herald the arrival of a new "summer" of legal artificial intelligence. The significant progress of technology has changed the practical operation of the entire law, so all legal professionals must keep pace with technology. By fully drawing on the historical experience of the development of legal artificial intelligence and organically integrating its evolution laws, technological logic, legal spirit, and rule of law concepts, we can promote the healthy development of legal technology while maintaining the independent personality, emotions, and thoughts of humanity.

download王禄生：ChatGPT类技术：法律人工智能的改进者还是颠覆者？