Location : Home > Resource > Report
Report
Qiu Yaokun | How will the text-to-video model Sora affect the construction and
2024-06-28 [author] Qiu Yaokun preview:

[author]Qiu Yaokun

[content]



How will the text-to-video model Sora affect the construction and upgrading of smart courts



 Qiu Yaokun

Associate Professor at Koguan School of Law/China Institute for Socio-Legal Studies, Shanghai Jiao Tong University.


For the construction of smart courts, there are both opportunities and risks associated with the text-to-video model. On the one hand, it can vividly reproduce the facts of the case in essence, maintain a balance between objective truth and legal truth, and promote the metaverse transformation of online court trials in form. It can also translate the judgment documents into judgment videos, further improving the visualization and intelligence level of the judiciary. On the other hand, it may exacerbate technological dependence, black boxes, and old problems that harm fairness, leading to negative effects such as evading judicial responsibility, hindering judicial transparency, and ignoring judicial fairness, which undermines the positive benefits of improving judicial efficiency. Therefore, while actively learning and applying the cultural and educational video model, potential risks should be prevented, and the dominant position of judges should be firmly upheld. It is required that the cultural and educational video model have moderate interpretability, and the review of artificial intelligence judgment results should be strengthened.


Introduction

The emergence of Sora, a text-to-video model, once again amazed the development speed and creative potential of artificial intelligence. It can generate 60 second, multi shot, complex scenes and characters videos based on user provided text, images, or videos, and understand human language and the logic of the physical world during this process. It can't help but make people imagine the metaverse where virtual and real coexist, and even where virtual is real. Even soon, universal artificial intelligence may no longer be a fantasy.

What does Sora mean for the construction of smart courts? What new opportunities may it bring and what new risks may it create? This is an important issue worth studying. The Opinion of the Supreme People's Court on Standardizing and Strengthening the Judicial Application of Artificial Intelligence proposes to "accelerate the deep integration of artificial intelligence technology with trial execution, litigation services, judicial management, and social governance services, standardize the application of judicial artificial intelligence technology, improve the effectiveness of artificial intelligence judicial application, promote the modernization of the trial system and trial capacity, and provide strong judicial services for the comprehensive construction of a socialist modernized country and the comprehensive promotion of the great rejuvenation of the Chinese nation." Therefore, the judicial application of the cultural and biological video model is the trend, and it is necessary to study its judicial impact.

In fact, scholars have extensively explored the positive impact and limitations, negative impact, and response of artificial intelligence in the judiciary, believing that it may change or even reshape the judicial system, but there are hidden concerns of excessive reliance on technology; We will actively follow up on the application prospects, risks, and regulations of the latest generative artificial intelligence in the judiciary, and examine its ability to generate illustrative evidence and legal documents to assess their impact on judicial trials. How to maximize the use of artificial intelligence to promote the construction of smart courts while minimizing potential risks is the core issue of research in this field. As a further development of generative artificial intelligence, the large-scale model of cultural and biological videos can naturally continue this research trend.

This article intends to discuss the application prospects, potential risks, and prevention issues of building a Chinese text-to-video model for smart courts, with the aim of utilizing digital new technologies to achieve continuous upgrading of smart court construction.

1. The Application Prospects of Building a Chinese Text-to-video Model in Smart Courts

The current construction of smart courts has utilized the generation, matching, and review functions of artificial intelligence to achieve various applications such as assisted generation of legal documents, retrieval and push of similar cases, and intelligent review management; The text-to-video model can further enhance the visualization and intelligence level of the judiciary, thereby improving judicial efficiency, ensuring judicial fairness, and strengthening judicial transparency. Specifically, it includes the following content.

1.1 Vividly reproducing the facts of the case's judgment

Firstly, the powerful video generation ability of the text-to-video model can generate vivid videos that reproduce the facts of the case judgment based on existing case text, images, and video evidence, becoming indicative evidence to assist judges in uncovering the truth. Demonstrative evidence refers to visual materials presented to explain the original evidence or the situation of the case, such as a floor plan, location map, funding direction map, character relationship map, simulation animation, etc. The new developments in digital technologies such as 3D printing and virtual reality (VR) have brought new possibilities to the production and display of schematic evidence. For example, in the case of Zhang's intentional homicide tried by the First Intermediate People's Court of Beijing, the prosecution used VR technology to demonstrate the process of the defendant stabbing the victim with a knife. The prosecution's opinion on the production and sale of counterfeit drugs by Wang and Gu in Beijing also agrees with the idea of "producing multimedia display documents with strong visual and impactful evidence".

The text-to-video model can not only generate more expressive and easily understandable images or videos based on existing case textual evidence such as suspect testimony and witness testimony, but also realistically restore the characters of the parties with expressions, actions, and rich emotions, as well as the crime scene with precise themes and complex background details, and present it with narrative multi lens images. It can also use existing case images and video evidence as a basis to animate image content or expand existing videos, filling in missing original evidence and restoring the full picture of the case. More importantly, its strong video generation capability means that compared to current practices that rely mainly or entirely on manual labor (3D printing, VR display still based on manual production of illustrative evidence, only enhancing display effects), the restoration of adjudicative facts in this case will be more convenient and vivid, so that more cases, and even all cases, rather than just a few typical cases, can be fairly judged based on the investigation of facts.

Secondly, the understanding of human language and the logic of the physical world by the text-to-video model, especially the emergence ability of generative artificial intelligence, may enable it to discover the details and clues of cases that judges overlook, thus having a better chance of approaching the objective reality of cases than simple human judgments. Even without considering the possible distortion of external factors that may motivate judges, assuming that judges are all diligent and well intentioned actors who want to handle cases well, the limited human intelligence may also become a constraint for judges to determine facts: not only is it impossible to achieve omniscience like a God's perspective, but it may also be difficult to effectively determine case facts due to limited language, scientific knowledge, and even certain cognitive habits inherent in "System One" from the perspective of behavioral science. It is also necessary to consider that there are differences in the internal quality of the large group of judges in our country, and judges are often unable to fully consider the entire situation of each case due to the burden of the case, so omissions and negligence are inevitable.

In contrast, the text-to-video model can not only fully and accurately reproduce the facts of the case based on their high level of language and text understanding and scientific knowledge mastery, which may support human bounded rationality and unexpected discovery of unimaginable case possibilities, but also, due to its higher work efficiency and consistency with output results, without human cognitive habits, it may achieve a higher overall accuracy in fact determination. More importantly, as a generative artificial intelligence, the text-to-video model also has the ability to emerge. This means that it may rely on its own intelligence to output results that exceed the limitations of input evidence data, thereby filling in missing links in the facts of the case, or at least providing a reference for judges to fully restore the full picture of the case, thus potentially approaching an omniscience of objective truth infinitely. Not to mention that artificial intelligence does not have the problem of incentive distortion, will not be corrupted, and will not be improperly affected by external factors, so it can more firmly engage in the tasks assigned to it by humans.

Finally, the strict limitations on the evidence data used in the text-to-video model can prevent judges from assuming factual findings, and thus better adhere to evidence-based legal truth than simple human judgments. The above-mentioned limitations of human intelligence may not only lead to insufficient cognitive abilities of judges, but also lead to excessive cognitive abilities of judges: going beyond the limitations of evidence, "brainwashing" the facts of the case, and making incorrect judgments. The seamless and effective operation of artificial intelligence is the best constraint on the arbitrariness of judges. The basic principle is to learn countless types of cases in the past and discover the rules of judgment. The specific operation is to combine all the evidence in the current case to obtain the judgment result, so the input data constitutes a substantial constraint on its output result; By referring to the vivid illustrative evidence provided by the text-to-video model, judges can self examine whether there are any unsubstantiated speculations in the process of factual determination, and thus discover the inherent cognitive limitations they may have.

Undoubtedly, the tension between objective truth and legal truth still exists in the judicial decision-making assisted by the text-to-video model, only transforming into the tension between emergence ability and data limitation. The emergence of artificial intelligence may not only be close to, but also deviate from objective reality, reflected in the assumption and "brain supplementation" of artificial intelligence. However, the unexpected impact, enhanced by the effectiveness of cultural videos, has a further stimulating effect on judges to review and rigorously examine existing evidence in cases, just as the sudden emergence of unreasonable lines or plots while watching a play can instantly make people cross the scene and jump out of it. The emerging role of "acting" can actually encourage judges to ensure that all factual findings are supported by solid evidence. Therefore, regardless of whether the result generated by its unexpected emergence is right or wrong, with or without evidence support, judges can obtain necessary references from it and achieve a better balance between objective truth and legal truth through the repeated shuttle between the output of artificial intelligence and their own thinking results.

1.2 Continue to optimize intelligent trial and adjudication

On the one hand, the text-to-video model can promote the metaverse transformation of online court hearings, allowing all parties to participate in the trial as virtual avatars, continuing to improve the efficiency of the trial, and at the same time, avoiding negative impacts of external factors unrelated to the case on the trial. The current online trial allows judges, parties, and other trial participants not to be in the same space, as long as they simultaneously access the trial system and hold video conferences; In the non trial stage, all parties do not even need to participate simultaneously. They only need to participate in the litigation process step by step according to the guidance of the structured network interface, without exceeding the prescribed deadline, which is known as the "asynchronous trial" mode.

However, there are still two major technical issues with video conferencing based online trials: firstly, the uneven network conditions of all parties involved, resulting in poor video conferencing results and negative impacts on online trials. More generally speaking, this involves the issue of the digital divide that has not yet been eliminated. If one party comes from a remote area or has a lower income level, this problem will become even more serious, affecting their already difficult rights protection actions. Secondly, the video requires all parties to rectify their appearance and environment, which still constitutes litigation costs and goes against the original intention of online litigation. Undoubtedly, setting up a virtual background can exempt some of the environmental renovation costs, and the aforementioned costs can also encourage all parties to take the trial more seriously and enhance its seriousness. However, if there are other better ways to reduce costs and increase efficiency, then this improvement is desirable.

The text-to-video model provides a better way to improve court trials. It can generate virtual trial scenarios and virtual trial participants, thereby upgrading video conference style online trials to virtual world style metaverse trials, relieving the burden of all parties opening videos. It can reduce traffic loss, improve access efficiency, eliminate the digital divide, and eliminate the cost of appearance and environmental decoration for video appearances, further enhancing the accessibility of justice. More importantly, participating in metaverse trials with virtual avatars can effectively avoid discriminatory factors such as gender, race, age, and geography that may affect judges' careful consideration of the case. This allows judges to only consider factors and issues related to the case, truly achieving "blindfolded justice". Moreover, for victims, it can reduce their pressure to appear in court, which is beneficial for them to protect their rights.

On the other hand, the text-to-video model can translate judicial documents into judicial videos, reducing the difficulty for the audience to receive information, making it more conducive for the parties to accept the judgment results, and also facilitating the widespread dissemination of judicial judgments in the public domain. The current automatic generation of judicial documents has gradually transformed judges from writers to editors, greatly reducing the burden on judges and allowing them to devote more time and energy to substantive judicial thinking and handling difficult cases. It even raises concerns about whether judges have overly relied on artificial intelligence, outsourced judicial responsibilities, and even lost control over judicial activities.

However, there is still enormous room for development of generative artificial intelligence in the field of judicial document writing. Regardless of the current situation where judicial artificial intelligence only plays a role in writing simple online financial loan dispute documents based on formatted facts provided by banks, as well as the negative psychology of judges such as laziness, helplessness in learning, and lack of trust in technology, or the insufficient financial reserves of court personnel and insufficient application of existing judicial artificial intelligence, the support of generative artificial intelligence for writing lengthy, terminologically complex, and difficult to express judgment documents may actually strengthen the existing obstacles for parties and the public to understand and accept the essence of the judgment. Admittedly, this is actually the thinking, language, and style required by legal professionalism, but excessive legal professionalism has also been criticized as an invisible barrier that the legal profession breeds to maintain monopolies, which is not conducive to bridging the cognitive gap between legal professionals and ordinary people.

The text-to-video model provides a powerful tool for optimizing referee documents. It can not only further translate the judgment documents compiled by judges into judgment videos, which not only vividly reproduce the facts of the case judgment mentioned above, but also include judgment reasoning on legal issues that are presented in text. It can even condense large documents into short to medium length videos, greatly reducing the difficulty for the parties to the case and the general public to receive judgment information and recognize judicial judgments, which is conducive to the continuous improvement of judicial and social effects. With its higher level of intelligent assistance, judicial artificial intelligence can be applied to the writing of more complex cases, and judges may become more and more cost-effective, receptive, and proficient in using judicial artificial intelligence.

In summary, the application of the text-to-video model in the construction of smart courts can not only vividly reproduce the facts of the case in essence, maintain the balance between objective truth and legal truth, but also promote the metaverse transformation of online court trials in form, and translate judgment documents into judgment videos, further improving the visualization and intelligence level of the judiciary. As a result, judicial efficiency can be improved, judicial fairness can be guaranteed, and judicial transparency can be strengthened.

2. The Potential Risks of Building a Text-to-video model in Smart Courts

For the construction of smart courts, the text-to-video model not only represents a new opportunity for further enhancement of visualization and intelligence, but also exacerbates technological dependence, black boxes, and old problems that harm fairness. It may lead to negative effects such as evading judicial responsibility, hindering judicial openness, and ignoring judicial fairness, which may compromise the positive benefits of improving judicial efficiency. Specifically

2.1 Technology dependence and evasion of judicial responsibility

The higher intelligence and stronger capabilities of the text-to-video model may make judges more reliant on technology, evade judicial responsibility, and lead to the transformation of artificial intelligence from an auxiliary to a dominant role.

Firstly, in terms of factual determination, the vivid reproduction of case facts by the text-to-video model may lead judges to overly believe in the schematic evidence they produce, no longer carefully examine more important original evidence, and even completely disregard the fact problem determination of artificial intelligence. However, indicative evidence is ultimately not true evidence and cannot serve as the factual basis for judging a case. Even due to the lack of improvement in the technical level and uncertainty in the emergence ability of the text-to-video model, it may not be able to approach objective reality or overly focus on objective reality while ignoring legal reality. However, regardless of which happens, it is a serious problem that may be caused by the lack of manual control.

Secondly, in terms of case trial, the metaverse transformation of online trial by the text-to-video model may make judges focus more on documentary evidence than oral testimony, forgetting to directly examine the statements of the parties and witnesses, which is not conducive to judges' free evaluation of evidence. In fact, although video conference style court hearings have improved judicial efficiency, they have raised concerns about judges being unable to judge the authenticity of oral statements and potentially damaging judicial fairness. Therefore, the progress of digital transformation in criminal justice has been relatively slow. The metaverse trial further erases the characteristics of the participants, making everyone a pure testimony provider. While eliminating discriminatory factors, it also eliminates factors that may be beneficial for judges to freely evaluate evidence, and the impact of this on the judiciary is still unknown.

Finally, in terms of legal reasoning, the video transcoding of judgment documents by the text-to-video model may further exempt judges from editing responsibilities after exempting them from writing responsibilities, no longer monitoring judgment documents generated by artificial intelligence, and obeying their judgments on legal issues. In fact, the replacement of judge writing by other personnel has given rise to concerns about the loss of judge dominance or at least a change in the writing style of legal documents, which has occurred in countries with more developed judge assistant systems. The higher intelligence and more modal transformations of the text-to-video model may lead judges to rely more on it, especially considering its case burden. At least in a large number of simple cases, it is not difficult to imagine more artificial intelligence applications.

Since both factual and legal issues are determined by the text-to-video model and its results are not subject to any manual review, once an error occurs, the responsibility does not lie with any specific judicial judge. This clearly goes against the original intention of using the text-to-video model to improve judicial visualization and intelligence. Undoubtedly, judicial personnel who have not undergone review are held responsible for negligence and can still be brought to justice. However, post accountability cannot recover the resulting losses, and the damage to judicial credibility caused by wrongful convictions is already a thing of the past.

More importantly, the evasion of judicial responsibility and the dominance of artificial intelligence mean that providers of the text-to-video model technology may have inappropriate influence on the judiciary. Whether it is the text-to-video model or other artificial intelligence, they are far from independent robots that think independently, but largely depend on the settings of the technology provider. If the judge abandons the review of the output results of artificial intelligence, it essentially transfers the judicial decision-making power to the technology provider. If technology providers have their own selfish interests, it is also difficult to predict the consequences of their interference with public power, which will pose significant national and public security risks.

2.2 The Obstacles of Technology Black Box to Judicial Disclosure

On the one hand, the vivid reproduction of the facts of the case judgment by the text-to-video model makes the process of factual determination a black box. How case evidence is transformed into relevant videos is difficult to understand due to the inherent complexity of machine learning and cannot be disclosed due to the intellectual property and trade secret protection needs of related technologies. Therefore, we can only obtain results and cannot understand the process, only knowing what it is and not knowing why it is. The emergence phenomenon in the process of generating facts in artificial intelligence further exacerbates the difficulty of technical cognition: if the technical black box only transforms input data into output results based on certain rules, even without understanding the internal technical details of the black box, it is possible to summarize its operating rules based on input and output, thus making a certain degree of technical cognition possible. But the existence of emergence phenomenon means that the assumed strict causal relationship is broken, and even if there are certain pre-existing laws, artificial intelligence may "suddenly break the rules", making it even more impossible to understand the technological black box.

Although judges' free assessment of factual issues is largely a black box, as our scientific understanding of the human brain is not as extensive as we imagine, and there are still many shortcomings in research progress in neuroscience, neuroscience, and other fields, we not only minimize the risk of human decision-making errors through a series of institutional designs, but also have a basic understanding of the general human thinking process, even if we cannot establish a scientific cognitive understanding of the process of brain operation. We can generate a sensory and common-sense understanding, so that we can trust similar judgments. This is the fundamental reason why highly complex machine intelligence in technology cannot be recognized and trusted, after all, technology is ultimately different from humans.

On the other hand, the continuous optimization of intelligent trial and adjudication by the text-to-video model has made the judicial trial process a technological black box. Why display certain information and block other information in the metaverse trial. In other words, there is a huge room for discussion on what is the relevant information that needs to be considered in the trial. But under the domination of technology, we selectively see certain things and not see certain things, often lacking necessary strict examination and reflection. Ultimately, this trial environment setting will have a subtle and profound impact on the judge's cognition. How to elevate reasoning from one-dimensional to three-dimensional and how to abbreviate summaries in video-based judgments also poses a high threshold for the public's understanding. However, what is retained or hidden in the process of dimensionality enhancement and compression can also shape the judicial cognition of anyone who does not have the time or ability to read the original judgment. Therefore, the level of judicial transparency improved by the visualization enhancement of the text-to-video model is reduced by the complexity of the technology itself.

The text-to-video model is a further development of generative artificial intelligence, therefore its parameter scale will increase exponentially, and the technological complexity will also double. The technological black box will further exacerbate the obstacles to judicial transparency. The parameter quantity of GPT-1 is 1 1.7 billion, GPT-2 has a parameter count of 1.5 billion, GPT-3 has an astonishing parameter count of 175 billion, and although the parameter count of GPT-4 has not been disclosed, multiple predictions indicate that it will reach 10 trillion. The parameter count of the text-to-video model will only increase inexplicably, far exceeding the existing technological level. Therefore, the judicial process of applying the text-to-video model will face accusations of obstruction to judicial transparency due to serious technical black box issues.

2.3 The neglect of technological efficiency on judicial fairness

On the one hand, the training principle of the text-to-video model is still based on big data and machine learning combined with human feedback, and its judicial application may overlook the characteristics of individual cases. The text-to-video model belongs to generative artificial intelligence and also uses self-supervised learning similar to cloze tests for training. The model algorithm prepares its own data for training parameters and continuously adjusts the parameters based on whether the filled words match the removed words, until the model completes the fill in the blank task perfectly. During this process, humans can guide and improve the model by providing prompt examples or content feedback. Compared to previous generative artificial intelligence, its further improvement lies in: to achieve different modal transformations of generated content, text words and visual sub blocks are mapped to isomorphic low dimensional spaces in advance, and the learning and training methods are applied in this space to discover the correlation between text words, spatial sub blocks, and temporal sub blocks. Therefore, its training principle is like summarizing case experience and applying it to new cases. Therefore, the same problem with sharing case judgments is whether a large model based on experience can be considered if there are new special circumstances in the case. Emergence ability, due to its uncertainty, cannot guarantee accurate factual conclusions in individual cases.

On the other hand, the powerful intelligence of the text-to-video model may dominate the judge's thinking and lead to uncensored erroneous results. As mentioned earlier, the stronger the technical ability, the greater the possibility for judges to rely on it and evade their own responsibilities, to the extent that the text-to-video model is left to decide on the presentation of court evidence, determination of judicial facts, and legal reasoning conclusions. Therefore, even if viewed from an internal perspective, there is no problem with summarizing the patterns of similar cases and applying them to specific cases, there may still be factual or legal errors observed from an external perspective, especially when legal provisions cannot keep up with the development of social reality and contradict people's concepts of fairness and justice. Just like algorithms trained on data with social or economic discriminatory factors will only continue to output discriminatory results and strengthen discriminatory effects. However, due to the judge's technical dependence and evasion of responsibility, this erroneous technical output result will be left unchecked, ultimately substantially affecting the rights and obligations of the parties and the stable order of society, and even making the legal and social order desired by the private interests of the technology provider a reality. And due to the efficiency of technology, the scope of this negative impact will be as extensive as the positive effect, resulting in a massive number of misjudgments in cases.

Due to the limited development level of the text-to-video model, especially in difficult cases, its handling level has always been questioned. Therefore, allowing it to lead the trial and blindly improve judicial efficiency will inevitably lead to negative consequences of neglecting judicial fairness. In the short term, the generation ability of the text-to-video model is far from stable, and it may still produce many results that go against the logic of the real world; In the long run, the content generation of the text-to-video model may still be suitable for task instructions with clear rules and neither black nor white, just like other technologies. Therefore, it is questionable whether it can play a role in difficult cases with unclear standards and legal gaps. Therefore, the judicial process of applying the text-to-video model will be criticized for neglecting fairness due to excessive emphasis on efficiency.

3. How will Sora, a large text-to-video model, affect the construction, upgrading, and iteration of smart courts

Since there are both opportunities and risks for the construction of smart courts, it is necessary to find ways to prevent risks and actively promote the judicial application of the big model in the field of cultural and biological video. In this regard, we should maintain the dominant position of judges, moderately interpret the l text-to-video model, and strengthen the review of artificial intelligence judgment results. However, before further elaboration, it should still be emphasized that the current level of development and application of judicial artificial intelligence is actually relatively low, and can only generate simple online financial loan dispute documents based on formatted facts provided by banks. Moreover, due to the subjective limitations of many judges (especially the older ones) such as laziness, helplessness in learning, and lack of trust in technology, as well as the objective limitations of insufficient human and financial reserves in many courts (especially in remote areas), even low-level judicial artificial intelligence has huge room for promotion and application. Therefore, although prudence and prevention are very important, the importance of learning and application is not to be underestimated.

3.1 Judge led, assisted by artificial intelligence

On the issue of technological dependence, on the one hand, the auxiliary role of the text-to-video model in the judicial trial process should be reiterated, emphasizing that judges should lead the trial and be responsible for the results. No matter how much the intelligence level of the model is improved, it cannot replace the leading role of judges in factual determination and legal reasoning. Its output results can only serve as a reference for trial or trial supervision and management. In terms of factual determination, judges should only use the schematic evidence generated by the text-to-video model as an auxiliary to understand the original evidence, and carefully review the schematic evidence based on the original evidence to avoid excessive or insufficient pursuit of objective truth, so as to accurately determine the facts of the case judgment. In the trial of a case, the judge should place equal emphasis on documentary evidence and oral testimony, and if necessary, require the parties or witnesses to open a video to make a statement, directly examine their statements, and thus form a free testimony. In terms of legal reasoning, judges should adhere to the bottom line of responsibility for editing the results of artificial intelligence judgments, and cannot allow them to be elevated or compressed, which will have a significant negative impact on the parties involved and the public. Leading the trial naturally means taking responsibility, and it cannot be assumed that responsibility lies with artificial intelligence or only with negligence. Just as the other side of prohibiting others from interfering in cases is also a judicial responsibility system, the basic requirement of allowing judges to make judgments and being held accountable by judges is still unbreakable in judicial artificial intelligence. Therefore, it should be ensured that judicial decisions are always made by judges, judicial powers are always exercised by the judicial organization, and judicial responsibility is ultimately borne by the judges.

On the other hand, strict review should be conducted on the l text-to-video model used to improve the level of technical and data security, and to avoid improper influence of technology providers on the judiciary. The providers of the text-to-video model should be carefully screened, whether they are internal technical staff of the court or external technical suppliers of the court, and all should have high-level technical capabilities. The large models provided by it also need to be reviewed and filed to ensure that they comply with the principles of security, legality, fairness and justice, assisted trial, transparency and trustworthiness, public order and good customs. Special attention should be paid to guarding the security bottom line, not damaging national security, not infringing on legitimate rights and interests, ensuring that national secrets, network security, data security, and personal information are not violated, and protecting personal privacy. It is also worth emphasizing that the models used must not have backdoors, in order to prevent technology providers from still having control over them and using this to improperly influence judicial decisions, while obtaining important information and data, and undermining the aforementioned judge's leadership in the judiciary. Due to the significant public interest involved in the judicial application of artificial intelligence, it can counter the intellectual property and trade secret protection demands of technology providers, requiring higher levels of information disclosure and technical interpretation.

3.2 Moderate interpretation to enhance judicial transparency

For the technical black box issue, first of all, a moderate explanation should be given to the large text-to-video model. On the premise of not harming public interests and intellectual property rights, disclose the technical principles of the cultural and biological video big model in a way that the general public can understand, so that they can understand how the big model generates videos based on text. The application of artificial intelligence in the judiciary is conducive to promoting the realization of public interests. Therefore, the demand for intellectual property and trade secret protection should be appropriately suppressed in this scenario, and only a wider range of model disclosure to the public (rather than a narrower range, only to internal technical security reviewers in the court) may cause adverse consequences of algorithm avoidance, leading to those affected by judicial artificial intelligence using their knowledge of technical principles to evade technology and cover up illegal purposes in a legal form, thereby seeking illegitimate benefits. In terms of promoting information disclosure and technological interpretation, the text-to-video model itself can also serve this purpose, by accurately and easily explaining its principles through video, further reducing the threshold for public understanding.

Secondly, it is necessary to strengthen communication with the parties involved in the application of the text-to-video model. In specific cases, judges can explain to the parties in a more direct face-to-face manner how the text-to-video model restores case facts and generates trial scenes, making it easier for them to accept this artificial intelligence in the trial process and judgment results. Due to the fact that trust is sometimes not based on more comprehensive information disclosure, but on stable expectations formed through long-term transactions, it is an effective solution to enhance technology trust by deeply incorporating users into the development, application, and optimization process of generative artificial intelligence, enhancing user understanding and trust in algorithms through personal participation, and eliminating fear of technology black boxes. The personal experience of the parties involved in the case trial makes them perfectly meet the necessary conditions for artificial intelligence participants. Combined with the general explanation of the large model in the previous article, how their relevant behaviors and materials lead to corresponding output results is also easier to understand, thereby achieving stronger technical trust than ordinary people and greater judicial significance.

Finally, if the explanation and communication are still insufficient to dispel the doubts of the parties and the public about the judicial application of the text-to-video model, limiting its application scope or even delaying its application is also a viable solution. Article 24, paragraphs 2 and 3 of the Personal Information Protection Law of the People's Republic of China stipulate: "When information is pushed or commercially marketed to individuals through automated decision-making, options that do not target their personal characteristics shall be provided at the same time, or convenient ways of refusal shall be provided to individuals. When decisions that have a significant impact on personal rights are made through automated decision-making, individuals have the right to request explanations from the personal information processor and the right to refuse decisions made solely through automated decision-making." In fact, not only are automated decisions made using personal information, but all types of users have the right to choose whether to use the assistance provided by judicial artificial intelligence, and have the right to withdraw from interaction with artificial intelligence products and services at any time. If they cannot understand or trust, then refusal and withdrawal are the best protection for their rights.

3.3 Manual review, unified efficiency and fairness

To address the issue of damaging fairness, firstly, it is necessary to strengthen the manual review of cases involving the application of the text-to-video models. In order to achieve the unity of efficiency and fairness, it is necessary to continue to streamline and focus on the review of artificial intelligence applications in difficult cases. Artificial leadership and control have always been the golden principles of applying judicial artificial intelligence, which not only correspond to the aforementioned judge's leadership and control over the judiciary, but also ensure the accuracy or at least accountability of the output results of artificial intelligence. If the excessive improvement of judicial efficiency may harm fairness, then manual review is the best way to delay efficiency and promote fairness. However, to avoid excessive damage to judicial efficiency and reduce the benefits of artificial intelligence applications, the separation of complexity and simplicity is an effective solution. Implement lenient review of simplified cases and trust more in the results of artificial intelligence processing; Strict review of complex cases is implemented, with more assumptions about distrust of artificial intelligence referees. The work of separating complexity and simplicity itself belongs to the use of artificial intelligence to enhance the auxiliary work of judicial affairs, which is conducive to improving judicial efficiency.

Secondly, we need to improve intelligent applications such as early warning of deviation in case judgments, final case verification, automatic inspection of non-standard judicial behavior, and risk prevention and control of clean and honest judiciary, in order to use technology to constrain technology. An important component of the application of artificial intelligence in the judiciary is to strengthen the intelligent management of the judiciary, improve the quality and efficiency of judicial management, and ensure the integrity of the judiciary. Therefore, using review based artificial intelligence to constrain generative artificial intelligence is also the essence of the application of judicial artificial intelligence, and it is conducive to partially compensating for the loss of judicial efficiency caused by manual review. Before the manual review of the previous article, intelligent judicial review management applications can be used to detect potential misjudgments, non-standard judicial behavior, or judicial corruption that may lead to misjudgments in advance, thereby indicating the focus and reducing the burden of manual review. Even the censorship based artificial intelligence technology that constrains generative artificial intelligence can act more proactively on the training process of the former, providing more guidance and feedback to its learning process, and playing a similar role as human supervision.

Finally, in addition to the top-down review, it is also important to pay attention to the feelings of the parties involved in the case and listen to their views on the application of the text-to-video model, in order to achieve the unity of reason and law. Whether it is technical review or manual review, both belong to top-down judicial internal self-inspection. Due to a single perspective and excessive professionalism, errors may not be recognized and many opportunities to discover problems may be missed. Therefore, a bottom-up external perspective is needed to supplement. This is the feedback from the parties involved, which can play a role similar to user comments and credit systems, enabling judicial staff to understand the real work effects of using artificial intelligence, supplementing internal review, and promoting model improvement. More importantly, since the goal of the judicial application of the text-to-video model is to make the people feel fairness and justice in every judicial case, and to achieve the unity of efficiency and fairness, the impact of its application cannot be ignored without the evaluation and feedback of the parties involved, so as to truly achieve the goal of serving litigation.

In summary, the construction of smart courts in China should seize the new opportunities provided by the text-to-video model, improve the visualization and intelligence level of the judiciary, and prevent the intensification of old risks, avoid technological dependence, black boxes, and old problems that harm fairness. In actively embracing digital technology, the people should feel fairness and justice in every judicial case.


The original text was published in the second issue of China Applied Law in 2024. Thanks for the authorized reprint of "China Applied Law" on WeChat official account!