Unlocking Sora: A Comprehensive Guide to Replicating Its Architecture, Scaling Parameters, and Optimizing Costs

Unlock the secrets behind Sora's architecture, its impressive encoder/decoder mechanism, the vastness of its training data, and strategic insights for replicating its success while optimizing costs

Preface

OpenAI is becoming less “open” despite releasing several groundbreaking closed-source large models—from ChatGPT to Sora—accompanied by technical reports that are less “technical” and more akin to marketing. The technical blog for Sora explicitly states that it will not share the technical implementation details, only the design philosophy of the model and its “cool” effect demonstrations.


Sora, still in beta testing, has created a sensation second only to ChatGPT’s “nuclear explosion” over a year ago. In terms of results, Sora has managed to surpass existing video generation models in maintaining high-resolution output while also excelling in video generation length and consistency.


While the global marvel at Sora’s impressive effects and realize this moment of intersection between human and AI civilizations, there is a lamentation over the widening grand gap with OpenAI on one hand, and on the other hand, various experts provide technical analysis, commentary, or deconstruction from multiple perspectives.


Today, AI stands once again at a crossroads. As technologists, we are curious about the exact implementation details of Sora. As entrepreneurs, we also ask if Sora can be replicated. What are the challenges in replicating Sora? Can we forge a path of strong and capable counterattack for it?


This article aims to offer a well-reasoned “guess” on the technical roadmap of Sora, primarily based on the practical experience of training the cross-modal large language model “Sequential Monkey” by Mobvoi and the review of related literature.


Any errors are our own responsibility, and we welcome criticisms and corrections from peers in the industry, hoping that we can encourage each other.


This article mainly answers the following questions:


  • What is the architecture of Sora?
  • What are its encoder and decoder?
  • How large is the training data?
  • Is there extensive use of model regenerative data?
  • What are the model scale and training costs?
  • What role should large language models play?
  • What should be the focus when replicating Sora?

Generic Multimodal Model Architecture

A Case Study on Text-to-Video

In the early stages of technology development, where consensus has yet to be reached, the classification or discussion of various video-related models can be quite chaotic, with some categorizations not even on the same dimension, causing confusion for researchers. Therefore, to understand the technical principles of Sora, we have organized various literatures and first define a more general framework for multimodal models. For ease of description, we use the text-to-video task as a case study.

Generally, a multimodal data processing system can be divided into three main modules or steps:

  1. Tokenizer/Encoder: This involves compressing video data in both spatial and temporal dimensions to obtain a latent representation, then segmenting this representation into “spacetime patches.” Here, a patch is what is commonly referred to as a Token, the atomic unit of data processing. Note that the specific value of each Visual Token can be a discrete representation (which can use VQ-VAE) or a continuous representation (which can use VAE).

  2. Cross-modality Alignment & Transfer: Once data from various modalities are compressed into the same latent space, the model needs to perform alignment or transformation within this space. Specifically, during training, the focus is on alignment, while during inference, the focus shifts to transformation. For example, in text-to-video, training primarily involves finding correspondences between text and video and within the video in the spacetime dimension. Inference involves converting text prompts into video. There are two dimensions to this step:

    • Network Architecture (U-net vs Transformer): The neural network used for this alignment or transformation can be either U-net or Transformer.

    • Model (Diffusion vs Autoregressive): If choosing a diffusion model, the optimization goal is to predict noise; if choosing an autoregressive model (like GPT), the optimization goal is to predict the next Token.

  3. De-Tokenizer / Decoder: The decoder converts the Latent Tokens generated in the second step back into Image/Video. This process is generally the inverse of the Tokenizer step, though a separate decoder can also be trained.

This framework is essentially consistent with Large Language Models (LLM), except that the Tokenizer/De-Tokenizer for text modality in language models is a very simple input/output interface, with the main focus being on the second step, as is well known with GPT.

Two Different Video Generation Model Architectures

Based on the generic architecture components described above, multiple different architectures can be constructed. Generally, although Tokenizer/De-Tokenizer is important, the focus of discussion often lies in the core part of the model architecture, which is the cross-modality alignment and transformation generation. There are at least the following possibilities:

  1. Diffusion Models: These models use U-net for modeling, with representative models including SD, Gen-2, Pika, etc. There are also architectures that replace U-net with Transformer, known as DiT (Diffusion Transformer). Sora is widely believed to adopt DiT or its variants. Compared to U-net, DiT leverages the powerful scaling capabilities of Transformer to enhance the quality of video generation.

  2. GPT Models: This approach draws from LLMs (mainly GPT) to model the alignment and transformation between text and video. Thanks to GPT’s long context window, the coherence and consistency of the video generation process are better ensured. Moreover, this type of GPT model also naturally inherits the LLM-friendly conversational prompt interface and can utilize in-context learning to enhance its capability to handle various new tasks.

    Generally, GPT models generate Tokens in the latent space from text, followed by a process of converting these Tokens into pixel-level videos, with several specific implementation methods:

    a. GPT + Codec Decoder The GPT model directly outputs Tokens representation of the video, which can be restored to pixel-level video output through a Codec Decoder. In this scenario, the restoration capability of the Codec Decoder determines the final quality of generation.

    b. GPT + Super Resolution (non-autoregressive Transformer / Diffusion) The GPT model outputs Tokens representation of a “video blueprint,” which is then rendered at higher resolution (SR: Super Resolution) through a post-processing model. This post-processing model can be a non-autoregressive Transformer model (such as used in videopoet); it can also use Diffusion for SR.

    c. End-to-end GPT (end-to-end autoregressive model) Apart from developing large GPT models, methods a and b require specific post-processing models (like Diffusion) for decoding high-resolution videos. However, as the context window of GPT continues to expand (with recent advances even reaching millions or tens of millions of Tokens for ultra-long windows), pure GPT models can also directly model the coarse-to-fine process typical of Diffusion. This means that a coarse version of a Token sequence can serve as the context for the next finer version of the Token sequence.

    This end-to-end GPT model is particularly beneficial for research and development: R&D teams can focus on a single model architecture, striving to perfect every detail within GPT, with other efforts mainly involving data management and optimizing the Tokenizer/De-Tokenizer interfaces for various modalities. I am very much looking forward to the emergence of such multimodal GPTs that can even integrate video post-processing. Mobvoi has its own LLM and has been actively exploring video generation applications, so we are also dedicated to researching and experimenting with these innovative models.

The architectures mentioned above all presuppose the existence of specialized Tokenizer/De-Tokenizer (encoders/decoders) for converting between video and Tokens. However, future developments may witness more innovative attempts, such as directly using a single VQ-VAE Decoder and scaling it up to generate videos, or even eliminating the presence of the latent space altogether, which theoretically is feasible. Such methods further simplify the model architecture, potentially speeding up model inference while also demanding the model to more directly extract semantic alignment information from text or other modal inputs.

Sora Model Architecture and Its Encoder/Decoder

Sora Model Architecture

While OpenAI has not confirmed it, many speculate that Sora utilizes an architecture similar to DiT (Diffusion Transformer), but it has been expanded from image generation to video generation, truly realizing the scaling of visual models, thereby producing astonishing effects.

 

Sora’s Tokenizer/De-Tokenizer

 

Beyond the core architecture, the encoder/decoder is also crucial. Sora’s technical blog does not elaborate much on this aspect. After reviewing some literature, I found the following projects to be most relevant:

 

  • ViT (June 2021): Introduced the concept of Patchify early on, using Transformer to convert images into Tokens.

  • ViViT (November 2021): Proposed the concept of spacetime Patches early on, extending ViT from images to videos, converting videos into Tokens.

  • NaViT (July 2023): Previous Tokenizers generally could only handle fixed resolutions and aspect ratios, often converting various resolutions or aspect ratios into a uniform format that the system could process before training. NaViT primarily solved this problem, capable of processing video data with different resolutions and aspect ratios.

  • MAGViT V2 (October 2023): Previous Tokenizers for images and videos generally used different vocabularies, processed separately. MAGViT V2 integrated images and videos into the same vocabulary, allowing images and videos to be jointly trained within the same model. Additionally, whereas the scale of the vocabulary was generally small (e.g., 8192), MAGVit V2 used a Lookup-free method to increase the vocabulary size to 260,000, significantly improving the compression and generation quality of videos.

Sora’s technical blog emphasizes converting video data into spacetime Patches/Tokens, the joint training of videos and images, and the ability to process training data with different resolutions and aspect ratios. It references the aforementioned literature but does not disclose how OpenAI specifically used these concepts, what modifications or innovations were made. It even seems to intentionally divert attention to DiT, which might be considered less crucial from a replication perspective. Is this perhaps a bit “sly”?

 

Additionally, guess who published these projects and the corresponding papers? Yes, it’s Google, the neighbor of OpenAI, who often seems to be “up early and catching the late fa

How Large Is the Training Data for Sora?

OpenAI has not disclosed the scale of the training data for Sora. However, if we were to make an educated guess, the conclusion might be that the image data consists of several billion images, and the video data comprises at least several million hours; the total number of Tokens after tokenizing images and videos could be in the order of several hundred trillion.

 

Why? Here are two approaches to estimate:

 

Estimation from Past Data Scales of Voice, Image, and Video Models

  • Voice: Experience in voice processing suggests that baseline data needs to reach 100k hours, with the number of Tokens after tokenization being around ten billion (10B). This aligns with Amazon’s report on “BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data.” To generate voices in multiple world languages, at least 1,000k hours of data are needed, equating to a hundred billion (100B) Tokens.

  • Images and Videos: The table below represents the data from several representative image/video models in recent years. Images range from several billion to tens of billions, equating to approximately tens of thousands of hours. Some papers have mentioned that video data is about several hundred thousand hours; the video data for the Sora model would need to increase by an order of magnitude, i.e., several million hours.
    – Flamingo / 2,297M Images / 27M Videos

    – VideoPoet / 1,000M Images / 270M Videos

    – DALL-E / 250M Images / 0M Videos

    – DALL-E2 / 650M Images / 0M Videos

    – BLIP-2 / 129M Images / 0M Videos

    – CLIP / 400M Images / 0M Videos

Using the MagVit encoding and compression scheme, if a video with a resolution of 128×128 at 8 fps requires 1280 Tokens for 2.125 seconds, then one minute would require approximately 36,000 Tokens, and one hour would be about 2.16 million Tokens. For 100,000 hours of video, that would be 216 billion Tokens.

But considering Sora’s training data could be several million hours and typically at a resolution higher than 128×128, the final volume of Tokens should start at least in the trillion range. For reference, if Sora used 5 million hours of video data for training, the volume of data it used would be roughly equivalent to the amount of data produced on YouTube in 9 days.

Estimation from Learning Complexity Across Modalities

Observing the complexity of learning tasks, we see an incremental increase from voice to images to videos, thereby increasing the demand for data correspondingly. If modalities increase by an order of magnitude, we can roughly estimate the necessary amount. Specifically, if voice data requirements are set at a hundred billion (100B) Tokens, then image data would need to rise to a trillion (1,000B) Tokens. Further, video data might require a level of ten trillion (10,000B) Tokens. This estimation aligns closely with conclusions drawn from specific calculations.

Does Sora Make Extensive Use of Regenerated Data?

It’s plausible to assume that Sora primarily trained on a vast amount of data scraped from YouTube, which includes a significant volume of gaming content. Thus, it likely encompasses data produced using 3D rendering technologies like the Unreal Engine (UE) for dynamic camera movements in 3D environments.

Unlike images, video data often lacks precise textual annotations. Sora’s technical report explicitly states that all training videos were aligned with text, with captions generated by Dall-E 3. While this is auto-generated text data, achieving full coverage is a substantial undertaking. Outsourcing this annotation task via API calls to their model would also be costly.

 

Interestingly, Sora’s practice of extensively using regenerated data was somewhat unexpected: they utilized regenerated data “without reservations.” Besides regenerating all captions (equivalent to video explanations or “scripts”), they also employed GPT-4 to auto-expand user prompts, adding detailed descriptions of video scenes to better align with training data captions, aiding in the creation of more realistic and detail-rich videos.

 

Many believe that high-fidelity game scene data regenerated using UE also formed a significant part of Sora’s training data. Otherwise, its ability to simulate digital world scenes with such high fidelity would be hard to imagine.

 

Previously, the emphasis in large model domains was on using high-quality “natural data” that had been cleansed and deduplicated, as it was thought to be the correct approach. An underlying assumption was that since natural data was created by humans (equating to natural “annotations”), AI could learn the true path to enlightenment by learning from humans. Regenerated data from models or engines was considered inferior.

 

For a time, many were somewhat embarrassed to use GPT-4’s regenerated data for training their models (a process referred to as “distillation”). Of course, this was partly due to OpenAI’s somewhat absurd and practically unenforceable stipulation against using its generated data for training other models on a large scale.

 

Calling it absurd stems from the fact that much of the data used to train the GPT series itself wasn’t fully authorized. For instance, The New York Times recently sued them for infringement, and it’s uncertain how many more datasets were used without authorization. In the era of big data and large models, overly emphasizing copyright might hinder AI development, necessitating a balance between copyright protection and technological advancement.

 

The reality is that the quality of data generated by mature AI models, such as GPT-4v, Gemini1.5, Claude, MidjourneyV6/Dall-E 3, and some domestic models (including Mobvoi’s “Sequential Monkey”), often surpasses the average quality of human-generated “natural” data.

 

For many annotation tasks, models are more stable and consistent than human annotators, often matching or exceeding the quality of human annotations under work/time pressures.

 

Moreover, in an era where truth and falsehood are increasingly blurred, regenerated data will soon mix with natural data online, making it impossible to completely distinguish between the two based on quality.

 

Eventually, the supply of human-generated natural data will run out. Humans, as carbon-based life forms, generate data far less efficiently than silicon-based models, which can generate data indefinitely with a power supply.

 

In summary, Sora offers an important insight: as long as there is a clear goal for the needed data, we can confidently use contemporary large models to generate data, replacing costly and inefficient manual annotation. The results generated in this way often outperform those using only natural data.

 

The realization that regenerated data can be utilized on such a scale to achieve remarkable effects is a recent phenomenon, becoming apparent over the last six months. Previous models weren’t reliable enough for people to trust their generated quality.

 

Now, with super-large models outperforming any individual human, we’ve entered an era where the importance of regenerated data and the concept of “models nurturing models” should be emphasized.

 

An additional concern with “models nurturing models” isn’t the fear of insufficient machine quality, but a kind of “psychological barrier” for humans. Theoretically, the automatic feedback loop of regenerated data, feeding models for further training and iteration, paints a “frightening” prospect: AI could theoretically improve and iterate autonomously, rendering humans irrelevant or sidelined. How does humanity, which prides itself as the “crown of creation,” cope with this?

Model Parameter Scale and Training Cost of Sora?

As usual, the less open OpenAI has not disclosed the parameter scale of Sora, nor the computational power or financial cost involved. However, let’s attempt to make an educated guess: Intuitively, Sora’s model parameter scale is likely in the billions, with training costs in the tens of millions of dollars.

 

To estimate the scale and cost of this project, we could look at GPU usage reported in past papers (e.g., GPT, OPT, DiT) in a data-driven, bottom-up approach. Engineers’ real-world experience, such as those from Mobvoi training the “Sequential Monkey” model, has also been considered. However, due to the multitude of variables involved, it’s challenging to arrive at a consistent conclusion.

 

Let’s approach this from a budget perspective, using a top-down method:

 

Estimating from a Computational Power Budget

If I were the CEO of OpenAI, given the importance of the video modality and the fact that video generation technology is still in its early, non-convergent stages, I might decide to allocate a budget equivalent to that of training a language model to develop a visual model. Considering that the mainstream models of large model companies are in the hundred-billion parameter range, what might the computational power budget for training a hundred-billion parameter LLM be?

 

If an LLM has 150 billion (150B) parameters and trains on three trillion Tokens, the computational cost might be around ten million USD.

 

So, what scale of video model training could a ten million USD budget support? To answer this, we need to understand two basic conclusions:

 

  1. The computational power consumption for training an LLM model is roughly proportional to the product of “model parameter scale” and “training data Tokens number.” With the same computational power budget, we can adjust the ratio between the model’s parameter scale and the training data’s scale. The trend over the past year has been towards smaller model parameters but more training data (the so-called big data, small model). Additionally, if the model is smaller, it also benefits inference in terms of cost and speed, under the same training computational power.

  2. Although GPT LLMs are autoregressive models and Sora is a diffusion model, both are based on the Transformer network architecture. In theory, the computational amount for a single step of training is comparable given the same scale of model parameters and sequence length. Thus, we can basically use past LLM training computational power needs to deduce the training computational power requirements for the video diffusion model Sora. (Note that although diffusion models like DiT require multiple sampling iterations to produce the final result during inference, this is not needed during training. The key here is that for each training step, a single sample randomly selects a point in time to calculate the loss, rather than multiple continuous samples. Therefore, under the same parameter scale and Tokens number, the computational power consumed by DiT-type diffusion models and LLM models is comparable.)

Based on these points, if the budget to train a hundred-billion-parameter language model is ten million USD, and if the training data scale for a video model is similar to that of a language model, then the parameter scale of the video model would also be similar, i.e., in the hundred-billion range. However, given the complexity of video data, its Tokens number might be an order of magnitude higher than text. Therefore, under the same training computational power budget, the video model’s parameter count would likely be an order of magnitude smaller, i.e., in the billion range, aligning with the recent trend of big data and small models.

 

Another reason to believe Sora might be a smaller model (billion rather than hundred-billion scale) is considering the cost of inference: since diffusion models sample multiple times (e.g., 20 times) during inference, a large model could make inference duration and cost problematic for scalable applications. Thus, for Sora to be productized, even if it’s not currently a smaller model, it will inevitably need to optimize towards smaller models in the future.

 

Of course, this speculation is very rough, and the actual costs could vary widely, potentially reaching hundreds of millions of dollars or dropping to millions. The model’s parameter scale could also be larger, up to the hundred-billion level, but that would require adjusting the data scale if the budget remains constant.

 

Any change in factors could lead to deviations from these estimates. For example, after a year of efforts in training GPT large language models, the optimization level of GPT models could be much higher than that of Sora-like diffusion models. Training costs depend on the computing framework used, the unit price of computational power, the optimization level of the algorithm framework, the computational utilization rate of GPUs, the proficiency of engineering personnel in training models, etc., so actual costs can vary significantly between companies and even within the same company at different stages.

 

For instance, training a model of the same scale as Mobvoi’s “Sequential Monkey” now might cost half or even less than it did a year ago.

 

It’s worth noting that the tens of millions of dollars cost is the price paid by OpenAI, a pioneer in video models. Over time, as knowledge spreads and various details continue to be optimized, these costs are expected to decrease significantly, potentially by several times or even an order of magnitude. This “latecomer advantage” in training costs has become very apparent in the past year’s “hundred model battle,” as the open-source community and major model companies have raced to catch up with GPT-3.5.

 

Additionally, these calculations are based on the prices of public clouds in the United States. For Chinese entrepreneurs, the cost of computational power could be lower, thanks to the intense competition among various public cloud providers in China.

 

Regardless, I believe that in the near future, with the support of large foundational language models, deep integration of video and language models, and an additional budget of tens of millions of dollars, replicating Sora’s results is highly probable.

What Role Should Large Language Models (LLMs) Play?

Is Sora similar to Gemini or RT-2, utilizing a Large Language Model (LLM) as a starting point for pre-training and then continuing training with visual data? Or is it like SVD, where the language model is frozen during the training of the video model, and the text embeddings generated by the language model serve merely as a condition to guide video generation? What role does the language model actually play?

 

From OpenAI’s technical blog, Sora seems to be more like the latter; it has not yet made extensive, systematic use of LLMs (though this is speculative, as the blog could deliberately avoid certain core technical topics).

 

Regardless of how OpenAI chooses to utilize LLMs, we know that generating videos with consistent narratives over extended durations requires a wealth of world knowledge. The impressive consistency demonstrated by Sora in demo videos suggests some form of LLM enhancement, which is intriguing.

 

If Sora has not deeply integrated with LLMs, how does it learn such rich world knowledge and logic from the “limited” textual data aligned with videos? Or does video data inherently make it easier to learn world knowledge than relying solely on text, with aligned data facilitating a comprehensive understanding of the physical world? The term “limited” is used here relative to the vast amount of text used to train LLMs.

 

Regardless of the actual scenario, future versions of video generation should attempt to utilize LLMs as a starting point, followed by joint training with video and its aligned data. The subsequent process of generating high-quality videos could either involve a separate Diffusion model, or if GPT can support sufficiently long contexts, it could directly use GPT for modeling. Here, the multimodal model based on LLMs is central, modeling the correspondence between text and “video blueprints.”

 

Why is the cognitive empowerment of LLMs and their seamless integration with video models so crucial?

 

For videos generated by the model to adhere to physical laws, the model needs a vast repository of world knowledge. Where does this knowledge come from? While we can learn these principles from a large volume of video data, inheriting the vast commonsense knowledge embedded in language models can significantly reduce the quality and quantity requirements for video data, as well as the difficulty of model training.

 

For example, if we ask Sora to generate a video of a cup falling onto the floor, today’s large language models, like Mobvoi’s “Sequential Monkey,” already contain commonsense knowledge that glass breaks and water splashes.

 

With this knowledge, video generation models would not need extensive video data of glasses falling to train, thereby significantly lowering the difficulty of generating realistic videos.

 

Language models also contain descriptions of other physical laws (e.g., optics, collisions), which can be transferred to downstream models in other modalities.

 

The knowledge transfer capabilities of LLMs to multimodal models have been repeatedly demonstrated in Google’s RT-2 and Gemini. Before Sora’s release, Mobvoi officially launched its voice model based on “Sequential Monkey” in DupDub  marking a significant achievement in this direction. By encoding voice with unified Tokens and conducting unified autoregressive multimodal joint training, the new generation of voice synthesis effects is very natural.

 

Evaluations show that DupDub has generated a new generation of high-quality voices characterized by conversational tones, approaching the voice quality of GPT-4.

 

Especially noteworthy is that the language comprehension ability of “Sequential Monkey” naturally transitions to voice generation: language expressing joy naturally generates a voice expression of elation, while reporting bad news carries a “crying tone.” The emotions conveyed by text are naturally integrated by the LLM into the expressive range of the voice.

 

In contrast, before the unified architecture, voice systems, no matter how advanced, had to handle emotion generation through specific processes (considered “hard coding,” rather than the “natural emergence” of voice capabilities in large models).

What Should Be the Focus When Replicating Sora?

The emergence of Sora, despite OpenAI’s lack of public technical details, has made it difficult for people to quickly form a unified opinion, but high-quality technical analyses are still widely available. 


So, what should be the focus when attempting to replicate Sora? Currently, most technical analyses translate or explain OpenAI’s technical blog without delving into the related papers in depth. There’s also an excessive focus on architectures like DiT, which are easier to understand and replicate, neglecting other critical details.

Specifically, efforts to replicate Sora should focus on:


  • The details of image and video encoders/decoders

  • The crawling and processing of high-quality “natural” video data

  • Utilizing other models or engines to generate data

  • Deep integration of video models with large language models

  • Joint training of images and videos and the unified support for various formats (resolution, aspect ratio, duration)

Summary of Key Answers to Sora

In summary, the quantified answers to the opening questions can be encapsulated in a single sentence: billions of parameters, trillions of Tokens, and millions of dollars in training costs.

Here are the specific answers:

 

  1. What is the architecture of Sora?

    • Sora is likely a diffusion video model that replaces the commonly used U-net with Transformer, achieving a massive scale-up of the visual model.

  2. What are its encoder and decoder?

    • Sora’s encoder/decoder likely draws heavily on MAGViT V2 to compress video into spacetime Tokens, integrating the training of images and videos. For various resolutions and aspect ratios, it probably utilizes NaViT’s approach.

  3. How large is the model?

    • Sora’s parameter scale is likely in the order of 10 billion.

  4. How large is the training data?

    • The video data scale is at least several million hours, with the total number of Tokens from images and videos after tokenization likely in the trillion range.

  5. Is there extensive use of model regenerative data?

    • Given Sora’s complexity and data demands, this is a likely strategy.

  6. What are the model scale and training costs?

    • The training cost for Sora is estimated to be in the tens of millions of dollars.

  7. What role should large language models play?

    • The world knowledge from LLMs can reduce dependency on massive aligned data for video generation and improve the consistency of longer videos.

  8. What should be the focus when replicating?

    • Efficient encoding/decoding techniques, high-quality native and regenerative data, deep integration of video and language models, joint training of images and videos, and unified support for various formats (resolution, aspect ratio, duration).

Conclusion

In the course of human civilization’s development, it’s hard to imagine and believe in the approaching world where video generation models set a new starting point towards a future unified by multimodal capabilities.


Reflecting on the progression from ChatGPT to Sora, through exploring two potential architectural approaches, we aim to unlock the future of video generation, emphasizing the importance of integrating text understanding models with video generation technology.


The trend towards unified multimodal models is becoming increasingly clear, driven by data volume, model parameters, computational power requirements, and cost-effectiveness. As technology converges and the open-source ecosystem evolves, the future will see higher-level models achieved at lower costs, opening new possibilities for creating and understanding complex multimodal content.


A scientific battle that respects the essence of technology also requires a profound commitment to science and engineering.


This is a moment of demystification for Sora. As AI entrepreneurs, we no longer feel bewildered or powerless as we did at our first encounter with Sora, but rather we are filled with a more solid strength and conviction.


“Ready for battle, with a determination to confront challenges, always fearing ancestors may have expected more from us.”


The time is of the essence to prepare and sharpen our skills.