The latest report released by Epochai indicates that the world’s public high-quality text training data sets will be exhausted by large language models in the next few years. The report predicts that between 2026 and 2032, the existing data of about 300 trillion tokens will be used up, and the "over-training" of the model has accelerated this process. Meta’s Llama3 8B version is overtrained by a staggering 100 times, and if all models adopt this approach, the data may be exhausted by 2025. Facing the coming "data shortage", Epochai has proposed four potential solutions, providing a new direction for data acquisition in the field of artificial intelligence.

The researchers specifically pointed out that "overtraining" is the main culprit in accelerating the consumption of training data. For example, the 8B version of Meta’s latest open source Llama3 has an astonishing 100 times overtraining! If all models behave like this, our data may have to say goodbye in 2025.

But don't worry, we still have a way. Epoch AI provides four new methods to obtain training data, making the "data shortage" in the AI world no longer a nightmare.
1) Synthetic data: Just like a meal made from a cooking kit, synthetic data uses deep learning to simulate real data and generate brand new data. But don’t get too excited; the quality of synthetic data can be uneven, prone to overfitting, and lack the nuanced linguistic features of real text.
2) Multi-modal and cross-domain data learning: This method is not limited to text, but also includes images, videos, audio and other data types. Just like in KTV, you can not only sing, dance, but also act. Multi-modal learning allows the model to more comprehensively understand and handle complex tasks.
3) Private data: At present, the total amount of private text data in the world is about 3100 trillion tokens, which is more than 10 times that of public data! But you have to be careful when using private data. After all, privacy and security are big things. Moreover, the process of obtaining and integrating non-public data can be complex.
4) Real-time interactive learning with the real world: Let the model learn and improve through direct interaction with the real world. This approach requires models to be autonomous and adaptable, able to accurately understand user instructions and take actions in the real world.
The four methods proposed by Epochai each have their own advantages and disadvantages and face different challenges. In the future, how to effectively solve data problems will directly affect the development and application of artificial intelligence technology. This requires scientific researchers and industry to work together to explore more effective ways to obtain and utilize data to ensure the sustainable and healthy development of artificial intelligence.