Kimi Open Platform, a subsidiary of Dark Side of the Moon, announced that its Context Caching technology has officially launched public beta. This technology can reduce the cost of using long text flagship large models by up to 90% for developers without increasing API prices, and significantly improve model response speed. By storing frequently requested data in advance, context caching can effectively reduce repeated calculations and data retrieval, thereby saving time and resources. It is especially suitable for scenarios where a large number of initial contexts are frequently requested, such as scenarios where a large number of questions need to be asked on fixed documents.
Yesterday, the Kimi Open Platform of Dark Side of the Moon announced that Context Caching has begun public testing. This technology can reduce the cost of using long-text flagship large models by up to 90% for developers while maintaining the same API price. Significantly improve the response speed of the model.
Context Caching is an efficient data management technology that allows the system to pre-store large amounts of data or information that may be frequently requested. This way, when you request the same information again, the system can quickly serve it directly from the cache without having to recalculate or retrieve it from the original data source, saving time and resources. Context Caching is particularly suitable for scenarios with frequent requests and repeated references to a large number of initial contexts. It can significantly reduce the cost of long text models and improve efficiency!

Specifically, "context caching" can be applied to scenarios where frequent requests and a large number of initial contexts are repeatedly referenced, bringing the following two effects:
Cost reduction of up to 90%: For example, for scenarios that require a large number of questions on fixed documents, context caching can save a lot of costs. For example, for a hardware product manual with a document of about 90,000 words, pre-sales support personnel need to conduct multiple questions and answers intensively in a short period of time. After accessing the context cache, the cost can be reduced to about 10% of the original price.
The first token delay is reduced by 83%: for a request of a 128k model, it usually takes 30 seconds to return the first token. Through context caching, the first token delay can be reduced to less than 5 seconds on average, reducing the delay time by approximately 83%.
The charging model of Context Caching is mainly divided into the following three parts:
Cache creation fee:
Call the Cache creation interface. After the Cache is successfully created, the actual amount of Tokens in the Cache will be billed. 24 yuan/M token
Cache storage fee:
Cache storage fees are charged per minute during the Cache survival time. 10 yuan/M token/minute
Cache call cost:
The charge for Cache calling incremental token: charged according to the original price of the model
Cache call count charges:
During the Cache survival time, the user requests the successfully created Cache through the chat interface. If the content of the chat message successfully matches the surviving Cache, the Cache call fee will be charged based on the number of calls. 0.02 yuan/time
All in all, the context caching technology of the Kimi open platform provides developers with a more cost-effective solution, significantly reducing the use cost and response delay of long text large models, and improving development efficiency. This is of great significance for application scenarios that need to process large amounts of text data.