DeepSeek's AI Model Similar to Google's Gemini

Earlier this year, OpenAI disclosed evidence of DeepSeek, a Chinese AI lab, utilizing a technique known as distillation for the training of their artificial intelligence models. Distillation is a method that involves the training of a smaller AI model by using outputs from more sophisticated ones. Microsoft, a significant partner of OpenAI, voiced concerns about unusual data extraction from developer accounts linked to DeepSeek, incidents that reportedly occurred in the last quarter of 2024. The implementation of distillation in AI training paves the way for more efficient and streamlined model development, potentially leading to breakthroughs in various applications, including lab protocol generation.

DeepSeek recently came under increased scrutiny following the unveiling of its latest model, R1-0528. This model is an updated version of its R1 reasoning model and was launched just a week ago. Demonstrating impressive capabilities in mathematical reasoning and code generation tasks, R1-0528 has however sparked controversy, not due to its performance, but the source of its training data.

DeepSeek has remained largely silent regarding the specific dataset used to train R1-0528, creating speculation within the developer community. Theories have emerged suggesting that Google’s Gemini might have been a key contributor to the training data.

Melbourne-based developer Sam Paech, renowned for creating “emotional intelligence” assessments for AI systems, has shared evidence that might point to the DeepSeek model being trained on Gemini outputs. Paech, in his post on the developer platform X, noted that R1-0528 often favors “words and expressions similar to those that Google’s Gemini 2.5 Pro prefers.” His statement has led to increased speculation that DeepSeek may have transitioned from using synthetic OpenAI data to synthetic Gemini outputs for their AI model training.

The potential shift in DeepSeek’s data sourcing strategy raises important questions about the transparency and ethical considerations of AI training. If DeepSeek is indeed using Gemini outputs for training its models, it would suggest an increasing trend of AI labs leveraging data from other sophisticated AI systems to fast-track their model development. This is not without its advantages, as this method can lead to more efficient AI development and potentially yield models with improved performance in various applications, such as lab protocol generation.

However, the concerns raised by Microsoft regarding unusual data extraction from developer accounts associated with DeepSeek underscore the need for transparency and ethical guidelines in AI development. The AI community, stakeholders, and regulatory bodies must continue to scrutinize and debate these practices to ensure that the rapid advancement of AI technology is balanced with respect for data privacy and intellectual property rights.