HLBank Research Highlights

2025 Eye on the Market Outlook

HLInvest
Publish date: Fri, 31 Jan 2025, 10:03 AM
HLInvest
0 12,319
This blog publishes research reports from Hong Leong Investment Bank

The sincerest form of flattery: on DeepSeek, NVIDIA, OpenAI and the futility of US chip bans

The DeepSeek episode can be two things at once: (i) a reflection of impressive Chinese AI innovation in the face of US chip bans and other restrictions, and (ii) the by-product of probable terms of service and copyright violations by DeepSeek against OpenAI. A Shakesperean irony: OpenAI may have had its terms of service violated after spending years training their own models on other people’s data.

What did DeepSeek announce? Let’s start with its V2 model introduced last December:

  • DeepSeek appears to have trained its models 45x more efficiently than other leading-edge models. To be clear, most of DeepSeek’s approaches already existed. It’s greatest accomplishment: figuring out how to deploy them all at once in the face of a chip ban, and introduce its own self-reinforcement learning
  • Mixture of Experts: GPT-3.5 uses its entire model for both training and inference to solve problems despite the fact that only a small part of the model might be needed. In contrast, GPT-4 and DeepSeek are mixture of experts (MoE) models which only activate the parts of the model that are needed to solve each problem. DeepSeek V3 is quite massive with 671 billion parameters, but only 37 billion are active at any given time
  • MLA refers to “multi-head latent attention”, which is jargon for how DeepSeek maintains a smaller memory cache while running
  • Other DeepSeek efficiency approaches: while parameters are stored with BF16 or FP32 precision, they are reduced to FP8 precision for training purposes1. The models also use multi-token prediction (MTP) rather than just predicting the next token, which reduces accuracy by ~10% but doubles inference speed
  • DeepSeek claims that V3 was very cheap to train, requiring 2.7 mm H800 GPU hours which at a cost of $2/GPU hour is just $5.6 million2. The comparable number of GPU hours for the Llama 3.1 405B final training run was ~10x higher3. DeepSeek made clear that this was the cost of the final training run, excluding “costs associated with prior research and ablation experiments on architectures, algorithms or data”
  • DeepSeek V3 performance is competitive with OpenAI's 4o and Anthropic's Sonnet-3.5 and appears to be better than Llama's biggest model with lower training costs. DeepSeek provides API access at $0.14 per million tokens while OpenAI charges $7.50 per million tokens4; perhaps some degree of loss leader pricing
  • DeepSeek may have “over-specified” its model: it tortured it to do well on the MMLU benchmark but when the questions changed slightly, its performance declined at a faster rate than other models did5. More analysis is needed to determine whether this overspecialization is a broader issue
  • DeepSeek just announced another release this morning: a multi modal model (text, image generation and interpretation). Unsurprisingly, DeepSeek makes no pretense of data privacy and stores everything. 

Did DeepSeek V2/V3 models benefit from “distillation”, which entails training a model by accessing other AI models? It sure looks that way

  • DeepSeek trained its models on 14.8 trillion tokens, which is a massive sample similar to Llama
  • Some AI analysts believe that DeepSeek sent prompts to a GPT-4 or Chat GPT teacher model, and then used the responses to train its own student model, at least for part of the training process6,7. Companies like OpenAI do this when deriving GPT-4 Turbo from GPT-4, but they are training their own models. Companies like OpenAI and Anthropic typically make clear that it would be a terms of service violation to use their models to train another model (although start-ups and researchers probably do this all the time such as the Stanford Alpaca project, which disclosed what it did)
  • Going forward, will OpenAI and other LLM companies more aggressively monitor how/who/when/why their models are being used and control access to them via IP address banning or rate limiting? And will start ups figure out ways to mask themselves?
  • As for the open-source approach DeepSeek is taking, we wrote about such risks to closed source models a year ago. The chart below shows how adapted open-source models performed just as well as closed source models across multiple domains. We also cited the leaked and infamous Google memo entitled “We have no moat…and neither does OpenAI”
  • The open-source issue may be a catalyst for the gradual divorce between Microsoft and OpenAI8. Microsoft presumably wants to provide inference to customers, but may be reluctant to fund billions for data centers to train models that may end up commoditized.
  • When answering questions about VW car sales in China, ChatGPT, Grok and Gemini all gave very different answers, while DeepSeek’s answer was almost identically worded to ChatGPT
  • Formatting is another highly identifiable LLM footprint. When asked to program an impossible graphics function, DeepSeek’s answer was 95% similar to ChatGPT but very different from the garbage that Co-Pilot, Grok and Gemini produced
  • Why would a Chinese chatbot be trained on what happened at Tiananmen Square in 1989, and be so easily cajoled into talking about it? Why does it talk about Presidents and “best cities to live in” by talking about American ones, even when asked in German?
  • Why would a Chinese chatbot refer to a single-party state as being “a dictatorship” and reject the one-party system unless it was trained on Western data with strong ideological beliefs?

Why did DeepSeek’s R1 announcement clobber NVIDIA, and what are implications for OpenAI and Anthropic?

  • DeepSeek’s R1 is a chain-of-thought reasoning model like OpenAI's o1. It can think through a problem and produce higher quality results in areas like coding, math and logic9. As shown in the chart on the first page, R1 offers similar performance at lower costs
  • The most important aspects of DeepSeek’s R1 model were already known a month ago when DeepSeek V2/V3 was released. Equity markets started to pay more attention when DeepSeek’s app became more popular than ChatGPT in the App store
  • One market shock: even after acknowledging DeepSeek’s probable piggy-backing off of OpenAI, China is further along on AI-LLMs than many market participants thought. AI-LLM breakthroughs are no longer just in the US domain
  • Another market shock: more efficient training/inference processes and possible alternatives to NVIDIA software could eventually affect long run projections of NVIDIA’s order book. One example: a company could conceivably run inference models on AMD GPUs which are half the price of NVIDIA on a $/FLOP basis, if DeepSeek coding disclosures help users mitigate AMD’s inferior chip-to-chip communications capabilities
  • I’ve read in a few places that the US chip ban on China indirectly led to DeepSeek’s success: by forcing China to innovate with less cutting edge hardware and software, Chinese engineers figured it out and developed innovations along the way10. One thing’s for sure: DeepSeek’s intention to make everything public stands in stark contrast to OpenAI’s pronouncements at the time of GPT-2’s release that they would not release datasets, training codes or model weights due to concerns of such data being misused by the great unwashed proletariat
  • Where to from here for OpenAI, Anthropic, Cohere, Mistral, etc? The questions on how closed source AI models will monetize IP become more challenging to answer. Even Sam Altman acknowledged last night that “DeepSeek’s R1 is an impressive model, particularly around what they are able to deliver for the price”

What does the long run look like for big tech and consumer companies?

  • Model commoditization and cheaper inference is probably good for Big Tech and large consumer-facing companies in the long run. The cost of providing inference models to customers would go down, which could increase AI adoption. That said, I cannot stop thinking about the massive amounts of money spent already on AI compute infrastructure, which we discussed on the first page of last week’s piece
  • Amazon could benefit; it hasn’t created its own high-quality model, but can now benefit from low-cost, high quality open source models like DeepSeek
  • Apple’s hardware could benefit from cheaper and more efficient inference models
  • Meta could benefit as well since almost every aspect of its business is AI related at this point, although it will be important to follow the impact on Llama12
  • Google may be less well positioned: in a world of possibly decreased hardware requirements, Google’s TPUs are less of an advantage. Also, lower inference costs may increase the viability and likelihood of products that displace Google search
  • All of these implications depend on whether DeepSeek and other low-cost, open-source models can thrive in a world where training data might not be as readily available.

How powerful are NVIDIA moats?

  • Most AI projects rely on NVIDIA’s CUDA software, which only works on NVIDIA chips. NVIDIA drivers are battle-tested and perform well on Linux (unlike AMD which is notorious for low quality and instability of their Linux drivers), and benefit from highly optimized open-source code in libraries like PyTorch. Nvidia also has a huge lead in terms of its ability to combine multiple chips together into one large virtual GPU. NVIDIA’s industry-leading interconnect technology dates back to its purchase of Mellanox in 2019
  • But there have been competitors circling around NVIDIA for a while: Cerebras (create one massive chip rather than a lot of little ones, thus eliminating interconnection challenges); Groq (deterministic computing chips that can offer better economics if GPU utilization rates are high enough); and several companies that are attempting to design code that works on a variety of different GPUs and TPUs (MLX, sponsored by Apple; Triton, sponsored by OpenAI; and JAX, developed by Google)
  • Yesterday was a “shoot first, ask questions later” market response; NVIDIA P/E based on forward earnings expectations declined towards the very low end of the range since 2020, assuming no material changes to NVIDIA’s order book…and that’s the big question.

What about implications for energy consumption due to more energy efficient training and inference models?

  • We should all dial down the frenzy about increased electricity demand from data centers. Even before DeepSeek, there were already strong incentives to reduce training and computation costs by developing more energy efficient chips and to develop and apply software innovations that require less training, fewer model solutions and much less movement of model solutions between nodes/chips on the network
  • Politics may slow US electricity demand growth as well. We will cover Trump 2.0 energy policies in more detail in the energy paper in March. The short version: solar, wind, battery, EV, carbon capture and other tax credits might be reduced through a Congressional reconciliation bill in which these reductions pay for tax cuts. Remember: tariffs don’t count towards reported fiscal outcomes unless they’re legislated (if tariffs are simply imposed by the President, they would not count as revenue offsets in a reconciliation process)
  • The low end of the US electricity demand forecast above is growth of just 7%, even after including EVs, electrification of home heating and new data centers

Source: Hong Leong Investment Bank Research - 31 Jan 2025

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment