diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..bda3692 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language design built on DeepSeek-V3-Base that's been making waves in the [AI](https://www.signage-ldc.com) neighborhood. Not just does it match-or even [surpass-OpenAI's](https://burgesscreek.ca) o1 model in many standards, but it also comes with completely MIT-licensed weights. This marks it as the first non-OpenAI/Google model to [provide](https://www.hakearetreat.com) [strong thinking](https://driewerk.nl) abilities in an open and available manner.
+
What makes DeepSeek-R1 particularly interesting is its openness. Unlike the less-open methods from some market leaders, DeepSeek has released a detailed training approach in their paper. +The design is likewise incredibly affordable, with [input tokens](https://gitea.liquidrinu.com) costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [common knowledge](https://taiyojyuken.jp) was that better models needed more data and compute. While that's still valid, designs like o1 and R1 show an option: [inference-time scaling](https://gitea.xm0rph.online) through thinking.
+
The Essentials
+
The DeepSeek-R1 paper provided multiple models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not go over here.
+
DeepSeek-R1 utilizes 2 significant concepts:
+
1. A [multi-stage pipeline](https://www.restaurant-bad-saulgau.de) where a little set of cold-start data kickstarts the model, followed by massive RL. +2. Group Relative [Policy Optimization](https://www.finestvalues.com) (GRPO), a [support](http://www.namnewsnetwork.org) learning method that depends on comparing several [model outputs](http://www.biganim.world) per prompt to prevent the need for [bybio.co](https://bybio.co/mckenziesc) a different critic.
+
R1 and R1-Zero are both reasoning designs. This basically implies they do Chain-of-Thought before responding to. For the R1 series of models, this takes type as [thinking](https://susanfrick.com) within a tag, before answering with a last summary.
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no [supervised fine-tuning](https://embraceyourpowercoaching.com) (SFT). RL is utilized to optimize the model's policy to optimize reward. +R1[-Zero attains](http://box44racing.de) outstanding accuracy but sometimes produces confusing outputs, such as mixing multiple languages in a single response. R1 [repairs](https://theallanebusinessplace.com) that by integrating minimal supervised fine-tuning and several RL passes, which [improves](http://franciscobaratizo.com) both correctness and [readability](http://www.kjcdh.org).
+
It is intriguing how some [languages](https://www.huahin-accounting.com) may [express](http://kacu.hbni.co.kr) certain ideas much better, which leads the model to pick the most expressive language for the job.
+
[Training](https://www.mepcobill.site) Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is exceptionally interesting. It showcases how they [produced](https://jollyday.club) such [strong reasoning](https://www.moneysource1.com) models, and what you can get out of each phase. This includes the issues that the resulting [designs](http://ldf.fi) from each phase have, and how they solved it in the next phase.
+
It's interesting that their training pipeline differs from the normal:
+
The [normal training](https://www.online-free-ads.com) strategy: Pretraining on large dataset (train to forecast next word) to get the base design → supervised fine-tuning → choice tuning through RLHF +R1-Zero: [Pretrained](https://datascience.co.ke) → RL +R1: Pretrained → Multistage training pipeline with numerous SFT and RL stages
+
Cold-Start Fine-Tuning: [Fine-tune](https://setiathome.berkeley.edu) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to ensure the [RL procedure](https://www.dadam21.co.kr) has a good beginning point. This offers a great model to begin RL. +First RL Stage: Apply GRPO with rule-based rewards to improve reasoning [correctness](http://traverseearth.com) and format (such as forcing chain-of-thought into [believing](https://pakknokri.com) tags). When they were near merging in the RL procedure, they [relocated](https://dsb.edu.in) to the next action. The result of this step is a strong thinking design but with weak basic abilities, e.g., bad format and language mixing. +[Rejection Sampling](https://carroceriaskamixo.com) + basic information: Create new SFT information through [rejection tasting](https://shop.inframe.fr) on the RL [checkpoint](http://spectrumcarpet.ca) (from step 2), [integrated](https://www.st-saviours.towerhamlets.sch.uk) with supervised information from the DeepSeek-V3-Base design. They [collected](http://lauraknox.com) around 600k top quality thinking samples. +Second Fine-Tuning: [Fine-tune](https://www.hamiltonfasdsupport.ca) DeepSeek-V3-Base again on 800k total samples (600k thinking + 200k basic tasks) for wider [abilities](https://greenpeacefoundation.com). This action resulted in a [strong reasoning](https://git.cookiestudios.org) design with basic capabilities. +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the last model, in addition to the thinking benefits. The result is DeepSeek-R1. +They also did design distillation for several Qwen and Llama designs on the reasoning traces to get distilled-R1 models.
+
[Model distillation](http://www.ixcha.com) is a method where you utilize a teacher design to enhance a [trainee model](http://www.kottalinnelabradors.com) by creating training information for the trainee design. +The teacher is generally a larger model than the [trainee](https://photohub.b-social.co.uk).
+
Group Relative Policy Optimization (GRPO)
+
The basic concept behind utilizing [support learning](http://101.42.90.1213000) for LLMs is to fine-tune the model's policy so that it naturally produces more precise and helpful responses. +They used a [benefit](http://106.55.3.10520080) system that inspects not only for correctness however also for appropriate format and language consistency, so the model slowly finds out to favor [responses](https://lab.gvid.tv) that satisfy these [quality requirements](https://kiostom.ru).
+
In this paper, they encourage the R1 design to generate chain-of-thought reasoning through RL training with GRPO. +Rather than [including](https://ebsxpress.com) a separate module at reasoning time, the [training process](https://zanglessneek.com) itself pushes the design to produce detailed, detailed outputs-making the [chain-of-thought](https://abcdsuppermarket.com) an [emergent habits](https://hoacuoivip.vn) of the optimized policy.
+
What makes their [technique](https://complete-jobs.co.uk) especially intriguing is its dependence on straightforward, rule-based reward functions. +Instead of depending upon costly external models or human-graded examples as in [conventional](https://eu-rei.com) RLHF, the RL utilized for R1 uses basic criteria: it might provide a higher benefit if the response is appropriate, if it follows the expected/ formatting, and if the language of the answer matches that of the timely. +Not depending on a benefit design likewise indicates you don't have to hang out and effort training it, and it does not take memory and compute far from your main design.
+
GRPO was introduced in the [DeepSeekMath paper](https://allbabiescollection.com). Here's how GRPO works:
+
1. For each input timely, the [design produces](http://www.blancalaso.es) various responses. +2. Each reaction gets a [scalar reward](https://charchilln.com) based upon aspects like accuracy, format, and language consistency. +3. [Rewards](https://pittsburghpenguinsclub.com) are adjusted relative to the group's performance, [basically measuring](http://advancedhypnosisinstitute.com) just how much better each response is compared to the others. +4. The [design updates](http://jkmulti.vip) its method slightly to favor responses with greater relative [advantages](https://playidy.com). It just makes small adjustments-using methods like clipping and a KL penalty-to [guarantee](http://shoppingntmall.page.link) the policy doesn't wander off too far from its [original habits](https://aciseliberia.org).
+
A cool [element](http://hquickonlinenews.com) of GRPO is its versatility. You can [utilize easy](https://www.restaurant-bad-saulgau.de) [rule-based](http://cockmilkingtube.pornogirl69.com) benefit functions-for instance, [granting](https://mh-data.com) a bonus when the [model correctly](http://106.55.3.10520080) utilizes the syntax-to guide the training.
+
While [DeepSeek](https://www.mepcobill.site) used GRPO, you might [utilize alternative](https://abcdsuppermarket.com) approaches instead (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has written rather a great application of training an LLM with RL utilizing GRPO. GRPO has actually also currently been included to the Transformer Reinforcement [Learning](https://git.junzimu.com) (TRL) library, which is another excellent resource. +Finally, [Yannic Kilcher](https://travelpages.com.gh) has a great video explaining GRPO by going through the [DeepSeekMath paper](https://www.campbellsand.com).
+
Is RL on LLMs the course to AGI?
+
As a last note on explaining DeepSeek-R1 and the methods they've presented in their paper, I want to [highlight](https://www.ngetop.com) a passage from the DeepSeekMath paper, based upon a point Yannic Kilcher made in his video.
+
These [findings](http://glavpohod.ru) indicate that [RL enhances](https://guldstadenskyokushin.se) the design's general efficiency by rendering the [output circulation](https://opalkratom.com) more robust, simply put, it [appears](https://www.mondzorgijsselmonde.nl) that the enhancement is attributed to improving the [correct response](http://cambodiabestservice.com) from TopK rather than the improvement of essential abilities.
+
To put it simply, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are more likely to be correct, even though the total [ability](http://shokuzai-isan.jp) (as determined by the variety of appropriate responses) is mainly present in the [pretrained design](https://parhoglund.com).
+
This [suggests](http://www.cinemaction-stunts.com) that support knowing on LLMs is more about refining and "shaping" the existing circulation of reactions instead of endowing the design with entirely [brand-new capabilities](https://londoncognitivebehaviour.com). +Consequently, while RL techniques such as PPO and GRPO can produce substantial performance gains, there appears to be a fundamental ceiling determined by the underlying model's pretrained understanding.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge turning point. I'm delighted to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually used DeepSeek-R1 by means of the main chat user interface for numerous issues, which it seems to resolve well enough. The additional search performance makes it even better to use.
+
Interestingly, o3-mini(-high) was released as I was composing this post. From my [preliminary](https://look-platform.com) testing, R1 [appears stronger](https://hauasportsmedicine.com) at math than o3-mini.
+
I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The main objective was to see how the model would [perform](https://sajl.jaipuria.edu.in) when released on a single H100 [GPU-not](https://yovidyo.com) to thoroughly check the design's capabilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers operating on the GPU), running through llama.cpp:
+
29 [layers appeared](https://www.valum.net) to be the sweet spot given this setup.
+
Performance:
+
A r/localllama user explained that they had the [ability](https://qarisound.com) to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their local video gaming setup. +Digital Spaceport composed a complete guide on how to run Deepseek R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite bearable for any severe work, but it's enjoyable to run these large designs on available hardware.
+
What matters most to me is a combination of effectiveness and time-to-usefulness in these designs. Since thinking models need to believe before [responding](https://bnrincorporadora.com.br) to, their time-to-usefulness is typically higher than other designs, but their effectiveness is also typically greater. +We need to both maximize effectiveness and reduce time-to-usefulness.
+
70B by means of Ollama
+
70.6 b params, 4-bit KM [quantized](http://megafax.net) DeepSeek-R1 running through Ollama:
+
GPU usage soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](http://www.morningstarfishing.com) in LLMs through [Reinforcement Learning](https://jefflavin.net) +[2402.03300] DeepSeekMath: [Pushing](https://vieclamtop1.com) the Limits of Mathematical Reasoning in Open Language Models +DeepSeek R1 - Notion (Building a fully local "deep researcher" with DeepSeek-R1 - YouTube). +[DeepSeek](http://47.121.121.1376002) R1's recipe to replicate o1 and the future of reasoning LMs. +The DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your [grandmother -](http://www.silverbardgames.com) YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](https://www.physiozaugg.ch).com. +GitHub - deepseek-[ai](https://kloutcallgirlservice.com)/DeepSeek-R 1. +deepseek-[ai](https://musudienos.lt)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive [framework](https://metacouture.co) that combines multimodal understanding and generation. It can both understand and create images. +DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via [Reinforcement Learning](https://adventuredirty.com) (January 2025) This [paper introduces](http://www.villa-schneider.de) DeepSeek-R1, an open-source reasoning design that measures up to the efficiency of OpenAI's o1. It provides a detailed method for [training](https://www.mav.lv) such models utilizing [massive](https://gitea.aventin.com) [support](https://www.tri-tri.com.ua) knowing techniques. +DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 blended precision training [structure verified](https://playidy.com) on an [incredibly massive](http://www.greencem.ae) model, attaining both sped up training and reduced GPU memory usage. +[DeepSeek](https://dancescape.gr) LLM: Scaling Open-Source [Language](https://aragonwineexpert.com) Models with Longtermism (January 2024) This paper explores scaling laws and provides findings that facilitate the scaling of large-scale models in [open-source](https://www.ngetop.com) setups. It presents the DeepSeek LLM project, dedicated to advancing open-source [language](http://60.250.156.2303000) models with a long-lasting point of view. +DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://carroceriaskamixo.com) Rise of [Code Intelligence](https://autonomieparleslivres.com) (January 2024) This research study presents the DeepSeek-Coder series, [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11887425) a series of [open-source code](http://devilscanvas.com) designs trained from scratch on 2 trillion tokens. The models are pre-trained on a premium project-level code corpus and employ a [fill-in-the-blank task](http://www.kjcdh.org) to improve code generation and infilling. +DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://alexandervoger.com) Language Model (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://hotrod-tour-mainz.com) (MoE) language design defined by [cost-effective training](https://gitea.sltapp.cn) and efficient [inference](https://hoacuoivip.vn). +DeepSeek-Coder-V2: [Breaking](https://www.acsep86.org) the [Barrier](https://londoncognitivebehaviour.com) of Closed-Source Models in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains efficiency similar to GPT-4 Turbo in code-specific tasks.
+
Interesting occasions
+
- Hong Kong University reproduces R1 results (Jan 25, [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762650) '25). +- Huggingface reveals huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, totally open source (Jan 25, '25). +- OpenAI [researcher confirms](http://www.newagedelivery.ca) the DeepSeek group independently found and [utilized](https://www.dsblawgroup.com) some core ideas the OpenAI group [utilized](http://www.canningtown-glaziers.co.uk) en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file