Tweets

Andrej Karpathy

· 2d ago

很高兴发布新的代码库：nanochat！

与我之前仅包含预训练功能的 nanoGPT 类似，nanochat 是一个极简的全栈训练/推理管线。它使用简单的 ChatGPT 克隆，并且所有代码都位于一个依赖性最小的代码库中。

启动云 GPU 服务器，运行单个脚本，最快只需 4 小时，即可在类似 ChatGPT 的 Web 用户界面中与你自己的 LLM 进行对话。

该项目包含约 8,000 行代码，在我看来这些代码非常清晰，其作用包括：

- 使用新的 Rust 实现来训练分词器
- 使用来自 SmolTalk 的用户-助手对话数据、多项选择题和工具使用案例进行中期训练。
- 使用 SFT 在世界知识多项选择（ARC-E/C、MMLU）、数学（GSM8K）和代码（HumanEval）等方面评估聊天模型。
- 可选择使用 GRPO 在 GSM8K 数据集上对模型进行强化学习 (RL)。
- 在带有 KV 缓存的引擎中高效地进行模型推理，支持简单的预填充/解码和工具使用（通过轻量级沙箱中的 Python 解释器实现）。可以通过 CLI 或类似 ChatGPT 的 WebUI 与模型进行交互。
- 生成独立的 Markdown 报告卡，总结并以游戏化的方式呈现整个训练过程。

即使成本仅为 100 美元左右（在 8XH100 节点上训练约 4 小时），你也可以训练出一个能够进行简单对话、编写故事/诗歌和回答简单问题的 ChatGPT 克隆模型。训练约 12 小时后，该模型在 CORE 指标上超过 GPT-2。

当训练成本增加到约 1000 美元（约 41.6 小时）时，模型将变得更加连贯，能够解决简单的数学/代码问题并完成多项选择测试。例如，一个深度为 30 的模型训练 24 小时后，在 MMLU 上可获得 40 多分，在 ARC-Easy 上获得 70 多分，在 GSM8K 上获得 20 多分。*（注：24 小时的训练量大约相当于 GPT-3 Small 125M 的 FLOPs 的量，是 GPT-3 的 1/1000）*

我的目标是创建一个完整、强大且易于理解、修改和复用的代码库，作为 LLM 开发的基石。nanochat 将是 LLM101n 项目（目前仍在开发中）的最终成果。我认为它有潜力发展成为类似于 nanoGPT 的研究工具或基准。

链接到代码库和 nanochat 速通的详细演练在回复中。

622

3,111

22.2K

16.4K

5,121

Andrej Karpathy

· 2d ago

Basically Llama-like, a bit simpler, some influences from modded-nanoGPT. Tried to find a solid baseline for this scale:

- dense transformer
- rotary embeddings (and no positional embeddings)
- QK norm
- untied weights for embedding and unembedding
- norm after token embedding
- relu^2 activation in MLP
- no learnable params in rmsnorm
- no biases in linear layers
- Multi-Query Attention (MQA)
- logit softcap

Optimizer is Muon+AdamW, heavily influenced from modded-nanoGPT. I have a TODO to try to tune Adam LRs well (e.g. per module) to remove Muon, I haven't tried hard enough yet.

808

253

72.1K

114

Andrej Karpathy

· 2d ago

@zenitsu_aprntc Good question, it's basically entirely hand-written (with tab autocomplete). I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution.

1,251

255

307.3K

187

Andrej Karpathy

· 2d ago

GitHub repo:
github.com/karpathy/nanoc…

A lot more detailed and technical walkthrough:
github.com/karpathy/nanoc…

Example conversation with the $100, 4-hour nanochat in the WebUI. It's... entertaining :) Larger models (e.g. a 12-hour depth 26 or a 24-hour depth 30) quickly get more coherent.

134

1,592

787

185.7K

272

Andrej Karpathy

· 2d ago

And an example of some of the summary metrics produced by the $100 speedrun in the report card to start. The current code base is a bit over 8000 lines, but I tried to keep them clean and well-commented.

Now comes the fun part - of tuning and hillclimbing.

783

139

132K

--- All content loaded ---

Tweets

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Andrej Karpathy

Sources