1. 数据构建

介绍预训练阶段的数据构建方法与流程

参考资料
- HuggingFace FineWeb 数据集构建手册：FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFW
- HuggingFace FineWeb 数据集构建手册的论文：The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
- 处理 PDF 获取数据：FinePDFs: Liberating 3T of the finest tokens from PDFs - a Hugging Face Space by HuggingFaceFW
- 扩展不同语言：Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks - a Hugging Face Space by HuggingFaceFW
- Developing an LLM: Building, Training, Finetuning

Previouse. Pre-Training Next2. 训练流程

Last updated 46 minutes ago

Was this helpful?