介绍预训练阶段的数据构建方法与流程
参考资料
HuggingFace FineWeb 数据集构建手册:FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFWarrow-up-right
HuggingFace FineWeb 数据集构建手册的论文:The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalearrow-up-right
处理 PDF 获取数据:FinePDFs: Liberating 3T of the finest tokens from PDFs - a Hugging Face Space by HuggingFaceFWarrow-up-right
扩展不同语言:Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks - a Hugging Face Space by HuggingFaceFWarrow-up-right
Developing an LLM: Building, Training, Finetuningarrow-up-right
Last updated 46 minutes ago
Was this helpful?