Ruida Docs
search
⌘Ctrlk
GitBook Assistant
GitBook Assistant
Working...Thinking...
GitBook Assistant
Good morning

I'm here to help you with the docs.

⌘Ctrli
AI Based on your contextquestion-circle
Ruida Docs
  • a. 基础知识
  • b. PyTorch
  • c. LLM 基础
  • d. 分布式训练
  • e. Pre-Training
    • 1. 数据构建
    • 2. 训练流程
    • 3. LLM Evaluation
    • 4. Scaling Law
    • evalscope
  • f. Post-Training
  • g. LLM Inference
  • h. Agent
  • i. 主流大模型技术
  • j. 其他
gitbookPowered by GitBook
block-quoteOn this pagechevron-down
  1. e. Pre-Training

1. 数据构建

介绍预训练阶段的数据构建方法与流程

  • 参考资料

    • HuggingFace FineWeb 数据集构建手册:FineWeb: decanting the web for the finest text data at scale - a Hugging Face Space by HuggingFaceFWarrow-up-right

    • HuggingFace FineWeb 数据集构建手册的论文:The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalearrow-up-right

    • 处理 PDF 获取数据:FinePDFs: Liberating 3T of the finest tokens from PDFs - a Hugging Face Space by HuggingFaceFWarrow-up-right

    • 扩展不同语言:Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks - a Hugging Face Space by HuggingFaceFWarrow-up-right

    • Developing an LLM: Building, Training, Finetuningarrow-up-right

Previouse. Pre-Trainingchevron-leftNext2. 训练流程chevron-right

Last updated 46 minutes ago

Was this helpful?

Created By Ruida

Was this helpful?