What Can We Learn from LLM? Train a model to train a model

LLM-stages

The training phases of LLM

Data Collection & preprocessing

Gold in gold out

Quality & Accuracy (Noise, outliers, errors, inconsistency & incoherence, biases) Filtering/curation/cleaning/Profanity or sensitive info or toxicity check

Scale & diversity (De-duplication) Catogorization (Summarization, Q&A, Codes, Math, Languages, table, charts, numbers, PDF, email) Sentiment classification Topic classification

Data augmentation

Unified format, e.g., JSON format

Structuring e.g., pairs for Q&A tasks

Tokenization

Learn patterns and relationships Improve generalisation to new data samples