ガチでスクラッチから作ってみたいそんなあなたに

今、X.com上でこんなリポジトリが盛り上がってます。

GitHub – FareedKhan-dev/train-llm-from-scratch: A straightforward method for training your LLM, from downloading data to generating text.

A straightforward method for training your LLM, from downloading data to generating text. – FareedKhan-dev/train-llm-fro…

これ、何が入ってるかと言いますと、スクラッチでTransformerを使用したLLMを事前学習から始めるためのキットが詰まっています。

全体構成

全体構成は以下のようになっています。

train-llm-from-scratch/
├── src/
│   ├── models/
│   │   ├── mlp.py                 # 多層感知機（MLP）モジュールの定義
│   │   ├── attention.py           # アテンションメカニズム（シングル/マルチヘッド）の定義
│   │   ├── transformer_block.py   # 単一のTransformerブロックの定義
│   │   └── transformer.py         # メインのTransformerモデルの定義
├── config/
│   └── config.py                  # デフォルト設定（モデルパラメータ、ファイルパスなど）
├── data_loader/
│   └── data_loader.py             # データローダー/イテレータ作成関数
├── scripts/
│   ├── train_transformer.py       # Transformerモデル訓練用スクリプト
│   ├── data_download.py           # データセットダウンロードスクリプト
│   ├── data_preprocess.py         # ダウンロードしたデータの事前処理用スクリプト
│   └── generate_text.py           # 訓練済みモデルを使用したテキスト生成スクリプト
├── data/                          # データセット保存ディレクトリ
│   ├── train/                     # 訓練データ
│   └── val/                       # 検証データ
└── models/                        # 訓練済みモデル保存ディレクトリ

train-llm-from-scratch/
├── src/
│   ├── models/
│   │   ├── mlp.py                 # 多層感知機（MLP）モジュールの定義
│   │   ├── attention.py           # アテンションメカニズム（シングル/マルチヘッド）の定義
│   │   ├── transformer_block.py   # 単一のTransformerブロックの定義
│   │   └── transformer.py         # メインのTransformerモデルの定義
├── config/
│   └── config.py                  # デフォルト設定（モデルパラメータ、ファイルパスなど）
├── data_loader/
│   └── data_loader.py             # データローダー/イテレータ作成関数
├── scripts/
│   ├── train_transformer.py       # Transformerモデル訓練用スクリプト
│   ├── data_download.py           # データセットダウンロードスクリプト
│   ├── data_preprocess.py         # ダウンロードしたデータの事前処理用スクリプト
│   └── generate_text.py           # 訓練済みモデルを使用したテキスト生成スクリプト
├── data/                          # データセット保存ディレクトリ
│   ├── train/                     # 訓練データ
│   └── val/                       # 検証データ
└── models/                        # 訓練済みモデル保存ディレクトリ

models配下には、記憶部分であるMLP、文法エンジンとなるAttention、それらを統合して一つのブロックを構成するResidual Attention Block部分、さらにそれらを重ね合わせて構成されるTransformer全体を表しています。

ハイパーパラメータはconfig.pyに配置、データを読み出すために必要なdata_loader部分、そしてこれらを使って訓練するためのツール群scripts・・という流れでできてます。

TokenizerはどうやらTiktokenを使用するよう定義されており、こちらは既製製品を使うという構成になりますが、それ以外はデータセットから一から作るという構成で臨む感じになるようです。素敵。

出来上がりそうなモデルの仕様

config.pyを覗く

config.pyを覗いてみると、以下のような内容になっていました。

# --- Configuration ---

# Define vocabulary size and transformer configuration (3 Billion)
VOCAB_SIZE = 50304          # Number of unique tokens in the vocabulary
CONTEXT_LENGTH = 512        # Maximum sequence length for the model
N_EMBED = 2048              # Dimension of the embedding space
N_HEAD = 16                 # Number of attention heads in each transformer block
N_BLOCKS = 64               # Number of transformer blocks in the model

# Paths to training and development datasets
TRAIN_PATH = "data/train/pile_train.h5"  # File path for the training dataset
DEV_PATH = "data/val/pile_dev.h5"      # File path for the validation dataset

# Transformer training parameters
T_BATCH_SIZE = 32          # Number of samples per training batch
T_CONTEXT_LENGTH = 16      # Context length for training batches
T_TRAIN_STEPS = 200000     # Total number of training steps
T_EVAL_STEPS = 1000        # Frequency (in steps) to perform evaluation
T_EVAL_ITERS = 250         # Number of iterations to evaluate the model
T_LR_DECAY_STEP = 50000    # Step at which to decay the learning rate
T_LR = 5e-4                # Initial learning rate for training
T_LR_DECAYED = 5e-5        # Learning rate after decay
T_OUT_PATH = "models/transformer_B.pt"  # Path to save the trained model

# Device configuration
DEVICE = 'cuda'

# Store all configurations in a dictionary for easy access and modification
default_config = {
    'vocab_size': VOCAB_SIZE,
    'context_length': CONTEXT_LENGTH,
    'n_embed': N_EMBED,
    'n_head': N_HEAD,
    'n_blocks': N_BLOCKS,
    'train_path': TRAIN_PATH,
    'dev_path': DEV_PATH,
    't_batch_size': T_BATCH_SIZE,
    't_context_length': T_CONTEXT_LENGTH,
    't_train_steps': T_TRAIN_STEPS,
    't_eval_steps': T_EVAL_STEPS,
    't_eval_iters': T_EVAL_ITERS,
    't_lr_decay_step': T_LR_DECAY_STEP,
    't_lr': T_LR,
    't_lr_decayed': T_LR_DECAYED,
    't_out_path': T_OUT_PATH,
    'device': DEVICE,
}

# --- Configuration ---

# Define vocabulary size and transformer configuration (3 Billion)
VOCAB_SIZE = 50304          # Number of unique tokens in the vocabulary
CONTEXT_LENGTH = 512        # Maximum sequence length for the model
N_EMBED = 2048              # Dimension of the embedding space
N_HEAD = 16                 # Number of attention heads in each transformer block
N_BLOCKS = 64               # Number of transformer blocks in the model

# Paths to training and development datasets
TRAIN_PATH = "data/train/pile_train.h5"  # File path for the training dataset
DEV_PATH = "data/val/pile_dev.h5"      # File path for the validation dataset

# Transformer training parameters
T_BATCH_SIZE = 32          # Number of samples per training batch
T_CONTEXT_LENGTH = 16      # Context length for training batches
T_TRAIN_STEPS = 200000     # Total number of training steps
T_EVAL_STEPS = 1000        # Frequency (in steps) to perform evaluation
T_EVAL_ITERS = 250         # Number of iterations to evaluate the model
T_LR_DECAY_STEP = 50000    # Step at which to decay the learning rate
T_LR = 5e-4                # Initial learning rate for training
T_LR_DECAYED = 5e-5        # Learning rate after decay
T_OUT_PATH = "models/transformer_B.pt"  # Path to save the trained model

# Device configuration
DEVICE = 'cuda'

# Store all configurations in a dictionary for easy access and modification
default_config = {
    'vocab_size': VOCAB_SIZE,
    'context_length': CONTEXT_LENGTH,
    'n_embed': N_EMBED,
    'n_head': N_HEAD,
    'n_blocks': N_BLOCKS,
    'train_path': TRAIN_PATH,
    'dev_path': DEV_PATH,
    't_batch_size': T_BATCH_SIZE,
    't_context_length': T_CONTEXT_LENGTH,
    't_train_steps': T_TRAIN_STEPS,
    't_eval_steps': T_EVAL_STEPS,
    't_eval_iters': T_EVAL_ITERS,
    't_lr_decay_step': T_LR_DECAY_STEP,
    't_lr': T_LR,
    't_lr_decayed': T_LR_DECAYED,
    't_out_path': T_OUT_PATH,
    'device': DEVICE,
}

このことから、Vocab数はおよそ50k程度、最大コンテキスト長は512、隠れ層次元数は2048、Multihead Attentionのヘッド数は16、層数は64といったものが出来上がりそうです。コメントの内容から、総パラメータ数は3B相当とのこと。この値を若干減じることで1B相当などサイズを設定できそうです。

活性関数

活性関数は src/models/mlp.py に記載があります。

    def __init__(self, n_embed: int) -> None:
        """
        Initializes the MLP module.

        Args:
            n_embed (int): The dimensionality of the input embedding.
        """
        super().__init__()
        self.hidden = nn.Linear(n_embed, 4 * n_embed)  # Linear layer to expand embedding size
        self.relu = nn.ReLU()                        # ReLU activation function
        self.proj = nn.Linear(4 * n_embed, n_embed)  # Linear layer to project back to original size

    def __init__(self, n_embed: int) -> None:
        """
        Initializes the MLP module.

        Args:
            n_embed (int): The dimensionality of the input embedding.
        """
        super().__init__()
        self.hidden = nn.Linear(n_embed, 4 * n_embed)  # Linear layer to expand embedding size
        self.relu = nn.ReLU()                        # ReLU activation function
        self.proj = nn.Linear(4 * n_embed, n_embed)  # Linear layer to project back to original size

活性関数はReLUです。比較的Transformerでは初期によく用いられてきた活性化関数ですね。Pytorchのnnクラスを用いてるため、一生懸命数学的な関数を作らずに済むところはありがたいですね。こういうところの関数を変えながら進めていくのもまた一つ面白ポイントとしてあるかもしれません。

なお、最近よく見かける活性化関数はSiLUです。

このMLPに入る前は2048次元であったテンソルをここで4倍の8192次元に広げ、情報のキメを細かくして表現を洗練させるようになっています。

マルチヘッドアテンション層

マルチヘッドアテンション層で使われるパラメータですが、QKV(Query/Key/Value)の3つが記述されています。入力次元数はそれぞれ2048のままとなっています。

    def __init__(self, head_size: int, n_embed: int, context_length: int) -> None:
        """
        Initializes the attention head.

        Args:
            head_size (int): The dimensionality of the key, query, and value projections.
            n_embed (int): The dimensionality of the input embedding.
            context_length (int): The maximum length of the input sequence.
        """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)   # Key projection
        self.query = nn.Linear(n_embed, head_size, bias=False) # Query projection
        self.value = nn.Linear(n_embed, head_size, bias=False) # Value projection
        # Lower triangular matrix for causal masking
        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def __init__(self, head_size: int, n_embed: int, context_length: int) -> None:
        """
        Initializes the attention head.

        Args:
            head_size (int): The dimensionality of the key, query, and value projections.
            n_embed (int): The dimensionality of the input embedding.
            context_length (int): The maximum length of the input sequence.
        """
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)   # Key projection
        self.query = nn.Linear(n_embed, head_size, bias=False) # Query projection
        self.value = nn.Linear(n_embed, head_size, bias=False) # Value projection
        # Lower triangular matrix for causal masking
        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

これを以下のロジックに放り込んで処理をさせていくことになります。

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the attention head.

        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).

        Returns:
            torch.Tensor: Output tensor after applying attention.
        """
        B, T, C = x.shape
        head_size = self.key.out_features
        k = self.key(x)     # (B, T, head_size)
        q = self.query(x)   # (B, T, head_size)
        scale_factor = 1 / math.sqrt(head_size)
        # Calculate attention weights: (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        attn_weights = q @ k.transpose(-2, -1) * scale_factor
        # Apply causal masking
        attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        attn_weights = F.softmax(attn_weights, dim=-1)
        v = self.value(x)   # (B, T, head_size)
        # Apply attention weights to values
        out = attn_weights @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
        return out

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the attention head.

        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).

        Returns:
            torch.Tensor: Output tensor after applying attention.
        """
        B, T, C = x.shape
        head_size = self.key.out_features
        k = self.key(x)     # (B, T, head_size)
        q = self.query(x)   # (B, T, head_size)
        scale_factor = 1 / math.sqrt(head_size)
        # Calculate attention weights: (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        attn_weights = q @ k.transpose(-2, -1) * scale_factor
        # Apply causal masking
        attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        attn_weights = F.softmax(attn_weights, dim=-1)
        v = self.value(x)   # (B, T, head_size)
        # Apply attention weights to values
        out = attn_weights @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
        return out

正味の処理部分を上から見ていくとこんな感じです。

ソースの内容	処理の概要
B, T, C = x.shape head_size = self.key.out_features	入力テンソルxと、変数B,T,Cの関係を定義
`k = self.key(x)` `q = self.query(x)`	k及びqの投影を実施。テンソルの構造は（B,T,head_size）実際には、各トークンベクトルに対して独立した線形変換を行っている
scale_factor = 1 / math.sqrt(head_size)	スケール因子設定。アテンション計算時の分散を抑えるための正則仮定数sを定義。
attn_weights = q @ k.transpose(-2, -1) * scale_factor	ドット積アテンション(内積)による類似度計算を実施する。これにより、Query行列と（Key行列の転置行列）の行列積を計算し、スケーリングを行う。結果として得られるのは、トークンiとトークンjの類似度を反映した内積である。
attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’))	マスク行列を適用し、マスク値が0である要素に対して値を負値に置き換える。こうすることで、未来の情報へ注意を向けさせないようにする。
attn_weights = F.softmax(attn_weights, dim=-1)	各行iに対してSoftmax関数を適用する。すべてのトークンに対する重みの和が1になるよう正規化定数を求める。こうすることにより、所謂次のトークンを予測するためのオッズ表が作成された状態になる。
v = self.value(x)	KeyやQueryと同様に、重み行列を用いて右から行列積を行う。
out = attn_weights @ v	正規化されたアテンション重み行列とValue行列の行列積を求める。
return out	得られた結果を呼び出し元へ返す

図に直すとこうなります。まさにAttention機構でよく解説に使われる図そのままが出てくるような感じになりましたね。

モデルの全体像

出来上がるモデルの全体像は以下のようになります。GPT-1などに例えられるDecoder Only Transformerの校正に似ていて、GPT-1と違うのは活性化関数がGeLUではなくReLUである点です。それ以外についてはおおむねオーソドックスな構成と言えると思います。

引用：このツールの使用方法について

元リポジトリのreadme.mdにたいていのことが書かれており、その中でUsageセクションを日本語に訳してみました。参考になればどうぞ。

作業を始める前に、環境に Git がインストールされていることを確認してください。

1. リポジトリのクローンと依存関係のインストール

git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git
cd train-llm-from-scratch
pip install -r requirements.txt

2. 学習データのダウンロード

python scripts/data_download.py

主な引数:
- --train_max: ダウンロードする訓練用ファイルの最大数（デフォルトは1、最大30まで。1ファイルあたり約11GB）。1ファイルのみ（00.jsonl.zst）に制限することで、ダウンロード時間を削減できます。
- --train_dir: 訓練用データの保存先ディレクトリ（デフォルトは data/train）。
- --val_dir: 検証用データの保存先ディレクトリ（デフォルトは data/val）。

3. データの事前処理（トークナイズ）

高品質な単語出力を行うため、OpenAIのオープンソースのトークナイザー tiktoken (ChatGPT/GPT-3で使用される r50k_base) を使用して、ダウンロードしたデータセットをエンコード（トークナイズ）します。

python scripts/data_preprocess.py

4. モデルの訓練

src/models/transformer.py でモデルの構成を、config/config.py で学習パラメータを編集できます。訓練を実行するには以下のスクリプトを実行します。

python scripts/train_transformer.py

学習が開始され、訓練済みモデルはデフォルトの models/ ディレクトリに保存されます。

5. テキストの生成

訓練完了後、以下のスクリプトを使用してテキストを生成できます。

python scripts/generate_text.py --model_path models/your_trained_model.pth --input_text "Once upon a time" --max_new_tokens 100

主な引数:
- --model_path: 訓練済みモデルのパス。
- --input_text: 生成を開始するプロンプトテキスト。
- --max_new_tokens: 生成する最大トークン数（デフォルトは 100）。