logo
0
0
WeChat Login
manayang<manayang@tencent.com>
Tencent Hunyuan A13B opensource

中文  | English



Model Training

Hunyuan-A13B provides processes related to model training. This section details how to process training data for model training purposes.

Training Data Format and Processing

Hunyuan-A13B-Instruct supports both "slow thinking" and "fast thinking" modes. By default, the model outputs in slow thinking mode. If you wish the model to use fast thinking, you can prepend "/no_think" to the query.

The training data should be formatted as a list of messages. By default, the system prompt for both training and inference is empty, but you may customize it as needed.

# Thinking pattern
think = ""
answer = ""
think_pattern = f"<think>\n{think}\n</think>\n<answer>\n{answer}\n</answer>"

# Fast thinking pattern
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "/no_think Why is seawater salty?" },
    {"role": "assistant", "content": "<think>\n\n</think>\n<answer>\nSeawater is primarily saline due to dissolved salts and minerals. These substances come from the chemical materials in rocks and soil on the Earth's surface, which are carried into the ocean over time. When seawater evaporates, the water vapor leaves, but the salts and minerals remain, making the seawater saltier. Therefore, the salinity of seawater is determined by the amount of salts and minerals it contains.\n</answer>"}
]

# Slow thinking pattern
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "1+1=" },
    {"role": "assistant", "content": "<think>\nThe user is asking for the result of 1 + 1. First, I must confirm this is a basic arithmetic question. In the decimal numeral system, 1 + 1 typically equals 2. While alternative interpretations might exist in different numeral systems (e.g., binary) or contextual riddles, no special context is specified here, so the default assumption is the decimal system. Additionally, there are occasional riddle-like scenarios where 1 + 1 could equal 1 (e.g., one drop of water plus another drop still forms one drop), but in standard mathematical contexts, the answer is 2. Therefore, the most accurate response is 2.</think>\n<answer>\nIn basic decimal arithmetic, 1 plus 1 equals 2. This operation adheres to the basic rules of natural number addition, so: 1 + 1 = 2.\n</answer>"}
    
from tokenizers import ByteLevelBPETokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./models", use_fast=False, trust_remote_code=True)
ids = tokenizer.apply_chat_template(messages)

 

Quick Start

You can quickly get started by following the instructions in the Quick Start Guide.

 

Model Training

Hardware Requirements

Based on testing, when make_moe_param_leaf_module and zero3+offload are disabled and max_seq_length is set to 2048, full fine-tuning requires at least a single machine with 8 GPUs (each with at least 80GB of memory).

Launch Methods

Reference: HuggingFace Transformers Trainer

Single-Machine Training

In the train directory, execute:

pip install -r requirements.txt
bash train.sh

Multi-Machine Training

To launch training across multiple machines, please follow the steps below and ensure all machines are within the same cluster.

Configure Passwordless SSH Login Between Machines

The following instructions use two machines as an example, with their IPs denoted as ${ip1} and ${ip2}. All steps should be performed inside the Docker container.

First, configure passwordless SSH for each container on every machine:

ssh-keygen                  # Generate id_rsa and id_rsa.pub for passwordless login
ssh-keygen -t rsa -A        # Generate /etc/ssh/ssh_host_rsa_key and ssh_host_ecdsa_key for SSH listening
/usr/sbin/sshd -p 36005 -o ListenAddress=0.0.0.0        # Start SSH listening
echo "Port 36005" > ~/.ssh/config   # Set SSH connection port to 36005
passwd root    # Set the root password to avoid monitoring platform alerts

Note: 36005 is an example port. You may use any available port, but ensure it is open and not occupied by other processes.

Next, in each machine's container, execute:

cat ~/.ssh/id_rsa.pub

Copy the output SSH public key and paste it into the ~/.ssh/authorized_keys file, one key per line. This must be done on every machine. In the end, the ~/.ssh/authorized_keys file on each machine should be identical and contain the public keys of all machines.

Please note that for multi-node training, the code executed on each node must be identical. It is recommended to mount a shared network drive. If this is not possible, you must manually copy the dataset, scripts, and code to the same directory on each machine.

Launching Multi-Machine Training

Once the above preparations are complete and all dependencies are installed (if not, run pip install -r requirements.txt), add the following configuration at the beginning of train.sh:

export HOST_GPU_NUM=8
# Local machine IP
export LOCAL_IP=${ip1}
# Comma-separated list of node IPs and GPU counts
export NODE_IP_LIST="${ip1}:8,${ip2}:8"
# Number of nodes
export NODES=2
export NODE_NUM=$((${NODES} * ${HOST_GPU_NUM}))

Note: Replace ${ip1} and ${ip2} with the actual IP addresses!

Then, on the machine with ${ip1}, execute bash train.sh in the train/ directory. On first launch, you may see the following output:

The authenticity of host '[ip]:36005 ([ip]:36005)' can't be established.
ECDSA key fingerprint is xxxxxx.
ECDSA key fingerprint is MD5:xxxxxx.
Are you sure you want to continue connecting (yes/no)?

Type yes to continue.

Key Parameters

The key parameters in the script are as follows:

  • --deepspeed: Path to the DeepSpeed configuration file. Three default DeepSpeed configuration files are provided in the train folder: ds_zero2_no_offload.json, ds_zero3_no_offload.json, and ds_zero3_offload.json, with decreasing memory requirements in that order.
  • --model_name_or_path: Path to the HF pre-trained model weights. Ensure this directory contains both modeling_hunyuan.py and configuration_hunyuan.py, otherwise loading will fail.
  • --tokenizer_name_or_path: Path to the tokenizer folder. Ensure this directory contains tokenization_hy.py, otherwise loading will fail.
  • --train_data_file: Path to the training file, which should be a jsonl file.
  • --output_dir: Output directory where logs, tensorboard files, and model weights will be stored.
  • --per_device_train_batch_size: Batch size per GPU.
  • --gradient_accumulation_steps: Number of gradient accumulation steps. The global batch size is per_device_train_batch_size * gradient_accumulation_steps * dp_size.
  • --max_steps: Total number of training steps.
  • --save_steps: Number of steps between saving checkpoints.
  • --use_lora: Whether to use LoRA training. Also accepts --lora_rank, --lora_alpha, and --lora_dropout parameters. By default, LoRA is applied to "q_proj", "k_proj", "v_proj", and "o_proj". To change this, modify the code. Note: ** When using LoRA training, only the LoRA weights are saved, not the base model weights. ** To merge LoRA weights, see the "LoRA Weight Merging" section below.
  • --make_moe_param_leaf_module: When using zero3 and MoE training, treat the MoE module as a leaf module, i.e., its parameters are not partitioned by zero3. This option is expected to significantly increase memory usage.
  • --gradient_checkpointing: Enable gradient checkpointing.
  • --train_attention_params_only: Whether to train only attention parameters.
  • --learning_rate: Maximum learning rate during training.
  • --min_lr: Minimum learning rate during training.
  • --use_flash_attn: Enable flash-attention for accelerated training.

Notes:

  • To resume training from a previously saved checkpoint rather than loading pre-trained weights, specify --resume_from_checkpoint with the path to the checkpoint. Do not specify --model_name_or_path, this will load only the weights, not the training state.
  • When resuming from a checkpoint, there may be minor differences in loss due to the randomness of some non-deterministic algorithms. This is normal. See: HuggingFace Transformers Trainer Randomness
  • When --model_name_or_path is specified, all model-related parameters will be ignored.
  • Samples within a batch are padded to the length of the longest sample in the batch, but the maximum length of each sample is max_seq_length. Any excess will be truncated.
  • If you see a warning about bias weights not being loaded, you can ignore it. Hunyuan-Large does not use bias.

What if GPU Memory is Insufficient?

Reference: DeepSpeed Configuration

You can try modifying the DeepSpeed configuration by removing the auto attribute from the following parameters and reducing their values:

  • stage3_param_persistence_threshold
  • stage3_prefetch_bucket_size
  • stage3_max_reuse_distance

LoRA Weight Merging

LoRA weights saved during training cannot be merged into the zero3 model at runtime, as zero3 partitions model weights across data parallel ranks. To merge LoRA weights into the base model, you can do so offline to obtain a merged weight file. Run merge_lora_weight.sh to merge the LoRA and base model weights. The parameters are:

  • --base_model_path: Directory of the base model weights
  • --adapter_model_path: Directory of the LoRA weights
  • --output_path: Directory to save the merged weights
  • --save_dtype: Data type for saving the merged weights; options are: fp16, bf16, fp32

LLaMA-Factory Support

If you are familiar with LLaMA-Factory, you may use https://github.com/hiyouga/LLaMA-Factory/tree/main for fine-tuning. We provide a sample configuration file for LLaMA-Factory training at ./train/llama_factory_support/hunyuan_a13b_full_sft.yaml.