logo
0
0
WeChat Login
facebook-github-bot<facebook-github-bot@users.noreply.github.com>
Initial commit

Training

This repository supports finetuning SAM3 models on custom datasets in multi-node setup or local execution. The training script is located at sam3/train.py and uses Hydra configuration management to handle complex training setups.

Installation

cd sam3 pip install -e ".[train]"

Training Script Usage

The main training script is located at sam3/train.py. It uses Hydra configuration management to handle complex training setups.

Basic Usage

# Example: Train on Roboflow dataset python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml # Example: Train on ODinW13 dataset python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml

Follow Roboflow 100-VL to download the roboflow 100-vl datasets. Follow GLIP to download the ODinW datasets. The data folder should be organized as follows, and put your roboflow_vl_100_root and odinw_data_root in the job configs.

roboflow_vl_100_root: 13-lkc01 train valid test 2024-frc actions ... odinw_data_root: AerialMaritimeDrone large train valid test Aquarium ...

Command Line Arguments

The training script supports several command line arguments:

python sam3/train/train.py \ -c CONFIG_NAME \ [--use-cluster 0|1] \ [--partition PARTITION_NAME] \ [--account ACCOUNT_NAME] \ [--qos QOS_NAME] \ [--num-gpus NUM_GPUS] \ [--num-nodes NUM_NODES]

Arguments:

  • -c, --config: Required. Path to the configuration file (e.g., sam3/train/configs/roboflow_v100_full_ft_100_images.yaml)
  • --use-cluster: Whether to launch on a cluster (0: local, 1: cluster). Default: uses config setting
  • --partition: SLURM partition name for cluster execution
  • --account: SLURM account name for cluster execution
  • --qos: SLURM QOS (Quality of Service) setting
  • --num-gpus: Number of GPUs per node. Default: uses config setting
  • --num-nodes: Number of nodes for distributed training. Default: uses config setting

Local Training Examples

# Single GPU training python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 1 # Multi-GPU training on a single node python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0 --num-gpus 4 # Force local execution even if config specifies GPUs python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 0

Cluster Training Examples

# Basic cluster training with default settings from config python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml --use-cluster 1 # Cluster training with specific SLURM settings python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \ --use-cluster 1 \ --partition gpu_partition \ --account my_account \ --qos high_priority \ --num-gpus 8 \ --num-nodes 2

Configuration Files

Training configurations are stored in sam3/train/configs/. The configuration files use Hydra's YAML format and support:

  • Dataset Configuration: Data paths, transforms, and loading parameters
  • Model Configuration: Architecture settings, checkpoint paths, and model parameters
  • Training Configuration: Batch sizes, learning rates, optimization settings
  • Launcher Configuration: Distributed training and cluster settings
  • Logging Configuration: TensorBoard, experiment tracking, and output directories

Key Configuration Sections

# Paths to datasets and checkpoints paths: bpe_path: /path/to/bpe/file dataset_root: /path/to/dataset experiment_log_dir: /path/to/logs # Launcher settings for local/cluster execution launcher: num_nodes: 1 gpus_per_node: 2 experiment_log_dir: ${paths.experiment_log_dir} # Cluster execution settings submitit: use_cluster: True timeout_hour: 72 cpus_per_task: 10 partition: null account: null

Monitoring Training

The training script automatically sets up logging and saves outputs to the experiment directory:

# Logs are saved to the experiment_log_dir specified in config experiment_log_dir/ ├── config.yaml # Original configuration ├── config_resolved.yaml # Resolved configuration with all variables expanded ├── checkpoints/ # Model checkpoints (if skip_checkpointing=False) ├── tensorboard/ # TensorBoard logs ├── logs/ # Text logs └── submitit_logs/ # Cluster job logs (if using cluster)

You can monitor training progress using TensorBoard:

tensorboard --logdir /path/to/experiment_log_dir/tensorboard

Job Arrays for Dataset Sweeps

The Roboflow and ODinW configuration supports job arrays for training multiple models on different datasets:

This feature is specifically enabled via,

submitit: job_array: num_tasks: 100 task_index: 0

The configuration includes a complete list of 100 Roboflow supercategories, and the submitit.job_array.task_index automatically selects which dataset to use based on the array job index.

# Submit job array to train on different Roboflow datasets # The job array index selects which dataset from all_roboflow_supercategories python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_full_ft_100_images.yaml \ --use-cluster 1

Reproduce ODinW13 10-shot results

Running the following job will give the results on the ODinW13 seed 300, see odinw_train.train_file: fewshot_train_shot10_seed300 in the config file.

# Example: Train on ODinW13 dataset python sam3/train/train.py -c configs/odinw13/odinw_text_only_train.yaml

Change odinw_train.train_file to fewshot_train_shot10_seed30 and fewshot_train_shot10_seed3 to get the results for the other two seeds. Final results are aggregated from the three seeds. Notice that a small number of jobs may diverge during training, in which case we just use the last checkpoint's result before it diverges.

Eval Script Usage

With a similar setup as the training config, the training script sam3/train.py can also be used for evaluation, too, when setting trainer.mode = val in the job config. Run the following job will give the results on the zero-shot results on RF100-VL and ODinW13 datasets.

# Example: Evaluate on Roboflow dataset python sam3/train/train.py -c configs/roboflow_v100/roboflow_v100_eval.yaml # Example: Evaluate on ODinW13 dataset python sam3/train/train.py -c configs/odinw13/odinw_text_only.yaml