xiaofei/sam3

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

sam3/scripts/eval/veval/README.md

Tengyu Ma<tym@meta.com>

Update veval README.md for frame shifting alert on sa-co/veval yt1b (#213)

2d1cbaea

PreviewCode viewBlame

SA-Co/VEval Dataset

License each domain has its own License

SA-Co/VEval - SA-V: CC-BY-NC 4.0
SA-Co/VEval - YT-Temporal-1B: CC-BY-NC 4.0
SA-Co/VEval - SmartGlasses: CC-by-4.0

SA-Co/VEval is an evaluation dataset comprising of 3 domains, each domain has a val and test split.

SA-Co/VEval - SA-V: videos are from the SA-V dataset
SA-Co/VEval - YT-Temporal-1B: videos are from the YT-Temporal-1B
SA-Co/VEval - SmartGlasses: egocentric videos from Smart Glasses

Environment

Install the SA-Co/VEVal required environment


pip install -e ".[veval]"

This will allow us to run:

scripts/eval/veval/saco_yt1b_downloader.py preparing frames for SA-Co/VEval - YT-Temporal-1B
examples/saco_veval_eval_example.ipynb example of running an offline evaluator
examples/saco_veval_vis_example.ipynb example of loading and visualizing the data

Download

The expected folder structure

The following folder structure is expected after finishing all the download and pre-processing steps in this section


data/
├── annotation/
│   ├── saco_veval_sav_test.json
│   ├── saco_veval_sav_val.json
│   ├── saco_veval_smartglasses_test.json
│   ├── saco_veval_smartglasses_val.json
│   ├── saco_veval_yt1b_test.json
│   ├── saco_veval_yt1b_val.json
└── media/
    ├── saco_sav
    │   └── JPEGImages_24fps
    ├── saco_sg
    │   └── JPEGImages_6fps
    └── saco_yt1b
        └── JPEGImages_6fps

Download ready-to-use data

The following links provide ready-to-use data, hosted on Roboflow, after completing the pre-processing steps outlined in the next section.

For each domain:

For all three domains:

SA-Co/VEval

Special note on SA-Co/VEval - YT-Temporal-1B:

Frame Shifting Alert!
The ready-to-use data hosted on Roboflow was produced by following the preprocessing steps below. Therefore, the frame-shifting issue for YT-Temporal-1B still exists: due to the nature of Youtube videos, the re-downloaded videos may not be exactly the same as those used during annotation, which can affect eval number reproducibility.

Download via preprocessing steps

Download annotations

The GT annotations are available at Hugging Face:

SA-Co/VEval
- SA-Co/VEval SA-V
  - Test: annotation/saco_veval_sav_test.json
  - Val: annotation/saco_veval_sav_val.json
- SA-Co/VEval YT-Temporal-1B
  - Test: annotation/saco_veval_yt1b_test.json
  - Val: annotation/saco_veval_yt1b_val.json
- SA-Co/VEval SmartGlasses
  - Test: annotation/saco_veval_smartglasses_test.json
  - Val: annotation/saco_veval_smartglasses_val.json

Download videos or frames

SA-Co/VEval - SAV

Follow instructions in SA-V dataset. Only the following two datasets are needed:

sav_test.tar
sav_val.tar

After untar:


sav_test/
├── Annotations_6fps [ignore this is the SAM 2 annotation]
├── JPEGImages_24fps
sav_val/
├── Annotations_6fps [ignore this is the SAM 2 annotation]
└── JPEGImages_24fps

Then merge the two JPEGImages_24fps together to better match our annotation json file path e.g.


media/
    └── saco_sav
        └── JPEGImages_24fps [merged from the two JPEGImages_24fps above]

Example commands to download and merge folders


cd ../data/media/saco_sav
wget -O sav_test.tar <sav_test.tar download link from the SA-V dataset page>
wget -O sav_val.tar <sav_val.tar download link from the SA-V dataset page>
tar -xf sav_test.tar
tar -xf sav_val.tar
mkdir JPEGImages_24fps
chmod -R u+w sav_test/
chmod -R u+w sav_val/
mv sav_test/JPEGImages_24fps/* JPEGImages_24fps/
mv sav_val/JPEGImages_24fps/* JPEGImages_24fps/

SA-Co/VEval - YT-Temporal-1B

Two files are needed to download the SA-Co/VEval - YT-Temporal-1B Youtube videos.

Download media/yt1b_start_end_time.json from SA-Co/VEval, which contains the Youtube video ids and the start and end time used in SA-Co/VEval - YT-Temporal-1B.
Prepare the cookies.txt file. Follow instruction in yt-dlp exporting-youtube-cookies and pass-cookies-to-yt-dlp to prepare the cookies_file.
- Please see the full WARNINGS in yt-dlp regarding the risk of Youtube account ban!!

Then run scripts/eval/veval/saco_yt1b_downloader.py to download the videos and prepare the frames e.g.


python saco_yt1b_downloader.py \
--data_dir ../data/media/saco_yt1b \
--cookies_file ../data/media/saco_yt1b/cookies.txt \
--yt1b_start_end_time_file ../data/media/saco_yt1b/yt1b_start_end_time.json \
--yt1b_frame_prep_log_file ../data/media/saco_yt1b/yt1b_frame_prep.log

data_dir: The directoy to download the Youtube videos and store the extraced frames
cookies_file: the cookies.txt downloaded above
yt1b_start_end_time_file: the yt1b_start_end_time.json downloaded above
yt1b_frame_prep_log_file: a log file to track the video downloading and frame extracting status

Then run scripts/eval/veval/saco_yt1b_annot_update.py to update the annotation based on the video availability e.g.


python saco_yt1b_annot_update.py \
--yt1b_media_dir ../data/media/saco_yt1b/JPEGImages_6fps \
--yt1b_input_annot_path ../data/annotation/saco_veval_yt1b_val.json \
--yt1b_output_annot_path ../data/annotation/saco_veval_yt1b_val_updated.json \
--yt1b_annot_update_log_path ../data/annotation/saco_veval_yt1b_val_updated.log

NOTE:

Not all Youtube videos might be available as Youtube videos can be deleted or become private. The script saco_yt1b_annot_update.py is used to remove the annotations of the unavailable videos.
Frame Shifting Alert!! Even when the videos are still available, their specifications, such as fps and duration, may differ from those used during annotation when re-downloaded from YouTube. Additionally, sometimes ffmpeg seems to find it hard to guarantee consistent frame extraction from the same video across different environments. This may cause the re-downloaded and re-extracted frames to have alignment issues with our annotations due to frame shifting. Please be aware of this caveat when evaluating on SA-Co/VEval - YT-Temporal-1B.

SA-Co/VEval - SmartGlasses

Go to SACo-VEval download media/saco_sg.tar.gz


cd ../data
hf download facebook/SACo-VEval media/saco_sg.tar.gz --repo-type dataset --local-dir .
cd ../data/media
tar -xzf saco_sg.tar.gz

Annotation Format

The format is similar to the YTVIS format.

In the annotation json, e.g. saco_veval_sav_test.json there are 5 fields:

info:
- A dict containing the dataset info
- E.g. {'version': 'v1', 'date': '2025-09-24', 'description': 'SA-Co/VEval SA-V Test'}
videos
- A list of videos that are used in the current annotation json
- It contains {id, video_name, file_names, height, width, length}
annotations
- A list of positive masklets and their related info
- It contains {id, segmentations, bboxes, areas, iscrowd, video_id, height, width, category_id, noun_phrase}
  - video_id should match to the videos - id field above
  - category_id should match to the categories - id field below
  - segmentations is a list of RLE
categories
- A globally used noun phrase id map, which is true across all 3 domains.
- It contains {id, name}
  - name is the noun phrase
video_np_pairs
- A list of video-np pairs, including both positive and negative used in the current annotation json
- It contains {id, video_id, category_id, noun_phrase, num_masklets}
  - video_id should match the videos - id above
  - category_id should match the categories - id above
  - when num_masklets > 0 it is a positive video-np pair, and the presenting masklets can be found in the annotations field
  - when num_masklets = 0 it is a negative video-np pair, meaning no masklet presenting at all


data {
    "info": info
    "videos": [video]
    "annotations": [annotation]
    "categories": [category]
    "video_np_pairs": [video_np_pair]
}
video {
    "id": int
    "video_name": str  # e.g. sav_000000
    "file_names": List[str]
    "height": int
    "width": width
    "length": length
}
annotation {
    "id": int
    "segmentations": List[RLE]
    "bboxes": List[List[int, int, int, int]]
    "areas": List[int]
    "iscrowd": int
    "video_id": str
    "height": int
    "width": int
    "category_id": int
    "noun_phrase": str
}
category {
    "id": int
    "name": str
}
video_np_pair {
    "id": int
    "video_id": str
    "category_id": int
    "noun_phrase": str
    "num_masklets" int
}

sam3/examples/saco_veval_vis_example.ipynb shows some examples of the data format and data visualization.

Run Offline Eval

An example notebook and an eval script have been provided for offline evaluation.


sam3/
├── examples/
│   └── saco_veval_eval_example.ipynb  # this notebook will load eval res or run the eval on the fly, and print the results
└── sam3/eval/
    └── saco_veval_eval.py  # this script will run the offline evaluator

saco_veval_eval.py supports two modes, one and all.

one: will take only one pair of gt and pred files to eval
all: will eval on all 6 SACo/VEval datasets

Example usage


python saco_veval_eval.py one \
--gt_annot_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_gt.json \
--pred_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_pred.json \
--eval_res_file ../sam3/assets/veval/toy_gt_and_pred/toy_saco_veval_sav_test_eval_res.json

gt_annot_file: the location of the GT file
pred_file: the location of the Pred file
eval_res_file: the location where the eval result will be written to


python saco_veval_eval.py all \
--gt_annot_dir ../data/annotation \
--pred_dir ../data/pred \
--eval_res_dir ../data/pred

gt_annot_dir: the location of the GT files
pred_dir: the location of the Pred files
eval_res_dir: the location where the eval results will be written to

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111