logo
0
0
WeChat Login

ModernBERT-base-zeroshot-v2.0

Model description

This model is answerdotai/ModernBERT-large fine-tuned on the same dataset mix as the zeroshot-v2.0 models in the Zeroshot Classifiers Collection.

General takeaways:

  • The model is very fast and memory efficient. It's multiple times faster and consumes multiple times less memory than DeBERTav3. The memory efficiency enables larger batch sizes. I got a ~2x speed increase by enabling bf16 (instead of fp16).
  • It performs slightly worse then DeBERTav3 on average on the tasks tested below.
  • I'm in the process of preparing a newer version trained on better synthetic data to make full use of the 8k context window and to update the training mix of the older zeroshot-v2.0 models.

Training results

DatasetsMeanMean w/o NLImnli_mmnli_mmfevernlianli_r1anli_r2anli_r3wanlilingnliwellformedqueryrottentomatoesamazonpolarityimdbyelpreviewshatexplainmassivebanking77emotiondairemocontextempatheticagnewsyahootopicsbiasframes_sexbiasframes_offensivebiasframes_intentfinancialphrasebankappreviewshateoffensivetrueteacherspamwikitoxic_toxicaggregatedwikitoxic_obscenewikitoxic_identityhatewikitoxic_threatwikitoxic_insultmanifestocapsotu
Accuracy0.850.8510.9420.9440.8940.8120.7170.7160.8360.9090.8150.8990.9640.9510.9840.8140.80.7440.7520.8020.5440.8990.7350.9340.8640.8770.9130.9530.9210.8210.9890.9010.9270.9310.9590.9110.4970.73
F1 macro0.8340.8350.9350.9380.8820.7950.6880.6760.8230.8980.8140.8990.9640.9510.9840.770.7530.7630.690.8050.5330.8990.7290.9250.8640.8770.9010.9530.8550.8210.9830.9010.9270.9310.9520.9110.3620.662
Inference text/sec (A100 40GB GPU, batch=32)1116.01104.01039.01241.01138.01102.01124.01133.01251.01240.01263.01231.01054.0559.0795.01238.01312.01285.01273.01268.0992.01222.0894.01176.01194.01197.01206.01166.01227.0541.01199.01045.01054.01020.01005.01063.01214.01220.0

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 9e-06
  • train_batch_size: 16
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.06
  • num_epochs: 2

Framework versions

  • Transformers 4.48.0.dev0
  • Pytorch 2.5.1+cu124
  • Datasets 3.2.0
  • Tokenizers 0.21.0

About

No description, topics, or website provided.
Language
Markdown100%