Skip to content

BytevalKit-Emb is a modular embedding model evaluation framework that implements automated model performance assessment through standardized processes. The framework adopts a configuration-driven design and supports multiple task types and model architectures.

License

Notifications You must be signed in to change notification settings

bytedance/BytevalKit-Emb

Repository files navigation

⚡️BytevalKit-Emb: One-Stop Embedding Model Evaluation Tool

Build Build License Build

English | 中文

Overview

BytevalKit-Emb is a modular embedding model evaluation framework that implements automated model performance assessment through standardized processes. The framework adopts a configuration-driven design and supports multiple task types and model architectures.

Core Features

  • Multi-type Model Support: Supports various model calls including GritLM/SentenceTransformers/GME, and supports both single-modal and multi-modal models
  • Automated Evaluation Pipeline: Complete automated pipeline of "dataset loading - model calling - evaluation metrics calculation"
  • Extended Evaluation Methods: Supports not only MTEB and MMEB evaluation tasks, but also custom Retrieval, Classification, and Similarity Classification evaluation tasks
  • Flexible Configuration System: YAML-based configuration system, easy to customize and extend
  • Extensible and Reproducible: Quickly support new models/evaluation tasks based on BaseModel and BaseTask; complete recording of embeddings & related results during evaluation, reproducible debugging of evaluation results

Changelog

  • 🎉 [2025.06.13]: BytevalKit-Emb v1.0.0 first open source release
  • 📚 [2025.06.13]: Documentation and tutorials are now online

Installation

Install from Source

Clone the repository and install:

Recommended Python Version: Python 3.9 or above

git clone https://github.com/bytedance/BytevalKit-Emb.git
cd BytevalKit-Emb
pip install -r requirements.txt

Quick Start

For more detailed usage instructions, including how to evaluate models, add custom models/datasets/evaluation metrics, please refer to Usage Instructions.

Basic Usage

Start evaluation task:

python3 run.py --yaml-path={workspace}/configs/config.yaml

For example YAML configuration, refer to Example YAML Configuration

Configuration Parameters

DEFAULT:  # Task-level configuration
    task_name: eval_task_1  # Evaluation task name
    work_dir: {workspace}/outputs  # Directory for storing evaluation inference results, metric results, etc.
DATASET:  # Dataset-level configuration
    dataset_xxxx:
        type: mteb_classification  # Evaluation task type, options: classification, mteb_classification, retrieval, similarity_classification
        name: IFlyTek  # Evaluation dataset name
        data_dir: {workspace}/demo/datasets/mteb_classification/IFlyTek-classification  # Evaluation dataset path
        data_type: parquet  # Dataset file format
        
        # For other configuration parameters, refer to the documentation for each evaluation task

MODEL:  # Model-level configuration
    model_paraphrase-multilingual-MiniLM-L12-v2:
        type: sentence_transformer  # Model type, options: sentence_transformer, gritlm
        name: paraphrase-multilingual-MiniLM-L12-v2  # Model name
        path_or_dir: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2  # Model save path
        model_kwargs:  # Model loading parameters
            revision: "v1.1"
        preprocessors: []  # Pre-inference processors
        worker_num: 20  # Inference concurrency number

Benchmark

Note: To demonstrate that our framework is applicable to MTEB and MMEB evaluation methods, we use open-source models to validate the framework on some evaluation datasets from MTEB and MMEB evaluation scripts. The evaluation datasets and evaluation logic are sourced from MTEB and MMEB evaluation scripts.

The following are only framework evaluation results, models are not ranked in any particular order

MTEB-Classification

Model IFlyTek-classification JDReview-classification MultilingualSentiment-classification OnlineShopping-classification TNews-classification waimai-classification
xiaobu-embedding 49.29 85.56 76.83 92.75 26.01 88.1
xiaobu-embedding-v2 51.21 88.47 79.38 94.5 27.3 88.85
Conan-embedding-v1 51.52 90.07 78.6 95 27.5 89.7
gte-base-zh 47.67 85.83 75.28 93.8 26.72 87.85
gte-large-zh 49.83 88 76.33 91.75 25.8 88.05
gte-Qwen2-1.5B-instruct 39.75 80.49 67.92 87.6 25.23 84.75
bge-large-zh-v1.5 48.21 85.02 74.15 92.74 26.08 86.7

MTEB-Similarity Classification

Model CMNLI Ocnli
xiaobu-embedding 55.3 55.93
xiaobu-embedding-v2 51.44 51.27
Conan-embedding-v1 54.46 51.38
gte-base-zh 63.04 60.8
gte-large-zh 76.2 73.03
gte-Qwen2-1.5B-instruct 53.27 53.65
bge-large-zh-v1.5 67.66 62.59

MTEB-Retrieval(NDCG@10)

CmedqaRetrieval CovidRetrieval DuRetrieval MedicalRetrieval MMarcoRetrieval T2Retrieval VideoRetrieval
xiaobu-embedding 44.47 87.75 86.81 63.19 78.39 86.22 73.17
xiaobu-embedding-v2 47.38 89.5 89.68 67.98 82.26 85.59 80.08
Conan-embedding-v1 47.78 91.23 88.79 67.13 82.27 83.79 80.29
gte-base-zh 44.57 75.71 84.09 65.02 77.71 83.91 74.38
gte-large-zh 43.42 88.44 85.65 62.81 77.52 82.95 73.01
bge-large-zh-v1.5 41.81 73.03 88.76 57.35 78.77 84.29 70.89

MMEB

Model ChartQA DocVQA ImageNet-1K ImageNet-A ImageNet-R MSCOCO_t2i ObjectNet OK-VQA VisDial
gme-Qwen2-VL-2B-Instruct 8.3 17.5 26.5 12.5 60.1 53.5 31.1 11.8 30.1
gme-Qwen2-VL-7B-Instruct 15.3 33.6 65.2 42.3 87.1 71.1 66.6 32.3 62.5

System Architecture

Architecture Design

Contributing

This project is developed by the BytevalKit team, development members:

{Zirui Guo, Hanyu Li, Shenwei Huang}, Yaling Mou, Xianxian Ma, 
Ming Jiang, Haizhen Liao, Jingwei Sun, Binbin Xing

{*} Equal Contributions.

We also thank the Bytedance Douyin Content Team for their support:

Jiefeng Long, Zhihe Wan, Zhenming Sun, Yongchao Liu, Xulei Lou, Shuang Zeng, Xing Lin, Chao Wang, 
Fubang Zhao, QingSong Liu, Song Chen, Xiao Liang, Yixing Chen, Mingyu Guo, Bolun Cai, 
Yi Lin, Junfeng Yao, Chao Feng, Jiao Ran

And the support provided by Product design and Byteval platform team:

Ziyu Shi, Zhao Lin, Yang Li, Jing Yang, Zhen Wang, Guojun Ma

And from AI platform team:

Huiyu Yu, Lin Dong, Yong Zhang

We welcome contributions of all kinds! Please check our Contributing Guide for details.

Citation

If you use BytevalKit-Emb in your research, please consider citing:

@misc{BytevalKit-Emb-2025,
  title={BytevalKit-Emb: Comprehensive Embedding Model Evaluation Framework},
  author={BytevalKit},
  year={2025},
  howpublished={\url{https://github.com/bytedance/BytevalKit-Emb}}
}

License

BytevalKit-Emb is licensed under the Apache License 2.0.

Contact Us

If you have any questions, feel free to contact us at: [email protected]

About

BytevalKit-Emb is a modular embedding model evaluation framework that implements automated model performance assessment through standardized processes. The framework adopts a configuration-driven design and supports multiple task types and model architectures.

Resources

License

Stars

Watchers

Forks

Languages