本项目基于Jittor框架,根据赛题一中的内容和要求,使用预训练的CLIP模型和极少的多领域训练样本完成的一些工作。
比赛数据集由以下四个子数据集构成(Tsinghua-Dog数据集,Caltech-101数据集,Food-101数据集,动物分类自建数据集),共374个类别。对于每个类别,选手可以从训练集中挑出任意4张图片训练自己的模型,当训练结束后,对测试集的每张图片进行分类,输出每张图片的Top5分类。
赛题baseline基于CLIP,本次比赛提供了两个版本的baseline(Training和Training-Free),如下图所示:
(备注:Baseline中并没有给
RN101.pkl文件,只给了转换脚本可以将PyTorch权重转换为Jittor权重,但是该脚本必须要求环境中同时安装PyTorch和Jittor的cuda版本,否则运行会报错。如果你是在不同环境中安装的这两个框架,可以参考self.ipynb进行权重的转换,或者点击上面链接直接下载转换好的权重文件。)
本项目可在一张NVIDIA GeForce RTX 3090上运行
执行以下命令安装 jittor
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple jittor
安装其他依赖库:
pip install -r requirements.txt
self.ipynb文件中运行对应的代码即可生成。ViT-B-32.pkl和 RN101.pkl两个预训练模型在训练集上进行训练和简单的微调,然后选择出效果更好的模型;然后在该模型基础上使用不同的微调方法,如 Linear Probe,Tip-Adapter等,进行二次训练和微调,最终对不同方法得到的模型集成融合。
A photo of XXX 替换成成 A photo of XXX with xxx_length,其中 XXX表示类别名称,xxx_length表示类别名称的文本长度。例如:A photo of ant with 3 、 A photo of Bernese_mountain_dog with 20等。A photo of XXX and the length of the prompt is xxx_length作为所有类别的文本描述。更多微调的一些方式可参考:generate_prompt以下是部分微调的一些结果:
- 2024/5/10:训练了50个epoch,最终在TestSetZ的准确率为:0.6923,TestSetA上的准确率为0.6541;
- 2024/5/11:微调学习率,训练200个epoch,最终在TestSetZ的准确率为:0.7220,TestSetA上的准确率为0.6772;
- 2024/5/11:微调学习率,训练250个epoch,最终在TestSetZ的准确率为:0.7243,TestSetA上的准确率为0.6769;
├── root
│ ├── JCLIP
│ | ├── configs
│ | ├── no_finetune.yaml
│ | ├── no_finetune_v1.yaml
│ | ├── Tip-Adapter-F.yaml
│ | ├── Coop.yaml
│ | ├── cross-modal-Adapter.yaml
│ | ├── FD-Align.yaml
│ | ├── finetune
│ | ├── cache_module.py
│ | ├── Coop.py
│ | ├── Cross_modal_Adapter.py
│ | ├── FD_Align.py
│ | ├── Tipadapter.py
│ | ├── jclip
│ | ├── train.py
│ | ├── test.py
│ | ├── utils.py
│ | ├── train1.sh
│ | ├── train2.sh
│ | ├── self.ipynb
│ ├── TestSetA
│ ├── TestSetB
│ ├── TestSetZ
│ ├── Weights
│ | ├── initial_weight
│ | ├── RN101.pkl
│ | ├── ViT-B-32.pkl
│ ├── results
│ ├── train.txt
│ ├── train_4class.txt
│ ├── TestSetZ-label.txt
no_finetune.yaml 预训练模型的路径设置成 ViT-B-32.pkl所在路径,运行命令:sh train1.shTip-Adapter-F.yaml 中预训练模型的路径设置成上一个步骤得到的权重所在路径,运行命令:sh train2.shtest.py文件中设置你的根目录(root_TrainSet),pkl_path和 Tip_Adapter分别为步骤1、2得到的权重路径,运行命令:python test.py,最终结果文件保存在results文件夹下。method_name的值即可。Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
@article{zhang2021tip,
title={Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling},
author={Zhang, Renrui and Fang, Rongyao and Gao, Peng and Zhang, Wei and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2111.03930},
year={2021}
}
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
@misc{lin2023crossmodal,
title={Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models},
author={Lin, Zhiqiu and Yu, Samuel and Kuang, Zhiyi and Pathak, Deepak and Ramanan, Deva},
year={2023},
eprint={2301.06267},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning
@article{song2023FD,
title={FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning},
author={Kun Song and Huimin Ma and Bochao Zou and Huishuai Zhang and Weiran Huang},
journal={NeurIPS},
year={2023}
}
Learning to Prompt for Vision-Language Models
@article{zhou2022coop,
title={Learning to Prompt for Vision-Language Models},
author={Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei},
journal={International Journal of Computer Vision (IJCV)},
year={2022}
}
Robust fine-tuning of zero-shot models
@article{wortsman2021robust,
title={Robust fine-tuning of zero-shot models},
author={Wortsman, Mitchell and Ilharco, Gabriel and Kim, Jong Wook and Li, Mike and Kornblith, Simon and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Hajishirzi, Hannaneh and Farhadi, Ali and Namkoong, Hongseok and Schmidt, Ludwig},
journal={arXiv preprint arXiv:2109.01903},
note={\url{https://arxiv.org/abs/2109.01903}},
year={2021}
}