Abstract: Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be ...