Abstract: Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results