Abstract: Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results