Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, Jiebo Luo
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and hea...
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian et al.
Temporally localizing actions in a video is a fundamental challenge in video understanding. Most existing approaches have often drawn inspiration from image object detection and extended the advances, e.g., SSD and Faster R-CNN, to produce temporal locations of an action in a 1D sequence. Neverthele...
Kun Liu, Qi Liu, Xinchen Liu, Jie Li et al.
Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, ...
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian et al.
Temporally localizing actions in a video is a fundamental challenge in video understanding. Most existing approaches have often drawn inspiration from image object detection and extended the advances, e.g., SSD and Faster R-CNN, to produce temporal locations of an action in a 1D sequence. Neverthele...