Abstract: Recent neural models for video captioning are typically built using a framework that combines a pre-trained visual encoder with a large language model(LLM) decoder. However, large language ...
Abstract: 6D object tracking plays an important role in various applications, including robotic manipulation and virtual reality. While current methodologies have achieved significant advancements ...