😄 😄 😄 I am currently seeking job opportunities for 2027. My research interests include video analysis, image restoration, multimodal pre-training, multimodal large language models, computer vision, or other multimodal-related fields!!! Feel free to contact me via WeChat (ID: Alocus)
I am a Ph.D. student at Beijing University of Technology. My research focuses on sports video understanding, multimodal learning, video captioning, low-level vision, diffusion models, and knowledge graphs. I received my M.S. degree from Minzu University of China.
I'm interested in computer vision, deep learning, and image processing.
Most of my research focuses on video understanding, especially video captioning and identity-aware sports video captioning.
Previously, I worked on image restoration tasks, including image dehazing and digital restoration of Dunhuang murals.
My master's thesis was centered on the digital restoration of Dunhuang murals.
We propose an EMKG framework for video captioning, which enhances generalization via a ConceptVision Knowledge Graph (CVKG) with two subgraphs: ConceptCore (C3) and VisionVivid (V3). To reduce noise, Cross-modal Fine-grained Adaptive Fusion (CmFAF) dynamically adjusts node/relation weights using cross-modal context. Hardest Sample Attribute-anchored Alignment (HSA-Aligner) improves visual-textual alignment via attribute-guided hard negative mining. Experiments on MSR-VTT and MSVD show EMKG outperforms state-of-the-art methods with superior generalization.
This paper proposes a multi-modal knowledge framework (MK-VC) for video captioning that specifically targets the challenging long-tail word distribution problem. By integrating dynamic context and static concept knowledge through fine-grained adaptive fusion and attribute-guided alignment, the framework enhances caption generation and effectively handles rare and infrequent words.
We proposed UHCL, a unified hierarchical contrastive learning method for video captioning that enhances caption distinctiveness and overall performance, using triamese decoders and adaptive token fusion, achieving state-of-the-art results on the MSR-VTT and MSVD datasets.
We developed LLM-VC, a player-centric multimodal prompt generation network for identity-aware sports video captioning, which integrates visual and semantic cues to recognize player identities and generate accurate, player-specific descriptions, achieving state-of-the-art performance on the new NBA-Identity and VC-NBA-2022 datasets.
We proposed EIKA, an Entity-Aware Sports Video Captioning framework that integrates explicit player knowledge and implicit scene understanding to generate fine-grained, informative captions, achieving state-of-the-art results on multiple benchmark datasets.
We propose DSSM-KG, a dual-stream state-space model with cross-modal knowledge injection for video captioning. By integrating Transformer–Mamba hybrid modules for joint spatiotemporal modeling and adaptively injecting a commonsense-enhanced knowledge graph, DSSM-KG achieves competitive results on MSVD and MSRVTT.
This paper proposes ST2, a SpatioTemporal-enhanced State Space Model and Transformer for video captioning, which integrates Mamba and Transformer in parallel to achieve efficient spatiotemporal joint modeling, achieving competitive performance on MSVD and MSR-VTT datasets.
We proposed a VAE-GAN-based hybrid sample learning framework for image dehazing that jointly leverages synthetic and real data through latent space alignment and feature-adaptive fusion, achieving clearer dehazing results and higher PSNR on real-world hazy images.
Feel free to steal this website's source code. Do not scrape the HTML from this page itself, as it includes analytics tags that you do not want on your own website — use the github code instead. Also, consider using Leonid Keselman's Jekyll fork of this page.