Wei Xiong

I am a senior research scientist at NVIDIA. I obtained my Ph.D. in Computer Science at the University of Rochester, under the supervision of Prof. Jiebo Luo.

My recent research focus is visual generative modeling, including research on foundational generative models like pixel-space diffusion models, generative rendering, text-to-game generation, streaming video generation, interactive world models, etc. I am also interested in image composition, relighting, shadow synthesis, and representation learning.

I am open to long-term university research collaborations, where I can provide mentorship for research projects, including: high-level and detailed research ideas, co-debugging if necessary, and connections to top researchers in the related fields. If you are interested, feel free to reach out with your resume.

Email / Google Scholar / Linkedin / Github

News!

Dec. 2025 -- Introduce our recent work "PixelDiT: Pixel Diffusion Transformers for Image Generation", a pixel-space diffusion model that is directly pretrained at 1024x1024 resolution in pixel space for text-to-image generation.
Nov. 2025 -- Our work "MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models" is accepted to NeurIPS 2025 .
Sep. 2025 -- I will serve as an Area Chair for ICLR 2026.
Jun. 2025 -- I am serving as an Area Chair for AAAI 2026.
Jun. 2025 -- Our work "DIVE: Taming DINO for Subject-Driven Video Editing" is accepted to ICCV 2025 .
Mar. 2025 -- Our work "MetaShadow" is accepted to CVPR 2025 .
Jan. 2025 -- Our work "Refine-by-Align" is accepted to ICLR 2025 .
Oct. 2024 -- Our work IMPRINT and Relightful Harmonization has been successfully productized and are introduced in Adobe Max Sneak! See YouTube introductions ProjectBlend in Adobe Max and ProjectVisionCast in Adobe Summit Sneaks.
Sep. 2024 -- Check our new work "GroundingBooth: Grounding Text-to-Image Customization", an advanced image Customization model.
Jul. 2024 -- 2 papers get accepted to ECCV 2024 . Many thanks to the collaborators! Check the details below.
Feb. 2024 -- 3 papers get accepted to CVPR 2024 . Many thanks to the collaborators! Check the details below.
Feb. 2024 -- Our work "IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation" is accepted to CVPR 2024 .
Feb. 2024 -- Our work "Relightful Harmonization: Lighting-aware Portrait Background Replacement" is accepted to CVPR 2024 .
Feb. 2024 -- Our work "InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning" is accepted to CVPR 2024 . [Project Page]
Sep. 2023 -- One paper "Photoswap: Personalized subject swapping in images" is accepted to NeurIPS 2023 . [Project Page]
Jun. 2022 -- I obtain my Ph.D from the University of Rochester. See my PhD thesis. Many thanks to my advisor Jiebo Luo, my mentors and collaborators, who have giving me this unforgettable experience.
Jun. 2022 -- Our paper "Unsupervised Low-light Image Enhancement with Decoupled Networks" is accepted to ICPR 2022 .
Mar. 2022 -- Our paper "Breast Cancer Induced Bone Osteolysis Prediction Using Temporal Variational Auto-Encoders" is accepted to BME Frontiers .

Selected Research Work

I am recently focusing on visual generative modeling, especially content creation with identity preservation, including generative image enhancement and and personalized visual concept generation. Representative works are highlighted. See full list in my Google Scholar.

	PixelDiT: Pixel Diffusion Transformers for Image Generation Yongsheng Yu, Wei Xiong^†, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo (^† Project Lead & Main Advising) ArXiv 2025 PDF / Project Page / PixelDiT scales diffusion transformers directly in pixel space, pretraining at 1024×1024 resolution to deliver high-fidelity text-to-image generation and editing.
	DIVE: Taming DINO for Subject-Driven Video Editing Yi Huang, Wei Xiong^†, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen (^† Project Lead) ICCV 2025 PDF / Project Page / We propose DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities.
	MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong*, Jiebo Luo NeurIPS 2025 PDF / Project Page / Code / Data We propose Multi-Modal Image Generation Benchmark (MMIG-Bench), a comprehensive benchmark for evaluating multi-modal image generation models. MMIG-Bench unifies compositional evaluation across T2I and customized generation, introduces explainable aspect-level metrics, and provides extensive human and automatic evaluations.
	Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian L. Price, Scott Cohen, Jianming Zhang, Daniel Aliaga ICLR 2025 PDF / Project Page We propose a new approach to improve the identity preservation of generated objects. Specifically, we automatically locate and align the visual tokens in the reference with the target region that needs to be refined.
	GroundingBooth: Grounding Text-to-Image Customization Zhexiao Xiong, Wei Xiong^†, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs (^† Main Advising & Project Lead) Arxiv 2024 [PDF] [Project Page] We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task.
	WAS: Dataset and Methods for Artistic Text Segmentation Xudong Xie, Yuzhe Li, Yang Liu, Zhifei Zhang, Zhaowen Wang, Wei Xiong, Xiang Bai ECCV 2024 [PDF] We tackle the task of artistic text segmentation and constructs a real artistic text segmentation dataset.
	SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing Jing Gu, Yilin Wang, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang ECCV 2024 [PDF] [Project Page] We introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged.
	IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga CVPR 2024 [PDF] [Project Page] Our work achieves advanced image composition with a decent identity preservation, automatic object viewpoint/pose adjustment, color and lighting harmonization, and shadow synthesis. All these effects are achieved in a single framework!
	Relightful Harmonization: Lighting-aware Portrait Background Replacement Mengwei Ren, Wei Xiong, Jae Shin Yoon, Zhixin Shu, Jianming Zhang, HyunJoon Jung, Guido Gerig, He Zhang ( Work done while Mengwei was an intern at Adobe) CVPR 2024 [PDF] [Project Page] We introduce Relightful Harmonization, a lighting-aware diffusion model designed to seamlessly harmonize sophisticated lighting effect for the foreground portrait using any background image.
	PHOTOSWAP: Personalized Subject Swapping in Images Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Juang, Eric Wang NeurIPS 2023 [PDF] [Project Page] [Code] We present Photoswap, a novel approach that enables image editing experience through personalized subject swapping in existing images.
	InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning Jing Shi, Wei Xiong*, Zhe Lin, HyunJoon Jung ( Equal Contribution) CVPR 2024 [PDF] [Project Page] We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables fast personalized text-to-image generation without test-time finetuning.
	Guidance-driven Visual Synthesis with Generative Models Wei Xiong PhD Thesis 2022 [PDF] My PhD thesis summarizes my main research works on Guided Visual Content Creation during my PhD program. Part I introduces guidance-driven visually pleasing data synthesis. Part II presents guidance-driven synthesis for downstream visual recognition tasks.
	Unsupervised Low-light Image Enhancement with Decoupled Networks Wei Xiong, Ding Liu, Xiaohui Shen, Chen Fang, Jiebo Luo ICPR 2022 [PDF] We are among the few pioneering works on unsupervised real-world low-light image enhancement. Specifically, we tackle the problem of enhancing real-world low-light images with significant noise in an unsupervised fashion. To this end, we explicitly decouple this task into two sub-problems: illumination enhancement and noise suppression.
	Breast Cancer Induced Bone Osteolysis Prediction Using Temporal Variational Auto-Encoders Wei Xiong, Neil Yeung, Shubo Wang, Haofu Liao, Jiebo Luo, Liyun Wang (* Equal Contribution) BME Frontiers 2022 [PDF] We adopt a temporal variational auto-encoder (T-VAE) model for bone osteolysis prediction on computed tomography (CT) images of murine breast cancer bone metastases.
	Image Sentiment Transfer Tianlang Chen, Wei Xiong, Haitian Zheng, Jiebo Luo ACM MM 2020 [PDF] We introduce an important but still unexplored research task Image Sentiment Transfer and propose an effective and flexible framework that performs image sentiment transfer at both the image level and the object level.
	Example-Guided Image Synthesis across Arbitrary Scenes using Masked Spatial-Channel Attention and Self-Supervision Haitian Zheng, Haofu Liao, Lele Chen, Wei Xiong, Tianlang Chen, Jiebo Luo ECCV 2020 [PDF] We tackle a challenging exemplar-guided image synthesis task, where the exemplar providing the style guidance is an arbitrary scene image which is semantically different from the given pixel-wise label map.
	Fine-grained Image-to-Image Transformation towards Visual Recognition Wei Xiong, Yutong He, Yixuan Zhang, Wenhan Luo, Lin Ma, Jiebo Luo CVPR 2020 [PDF] [Supplementary] We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks.
	CariGAN: Caricature Generation through Weakly Paired Adversarial Learning. Neural Networks Wei Xiong, Wenbin Li, Haofu Liao, Jing Huo, Yang Gao, Jiebo Luo (* Equal Contribution) Neural Networks 2020 [PDF] We frame the caricature generation task as a weakly paired image-to-image translation task, and propose CariGAN model to generate high-fidelity caricature images from human faces with proper exaggerations.
	Foreground-aware Image Inpainting Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, Jiebo Luo CVPR 2019 [PDF] [Hole Mask Dataset] We propose a foreground-aware image inpainting system that explicitly disentangles structure inference and content completion. Our model first learns to predict the foreground contour, and then inpaints the missing region using the predicted contour as guidance.
	Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, Jiebo Luo CVPR 2018 [Project Page] [PDF] [Code] [Timelapse Video Dataset] We propose a two-stage GAN model to generate vivid yet content-preserving time-lapse videos from only a single starting frame. To this end, we desentangle the task into content generation and motion enhancement.
	Stacked Convolutional Denoising Auto-Encoders for Feature Representation Bo Du, Wei Xiong, Jia Wu, Lefei Zhang, Liangpei Zhang, Dacheng Tao ( First author was my advisor) IEEE Trans. Cybernetics 2017 [PDF] We proposes an unsupervised feature learning model, named the Stacked Convolutional Denoising Auto-Encoders, that can map an image to hierarchical representations without any label information.
	Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint Wei Xiong, Bo Du, Lefei Zhang, Ruimin Hu, Dacheng Tao ICDM 2016 [PDF] We propose a group regularization method, Structured Decorrelation Constraint (SDC), that regularizes the activations of the hidden layers in groups to achieve better generalization.