Building more controllable Text-to-Image Generation
Recently, text-to-image generation models have gained tremendous popularity due to their capabilities to produce accurate, diverse and even creative images from text prompt. However, text prompts are highly ambiguous in terms of conveying visual control. For example, if we want to generate an image with "my own backpack", or generate an image with "my backyard" as the background. These control signals cannot be well represented as text. Therefore, we need diverse types of control signals to complement the text-to-image generation process. Specifically, we work on two novel tasks: (1) subject-driven image generation, where the model needs to generate images containing a given subject (like a specific dog, backpack, etc). (2) subject-driven image editing, where the model needs to swap or add a given subject into a given scene. We first propose new benchmarks and then propose new training algorithms to address these two new tasks.
Bio: Wenhu Chen is currently an assistant professor in the Computer Science at University of Waterloo, a faculty member at Vector Institute, a CIFAR AI chair and a part-time researcher at Google DeepMind. He obtained his PhD from the computing science department of University of California, Santa Barbara in 2021, and he spent a wonderful postdoctoral year at Google Research. His main research interests include natural language processing, large language models, vision-language interaction, and image generation.