
Aligning Text, Images, and 3D Structure Token-by-Token
We present a unified LLM that aligns language, images, and structured 3D scenes, demonstrating applications in rendering, recognition, instruction following, and 3D QA.
Ph.D. student at Caltech, advised by Prof. Georgia Gkioxari and Prof. Pietro Perona.
I am a Ph.D. student at Caltech advised by Prof. Georgia Gkioxari and Prof. Pietro Perona.
Previously, I was a student researcher at the MIT–IBM Watson AI Lab, pretraining large language models and doing research on vision–language models with Dr. Rameswar Panda, Dr. Rogerio Feris, and Prof. Yoon Kim. I completed a dual degree (B.Tech + M.Tech) at IIT Kharagpur, where I worked in the Computer Vision and Intelligence Research Lab under Prof. Abir Das.
In Summer 2021, I worked with Prof. Kate Saenko (Boston University) and Prof. Trevor Darrell (UC Berkeley) as a research intern for the DARPA LwLL project.
My research interests lie in understanding the principles of learning across multiple modalities and how knowledge transfers between them, with the goal of designing embodied multimodal agents that benefit society. Questions like “Do toddlers use similar principles to learn new languages as they do to learn to walk?” excite me.
We present a unified LLM that aligns language, images, and structured 3D scenes, demonstrating applications in rendering, recognition, instruction following, and 3D QA.
A semi-supervised prompt learning framework leveraging unlabeled data to improve VLM adaptation via cross-model consistency.
We propose a domain-alignment approach with switchable depth, width, and input resolution to realize accuracy–efficiency trade-offs under different constraints.
We introduce a dataset repair strategy combining a classifier with a GAN to augment minority-class examples.
Email: aadarsh.sahoo.99@gmail.com
Profiles: Google Scholar · LinkedIn · GitHub · X/Twitter