
Aligning Text, Images, and 3D Structure Token-by-Token
We present a unified LLM that aligns language, images, and structured 3D scenes, demonstrating applications in rendering, recognition, instruction following, and 3D QA.
Ph.D. student at Caltech, advised by Prof. Georgia Gkioxari and Prof. Pietro Perona.
I am a Ph.D. student at Caltech advised by Prof. Georgia Gkioxari and Prof. Pietro Perona.
Previously, I was a student researcher at the MIT–IBM Watson AI Lab, pretraining large language models and doing research on vision–language models with Dr. Rameswar Panda, Dr. Rogerio Feris, and Prof. Yoon Kim. I completed a dual degree (B.Tech + M.Tech) at IIT Kharagpur, where I worked in the Computer Vision and Intelligence Research Lab under Prof. Abir Das.
In Summer 2021, I worked with Prof. Kate Saenko (Boston University) and Prof. Trevor Darrell (UC Berkeley) as a research intern for the DARPA LwLL project.
My research focuses on building multimodal foundation models that perceive, reason, and interact in 3D environments. I am particularly interested in developing unified representations that bridge geometry, vision, and language-guided systems that can understand spatial relationships, engage in grounded conversations, and perform complex reasoning tasks. My work spans 3D tokenization for sequential modeling, conversational visual understanding, and parameter-efficient adaptation of large models. Ultimately, I aim to advance embodied AI systems that can robustly interpret and act within real-world environments to benefit society.
We present a unified LLM that aligns language, images, and structured 3D scenes, demonstrating applications in rendering, recognition, instruction following, and 3D QA.
A semi-supervised prompt learning framework leveraging unlabeled data to improve VLM adaptation via cross-model consistency.
We propose a domain-alignment approach with switchable depth, width, and input resolution to realize accuracy–efficiency trade-offs under different constraints.
We introduce a dataset repair strategy combining a classifier with a GAN to augment minority-class examples.
Email: aadarsh.sahoo.99@gmail.com
Profiles: Google Scholar · LinkedIn · GitHub · X/Twitter