Kevin (Qi) Zhao 赵齐

picture

About Me

Greetings! I am Qi Zhao, feel free to refer me as Kevin. I am a second year master student at Brown University foretunately advised by professor Chen Sun on computer vision. Together we connected language in video tasks and studied in-depth the implication that LLMs bring to video understanding. I also work with professor George Konidaris on multimodal embodied agent research in advancing language grounding for decision-making. Before I came to Brown, I worked at Zoo Capital, an leading investment institution in Shanghai, China. I received my Bachelor of Science degree from NYU Stern in 2021, double majored in Finance and Mathematics.

My recent research projects focus on bridging large language models (LLMs) with vision foundation models to advance video understanding. Particularly, I worked on incorporating the emergent reasoning capabilities of LLMs into goal-conditioned human action dynamics modeling. My ongoing research is on building multimodal decision-making embodied agent through environment feedback (RLEF). Specifically, my collaborators and I are interested in building a generalizable reward model that captures environment feedback to ground natural language instructions to hierarichal subgoal decomposition through the versatility of LLMs as reasoning backbones.

As for my future research goals, I am interested in connecting the "concepts" across the knowledge picked up by foundation models during multimodal representation learning and empowering embodied agents and robots with such representation. To this end, I am broadly interested in topics of computer vision and robotic learning. In the near future, I am aspiring to advance multimodal learning with embodied agents and robots through various feedbacks and supervisions with inspiration from cognitive science, sounds like the inverse function of my second goal:)

In my free time, I watch movies, play boardgames and advise start-ups. I am always down for a chat in AI, sports, and good food:)



Research

AntGPT: Can Large Language Model Help Long-term Action Anticipation from Videos?

Qi Zhao*, Shijie Wang*, Ce Zhang, Changcheng Fu, Minh Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

Accepted at ICLR 2024

[website] [arXiv] [video] [code coming soon]

Vamos: Versatile Action Models for Video Understanding

Shijie Wang, Qi Zhao, Minh Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

Preprint, In Submission to CVPR 2024

[website] [arXiv]

RLEF: Building Multimodal Decision-making Agents from Environment Feedback

Planned Submission to ICML 2024

[ALFRED Leaderboard] No.1 RLEF by Kevinz



Applications

Phoodify: Edge-Computing AI Mobile App for Diet Tracking

Founder & Developer

[website] [iOS]

Contact Me

Email: qi_zhao [at] brown [dot] edu