Ellis Brown

I am a CS PhD Student at NYU Courant advised by Profs. Saining Xie and Rob Fergus. My research is supported by the NDSEG Fellowship. I recently interned with Ross Girshick at the Allen Institute for AI (Ai2).

Before NYU, I graduated from a Master’s at Carnegie Mellon where I was advised by Profs. Deepak Pathak and Alyosha Efros. Before that, I was a founding research engineer at BlackRock AI Labs, working with Profs. Mykel Kochenderfer, Stephen Boyd, and Trevor Hastie on applied research & finance and a non-degree grad student at Stanford and Columbia. I did my undergrad at Vanderbilt where I majored in CS & Math and did research in CogSci & Vision with Prof. Maithilee Kunda. I’m originally from St. Louis, MO and am a proud member of the Osage Nation.

→ If you haven’t made time for a regular checkin with a doctor recently, please do! Even if you feel perfectly healthy.

news

May., 2025	Honored to be recognized as a CVPR 2025 Outstanding Reviewer!
Sep., 2024	Cambrian was accepted to NeurIPS 2024 as an oral presentation 🪼🎉
Mar., 2024	Thrilled to have been awarded the NDSEG Fellowship to support my PhD research at NYU!
Feb., 2024	I will be joining AllenAI (AI2) as a Resesarch Intern this summer in Seattle, working with Ross Girshick!
Aug., 2023	Excited to be starting my PhD at NYU advised by Profs. Saining Xie and Rob Fergus 🎉🗽

selected research (all)

My research interests lie at the intersection of deep learning, computer vision, and robotics—particularly in the areas of (multimodal) representation learning, self-supervised learning, open-endedness, and agents.

publications

2024

NeurIPS
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Shengbang Tong^*, Ellis Brown^*, Penghao Wu^*, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang , Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie

In NeurIPS, 2024

🎖️ Oral (1.8%) Abs arXiv Bib PDF Code Website

Selected for Oral Presentation (1.8%) at NeurIPS 2024

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures—self-supervised, strongly supervised, or combinations thereof—based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
@inproceedings{tong2024cambrian, title = {{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}}, author = {Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng and Iyer, Adithya and Pan, Xichen and Wang, Ziteng and Fergus, Rob and LeCun, Yann and Xie, Saining}, year = {2024}, booktitle = {NeurIPS}, }
ECCV
V-IRL: Grounding Virtual Intelligence in Real Life

Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, and Saining Xie

In ECCV, 2024

Abs arXiv Bib PDF Code Website

There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as ﬂexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce V-IRL: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe.
@inproceedings{yang2024virl, author = {Yang, Jihan and Ding, Runyu and Brown, Ellis and Qi, Xiaojuan and Xie, Saining}, title = {V-IRL: Grounding Virtual Intelligence in Real Life}, year = {2024}, booktitle = {ECCV} }

2023

ICCV
Your Diffusion Model is Secretly a Zero-Shot Classifier

Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak

In ICCV, 2023

Abs arXiv Bib PDF Website

The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation.

In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, our diffusion-based approach has significantly stronger multimodal compositional reasoning ability than competing discriminative approaches.

Finally, we use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. Our models achieve strong classification performance using only weak augmentations and exhibit qualitatively better "effective robustness" to distribution shift. Overall, our results are a step toward using generative over discriminative models for downstream tasks.
@inproceedings{li2023diffusion, title = {Your Diffusion Model is Secretly a Zero-Shot Classifier}, author = {Li, Alexander C. and Prabhudesai, Mihir and Duggal, Shivam and Brown, Ellis and Pathak, Deepak}, year = {2023}, booktitle = {ICCV}, pages = {2206-2217}, }
ICML
Internet Explorer: Targeted Representation Learning on the Open Web

Alexander C. Li^*, Ellis Brown^*, Alexei A. Efros, and Deepak Pathak

In ICML, 2023

Abs arXiv Bib PDF Website

Vision models heavily rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only understand knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet—where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30–40 hours.
@inproceedings{li2023internet, title = {Internet Explorer: Targeted Representation Learning on the Open Web}, author = {Li, Alexander C. and Brown, Ellis and Efros, Alexei A. and Pathak, Deepak}, year = {2023}, booktitle = {ICML}, }

news

selected research (all)

publications

2024

2023

code