I’m interested in self-supervised learning, representation learning, curiosity-based exploration, and leveraging internet-scale models and data. I am keen to draw inspiration from intelligence in humans and nature—especially as a goal-post rather than a blueprint. My long-term goal is to develop intelligent agents that can generalize and continually adapt as robustly and efficiently as humans do, allowing them to be safely deployed in the real world.
Publications
2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures—self-supervised, strongly supervised, or combinations thereof—based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
V-IRL: Grounding Virtual Intelligence in Real Life
There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce V-IRL: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe.
The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation.
In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, our diffusion-based approach has significantly stronger multimodal compositional reasoning ability than competing discriminative approaches.
Finally, we use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. Our models achieve strong classification performance using only weak augmentations and exhibit qualitatively better "effective robustness" to distribution shift. Overall, our results are a step toward using generative over discriminative models for downstream tasks.
Internet Explorer: Targeted Representation Learning on the Open Web
Vision models heavily rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only understand knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet—where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30–40 hours.
2022
Internet Curiosity: Directed Unsupervised Learning on Uncurated Internet Data
We show that a curiosity-driven computer vision algorithm can learn to efficiently query Internet text-to-image search engines for images that improve the model’s performance on a specified dataset. In contrast to typical self-supervised computer vision algorithms, which learn from static datasets, our model actively expands its training set with the most relevant images. First, we calculate the image-level curiosity reward as the negative distance of an image’s representation to its nearest neighbor in the targeted dataset. This reward is easily estimated using only unlabeled data from the targeted dataset, and can be aggregated into a query-level reward that effectively identifies useful queries. Second, we use text embedding similarity scores to propagate observed curiosity rewards to untried text queries. This efficiently identifies relevant semantic clusters without any need for class labels or label names from the targeted dataset. Our method significantly outperforms models that require 1-2 orders of magnitude more compute and data.
2018
An Architecture for Spatiotemporal Template-Based Search
Ellis Brown , Soobeen Park , Noel Wardord , Adriane Seiffert , Kazuhiko Kawamura , Joseph Lappin , and Maithilee Kunda
Visual search for a spatiotemporal target occurs frequently in human experience—from military or aviation staff monitoring complex displays of multiple moving objects to daycare teachers monitoring a group of children on a playground for risky behaviors. In spatiotemporal search, unlike more traditional visual search tasks, the target cannot be identified from a single frame of visual experience; as the target is a spatiotemporal pattern that unfolds over time, detection of the target must also integrate information over time. We propose a new computational cognitive architecture used to model and understand human visual attention in the specific context of visual search for a spatiotemporal target. Results from a previous human participant study found that humans show interesting attentional capacity limitations in this type of search task. Our architecture, called the SpatioTemporal Template-based Search (STTS) architecture, solves the same search task from the study using a wide variety of parameterized mod els that each represent a different cognitive theory of visual attention from the psychological literature. We present results from initial computational experiments using STTS as a first step towards understanding the computational nature of attentional bottlenecks in this type of search task, and we discuss how continued STTS experiments will help determine which theoretical models best explain the capacity limitations shown by humans. We expect that results from this research will help refine the design of visual information displays to help human operators perform difficult, real-world monitoring tasks.
ACS-18
SpatioTemporal Template-based Search: An Architecture for Spatiotemporal Template-Based Search
Ellis Brown , Soobeen Park , Noel Warford , Adriane Seiffert , Kazuhiko Kawamura , Joe Lappin , and Maithilee Kunda
In Proceedings of the 6th Conference on Advances in Cognitive Systems , Aug 2018
Many optimization problems involve minimizing a sum of univariate functions, each with a different variable, subject to coupling constraints. We present PiecewiseQuadratics.jl and SeparableOptimization.jl, two Julia packages for solving such problems when these univariate functions in the objective are piecewise-quadratic.
2019
AISES-19
Modeling Uncertainty in Bayesian Neural Networks with Dropout: The effect of weight prior and network architecture selection
Ellis Brown* , Melanie Manko* , and Ethan Matlin*
In American Indian Science and Engineering Society National Conference , Oct 2019
🎖️ Third Place, Graduate Student Research Competition
While neural networks are quite successful at making predictions, these predictions are usually point estimates lacking any notion of uncertainty. However, when fed data very different from its training data, it is useful for a neural network to realize that its predictions could very well be wrong and encode that information through uncertainty bands around its point estimate prediction. Bayesian Neural Networks trained with Dropout are a natural way of modeling this uncertainty with theoretical foundations relating them to Variational Inference approximating Gaussian Process posteriors. In this paper, we investigate the effects of weight prior selection and network architecture on uncertainty estimates derived from Dropout Bayesian Neural Networks.
2017
AISES-17
Computational Cognitive Systems to Model Information Salience
Ellis Brown , Adriane Seiffert , Noel Warford , Soobeen Park , and Maithilee Kunda
In American Indian Science and Engineering Society National Conference , Sep 2017
This project seeks to model human information salience—in this case, how much a person will notice a piece of visual information—using an approach from artificial intelligence that involves building computational cognitive systems models of human performance on certain tasks. Too much visual information can be overwhelming, making it difficult for people to discern the important parts. This can be detrimental in many situations where visual attention is crucial, such as air traffic control systems, military information displays, or TSA screening stations. Results from a previous human participant study found that humans show interesting limitations in attentional capacity when asked to monitor a complex moving display. We create artificial computational agents based on various cognitive models of visual attention to solve the same visual information salience task and run computational experiments to measure cognitive performance. In follow-up work, we will compare the performance of our models to the human data to determine which models best explain the capacity limitations shown by people.
Reports
2022
CMU 16-824
Self-Supervised Representation Learning via Curiosity-Driven Exploration
Alvin Shek , Ellis Brown , Nilay Pande , and David Noursi
“The performance of machine learning methods is heavily dependent on the choice of data representation” — Bengio et al, 2012. As machine learning continues to be applied to more complex and important tasks, this dependence on the data representation will only increase. While current machine learning methods are bottlenecked by representation quality, current methods for learning representations are bottlenecked on the dataset size. But this process of creating large static datasets, as is the mainstream practice, is expensive, time consuming, and heavily prone to human bias. Machine learning practitioners have increasingly been focusing on paradigms such as unsupervised and self-supervised learning to help alleviate the expense of supervision in working with bigger datasets; however, these methods still suffer from the issues of static datasets. One promising approach to learn good representations without a fixed datasets is by directly interacting with the environment. The visual state space of real environments/simulators can be quite huge and intractable to explore fully. Hence, in this project, we investigate intelligent curiosity driven exploration strategies to learn good representations from a simulator using self supervised learning objectives. We discuss the effectiveness of different strategies, issues and future directions of research in this field.
2021
CMU 16-811
Scaling Interpretable Reinforcement Learning via Decision Trees to Minecraft
Deep reinforcement learning is a powerful tool for learning complex control tasks; however, neural networks are notoriously “black boxes” and lack many properties desirable of autonymous systems deployed in safety critical environments. In this project, we focus on methods that result in a final control policy specified via a decision tree—which is thus interpretable and verifiable. We build upon a prior method, VIPER, that first learns a high-performing “expert” policy via any standard Deep RL technique, and then distills the expert policy into a decision tree. Our method, called MSVIPER, is specifically designed to scale to complex environements that greatly benefit from (or require) curriculum learning to be solved; we leverage the structure in the currculum stages to enable more efficient learning and a smaller (and thus more interpretable) decision tree. To demonstrate the ability of our method to succeed in complex environments, we apply it to Minecraft—a challenging open-world environment. We highlight that our method is amennable to post-training verification and modification or improvement.
This paper presents a method to determine an optimal policy for the lending of securities by large institutions in the securities finance market as a final project for the Stanford University AA222 Engineering Design Optimization class. The securities lending process is formulated as a Markov decision process in which the lender decides whether to accept or reject incoming offers from borrowers. This formulation allows for a policy that maximizes the expected return with each decision to be derived using dynamic programming. The framework presented is easily extensible through the creation of more realistic models of the dynamics of the securities lending market.
2019
Columbia CS E6699
Modeling Uncertainty in Bayesian Neural Networks with Dropout
While neural networks are quite successful at making predictions, these predictions are usually point estimates lacking any notion of uncertainty. However, when fed data very different from its training data, it is useful for a neural network to realize that its predictions could very well be wrong and encode that information through uncertainty bands around its point estimate prediction. Bayesian Neural Networks trained with Dropout are a natural way of modeling this uncertainty with theoretical foundations relating them to Variational Inference approximating Gaussian Process posteriors. In this paper, we investigate the effects of weight prior selection and network architecture on uncertainty estimates derived from Dropout Bayesian Neural Networks.