2024 SCIEN Affiliates Meeting Poster Presentations
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images by Boyang Deng, Kyle Genova, Songyou Peng, Gordon Wetzstein, Noah Snavely, Leonidas Guibas, Thomas Funkhouser
Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction by Youming Deng, Wenqi Xian, Guandao Yang, Leonidas Guibas, Gordon Wetzstein, Steve Marschner, Paul Debevec
Orthogonal Adaptation for Multi-concept Fine-tuning of Text-to-Image Diffusion Models by Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein
Linearizing Diffusion Transformers for Long-Context 4K Image Generation by Guandao Yang, Youjin Song, Leonidas Guibas, Gordon Wetzstein
3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation by Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wetzstein, Leonidas Guibas
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models by Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Max Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Donglai Xiang, Gordon Wetzstein, Ming-Yu Liu, Tsung-Yi Lin
T3DGesture: Text- and 3D-Labeled Multi-modal Synthetic Hand Gesture Dataset by Menghe Zhang, Haley M. So, Mohammad Asadi, Gordon Wetzstein, Yangwen Liang, Shuangquan Wang, Kee-Bong Song, Donghoon Kim
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control by Zhengfei Kuang, Shengqu Cai*, Hao He, Yinghao Xu, Hongsheng Li, Leonidas Guibas, Gordon Wetzstein
AI-based Metasurface Lens Design by Jiazhou Cheng*, Gun-Yeal Lee*, Gordon Wetzstein
X-RAI: Scalable 3D Reconstruction for X-Ray Single Particle Imaging Based on Online Machine Learning by Jay Shenoy, Axel Levy, Kartik Ayyer, Frédéric Poitevin, Gordon Wetzstein
Predicting human eye movements by Hyunwoo Gu, Justin Gardner
Cognitive Metrics in Human-Centered Augmentation: Bridging Neuroscience, Computer Vision, and AEC by Alberto Tono, Hari Subramonyam, Martin Fischer
Examining Looking Behavior Patterns Across Cognitive Tasks in Natural Environments by Jiwon Yeon, Hyunwoo Gu, Justin Gardner
Lightning Pose: Improved animal pose estimation via semi-supervised learning, Bayesian ensembling, and cloud-native open-source tools by Dan Biderman, Matt Whiteway, Cole Hurwitz, Nicholas Greenspan, Robert S Lee, Ankit Vishnubhotla, Richard Warren, Federico Pedraja, Dillon Noone, Michael Schartner, Julia M Huntenburg, Anup Khanal, Guido T Meijer, Jean-Paul Noel, Alejandro Pan-Vazquez, Karolina Z Socha, Anne E Urai, The International Brain Laboratory, John P Cunningham, Nathaniel B Sawtell, and Liam Paninski
AIpparel: A Large Multimodal Generative Model for Digital Garments by George Nakayama, Jan Ackermann, Timurs Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas Guibas, Guandao Yang, Gordon Wetzstein
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals by Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Mysore Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins
Holographic parallax improves 3D perceptual realism by Suyeon Choi*, Dongyeon Kim*, Seong-Woo Nam*, Jong-Mo Seo, Gordon Wetzstein, Yoonchan Jeong
Engineering Large-Scale Optical Tweezer Arrays for Next-Generation Quantum Processors by Timothy Chang*, Benjamin Kroul*, Yang Xu, Zephy Leung, and Joonhee Choi
SARLink: Satellite Backscatter Connectivity using Synthetic Aperture Radar by Geneva Ecola, Bill Yen, Ana Morgado, Bodhi Priyantha, Ranveer Chandra, and Zerina Kapetanovic
Med-VAE: Large-scale Generalizable Autoencoders for Medical Imaging by Maya Varma*, Ashwin Kumar*, Rogier van der Sluijs*, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari
Constrained Diffusion with Trust Sampling by William Huang, Yifeng Jiang, Tom Van Wouwe, Karen Liu
Tunable Free Space Compression by Matthew Beutel, Shanhui Fan
Title: Classification of amyloid β status using 3D quantitative-amplified Magnetic Resonance Imaging (3D q-aMRI) – A preliminary study
Abstract: Amplified Magnetic Resonance Imaging (aMRI) is a method for visualizing pulsatile brain motion, producing high-contrast, high-temporal-resolution ‘videos’ that have shown promise as a tool in assessing various neurological disorders. Recently, this approach was advanced with 3D quantitative aMRI (q-aMRI), enabling sub-voxel displacement quantification. 3D q-aMRI has shown promise in detecting abnormal brain motion in patients with neurodegenerative diseases. Abnormal accumulation of β-amyloid (Aβ) in the brain is an early indicator of Alzheimer’s disease. However, assessing Aβ status typically relies on invasive procedures such as PET scans or cerebrospinal fluid (CSF) assays. In this study, the sub-voxel displacement field generated by 3D q-aMRI was used as input to a multi-layer perceptron to classify patients as Aβ+ or Aβ-.
Authors: Itamar Terem, Kyan Younes , Skylar Weiss, Andrew Dreisbach, Yonatan Urman, Elizabeth C. Mormino, Samantha Holdsworth, and Kawin Setsompop
Bio: Itamar Terem is a PhD candidate in the Department of Electrical Engineering at Stanford University and an NSF Graduate Research Fellow. His current research focuses on developing computational and acquisition techniques in Magnetic Resonance Imaging to explore pulsatile brain dynamics and their potential as biomarkers for various neurological conditions, tissue mechanical response and clearance mechanician to blood pulsation and cerebrospinal fluid (CSF) motion.
Title: Full-color 3D holographic augmented-reality displays with metasurface waveguides
Abstract: Emerging spatial computing systems seamlessly superimpose digital information on the physical environment observed by a user, enabling transformative experiences across various domains, such as entertainment, education, communication and training. However, the widespread adoption of augmented-reality (AR) displays has been limited due to the bulky projection optics of their light engines and their inability to accurately portray three-dimensional (3D) depth cues for virtual content, among other factors. We will discuss a holographic AR system that overcomes these challenges using a unique combination of inverse-designed full-color metasurface gratings, a compact dispersion-compensating waveguide geometry, and artificial-intelligence-driven holography algorithms. These elements are co-designed to eliminate the need for bulky collimation optics between the spatial light modulator and the waveguide and to present vibrant, full-color, 3D AR content in a compact device form factor. To deliver unprecedented visual quality with our prototype, we developed an innovative image formation model that combines a physically accurate waveguide model with learned components that are automatically calibrated using camera feedback. Our unique co-design of a nanophotonic metasurface waveguide and artificial-intelligence-driven holographic algorithms represents a significant advancement in creating visually compelling 3D AR experiences in a compact wearable device.
Authors: Manu Gopakumar*, Gun-Yeal Lee*, Suyeon Choi, Brian Chao, Yifan Peng, Jonghyun Kim, Gordon Wetzstein
Bio: Gun-Yeal is a postdoctoral researcher at Stanford University, working with Professor Gordon Wetzstein at the Stanford Computational Imaging Lab. He is broadly interested in Optics and Photonics, with a particular focus on nanophotonics and optical system engineering. His recent research at the intersection of optics and computer vision focuses on developing next-generation optical imaging, display, and computing systems, utilizing advanced photonic devices and AI-driven algorithms. Gun-Yeal completed his PhD at Seoul National University in 2021 under the guidance of Professor Byoungho Lee. For his undergraduate studies, he double-majored in Electrical and Computer Engineering and Physics, also at Seoul National University. He is a recipient of OSA Incubic/Milton Chang Award, SPIE Optics and Photonics Education Scholarship, OSA Emil-wolf Award finalist, and NRF postdoc fellowship.
Manu is a 5th year PhD candidate in the Electrical Engineering Department at Stanford University working with Professor Gordon Wetzstein at the Stanford Computational Imaging Lab. His research interests are centered on the co-design of optical systems and computational algorithms. More specifically, he is currently focused on utilizing novel computational algorithms to unlock high quality 3D and 4D holography and more compact form-factors for holographic displays. Prior to coming to Stanford, Manu received a Bachelor’s and Master’s degree in Electrical and Computer Engineering from Carnegie Mellon University during which he worked with Pulkit Grover and Aswin Sankaranarayanan.
Title: Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Abstract: This project presents a method for using a Multimodal Large Language Model (MLLM) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes across a city over a certain period (which we call “trends”) and to provide visual evidence for them. Unlike most previous visual analysis systems, ours is designed to answer open-ended queries (e.g., “what are the frequent types of changes in the city?”) without any externally provided search terms or training supervision. At first glance, this looks like a problem for an MLLM. However, our datasets are four orders of magnitude too large for an MLLM to injest as context. So we introduce a bottom-up procedure that decomposes the massive visual search problem into a sequence of more tractable problems, each of which can be solved with high accuracy using an MLLM. During experiments and ablation studies with this method, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., “addition of outdoor dining,” “overpass was painted blue,” etc.).
Authors: Boyang Deng, Kyle Genova, Songyou Peng, Gordon Wetzstein, Noah Snavely, Leonidas Guibas, Thomas Funkhouser
Bio: Boyang Deng is a 3rd year PhD student in Computer Science at Stanford University. He is advised by Gordon Wetzstein and Leonidas Guibas. His research focuses on solving problems in Computer Graphics and Computer Vision, using Machine Learning.
Title: Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction
Abstract: In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. In particular, our technique enables high-quality scene reconstruction from Large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortion, and demonstrates state-of-the-art performance on both synthetic and real-world datasets.
Authors: Youming Deng, Wenqi Xian, Guandao Yang, Leonidas Guibas, Gordon Wetzstein, Steve Marschner, Paul Debevec
Bio: Guandao Yang is a postdoctoral scholar at Stanford, working with Prof. Leonidas Guibas and Prof. Gordon Wetzstein. I earned my PhD at Cornell Tech, advised by Prof. Serge Belongie and Prof. Bharath Hariharan. During my doctoral studies, I interned at various industry labs, including NVIDIA, Intel, and Google. Prior to my PhD, I received my Bachelor’s degree from Cornell University in Ithaca, majoring in Mathematics and Computer Science.
Title: Orthogonal Adaptation for Multi-concept Fine-tuning of Text-to-Image Diffusion Models
Abstract: Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.
Authors: Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein
Bio: Po is a PhD student at Stanford University working with Prof. Gordon Wetzstein at the Stanford Computational Imaging Lab. My research lies in 3D content generation/reconstruction.
Title: Linearizing Diffusion Transformers for Long-Context 4K Image Generation
Abstract: This project aims to develop a scalable long-context diffusion model capable of generating high-quality, aesthetic 4K photographs. Unlike methods that treat such images as mere super-resolution outputs, our approach focuses on creating large canvases with coherent and visually appealing elements. Capturing relationships between distant pixels is critical, as larger images include more objects and require a cohesive structure, making spatial configuration increasingly important for quality. Although the Diffusion Transformer (DiT) leverages global context effectively, its self-attention mechanism scales quadratically with image size, posing computational challenges. To address this, we explore linearization through efficient RNNs, particularly test-time training (TTT) layers, which have shown promise in handling long-context natural language tasks. Recognizing the unique demands of vision tasks, we are refining the TTT layer’s self-supervision loop to better align with image-based applications.
Authors: Guandao Yang, Youjin Song, Leonidas Guibas, Gordon Wetzstein
Bio: Guandao Yang is a postdoctoral scholar at Stanford, working with Prof. Leonidas Guibas and Prof. Gordon Wetzstein. His research aims to develop Spatial Intelligence capable of creating, editing, and analyzing geometries in both the virtual and physical worlds, utilizing advancements in machine learning, computer vision, and computer graphics. Youjin Song is a master’s student at Stanford, Department of Electrical Engineering.
Title: 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation
Abstract: Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.
Authors: Hansheng Chen, Bokui Shen, Yulin Liu, Ruoxi Shi, Linqi Zhou, Connor Z. Lin, Jiayuan Gu, Hao Su, Gordon Wetzstein, Leonidas Guibas
Bio: Hansheng Chen is a second-year PhD student at Stanford University, co-advised by Prof. Leonidas Guibas and Prof. Gordon Wetzstein. His research focuses on generative models and their applications in vision and graphics, particularly diffusion models and 3D generation. He is the recipient of CVPR 2022 Best Student Paper and Qualcomm Innovation Fellowship.
Title: CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input–output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Max Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Donglai Xiang, Gordon Wetzstein, Ming-Yu Liu, Tsung-Yi Lin
Bio: Qingqing Zhao is a final year Ph.D. student in Electrical Engineering at Stanford Computational Imaging Lab advised by Prof. Gordon Wetzstein. She is interested in foundation models for perception, control, and modeling.
Title: T3DGesture: Text- and 3D-Labeled Multi-modal Synthetic Hand Gesture Dataset
Abstract: We present T3DGesture, a high-fidelity, large-scale synthetic 3D hand gesture recognition (HGR) dataset designed to advance bare-hand 3D interaction tasks for eXtended Reality (XR). T3DGesture addresses key limitations of existing datasets, such as insufficient diversity, lack of full 3D annotations, and inconsistent representations across modalities. T3DGesture bridges these gaps by offering 22.6K synthetic RGB-D hand gesture video clips, annotated with high-resolution 3D meshes (including wrist and forearm), point clouds, 2D/3D keypoints, camera metadata, and semantic text labels. Generated with a kinematics-aware CVAE framework with two-fold biomechanical constraints, T3DGesture ensures physically valid and realistic hand motions. As the largest and most comprehensive dynamic HGR dataset to date, T3DGesture offers increased motion variance and appearance diversity, including variations in hand shapes, textures, skin tones, lighting, camera angles, and configurations. These features empower researchers to explore a broad range of hand-related tasks and enrich bare-hand interaction categories, particularly for 3D interactions in XR scenarios. Quantitative experiments demonstrate that models trained on T3DGesture outperform those trained on existing HGR datasets across multiple modalities. Additionally, T3DGesture enables novel downstream tasks, such as stereo depth estimation for XR interaction scenes and multi-modal fusion tasks. T3DGesture provides a valuable resource for advancing research in hand motion analysis and human–computer interaction. The dataset will be made publicly available.
Authors: Menghe Zhang, Haley M. So, Mohammad Asadi, Gordon Wetzstein, Yangwen Liang, Shuangquan Wang, Kee-Bong Song, Donghoon Kim
Bio: Haley So is a PhD candidate in Professor Gordon Wetzstein’s Computational Imaging Lab. She is interested in utilizing emerging sensors to rethink imaging algorithms and computer vision tasks.
Title: Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
Abstract: Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments.
Authors: Zhengfei Kuang, Shengqu Cai*, Hao He, Yinghao Xu, Hongsheng Li, Leonidas Guibas, Gordon Wetzstein
Bio: Zhengfei Kuang is a third-year Ph.D. student at the Stanford University, primally advised by Prof. Gordon Wetzstein. His main research interests are neural rendering, 3D/4D reconstruction and generation and video diffusion models. Before Stanford, he was studying at the University of Southern California (advised by Prof. Hao Li) and Tsinghua University.
Title: AI-based Metasurface Lens Design
Abstract: Conventional optical imaging systems are bulky and complex, requiring multiple elements to correct aberrations. Optical metasurfaces, planar structures that are capable of manipulating light at subwavelength scales, are compact alternatives to conventional refractive optical elements. Their miniature volume is suitable for technologies like AR/VR displays and wearables. However, existing metalenses face significant challenges from monochromatic (e.g., coma) and chromatic aberrations, limiting their applicability. Here we present an end-to-end AI-based computational method that parametrizes the profile of metalenses and optimizes it based on customized loss functions. This innovation enables wide-angle imaging with corrected aberrations while retaining a single-layer form factor, overcoming the key limitations of existing metalenses and advancing their potential for miniaturized imaging systems.
Authors: Jiazhou Cheng*, Gun-Yeal Lee*, Gordon Wetzstein
Bio: Jiazhou “Jasmine” Cheng: I am a first-year PhD student supervised by Prof. Gordon Wetzstein, interested in computational imaging technologies.
Title: X-RAI: Scalable 3D Reconstruction for X-Ray Single Particle Imaging Based on Online Machine Learning
Abstract: X-ray free-electron lasers (XFELs) offer unique capabilities for measuring the structure and dynamics of biomolecules, helping us understand the basic building blocks of life. Notably, high-repetition-rate XFELs enable single particle imaging (X-ray SPI) where individual, weakly scattering biomolecules are imaged under near-physiological conditions with the opportunity to access fleeting states that cannot be captured in cryogenic or crystallized conditions. Existing X-ray SPI reconstruction algorithms, which estimate the unknown orientation of a particle in each captured image as well as its shared 3D structure, are inadequate in handling the massive datasets generated by these emerging XFELs. Here, we introduce X-RAI, an online reconstruction framework that estimates the structure of a 3D macromolecule from large X-ray SPI datasets. X-RAI consists of a convolutional encoder, which amortizes pose estimation over large datasets, as well as a physics-based decoder, which employs an implicit neural representation to enable high-quality 3D reconstruction in an end-to-end, self-supervised manner. We demonstrate that X-RAI achieves state-of-the-art performance for small-scale datasets in simulation and challenging experimental settings and demonstrate its unprecedented ability to process large datasets containing millions of diffraction images in an online fashion. These abilities signify a paradigm shift in X-ray SPI towards real-time capture and reconstruction.
Authors: Jay Shenoy, Axel Levy, Kartik Ayyer, Frédéric Poitevin, Gordon Wetzstein
Bio: Jay is a third-year PhD student in the Stanford Computational Imaging Lab, focusing on inverse problems in scientific imaging.
Title: Predicting human eye movements
Abstract: A model that accurately predicts typical eye movement paths during visual search has many applications, such as optimizing resource allocation in AR/VR imaging systems, enhancing video game and web-based advertisement design, and aiding in neurological diagnostics. Prominent models of visual search are based on the idea that people create an internal template that approximates the target, which they use to guide their search behavior (Najemnik & Geisler, 2005). These templates are often assumed to be ideal, in the sense that they perfectly match the target’s appearance. We tested this assumption using a novel method that allowed us to estimate the template from a person’s eye movement during search tasks. In both highly controlled and naturalistic scenes we found that the human-derived templates differ from those of an ideal searcher. We are further exploring the range of templates across individuals and task contexts with the ultimate goal of building a more accurate model of human eye movements.
Authors: Hyunwoo Gu, Justin Gardner
Bio: Hyunwoo Gu is a third-year Ph.D. student at Stanford University, advised by Prof. Justin Gardner. His research focuses on modeling human visual behaviors by integrating classical visual psychophysics with recent advancements in vision-language models and diffusion models.
Title: Cognitive Metrics in Human-Centered Augmentation: Bridging Neuroscience, Computer Vision, and AEC
Abstract: Vitruvius articulated that an optimal designer should consider firmitas (strength), utilitas (utility), and venustas (aesthetics) during the initial design phases. Therefore, how do we objectively measure this consideration at the early stage of the project, setting the project on the right track? More importantly, how do we assess if novel 3D GenAI tools allow considering these three aspects during these initial phases, augmenting the designer’s skills? This research aims to answer whether these novel 3DGenAI tools augment human designer skills. We introduce a dataset, two 3D GenAI methods, and an IRP-approved protocol to test a novel 3D GenAI interface. This research can inform Brain-Computer Interfaces (BCIs) that identify these design elements and highlight overlooked factors, helping set projects on a successful path early in the design process.
Authors: Alberto Tono, Hari Subramonyam, Martin Fischer
Bio: Tono Alberto is a current Ph.D. Candidate at Stanford University under the supervision of Kumagai Professor Martin Fischer in Civil Environmental Engineering. Furthermore, he is the founder of the Computational Design Institute, where he is exploring ways in which the convergence between Digital and Humanities can facilitate cross-pollination between different industries. Following this mission, he became a Stanford HAI Graduate Fellow researching Human-Centered AI solutions for augmenting and amplifying human capabilities with the design process.
Title: Examining Looking Behavior Patterns Across Cognitive Tasks in Natural Environments
Abstract: How do internal goals shape patterns of visual attention in consistent sensory environments? To explore this question, we conducted a study in which participants walked the same path multiple times while performing distinct cognitive tasks: learning the path (Practice), memorizing a word list (Memory), or attending to their surroundings (No-memory). Eye movement data was collected using wearable eye-tracking glasses, capturing gaze metrics across different sublocations defined by changes in visual scenes. Preliminary analyses revealed task-dependent differences in gaze patterns, with cognitive demands influencing how participants allocated their attention. For example, tasks involving memory recall showed unique gaze behaviors compared to those requiring general environmental awareness. These findings suggest that internal cognitive goals may play a significant role in shaping visual behaviors in natural environments. Leveraging computational modeling, such patterns could be used to predict internal states or goals, offering potential applications in adaptive systems, augmented reality, and human-AI interaction.
Authors: Jiwon Yeon, Hyunwoo Gu, Justin Gardner
Bio: Jiwon Yeon is a postdoctoral researcher in Justin Gardner’s lab at Stanford University. Her research focuses on developing predictive models for future eye movements and identifying brain degeneration by detecting deviations from predicted eye movement patterns. In 2022, she was awarded the prestigious Wu Tsai-Human AI (HAI) Interdisciplinary Scholar Fellowship.
Title: Lightning Pose: Improved animal pose estimation via semi-supervised learning, Bayesian ensembling, and cloud-native open-source tools
Abstract: Contemporary pose estimation methods enable precise measurements of behavior via supervised deep learning with hand-labeled video frames. Although effective in many cases, the supervised approach requires extensive labeling and often produces outputs that are unreliable for downstream analyses. Here, we introduce “Lightning Pose,” an efficient pose estimation package with three algorithmic contributions. First, in addition to training on a few labeled video frames, we use many unlabeled videos and penalize the network whenever its predictions violate motion continuity, multiple-view geometry, and posture plausibility (semi-supervised learning). Second, we introduce a network architecture that resolves occlusions by predicting pose on any given frame using surrounding unlabeled frames. Third, we refine the pose predictions post-hoc by combining ensembling and Kalman smoothing. Together, these components render pose trajectories more accurate and scientifically usable. We release a cloud application that allows users to label data, train networks, and predict new videos directly from the browser.
Authors: Dan Biderman, Matt Whiteway, Cole Hurwitz, Nicholas Greenspan, Robert S Lee, Ankit Vishnubhotla, Richard Warren, Federico Pedraja, Dillon Noone, Michael Schartner, Julia M Huntenburg, Anup Khanal, Guido T Meijer, Jean-Paul Noel, Alejandro Pan-Vazquez, Karolina Z Socha, Anne E Urai, The International Brain Laboratory, John P Cunningham, Nathaniel B Sawtell, and Liam Paninski
Bio: Dan Biderman is a postdoctoral co-advised by Scott Linderman (Statistics) and Christopher Ré (Computer Science). He obtained his P.h.D at Columbia’s Center for Theoretical Neuroscience where he worked with John Cunningham and Liam Paninski. His work develops large-scale AI systems with applications to neuroscience.
Title: AIpparel: A Large Multimodal Generative Model for Digital Garments
Abstract: Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme
that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and it enables novel multimodal garment generation applications such as interactive garment editing.
Authors: George Nakayama, Jan Ackermann, Timurs Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas Guibas, Guandao Yang, Gordon Wetzstein
Bio: George Nakayama is a Coterm Master’s student at Stanford University majoring in mathematics and computer science. He is generally interested in computer vision, graphics and machine learning. His recent research focuses on the representation and generation of objects/scenes for 2D and 3D content creation.
Jan Ackermann is a Master’s student at ETH Zurich majoring in computer science and also a visiting student researcher at the Computational Imaging Lab, working with Prof. Gordon Wetzstein. His research interests broadly lie at the intersection of computer vision, graphics, and deep learning. Recent works focus on reconstruction and generation of 3D scenes/objects.
Title: Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals
Abstract: A major challenge in representation learning from visual inputs is extracting information from the learned representations to an explicit and usable form. This is most commonly done by learning readout layers with supervision or using highly specialized heuristics. This is challenging primarily because the pretext tasks and the downstream tasks that extract information are not tightly connected in a principled manner—improving the former does not guarantee improvements in the latter. The recently proposed counterfactual world modeling paradigm aims to address this challenge through a masked next frame predictor base model, which enables simple counterfactual extraction procedures for extracting optical flow, segments, and depth. In this work, we take the next step and parameterize and optimize the counterfactual extraction of optical flow by solving the same simple next frame prediction task as the base model. Our approach, Opt-CWM, achieves state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data. This work sets the foundation for future methods on extraction of more complex visual structures like segments and depth with high accuracy.
Authors: Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Mysore Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins
Bio: Stefan Stojanov is a postdoctoral researcher working with Professors Jiajun Wu and Daniel Yamins. He is interested in building computer vision systems guided by our knowledge about the generalization, adaptability, and efficiency of human perception and its development. Stefan completed his PhD at the Georgia Institute of Technology, where he worked on self-supervised and data-efficient computer vision algorithms.
Title: Holographic parallax improves 3D perceptual realism
Abstract: Holographic near-eye displays are a promising technology to solve long-standing challenges in virtual and augmented reality display systems. Over the last few years, many different computer-generated holography (CGH) algorithms have been proposed that are supervised by different types of target content, such as 2.5D RGB-depth maps, 3D focal stacks, and 4D light fields. It is unclear, however, what the perceptual implications are of the choice of algorithm and target content type. In this work, we build a perceptual testbed of a full-color, high-quality holographic near-eye display. Under natural viewing conditions, we examine the effects of various CGH supervision formats and conduct user studies to assess their perceptual impacts on 3D realism. Our results indicate that CGH algorithms designed for specific viewpoints exhibit noticeable deficiencies in achieving 3D realism. In contrast, holograms incorporating parallax cues consistently outperform other formats across different viewing conditions, including the center of the eyebox. This finding is particularly interesting and suggests that the inclusion of parallax cues in CGH rendering plays a crucial role in enhancing the overall quality of the holographic experience. This work represents an initial stride towards delivering a perceptually realistic 3D experience with holographic near-eye displays.
Authors: Suyeon Choi*, Dongyeon Kim*, Seong-Woo Nam*, Jong-Mo Seo, Gordon Wetzstein, Yoonchan Jeong
Bio: Suyeon Choi is a postdoctoral scholar at Stanford University, working with Professor Gordon Wetzstein. His research focuses on developing computational optical systems at the intersection of graphics, computational optics, artificial intelligence, and applied vision science.
Title: Engineering Large-Scale Optical Tweezer Arrays for Next-Generation Quantum Processors
Abstract: Advances in quantum computing and simulation herald a revolutionary approach to solving computationally intensive problems that have long remained intractable for classical computers. In recent years, arrays of neutral atoms trapped by optical tweezers have emerged as a groundbreaking architecture for quantum information processing, offering exceptional scalability and programmable interactions. Foundational work in optical phase engineering and acousto-optics have been especially instrumental in developing atom-based quantum processors, enabling state-of-the-art implementations of arbitrary interaction geometries, parallelizable quantum compilers, and high-fidelity local gate controllers. In this work, we present progress towards generating both static and dynamic arrays of up to 1024 optical tweezers through a high-NA objective, using a Spatial Light Modulator (SLM) and crossed Acousto-Optic Deflectors (AODs), respectively. We demonstrate preliminary work for optimizing intensity uniformity, correcting aberrations, and developing in-house phase engineering solutions to address key optical challenges in high-density tweezer arrays.
Authors: Timothy Chang*, Benjamin Kroul*, Yang Xu, Zephy Leung, and Joonhee Choi
Bio: Timmy Chang and Ben Kroul are building utility-scale quantum processors for both benchmarked analog quantum simulation and fault-tolerant quantum computation, advised by Prof. Joonhee Choi in the Electrical Engineering department. Timmy is a second-year PhD student in EE, and Ben is a co-term student in Applied Physics.
Title: SARLink: Satellite Backscatter Connectivity using Synthetic Aperture Radar
Abstract: The number of satellites in lower earth orbit is rapidly increasing to tens of thousands of devices. These systems enable detailed imagery of the earth and provide internet connectivity, but also introduce new challenges. The growing number of satellites exacerbates issues of spectrum efficiency and co-existence with existing networks, obstructs astronomical observations, increases light pollution, and generates more space debris. These challenges have sparked a renewed interest in using space infrastructure more efficiently through joint sensing and communication techniques. Spaceborne synthetic aperture radar (SAR) imagery systems create detailed images of the earth’s terrain, but can also be jointly used for communication. In this work, we present SARLink which is a system that enables passive satellite backscatter communication using existing SAR imagery satellites, such as the European Space Agency’s Sentinel-1. This work presents a cooperative ground target, a mechanically modulating reflector, for applying amplitude modulation to SAR backscatter signals and devises a processing algorithm to extract the communication bits from a single SAR image by leveraging SAR subaperture processing techniques. A theoretical analysis of the expected signal model, throughput, and bit error rate (BER) of this communication system is presented, which shows that publicly available images from Sentinel-1 can be used to send 60 bits per satellite pass with a BER of 1% from a 5’x5′ square trihedral modulating corner reflector. To demonstrate the feasibility of this system, we design and evaluate the effectiveness of a 2’x2′ mechanically modulating reflector and find that it achieves a 10 dB difference in its radar cross-section. We deploy static and modulating reflectors during Sentinel-1 imaging and demonstrate the proposed processing algorithm matches the expected results shown in the theoretical analysis. Further, we demonstrate the effectiveness of our subaperture processing technique on a modulating reflector in the field.
Authors: Geneva Ecola, Bill Yen, Ana Morgado, Bodhi Priyantha, Ranveer Chandra, and Zerina Kapetanovic
Bio: Geneva Ecola is a PhD candidate advised by Zerina Kapetanovic in the Electrical Engineering department at Stanford University. She is interested in satellites and ultra-low power communication and sensing systems.
Title: Med-VAE: Large-scale Generalizable Autoencoders for Medical Imaging
Abstract: Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce Med-VAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train each autoencoder on over one million medical images using a novel two-stage training approach. Across diverse tasks obtained from 19 medical image datasets, we demonstrate that (1) utilizing Med-VAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) Med-VAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain.
Authors: Maya Varma*, Ashwin Kumar*, Rogier van der Sluijs*, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari
Bio: Ashwin Kumar is a third-year PhD student in Biomedical Physics, advised by Akshay Chaudhari and Greg Zaharchuk. He focuses on developing deep learning methodologies to advance medical image acquisition and analysis. He is supported by the Stanford Knight-Hennessy fellowship.
Title: Constrained Diffusion with Trust Sampling
Abstract: Diffusion models have demonstrated significant promise in various generative tasks; however, they often struggle to satisfy challenging constraints. Our approach addresses this limitation by rethinking training-free loss-guided diffusion from an optimization perspective. We formulate a series of constrained optimizations throughout the inference process of a diffusion model. In each optimization, we allow the sample to take multiple steps along the gradient of the proxy constraint function until we can no longer trust the proxy, according to the variance at each diffusion level. Additionally, we estimate the state manifold of diffusion model to allow for early termination when the sample starts to wander away from the state manifold at each diffusion step. Trust sampling effectively balances between following the unconditional diffusion model and adhering to the loss guidance, enabling more flexible and accurate constrained generation. We demonstrate the efficacy of our method through extensive experiments on complex tasks, and in dras- tically different domains of images and 3D motion generation, showing significant improvements over existing methods in terms of generation quality.
Authors: William Huang, Yifeng Jiang, Tom Van Wouwe, Karen Liu
Bio: William is a MS/BS Computer Science student at Stanford. A former IPhO gold medalist turned NeurIPS first author, he has researched a broad range of ML, including multimodal transformers, RL, and generative diffusion models. He was previously a Quantitative Researcher at Citadel Securities and an ML Researcher under Prof. Liu (Stanford), Prof. Ng (Stanford), and Prof. Rajpurkar (Harvard Med).
Title: Tunable Free Space Compression
Abstract: With the goal of miniaturization, there has been significant interest in the use of flat optics to shrink free space. Due to the momentum-dependent transfer function of free space, local optics are ruled out. In our work, we show that not only can free space be replaced with non-local flat optics in the momentum domain, but that it can be done in a tunable manner. We describe the
operating principle and through the use of twisted bilayer photonic crystal systems and Moire periodicity, provide a specific setup that accomplishes this function. The operating function of this device is one that is key in the miniaturization of many present-day technologies.
Authors: Matthew Beutel, Shanhui Fan
Bio: Matthew Beutel is a second-year Applied Physics PhD student. He is advised by Shanhui Fan and is working on photonic crystal device design.