2023 SCIEN Affiliates Meeting Poster Presentations

Index to Posters 

Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition by Cindy M. Nguyen, Eric R. Chan, Alexander W. Bergman, Gordon Wetzstein

Controlling Energy Flow to Improve Solid-State Upconversion by Pournima Narayanan, Manchen Hu, Emma Belliveau, Ghada Ahmed, Qi Zhou, Will Michaels, Martin Sebastian Fernandez, Linda Pucurimay, Arynn O. Gallegos, Vongaishe Mutatu, Dan Congreve

Real-Time Hand Keypoint Detection on Edge by Mohammad Asadi, Haley So, Gordon Wetzstein

Multifunctional Spaceplates for Chromatic and Spherical Aberration Correction by Yixuan Shao, Robert Lupoiu, Jiaqi Jiang, You Zhou, Jonathan A. Fan

Physics-based Lens Flare Simulation for Nighttime Driving by Zhenyi Liu, Devesh Shah, Alireza Rahimpour, Devesh Upadhyay, Joyce Farrell, Brian Wandell

Make a Donut: Language-Guided Hierarchical EMD-Space Planning for Zero-shot Deformable Object Manipulation by Yang You, Bokui Shen, Congyue Deng, Haoran Geng, He Wang, Leonidas Guibas

Full-Color Metasurface Waveguide Holography by Manu Gopakumar, Gun-Yeal Lee, Suyeon Choi, Brian Chao, Yifan Peng, Jonghyun Kim, Gordon Wetzstein

Deconvolution volumetric additive manufacturing by Triplet-Triplet Annihilation Upconversion by Hao-Chi Yen, Qi Chou, Arynn Gallegos, Dan Congreve 

Improved Water Sound Synthesis using Coupled Bubbles by Kangrui Xue, Ryan M. Aronson, Jui-Hsien Wang, Timothy R. Langlois, Doug L. James

TAG: Tracking at Any Granularity by Adam W Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Sheldon Shiqian Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Rares Andrei Ambrus, Katerina Fragkiadaki, Leonidas Guibas

PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields by Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, Kalyan Sunkavalli

PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors by Haley So, Laurie Bose, Piotr Dudek, and Gordon Wetzstein

Gaussian Shell Maps for Efficient 3D Human Generation by Rameen Abdal, Yifan Wang, Zifan Shi, Gordon Wetzstein

Automatic Neural Spatial Integration by Zilu Li, Guandao Yang, Xi Deng, Bharath Hariharan, Leonidas Guibas, Gordon Wetzstein

Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior by Qingqing Zhao, Peizhuo Li, Wang Yifan, Olga Sorkine-Hornung, Gordon Wetzstein

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models by Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Huang, Tuanfeng Wang, Gordon Wetzstein

An Online Method for 3D Protein Reconstruction in X-Ray Single Particle Imaging by Jay Shenoy, Axel Levy, Frédéric Poitevin, Gordon Wetzstein

Orthogonal Adaptation for Multi-concept Fine-tuning of Text-to-Image Diffusion Models by Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein

Synthesize City Walk Videos from Street Maps by Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Gordon Wetzstein, Noah Snavely

Quantitative AFI (Autofluorescence Imaging) for Oral Cancer Screening by Xi Mou, Zhenyi Liu, Haomiao Jiang, Brian Wandell, Joyce Farrell

Efficient Geometry-Aware 3D Generative Adversarial Networks by Eric Chan*, Connor Lin*, Matthew Chan*, Koki Nagano*, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, Gordon Wetzstein

Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI by Annesha Ghosh, Gordon Wetzstein, Mert Pilanci, Sara Fridovich-Keil

Saliency-guided Image Generation by Yunxiang Zhang, Connor Lin, Nan Wu, Qi Sun, Gordon Wetzstein

GPT-4V(ision) is a Versatile and Human-Aligned Evaluator for Text-to-3D Generation by Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein

Image Feature Consensus with Deep Functional Maps by Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas

Toward general neural surrogate PDE solvers with specialized neural accelerators by Chenkai Mao, Robert Lupoiu, Mingkun Chen, Tianxiang Dai, Jonathan Fan

Editing Motion Graphics Video via Vectorization and Transformation by Sharon Zhang, Jiaju Ma, Jiajun Wu, Daniel Ritchie, Maneesh Agrawala

DRGN-AI: Ab initio reconstruction of heterogeneous structural ensembles by Axel Levy, Frederic Poitevin, Gordon Wetzstein, Ellen Zhong

Inferring Hybrid Neural Fluid Fields from Videos by Koven Yu, Yang Zheng, Yuan Gao, Yitong Deng, Bo Zhu, Jiajun Wu 

Thermal Radiance Fields by Yvette Lin, Xin-Yi Pan, Sara Fridovich-Keil, Gordon Wetzstein  


 

Abstracts


Title: Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition

Abstract: Capturing images is a key part of automation for high-level tasks such as scene text recognition. Low-light conditions pose a challenge for high-level perception stacks, which are often optimized on well-lit, artifact-free images. Reconstruction methods for low-light images can produce well-lit counterparts, but typically at the cost of high-frequency details critical for downstream tasks. We propose Diffusion in the Dark (DiD), a diffusion model for low-light image reconstruction for text recognition. DiD provides qualitatively competitive reconstructions with that of state-of-the-art (SOTA), while preserving high-frequency details even in extremely noisy, dark conditions. We demonstrate that DiD, without any task-specific optimization, can outperform SOTA low-light methods in low-light text recognition on real images, bolstering the potential of diffusion models to solve ill-posed inverse problems.

Authors: Cindy M. Nguyen, Eric R. Chan, Alexander W. Bergman, Gordon Wetzstein

Bio: Cindy Nguyen is a fifth-year PhD Candidate in the Stanford Computational Imaging Lab. Her interests lie in computational photography and image reconstruction using deep learning and generative AI.


Title: Controlling Energy Flow to Improve Solid-State Upconversion

Abstract: Recently, upconversion (UC) of low-energy photons to higher-energy photons has found applications in various fields like 3D printing, bioimaging, photochemistry, and more. In particular, solid-state UC of incoherent, NIR photons into visible photons has been identified as a process which can revolutionize current photovoltaics technologies by capturing sub-bandgap photons and can also be deployed to enable passive night vision beyond bulky, externally powered devices. Triplet-triplet annihilation (TTA-) UC is particularly attractive for these applications due to its low thresholds and broadband, tunable absorption. However, current NIR-to-visible TTA-UC device using PbS quantum dots and rubrene are limited by low absorption, low energy transfer rates, and importantly, highly parasitic back transfer processes which lead to low external quantum efficiencies (EQE). We propose the introduction of a “blocker layer” which can mitigate FRET-based back transfer to improve the EQE. We demonstrate the use of 5-tetracene carboxylic acid (TCA) as a ligand/blocker layer to improve Dexter energy transfer and alleviate parasitic back FRET, leading to 3-5x improvement in EQE. Finally, we deconvolute the mechanism of improvement through spectroscopic comparison of the traditional and our novel UC devices. We achieved up to 0.1% EQE using our novel device architecture. This improvement in EQE is a key step forward toward the realization of TTA-UC in photovoltaics and night vision technologies. Additionally, we highlight that this novel architecture relies on simple design principles which can be broadly used to improve EQEs across different wavelength regimes. 

Authors: Pournima Narayanan, Manchen Hu, Emma Belliveau, Ghada Ahmed, Qi Zhou, Will Michaels, Martin Sebastian Fernandez, Linda Pucurimay, Arynn O. Gallegos, Vongaishe Mutatu, Dan Congreve

Bio: Pournima Narayanan received her Honors B.Sc. degree in Chemistry from the University of Toronto, where she did her thesis project on the purification of polymer-grafted nanocrystals under Prof. Eugenia Kumacheva. As an SGF and Chevron Fellow, Pournima is currently working in Dan Congreve’s lab on optimizing triplet-triplet annihilation upconversion for applications in imaging, biomedicine, photovoltaics, etc.


Title: Real-Time Hand Keypoint Detection on Edge

Authors: Mohammad Asadi, Haley So, Gordon Wetzstein

Abstract: Real-time 2d hand keypoint detection using RTMPose pipeline and CSPNeXt backbone on MINOTAUR. In this work, we demonstrate quantization and compression of the RTMPose real-time keypoint detection model and its adaptation to MINOTAUR.

Bio: I am a first-year EE PhD student and currently a rotation student at the Computational Imaging Lab. Previously, I have worked on interpretable AI-based feedback systems for education as an intern at the ML4ED laboratory of EPFL University and also on uncertainty estimation of human motion recognition models for autonomous vehicles at the VITA laboratory of EPFL University.


Title: Multifunctional Spaceplates for Chromatic and Spherical Aberration Correction

Authors: Yixuan Shao, Robert Lupoiu, Jiaqi Jiang, You Zhou, Jonathan A. Fan

Abstract: Over the last decade, substantial research endeavors have been devoted to miniaturizing imaging systems. Recently, the invention of a new optical device, known as spaceplates, offers an innovative approach to shrinking the thicknesses of air gaps between lenses in imaging systems. Spaceplates emulate the optical responses of free space within a reduced physical space. A typical design of spaceplates employs optimized thin-film multilayer structures, which provide an extensive range of design freedoms and the possibility of integrating additional functionalities. In this study, we leverage this versatility of spaceplates to assume parts of the aberration correction responsibility traditionally assigned to lenses. We present the design and application of multifunctional spaceplates aimed at rectifying chromatic and spherical aberrations. This approach obviates the need for multiple corrector lenses without expanding the system’s footprint by effectively reusing the air gaps’ space, making spaceplates ideal for compact integrated systems where size constraints are a critical concern such as virtual reality and augmented reality applications.

Bio: Yixuan Shao is a 3rd-year PhD student in electrical engineering. Working with Prof. Jonathan Fan, he is doing research in designing ultra-thin imaging systems with reduced optical aberrations.


Title: Physics-based Lens Flare Simulation for Nighttime Driving

Authors: Zhenyi Liu, Devesh Shah, Alireza Rahimpour, Devesh Upadhyay, Joyce Farrell, Brian Wandell

Abstract: Nighttime driving images present unique challenges compared to daytime images, including low-intensity regions and bright light sources causing sensor saturation and lens flare artifacts. These issues impair computer vision models and make image labeling for network training both costly and error-prone. To address this, we developed an end-to-end image system simulation for creating realistic nighttime images. Our approach involves characterizing the simulation system, generating a synthetic nighttime dataset with detailed labels for training, and demonstrating its effectiveness in tasks like flare removal and object detection.

Bio: Zhenyi Liu is a postdoctoral scholar in the Psychology Department at Stanford University, advised by Prof. Brian Wandell and Dr. Joyce Farrell. His research interests focus on physically based imaging system full pipeline simulation for autonomous driving and consumer photography.


Title: Make a Donut: Language-Guided Hierarchical EMD-Space Planning for Zero-shot Deformable Object Manipulation

Authors: Yang You, Bokui Shen, Congyue Deng, Haoran Geng, He Wang, Leonidas Guibas

Abstract: Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, basing learning entirely on demonstrations can hamper the model’s ability to generalize beyond the demonstrated tasks. In this work, we introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon tasks without necessitating any training. We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task. For every individual stage, the LLM provides both the tool’s name and the Python code to craft intermediate subgoal point clouds. With the tool and subgoal for a particular stage at our disposal, we present a granular closed-loop model predictive control strategy. This leverages Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space, applied iteratively. Experimental findings affirm that our technique surpasses multiple benchmarks in dough manipulation, spanning both short and long horizons. Remarkably, our model demonstrates robust generalization capabilities to novel and previously unencountered complex tasks without any preliminary demonstrations. We further substantiate our approach with experimental trials on real-world robotic platforms.

Bio: Yang You is a postdoc at the Geometric Computation group in Stanford University, working with Prof. Leonidas Guibas. His interest lies in 3D vision and embodied AI.


Title: Full-Color Metasurface Waveguide Holography

Authors: Manu Gopakumar, Gun-Yeal Lee, Suyeon Choi, Brian Chao, Yifan Peng, Jonghyun Kim, Gordon Wetzstein

Abstract: Recent advances in augmented reality (AR) technology are expected to revolutionize the way digital data is integrated with users’ perception of the real world, opening up new possibilities in various fields such as entertainment, education, communication, and training. Despite its potential, the widespread implementation of AR display technology faces obstacles due to the bulky projection optics of display engines and the challenge of displaying accurate 3D depth cues for virtual content, among other factors. Here, we present an innovative holographic AR display system that addresses these challenges through a unique and synergistic combination of nanophotonic hardware technology and artificial intelligence (AI) software technology. Thanks to a compact metasurface waveguide system and AI-driven holography algorithms, our AR holographic display system delivers high-quality, full-color 3D augmented reality content in a compact device form factor. The core techniques of our work are to use inverse-designed metasurfaces with dispersion-compensating waveguide geometry and an innovative image formation model, taking into account both physical waveguide models and learned components that are automatically calibrated using camera-in-the-loop technology. The groundbreaking integration of nanophotonic metasurface waveguides and AI-based holography algorithms represents a major leap forward in producing immersive 3D augmented reality experiences with a compact wearable device.

Bio: Manu Gopakumar is a PhD student in the Stanford Computational Imaging Lab, and his research interests are centered on using the co-design of optical systems and computational algorithms to build next generation VR and AR headsets. Gun-Yeal Lee is a postdoctoral researcher in the Stanford Computationl Imaging Lab, and his current research focuses on novel optical applications using nanophotonics and metasurface optical elements to develop next-generation display/imaging systems.


Title: Deconvolution volumetric additive manufacturing by Triplet-Triplet Annihilation Upconversion

Authors: Hao-Chi Yen, Qi Chou, Arynn Gallegos, Dan Congreve 

Abstract: Triplet-triplet annihilation upconversion (TTA-UC) introduces a breakthrough in 3D printing, combining low-power consumption with high-speed, high-resolution output. This method diverges from traditional two-photon absorption processes by transforming two low-energy photons into a single high-energy photon via excitonic state manipulation in molecules. A key feature of TTA-UC is its efficiency with low-power light sources like LEDs, enabling deeper polymerization for intricate volumetric printing. The integration of deconvolution techniques further enhances this process by mitigating the Gaussian blur effect typically associated with each pixel in 3D printing. This refinement improves the overall image quality, allowing for sharper, more precise prints. The synergy between TTA-UC and deconvolution marks a significant leap forward in micro- and nanoscale fabrication, paving the way for a wide array of advanced applications in various fields.

Bio: Materials Science & Engineering second-year master student in Prof. Dan Congreve’s lab, working with volumetric 3D printing by Triplet-Triplet Annihilation Upconversion.


Title: Improved Water Sound Synthesis using Coupled Bubbles

Authors: Kangrui Xue, Ryan M. Aronson, Jui-Hsien Wang, Timothy R. Langlois, Doug L. James

Abstract: We introduce a practical framework for synthesizing bubble-based water sounds that captures the rich inter-bubble coupling effects responsible for low-frequency acoustic emissions from bubble clouds. We propose coupled bubble oscillator models with regularized singularities, and techniques to reduce the computational cost of time stepping with dense, time-varying mass matrices. Airborne acoustic emissions are estimated using finite-difference time-domain (FDTD) methods. We propose a simple, analytical surface acceleration model, and a sample-and-hold GPU wavesolver that is simple and faster than prior CPU wavesolvers. Sound synthesis results are demonstrated using bubbly flows from incompressible, two-phase simulations, as well as procedurally generated examples using single-phase FLIP fluid animations. Our results demonstrate sound simulations with hundreds of thousands of bubbles, and perceptually significant frequency transformations with fuller low-frequency content.

Bio: Kangrui (Cong-ray) Xue is a 2nd-year PhD student in Computer Science at Stanford University, advised by Prof. Doug James. His research focuses on physics-based sound simulation for computer animation (i.e., methods for augmenting video with sound). Previously, he completed a BS in Electrical Engineering and a BA in Music, also at Stanford. 


Title: TAG: Tracking at Any Granularity

Authors: Adam W Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Sheldon Shiqian Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Rares Andrei Ambrus, Katerina Fragkiadaki, Leonidas Guibas

Abstract: We introduce the Tracking at Any Granularity (TAG) project: a new task, model, and dataset, for tracking arbitrary targets in video. We seek a tracking method that treats points, parts, and objects as equally trackable target types, embracing the fact that the distinction between these granularities is ambiguous. We introduce a generic high-capacity transformer for the task, which accepts multi-modal prompt inputs: targets can be indicated by clicks, boxes, masks, or natural language. To train the model, we aggregate all publicly-available tracking datasets that we are aware of, which currently totals 62, consisting of millions of clips with tracking annotations, which includes a long tail of rare subjects such as body keypoints on insects, and microscopy data. Our model is competitive with state-of-the-art on standard benchmarks for point tracking, mask tracking, and box tracking, but more importantly, achieves zero-shot performance far superior to prior work, largely thanks to the data effort. We will publicly release our code, model, and aggregated dataset, to provide a foundation model for motion and video understanding, and facilitate future work in this direction. 

Bio: Adam is a postdoc at Stanford University, working with Leonidas Guibas. He recently completed his Ph.D. at The Robotics Institute at Carnegie Mellon University, where he worked with Katerina Fragkiadaki. His research interests lie in Computer Vision and Machine Learning, particularly for 3D understanding and fine-grained tracking.


Title: PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields

Authors: Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, Kalyan Sunkavalli

Abstract: Recent advances in neural radiance fields have enabled the high-fidelity 3D reconstruction of complex scenes for novel view synthesis. However, it remains underexplored how the appearance of such representations can be efficiently edited while maintaining photorealism. In this work, we present PaletteNeRF, a novel method for photorealistic appearance editing of neural radiance fields (NeRF) based on 3D color decomposition. Our method decomposes the appearance of each 3D point into a linear combination of palette-based bases (i.e., 3D segmentations defined by a group of NeRF-type functions) that are shared across the scene. While our palette-based bases are view-independent, we also predict a view-dependent function to capture the color residual (e.g., specular shading). During training, we jointly optimize the basis functions and the color palettes, and we also introduce novel regularizers to encourage the spatial coherence of the decomposition. Our method allows users to efficiently edit the appearance of the 3D scene by modifying the color palettes. We also extend our framework with compressed semantic features for semantic-aware appearance editing. We demonstrate that our technique is superior to baseline methods both quantitatively and qualitatively for appearance editing of complex real-world scenes.

Bio: Zhengfei Kuang is a second year Ph.D. at Stanford University, advised by Gordon Wetzstein and Ron Fedkiw. His current research areas are neural rendering, 3D/4D content generation and human digitalization.


Title: PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors

Authors: Haley So, Laurie Bose, Piotr Dudek, and Gordon Wetzstein

Abstract: Conventional image sensors digitize high-resolution images at fast frame rates, producing a large amount of data that needs to be transmitted off the sensor for further processing. This is challenging for perception systems operating on edge devices, because communication is power inefficient and induces latency. Fueled by innovations in stacked image sensor fabrication, emerging sensor–processors offer programmability and minimal processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture, PixelRNN, that encodes spatio-temporal features on the sensor using purely binary operations. PixelRNN reduces the amount of data to be transmitted off the sensor by factors up to 256 compared to the raw sensor data while offering competitive accuracy for hand gesture recognition and lip reading tasks. We experimentally validate PixelRNN using a prototype implementation on the SCAMP-5 sensor–processor platform. 

Bio: Haley So is a PhD candidate in Professor Gordon Wetzstein’s Computational Imaging Lab. She is interested in utilizing emerging sensors to rethink imaging algorithms and computer vision tasks.


Title: Gaussian Shell Maps for Efficient 3D Human Generation

Authors: Rameen Abdal, Yifan Wang, Zifan Shi, Gordon Wetzstein

Abstract: Efficient generation of 3D digital humans is important in several industries, including virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures, however, typically rely on volume representations, which are slow to render, thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi-shell–based scaffold. In this setting, a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and, at inference time, to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of 512 x 512 pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ, DeepFashion, and AIST++.

Bio: Rameen Abdal is a postdoc working under the supervision of Prof. Gordon Wetzstein. He completed his PhD in Computer Science from KAUST, Saudi Arabia. He is interested in 3D generative modeling using GANs and Diffusion models. Yifan Wang is a postdoc working under the supervision of Prof. Gordon Wetzstein. She completed her PhD in Computer Science from ETH Zurich, Switzerland. She is interested in developing intelligent systems for automatic 3D creation and modeling. Zifan Shi is a final-year PhD student at the Hong Kong University of Science and Technology (HKUST). She is a visiting student at Stanford working with Prof. Gordon Wetzstein. Her research interests include generative models and 3D vision.


Title: Automatic Neural Spatial Integration

Authors: Zilu Li, Guandao Yang, Xi Deng, Bharath Hariharan, Leonidas Guibas, Gordon Wetzstein

Abstract: Spatial integration is essential for a number of scientific computing applications, such as solving Partial Differential Equations. Numerically computing a spatial integration is usually done via Monte Carlo methods, which produce accurate and unbiased results. However, they can be slow since it require evaluating the integration many times to achieve accurate low-variance results. Recently, researchers have proposed to use neural networks to approximate integration results. While networks are very fast to infer in test-time, they can only approximate the integration results and thus produce biased estimations. In this paper, we propose to combine these two complementary classes of methods to create a fast and unbiased estimator. The key idea is instead of relying on the neural network’s approximate output directly, we use the network as a control variate for the Monte Carlo estimator. We propose a principal way to construct such estimators and derive a training object that can minimize its variance. We also provide preliminary results showing our proposed estimator can both reduce the variance of Monte Carlo PDE solvers and produce unbiased results in solving Laplace and Poisson equations.

Bio: 1. Zilu Li: I’m a visiting undergraduate student from Cornell University working with Prof. Gordon Wetzstein. My research focus revolves around computer vision, computer graphics, and Machine Learning. These days, I’m interested in exploring how to use Neural Fields efficiently for different physics and geometric computing problems. Such as solving PDEs and Integrals. 2. Guandao Yang: I’m a postdoctoral researcher at Stanford University, advised by Prof. Leonidas Guibas and Prof. Gordon Wetzstein. I’m doing research in the intersection of Computer Vision, Machine Learning, and Computer Graphics. I did my PhD at Cornell University, advised by Prof. Serge Belongie and Prof. Bharath Hariharan. My recent research focus is developing a geometry processing pipeline that can create, manipulate, and analyze 3D shapes in an intelligent and data-driven way.


Title: Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior

Authors: Qingqing Zhao, Peizhuo Li, Wang Yifan, Olga Sorkine-Hornung, Gordon Wetzstein

Abstract: Creating believable motions for various characters has long been a goal in computer graphics. Current learning-based motion synthesis methods depend on extensive motion datasets, which are often challenging, if not impossible, to obtain. On the other hand, pose data is more accessible, since static posed characters are easier to create and can even be extracted from images using recent advancements in computer vision. In this paper, we utilize this alternative data source and introduce a neural motion synthesis approach through retargeting. Our method generates plausible motions for characters that have only pose data by transferring motion from an existing motion capture dataset of another character, which can have drastically different skeletons. Our experiments show that our method effectively combines the motion features of the source character with the pose features of the target character, and performs robustly with small or noisy pose data sets, ranging from a few artist-created poses to noisy poses estimated directly from images. Additionally, a conducted user study indicated that a majority of participants found our retargeted motion to be more enjoyable to watch, more lifelike in appearance, and exhibiting fewer artifacts.

Bio: Qingqing Zhao is a 4th year EE PhD student advised by Prof. Gordon Wetzstein. Her research interest lies in the intersection of Machine Learning and Dynamic modeling – including both physics-based simulation and character animation.


Title: Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Authors: Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Huang, Tuanfeng Wang, Gordon Wetzstein

Abstract: Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene’s geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hindering a user from applying their creativity rather than amplifying it. To address this challenge, we present a novel approach that combines the controllability of dynamic 3D meshes with the expressiveness and editability of emerging diffusion models. For this purpose, our approach takes an animated, low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by changing the camera path or animating rigged assets. By utilizing the underlying 4D spatio-temporal representation, our method can capture long-range interactions between multiple objects that undergo occlusions, which are otherwise very challenging to handle with existing methods.

Bio: I am a first-year PhD student in computer science at Stanford, advised by Gordon Wetzstein. Prior to Stanford, I obtained his master’s from ETH Zurich in Switzerland and bachelor’s from King’s College in UK. My research focuses on solving graphics or inverse graphics tasks that are fundamentally ill-posed via traditional methods. I have been working primarily around neural rendering, including but not limited to generative models, inverse rendering, unsupervised learning, scene representations, etc. I like making cool videos and demos.


Title: An Online Method for 3D Protein Reconstruction in X-Ray Single Particle Imaging

Authors: Jay Shenoy, Axel Levy, Frédéric Poitevin, Gordon Wetzstein

Abstract: X-ray single particle imaging (SPI) is a nascent technique that can capture the dynamics of biomolecules at room temperature. SPI experiments will one day collect tens of millions of images of the same molecule in order to overcome the weak scattering of individual proteins. Existing reconstruction algorithms will be unable to scale to datasets of this size because they perform computationally expensive search steps to estimate the orientation of the molecule in each image. In this work, we propose a reconstruction algorithm that amortizes the estimation of pose via an autoencoder framework. Our approach consists of a convolutional encoder that maps X-ray images to predicted poses and a physics-based decoder that implicitly fuses all the 2D scattering images into a volumetric representation of the molecule. We validate our method on 6 synthetic datasets of 2 distinct proteins, showing that for the largest datasets containing 5 million images, our technique can reconstruct the electron density in a single pass.

Bio: Jay is a second-year PhD student in the lab of Prof. Gordon Wetzstein conducting research in the area of scientific imaging. He is interested in physics-based reconstruction of both static and dynamic entities in the natural sciences.


Title: Orthogonal Adaptation for Multi-concept Fine-tuning of Text-to-Image Diffusion Models

Authors: Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein

Abstract: Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed up without interfering with each other. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

Bio: Po is a second year PhD student at Stanford University working with Prof. Gordon Wetzstein at the Stanford Computational Imaging Lab. His research lies in 2D/3D content generation with a focus on adding intuitive control handles to AI generated content.


Title: Synthesize City Walk Videos from Street Maps

Authors: Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Gordon Wetzstein, Noah Snavely

Abstract: Given a map of a city, e.g., London, we aim to generate a street walk video conditioned on a style described by text, e.g. “New York in the Rain”. The viewpoint of each frame can then be controlled by the user according to the map as if walking through the scene. With abundant camera imagery from Google Street View and aerial imagery from Google Earth, existing text-to-image generation approaches can generate high quality single frames for such videos. Yet, making the whole video consistent across different views for large city scenes is still an open problem, particularly when a training set of such videos is absent. In contrast to prior work that focuses on propagating information across different frames, we approach this problem from a novel search perspective. From all the diverse images we can generate for each frame, our algorithm seeks a sequence of images that is multi-view consistent. While the coarse and noisy geometry from the street map and street height map given is insufficient for transforming frames in way that is geometrically accurate, this coarse geometry is very informative in scoring multi-view consistency across frames, which is vital to our search algorithm. Additionally, we derive geometry-aware sampling techniques to accelerate the search process. Our results show that our algorithm generates notably more consistent videos compared to prior video generation methods. Meanwhile, at the cost of imperfect multi-view consistency, our algorithm achieves higher per-frame quality than prior street view reconstruction or generation methods with an actual 3D representation. We also showcase examples of creative video creation using our algorithm.

Bio: Boyang Deng is a second-year PhD student in CS at Stanford, jointly supervised by Prof. Gordon Wetzstein and Prof. Leonidas Guibas. He also works at Google Research as a part-time student researcher. Prior to Stanford, he worked as a research scientist at Waymo Research and Google Brain.


Title: Quantitative AFI (Autofluorescence Imaging) for Oral Cancer Screening

Authors: Xi Mou, Zhenyi Liu, Haomiao Jiang, Brian Wandell, Joyce Farrell

Abstract: Autofluorescence imaging (AFI) is a non-invasive and real-time imaging technique that has proven valuable for early detection and monitoring of oral cancers. In many cases, early detection can extend patient’s lives and improve their quality of life. In the case of oral cancer, AFI visualizes the fluorescence emitted by endogenous tissue fluorophores in the mouth, with no need for exogenous labels or dyes. Today’s AFI devices for detecting oral lesions rely on the subjective judgment of clinicians, which in turn depends on their training and visual abilities. We are designing an autofluorescence imaging system that obtains quantitative imaging data for oral cancer screening. The instrument design is guided by measurements and simulation. We employ an excitation light with peak wavelength, spectral bandwidth and beam angle that maximizes tissue fluorescence without causing tissue damage. Our measurements indicate that the tongue generated autofluorescence signal is about four orders of magnitude less intense than the reflected light from the tongue. To detect this signal we use a camera with a longpass filter that reduces the reflected light reaching a calibrated imaging sensor. The agreement between our simulations and measurements make it possible to quantify tissue fluorescence, evaluate hypotheses about the underlying tissue fluorophores, and potentially develop lab tests for oral lesion diagnosis.

Bio: Xi Mou is a postdoctoral scholar at Stanford University, advised by Prof. Brian Wandell and Dr. Joyce Farrell. Her research focuses on simulation and design of medical imaging devices. 


Title: Efficient Geometry-Aware 3D Generative Adversarial Networks

Authors: Eric Chan*, Connor Lin*, Matthew Chan*, Koki Nagano*, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, Gordon Wetzstein

Abstract: Recent advances in neural rendering and generative models have enabled efficient, near photorealistic generation of 3D objects. In this demonstration, we showcase real-time 3D avatar generation, running off of a consumer PC. Similar generative 3D technology may soon transform how we create 3D objects, avatars, and virtual worlds in the future, with exciting applications in film, video games, the metaverse, and beyond.

Bio: Eric Chan is a Ph.D. student at Stanford where he is currently working with Prof. Gordon Wetzstein’s Computational Imaging group. During his childhood in Oakland, CA, a family full of architects and many years spent in robotics competitions embedded an appreciation for design, robotic locomotion, and spatial understanding. After studying mechanical engineering and computer science at Yale, he began learning the basics of computer vision in the hope of teaching his robots and algorithms how to better understand the world around them. Over the last couple of years, his focus has shifted to the intersection of 3D graphics and vision—to generalization across 3D representations and 3D generative models. Find more at ericryanchan.github.io.


Title: Volumetric Reconstruction Resolves Off-Resonance Artifacts in Static and Dynamic PROPELLER MRI

Authors: Annesha Ghosh, Gordon Wetzstein, Mert Pilanci, Sara Fridovich-Keil

Abstract: Off-resonance artifacts in magnetic resonance imaging (MRI) are visual distortions that occur when the actual resonant frequencies of spins within the imaging volume differ from the expected frequencies used to encode spatial information. These discrepancies can be caused by a variety of factors, including magnetic field inhomogeneities, chemical shifts, or susceptibility differences within the tissues. Such artifacts can manifest as blurring, ghosting, or misregistration of the reconstructed image, and they often compromise its diagnostic quality. We propose to resolve these artifacts by lifting the 2D MRI reconstruction problem to 3D, introducing an additional “spectral” dimension to model this off-resonance. Our approach is inspired by recent progress in modeling radiance fields, and is capable of reconstructing both static and dynamic MR images. We demonstrate our approach in the context of PROPELLER (Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction) MRI acquisitions, which are popular for their robustness to motion artifacts. Our method operates in a few minutes on a single GPU, and to our knowledge is the first to correct for chemical shift in gradient echo PROPELLER MRI reconstruction without additional measurements or pretraining data.

Bio: Sara Fridovich-Keil is a postdoctoral scholar at Stanford working with Professors Gordon Wetzstein and Mert Pilanci. She completed her PhD at UC Berkeley, where she was advised by Professor Ben Recht. Sara’s interests span machine learning, computer vision, and signal processing, with a particular interest in scene representations and medical applications.


Title: Saliency-guided Image Generation

Authors: Yunxiang Zhang, Connor Lin, Nan Wu, Qi Sun, Gordon Wetzstein

Abstract: The advent of models like Stable Diffusion marks a significant leap forward in image generation. However, existing models lack the capability to steer users’ focus within generated images, often resulting in saliency occurring at random regions. Visual saliency is critical in applications that demand targeted user attention, such as video games and advertising, where the control of user focus can substantially enhance the gameplay experience and engagement with promotional content. There have been extensive researches on controlling images generated by Stable Diffusion using ControlNet. However, existing ControlNet models, such as depth, are limited in guiding viewers’ attention. They do no cater to the dynamic nature of visual saliency, which encapsulates not only depth but also factors such as color, contrast, and texture that collectively capture a viewer’s gaze. Our study introduces a fine-tuned ControlNet trained on a diverse set of saliency maps to modulate the image generation process. This results in images that are intentionally crafted to direct visual attention to specific areas. To validate our approach, user studies were conducted wherein observers’ gaze patterns were monitored and compared to the intended saliency conditions.

Bio: Connor: I am a third-year Computer Science PhD Candidate at Stanford University, co-advised by Leonidas Guibas and Gordon Wetzstein. My research explores neural representations for 3D reconstruction, generation, and editing. Nan: I am a second-year Electrical Engineering Master student at Stanford University. My interests include XR and IoT. I worked at Google as a software engineer on Google Home and Gmail before coming to Stanford.


Title: GPT-4V(ision) is a Versatile and Human-Aligned Evaluator for Text-to-3D Generation

Authors: Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, Gordon Wetzstein

Abstract: Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics, such as CLIP similarity, focus on single criterion like text-asset alignment. These metrics lack the flexibility to generalize to different evaluation criteria and usually do not align well with human preference. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies, however, can be very expensive to scale. This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts, which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally, we use the pairwise comparison result from GPT-4V to assign different text-to-3D models an Elo rating. Experimental results suggest that our metric persistently achieves strong human preference across different evaluation criteria.

Bio: Guandao is a postdoctoral researcher at Stanford University, advised by Prof. Leonidas Guibas and Prof. Gordon Wetzstein. He’s doing research in the intersection of Computer Vision, Machine Learning, and Computer Graphics. He obtained his PhD at Cornell University, advised by Prof. Serge Belongie and Prof. Bharath Hariharan. 


Title: Image Feature Consensus with Deep Functional Maps

Authors: Xinle Cheng, Congyue Deng, Adam Harley, Yixin Zhu, Leonidas Guibas

Abstract: Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed, and benchmarked, by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. Existing work has attempted to improve the quality of these correspondence maps by carefully mixing features from different sources, such as by combining the features of two networks. We point out that a better correspondence strategy is available, which directly imposes structure on the correspondence field: the functional map. Wielding this simple mathematical tool, we lift the correspondence problem from the pixel space to the function space and directly optimize for mappings that are globally coherent. We demonstrate that our technique yields correspondences that are not only smoother but more accurate, perhaps better reflecting the knowledge embedded in the large-scale vision models that we are studying. Our approach sets a new state-of-the-art on a variety of dense correspondence tasks. We also demonstrate a novel application, in transferring affordance maps from one tool to another.

Bio: I am a fourth-year Ph.D. student in computer science at Stanford University advised by Leonidas Guibas. My research interests include 3D computer vision and geometric deep learning. I am particularly interested in developing and leveraging better data representations in 3D shape analysis and processing.


Title: Toward general neural surrogate PDE solvers with specialized neural accelerators

Authors: Chenkai Mao, Robert Lupoiu, Mingkun Chen, Tianxiang Dai, Jonathan Fan

Abstract: Surrogate neural network-based partial differential equation (PDE) solvers have the potential to solve PDEs in an accelerated manner, but they are largely limited to systems featuring fixed domain sizes, geometric layouts, and boundary condi- tions. We propose Specialized Neural Accelerator-Powered Domain Decompo- sition Methods (SNAP-DDM), a DDM-based approach to PDE solving in which subdomain problems containing arbitrary boundary conditions and geometric pa- rameters are accurately solved using an ensemble of specialized neural opera- tors. We tailor SNAP-DDM to 2D electromagnetics and fluidic flow problems and show how innovations in network architecture and loss function engineering can produce specialized surrogate subdomain solvers with near unity accuracy. We utilize these solvers with standard DDM algorithms to accurately solve freeform electromagnetics and fluids problems featuring a wide range of domain sizes.

Bio: Chenkai Mao is a 4th year PhD in Electrical Engineering adviced by Prof. Jonathan Fan. His research focuses on two main aspects: (1) AI for science: using deep learning to accelerate scientific computing and improve design optimization. (2) Inverse design and nano fabrication of nanophotonics. 


Title: Editing Motion Graphics Video via Vectorization and Transformation

Authors: Sharon Zhang, Jiaju Ma, Jiajun Wu, Daniel Ritchie, Maneesh Agrawala

Abstract: Motion graphics videos are widely used in Web design, digital advertising, animated logos and film title sequences, to capture a viewer’s attention. But editing such video is challenging because the video provides a low-level sequence of pixels and frames rather than higher-level structure such as the objects in the video with their corresponding motions and occlusions. We present a motion vectorization pipeline for converting motion graphics video into an SVG motion program that provides such structure. The resulting SVG program can be rendered using any SVG renderer (e.g. most Web browsers) and edited using any SVG editor. We also introduce a program transformation API that facilitates editing of a SVG motion program to create variations that adjust the timing, motions and/or appearances of objects. We show how the API can be used to create a variety of effects including retiming object motion to match a music beat, adding motion textures to objects, and collision preserving appearance changes.

Bio: Sharon Zhang is a third-year PhD student advised by Maneesh Agrawala. Her research is in developing visual representations for content creation, particularly for video editing.


Title: DRGN-AI: Ab initio reconstruction of heterogeneous structural ensembles

Authors: Axel Levy, Frederic Poitevin, Gordon Wetzstein, Ellen Zhong

Abstract: Proteins and other biomolecules form dynamic macromolecular machines that are tightly orchestrated to move, bind, and perform chemistry. Experimental techniques such as cryo-electron microscopy (cryo-EM) and cryo-electron tomography (cryo-ET) can access the structure, motion, and interactions of macromolecular complexes. We introduce DRGN-AI, a unified framework for ab initio heterogeneous reconstruction of single particle cryo-EM and cryo-ET subtomogram data. DRGN-AI fuses the flexibility of implicit neural representations with a robust and scalable strategy for pose estimation, we circumvent the need for structural priors or input poses, enabling the discovery of new biological states and previously unresolved molecular motion. For the first time, we demonstrate ab initio heterogeneous subtomogram reconstruction of a cryo-ET dataset. Our method is released as part of the open source cryoDRGN software.

Bio: Axel Levy is a fourth year PhD student in Electrical Engineering at Stanford University. He is advised by Prof. Mike Dunne (director of LCLS at SLAC National Lab) and Prof. Gordon Wetzstein (head of SCI). His research focuses on solving 3D reconstruction problems in unknown-view setups. Most of his work addresses the problem of 3D molecular reconstruction from cryo-electron microscopy images. Prior to his PhD, Axel graduated from the Ecole Polytechnique (France).


Title: Inferring Hybrid Neural Fluid Fields from Videos

Authors: Koven Yu, Yang Zheng, Yuan Gao, Yitong Deng, Bo Zhu, Jiajun Wu 

Abstract: We study recovering fluid density and velocity from sparse multiview videos. Existing neural dynamic reconstruction methods predominantly rely on optical flows; therefore, they cannot accurately estimate the density and uncover the underlying velocity due to the inherent visual ambiguities of fluid velocity, as fluids are often shapeless and lack stable visual features. The challenge is further pronounced by the turbulent nature of fluid flows, which calls for properly designed fluid velocity representations. To address these challenges, we propose hybrid neural fluid fields (HyFluid), a neural approach to jointly infer fluid density and velocity fields. Specifically, to deal with visual ambiguities of fluid velocity, we introduce a set of physics-based losses that enforce inferring a physically plausible velocity field, which is divergence-free and drives the transport of density. To deal with the turbulent nature of fluid velocity, we design a hybrid neural velocity representation that includes a base neural velocity field that captures most irrotational energy and a vortex particle-based velocity that models residual turbulent velocity. We show that our method enables recovering vortical flow details. Our approach opens up possibilities for various learning and reconstruction applications centered around 3D incompressible flow, including fluid re-simulation and editing, future prediction, and neural dynamic scene composition.

Bio: Yang Zheng is a second-year CS Ph.D. student at Stanford University, advised by Prof. Gordon Wetzstein and Prof. Leonidas Guibas. His research focuses on computer vision and graphics. 


Title: Thermal Radiance Fields

Authors: Yvette Lin*, Xin-Yi Pan*, Sara Fridovich-Keil, Gordon Wetzstein 

Abstract: Thermal infrared imaging has a variety of applications, from agricultural monitoring to building inspection to imaging under poor visibility, such as in low light, fog and rain. Studying large objects or navigating in complex environments requires combining multiple thermal images into a spatially coherent 3D reconstruction, or radiance field. However, reconstructing infrared scenes poses several challenges due to the comparatively lower resolution, narrower field of view, and lower number of available features present in infrared images. To overcome these challenges, we propose a unified framework for scene reconstruction from a set of uncalibrated infrared and RGB images, using a Neural Radiance Field (NeRF) to represent a scene viewed by both visible and infrared cameras, thus leveraging information across both spectra. We calibrate the RGB and infrared cameras with respect to each other as a preprocessing step using a simple calibration image. We demonstrate our method on real-world sets of RGB and infrared photographs captured from a handheld thermal camera, showing the effectiveness of our method at scene representation across the visible and infrared spectrum.

Bio: Yvette Lin is a master’s student in computer science at Stanford University. Her research interests lie in computational imaging, machine learning, and inverse problems.
Xin-Yi Pan is a master’s student in electrical engineering at Stanford University. Her interests lie in the interaction of physics and computer science, applied to optics and imaging.
Sara Fridovich-Keil is a postdoc at Stanford working with Professors Gordon Wetzstein and Mert Pilanci. She completed her PhD at UC Berkeley where she was advised by Professor Ben Recht.