My ICRA 2026 digest

ICRA 2026 happened in Vienna, June 1–5. As with my ICRA 2023 digest, this is a brain-dump of my notes, photos, and links collected throughout the week — lightly edited, so expect raw conference-note energy. If you want the bird’s-eye view of the proceedings instead, I also made an interactive topic map of all 3,028 papers.

Mucho texto: Table of contents

Day 1 · Workshops: S2S — From Sea to Space8
Day 2 · Keynotes: Milford, Bera, Wang4
Day 3 · Posters, Carlone, Barfoot, and how to talk to humans6
Day 4 · Robot learning, planning and foundation models6
Day 5 · Workshops: Robots Meet Prior Maps10
Posters9
Our poster
Other links1
Ideas I left with

Day 1. Workshops: S2S — From Sea to Space

I spent day 1 at the S2S: From Sea to Space workshop — perception for the domains where GPS doesn’t reach and everything is trying to corrode, freeze, or irradiate your robot. A great lineup of invited talks plus a poster session.

Tobias Fischer (QUT) — from sea to space, literally

A tour through the QUT Centre for Robotics work spanning both ends of the workshop title:

Melanie Wille et al. (QUT) — also in the poster session

They want to identify the effect that challenging imaging conditions have on object detection. They define three axes to characterize an underwater image, and quantify properties within each:
  • Axis 1 — image appearance: visibility, illumination, color.
  • Axis 2 — scene composition: scale, layout, background.
  • Axis 3 — acquisition geometry: orientation, perspective.
Some findings: more visibility → more objects detected. More blue coloring → more detections. More objects in the image → more detections, probably because they're better distinguished from the background. Code and project page on GitHub.

Gorry et al. (QUT)

Input RGB images from different sessions (think the same reef in 2016, 2017, 2018), SIFT features and COLMAP for SfM. SIFT feature matching for same-session images, LightGlue matching for the cross-session VPR pairs.

Vignesh Ramanathan, Michael Milford, Tobias Fischer (QUT) — presented at this conference

VPR from less than a millisecond of event data, encoding active pixel locations as binary frames and matching with bitwise operations — an 11× Recall@1 improvement over baselines.

They also organize a marine robotics seminar series — once a month, international speakers, wide range of topics: sgraine.github.io/marine-robotics-seminars. Recordings available on YouTube too.

Jungseok Hong (MIT) — underwater 3D reconstruction by interleaving multimodal SLAM and incremental Gaussian splatting

Daniel Yang, Jungseok Hong, John J. Leonard, Yogesh Girdhar (MIT / WHOI)

They run multimodal SLAM — visual SLAM plus odometry from IMU and DVL — and interleave it with incremental Gaussian splatting for the 3D reconstruction. Notably, they do this without bundle adjustment. 3DGS relies on accurate camera poses typically obtained from computationally intensive SfM, which makes it unsuitable for field robotics — so they replace SfM with pose-graph-optimization SLAM over the acoustic, inertial, pressure and visual sensors of the AUV.

Josh Mangelson (BYU FROSTLab) — towards robust multi-agent underwater localization and coastal semantic mapping

Kalliyan Velasco et al. (BYU FROSTLab)

A fleet of AUVs where each vehicle localizes locally, and they communicate to localize relative to each other — under the bandwidth constraints of acoustic comms. Locally, each AUV uses an inverted-USBL setup: every agent carries its own USBL hydrophone array (as opposed to a single centralized agent). A transmitting agent broadcasts its depth from an onboard pressure sensor; the acoustic signal is passively received by all neighbors and used to determine the depth difference, azimuth and elevation, from which each receiver triangulates the transmitter's position. The whole thing is formulated as two inter-related maximum likelihood estimation problems (local vehicle odometry + full-fleet cooperative localization), both running on-board each vehicle.

FROSTLab also has a few papers in the main proceedings this year:

  • Weighted group-k consistent set maximization for outlier rejection of azimuth-elevation measurements,
  • Terra: hierarchical terrain-aware 3D scene graph for task-agnostic outdoor mapping, and
  • DreamSea: photorealistic 3D underwater terrain generation by latent fractal diffusion models.

Teresa Vidal-Calleja (UTS) — spatial perception in marine, orbital, and planetary domains

They use continuous spatial maps represented with Gaussian processes + linear operators. The motivation list is compelling: they can be linearly operated, they can be physics-driven through the kernel, computationally costly but can be made sparse for efficiency, the hyperparameters have intuitive meaning, and they run on CPU or GPU. The work builds on two threads:

Le Gentil et al., IROS 2021

Infer a continuous elevation map from submap point clouds with a GP, and compute the gradient analytically from the same GP — gradient images turn out to be much more robust for matching than raw elevation. They use them for loop-closure detection in unstructured planetary environments, where visual place recognition struggles, which grew into GPGM-SLAM.

Wu et al., RA-L 2021

The trick is regressing the log of the field and reverting the kernel — applying a reverting function related to the kernel inverse — so the GP's latent scalar field becomes an accurate distance field with a principled uncertainty proxy. One representation that serves surface reconstruction, collision checking, and planning at once.

Wu et al., ICRA 2026

The latest instalment of that line, building exactly on the GPIS representation — surface normals serve as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty, from which the scene flow is computed.

Hanumant Singh (Northeastern) — NeuSLAM: dense visual SLAM on edge devices

Aniket Gupta (spotlight speaker), with Hanumant Singh and Huaizu Jiang among the co-authors (Northeastern)

A hybrid architecture for dense visual SLAM with stereo and RGB-D sensors on resource-constrained edge devices: recent learning-based dense SLAM achieves strong accuracy, but the learned backends keep dense correlation volumes and feature maps on the GPU throughout optimization — several gigabytes of memory, restricting them to desktop-grade hardware. NeuSLAM overcomes that bottleneck with a lightweight network extending NeuFlow-V2 that jointly predicts dense optical flow and stereo disparity, plus per-pixel confidence maps, from a shared feature encoder (for RGB-D, the disparity branch is simply bypassed in favor of sensor depth). The front-end keeps only the high-confidence sparse correspondences for pose estimation, and a classical back-end maintains a lightweight pose graph over keyframes — global consistency with minimal GPU memory.

Annette Stahl (NTNU) — resilient perception for field robotics in harsh maritime environments

Among other things:

NTNU

They use Blender to generate artificial lighting and insert marine snow — including a photogrammetric model of a real underwater archaeology site (Malta amphorae) relit in simulation. There is also a version with superimposed marine snow on Zenodo.

Fishnet anomaly detection

Underwater · Detection

Schellewald & Stahl (NTNU), IFAC PapersOnLine

They detect holes in fishnets using the Fourier transform — a regular net is a beautifully periodic signal, so holes stand out in frequency space.

NTNU

Cameras, radar and LiDAR mounted on a boat. They have a dataset from the Trondheim canal available, with short-baseline stereo, wide-baseline stereo, LiDAR, INS, GNSS, and a polarized camera. The dataset lives at the NIRD Research Data Archive — they recommend starting from the examples in the code repository before exploring the data files (there's also a single-sequence sample to try before downloading the whole thing).

Industry spotlight: EONSEA

EONSEA

Industry

Presented by Camila Rodrigues (CTO & co-founder)

ROVs for automated underwater inspection of vessels and offshore infrastructure, with AI for corrosion and biofouling detection. They work across ports, naval, energy and aquaculture, and run extensive research collaborations.

Hiro Ono (JPL, remote) — toward interplanetary foundation models

Can AI drive a Mars rover? — beamed in from JPL to close the day on the space end of the spectrum.

Day 2. Keynotes: Milford, Bera, Wang

Michael Milford — biology as engineering blueprint

A very inspiring one. His research philosophy is to draw on three layers of inspiration for robotics — biological, behavioural, and neural — closing the loop from real-world embodied testing back to scientific curiosity. But he was also honest about why bio-inspired robotics is hard, and why it’s still a niche keyword compared to the hot topics of the day. The core problem: robot sensors aren’t the sensors living beings have — you can’t simply copy biology when both your hardware and our scientific understanding of the original fall short.

Twenty years of this thread, from RatSLAM to visual place recognition and neuromorphic sensors, all the way to a GPS-free positioning system:

Michael Milford et al. (QUT)

In 2004 he published the initial RatSLAM system, a robot navigation system modelled on the rat hippocampus, at a time when biologically-inspired robotics was not in fashion — the paper was rejected before eventually winning a best paper award. It later matured into the final RatSLAM system, shown here — pose cells, local-view cells, and an experience map (the what, where, and where-what), with multiple packets rather than packet spread as the main mechanism for encoding uncertainty. It spawned a whole zoo of "X-SLAM" follow-ups.

Michael Milford (QUT)

He made the case for the ideal robotic system as a blend of three ingredients cycling into each other: conventional/learning-based techniques, biologically-inspired components, and modern AI / foundation-model / deep-learning methods. The trade-offs to balance across them: sensor differences (both limited and opportunistic), embodiment differences, different risk appetites, performance ceilings, provability, human factors (usability, collaborative potential), and the practical stuff like cost, uptime, generality, and product development cycle. A nice example of the cross-over: SeqSLAM came out of spiking-neural-network modelling.

The Local Positioning System (LPS)

Positioning · GPS-free

Michael Milford et al. (QUT) — current project

His current project: a positioning system that works without GPS, inspired by how animals navigate — no satellites, no infrastructure, just biology as engineering blueprint. The aim: a ubiquitous, low-cost, and societally acceptable positioning service without reliance on (or vulnerability to) satellites or communications.

Aniket Bera — safe navigation in unstructured, human-centered environments

“Learned models are very useful, but they should generate checkable outputs.”

Three take-home messages:

  1. Safety is not a module. It is the whole stack. Perception, prediction, planning, and control have to be coupled.
  2. Learning gives robots intuition; structure keeps them honest. Representations are powerful, but maps, logic, constraints, barriers, and solvers make them reliable.
  3. The real world does not give clean problem settings. Robots must operate in uncertain, cluttered, changing, and partially observable environments.

Two works from his lab that put this into practice:

Aniket Bera's lab

Tracking: (1) detect correspondence matches; (2) project matches to world coordinates using depth; (3) coarse-to-fine alignment — (a) initial pose from rigid transformation and point-cloud registration of the correspondences, (b) refinement via gradient-based optimization on rendering losses.

Matches the current frame against the active frames and does SE(3) alignment of the Gaussian splatting — real-time performance in unstructured environments like a forest.
  • Core idea: make pose a closed-form geometric registration problem, while keeping 3DGS optimization for map refinement.
  • Prior bottleneck: rendering-based 3DGS-SLAM optimizes pose through a photometric loss. The gradients are expensive and can be ill-conditioned when overlap is sparse or depth is noisy.
  • Their estimator: use learned correspondences from 2D-3D matches, back-project into point sets, solve SE(3) with Kabsch/SVD, then refine Gaussians and keyframes.

He framed FlashSLAM as the fast local map, then walked through the other world-modeling problems his lab solved around itafter the local map is fast, the remaining question is how to make it actionable: grounded, globally anchored, and logically usable.

Object grounding — Go-SLAM

Grounding · 3DGS

grounded object segmentation/localization with 3DGS SLAM

Go-SLAM stores open-vocabulary semantic features at the 3D primitive/object level, turning a reconstruction into a queryable map.

Global anchoring — TransLocNet

Cross-modal · Localization

cross-modal aerial–ground localization

TransLocNet registers ground BEV observations to overhead imagery with contrastive retrieval, followed by geometric SE(2) refinement.

Relational constraints — NaviWM

Scene Graph · Logic

logic-guided socially-aware world model

NaviWM lifts mapped agents/objects into a scene graph and checks social/navigation rules as first-order predicates.

The unifying world-modeling contribution: perception exports object masks, extents, global pose, relations, and uncertainty — the state that later planners can check.

Aniket Bera's lab — ICRA 2025 Best Paper finalist

Specification-constrained task planning with an LLM — a way to make it trustable and safe, instead of end-to-end plan generation. Unsafe actions are pruned during generation.
  • The failure mode: a language model can emit a plan that is syntactically fluent but violates ordering, reach-avoid, safety, or resource constraints.
  • The SELP move: translate the task requirement into a temporal-logic specification, track feasible prefixes during generation, and mask tokens that would make the plan unsatisfiable.
  • What changes conceptually: the LLM becomes a proposal mechanism inside constrained search, not an unchecked robot planner.

He also discussed uncertainty-aware world modelling as a key component of his autonomy stack — one of the rising keywords at this conference, with AGIBOT dedicating an entire competition track to it here at ICRA 2026.

Hesheng Wang — learning to navigate: from scene understanding to decision making

In case you haven’t noticed yet, SLAM is a core technique in robot navigation :) A dense showcase of work from his lab at Shanghai Jiao Tong University:

End-to-end visual-LiDAR odometry bridging the structural inconsistency between dense image pixels and sparse point clouds.
4D dynamic SLAM using 2D optical flow, 3D scene flow, and diffusion for scene flow refinement. Tracks non-rigid objects while estimating camera pose, with Gaussians as the map representation.

T-PAMI 2026

Integrates geometry, appearance, and semantic features via cross-attention, with semantic constraints directly coupled into pose optimization.
A temporal deformation field and global deformable bundle adjustment. Canonical Gaussian representation augmented with deformable probability. Medical application.

code at dtc111111/vpgs-slam

Compact 3DGS SLAM with voxelized Gaussians and sliding-window BA, 2D-3D cross-modal localization, and NeRF-based exploration for learning-based planning.

Awards talks: SA-VLM v2 — VLMs that help people

I’ve seen quite some works on VLMs this week, but shout-out to the very cool use case at the awards talks. We usually see visual SLAM as the base of autonomy for different robotic applications — but what about using it to help people?

SA-VLM v2: useful, comprehensive, and concise guidance for guide-dog robots assisting the visually impaired

VLM · Accessibility

Woo-han Yun et al.

SA-VLM v2 presented at the ICRA 2026 awards talks.

Generates structured walking guidance for visually impaired users. Not "there is an obstacle", but "Walking is difficult. A construction fence is ahead. Detour toward the 9 o'clock direction." Walking guidance should be useful, comprehensive, and concise so that instructions are both actionable and easy to follow — co-designed with professional guide dog trainers, and using a dataset designed to that end: SideGuide, a large-scale sidewalk dataset for guiding impaired people.

Day 3. Posters, Carlone, Barfoot, and how to talk to humans

Morning posters and Robot Perception I

A few things that caught my eye while wandering the poster hall:

GFreeDet2: Exploiting Gaussian Splatting and Foundation Models for RGB-based Model-free 2D and 6D Detection of Unseen Objects

Detection · 3DGS

Gu Wang et al.

Reconstructs 3D Gaussian object models from multi-view RGB references, enabling model-free detection without CAD models.

From the Robot Perception I session:

Sparse Variable Projection in Robot Perception: Exploiting Separable Structure for Efficient Nonlinear Optimization

Optimization

Alan Papalia, Nikolas Sanderson, Haoyu Han, Heng Yang, Hanumant Singh, Michael Everett — code at UMich-RobotExploration/variable-projection

While sparsity is well-exploited to scale nonlinear least-squares solvers, a complementary and underexploited structure is separability: some variables (e.g. visual landmarks) appear linearly in the residuals and, for any estimate of the remaining variables (e.g. poses), have a closed-form solution — variable projection (VarPro) eliminates them analytically. 2×–41× faster across SLAM, SfM, and SNL, with equal or better solution quality, and it drops in as preprocessing for existing solvers (GTSAM, Ceres).

RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Visual Contextual Adaptation

Navigation · Zero-Shot

K.1. Luca Carlone — from SLAM to spatial memories

As he said, “Classic SLAM is a huge success story”. SLAM algorithms have long been stablished at this point in history. However, classic SLAM systems are complex and need fine-tuning. From that perspective, a key advancement that learning-based algorithms have allowed is much simpler, even uncalibrated setups. Luca’s talk was a walkthrough their advances on SLAM, from adding learnt modules, to add further layers of abstraction on top of it, to the point of turning it into an actionable element that allows the robot to navigate and have a “memory”. The talk was divided into three sections:

  1. Maps: from classical SLAM to geometric foundation models.
  2. Memories: from mapping to semantic episodic memories.
  3. Tasks: towards task-driven memory representations.

1. Maps — from classical SLAM to geometric foundation models

On VGGT-based SLAM:

VGGT-SLAM & VGGT-SLAM 2.0

SLAM · Foundation

MIT-SPARK — 2.0 at RSS 2026 — code at MIT-SPARK/VGGT-SLAM

VGGT can only process ~69 frames before running out of memory, and it's slow. So: split the trajectory into submaps, process each with VGGT, then align the maps with pose graph optimization. This leads to artifacts — coming from projective ambiguity: you cannot distinguish calibration errors from reconstruction errors (deformations), and different maps get different deformations → alignment artifacts. They fix the deformation by doing homography graph optimization. VGGT-SLAM 2.0 takes it further in real time. They also explored feeding VGGT tokens into VLAs to encode geometric cues (geometric foundation models beyond SLAM).

Yang, Lin, Martín-Martín, Labrie, Gayska, Kuo, Carlone — with Amazon

The step beyond SLAM, towards actions: injecting VGGT tokens into Vision-Language-Action models (VLAs) improves geometric understanding and manipulation performance. On GR00T, the early-fusion VGGT variant (GR00T-VGGT) beats the GR00T-N1.5 baseline.

2. Memories — from mapping to semantic episodic memories

“Memory is the ability to encode, store, and retrieve general information about the environment, learned facts, and past experiences.”

Beyond mapping — from maps to memories:

code at MIT-SPARK/Hydra

3D scene graphs as a hierarchical organization of spatial information — multiple layers of abstraction, hierarchical models. Hydra was the first system to run this in real time.

MIT-SPARK — CVPR 2026 — code at MIT-SPARK/DAAAM

Augments 3D scene graphs with rich descriptions — captions for objects and images. The first real-time system to build scene graphs with detailed captions. Frames the captioning as an optimization problem to select the best frames to caption objects via DescribeAnything, with agentic retrieval on top.

Nicolas Gorlo, Derek K. Wise, Alberto Speranzon, Luca Carlone (MIT / Lockheed Martin)

Inspired by neuroscience, they mix semantic and episodic memory in a framework using Bayesian surprise. The choice of embedding is very consequential — they use V-JEPA 2. LLM agents can then use this episodic memory.

3. Tasks — towards task-driven memory representations

Clio & Found-it — task-driven memory

Task-driven

code at MIT-SPARK/Clio

There's a need for task-driven memory representations. Clio was the first approach for task-driven 3D scene graphs — it extracts objects with segmentation and clusters them in an informative way, so being aware of the task, the robot is better at choosing relevant information. Found-it is an upgrade of Clio using foundation models, built on VGGT-SLAM: it works on standard monocular videos and can receive tasks at runtime.

And to close, a quote: “The next generation of robots will require task-driven memory systems powered by foundation models.”

Amazing that so many of these resources are open — the mentioned works live in the MIT-SPARK GitHub org, and there’s also the SLAM Handbook!!

K.2. Tim Barfoot — why field robotics still matters

The talk was a showcase of the field experiments they carry out — like localization on an ice surface, which worked quite well, testing robots that one day might go through the Arctic. You can check out the resources at their lab website, such as a map of the places they’ve been and the datasets they’ve collected.

Science Communication Crash Course

A panel on how to communicate your science in a more engaging way, with Sabine Hauert (University of Bristol & Robohub), Ella Scallan (Robohub), Evan Ackerman (IEEE Spectrum) and Kohava Mendelsohn (IEEE Spectrum).

The strongest emphasis was on making it human. Sure, we’re talking about robots — but the developers behind them are people too. The story isn’t just the machine; it’s the people who built it, why they cared, and what they struggled with. That’s what an audience actually connects to.

A few takeaways:

  • Lead with the human, not the hardware. People remember people. Frame the work around the person and the motivation, then bring in the robot.
  • Drop the jargon. If your grandparent can’t follow it, neither can most of your audience.
  • IEEE Spectrum is approachable. It’s a more laid-back magazine for science and robotics, and they’re always open to collaborating with scientists to write an article — so if you have a story worth telling, reach out.

Tutorial: Building, Running and Deploying Modern Software Tools for Robotics

Many years ago, as a student at a university with not so many resources, the nonsense of the equations and numbers we were seeing in class turned into something tangible thanks to the Robotics Toolbox, and understandable thanks to the book behind it (I’m a practical learner, ok?). I can’t put into words what it felt like to be at this keynote tutorial from Peter Corke and Tobias Fischer, and it being a practical class just like the ones I had at my Bachelor’s.

Luca Carlone said at his keynote that this is one of the most exciting times to be a robotics researcher. I fully agree, and one of the reasons is exactly this: the number of open libraries and resources that make robotics approachable. Seeing the history of how all these tools evolved, and the new tools that exist today, is a great reminder of this privilege. Being here is such a privilege!

Anyway, whether you were here or not, all material is on GitHub.

Arts & Robotics

The photo above is disarming II by Emanuel Gollob (University of Arts Linz / Creative Robotics): a freely placed industrial robot arm durationally learning locomotion on a gym mat. The piece plays with the ambiguity of disarming as both physical detachment and emotional attachment — locomotion as a primal, post-birth instinct and ultimate act of independence, attempted by a limb that was never made for it. Its reinforcement learning is deliberately slowed down and partially deprived of efficiency, leaving space to watch your own projections at work: parallel to the familiar dystopian plot of technological autonomy, witnessing these first clumsy tries may awaken compassion, or even a certain emotional bond.

Day 4. Robot learning, planning and foundation models

Keynote session 5 was a four-speaker block on robot learning, planning, and foundation models.

K.3. David Hsu — scalable robot decision making in the open world

Planning and plan prediction with foundation models.

Open-world challenges: scalability and uncertainty — complex appliances for the robot to interact with. The robot can be thought of as a function mapping inputs from perception to outputs for actions.

  • The classic era is model-based: state estimation, planning and control. The challenge is acquiring good models.
  • The deep learning era is data-driven: we acquire data for the robot to learn a policy. Successful, but the challenge is acquiring data — and anything changing on the robot. How to generalize?
  • The foundation model era: we have to lay out a strategy over a two-dimensional space. One axis is representation, the other is reasoning. We’re going to see the benefits of structured representation.

Papers mentioned:

CoRL

The robot reads the appliance's user manual to figure out how to operate it.

RSS (to be presented)

A service robot in an unknown environment has incomplete knowledge of the objects and actions around it. PDDL is deterministic, but our hypotheses about the world are uncertain — so the robot automatically generates, verifies, and updates hypotheses about its abstract world model. Foundation models seed the initial hypotheses about states and transitions; the planner then produces action sequences that handle both hypothesis verification and task execution, expanding the model under uncertainty and folding in feedback whenever a hypothesis turns out wrong.

K.4. Stefanie Tellex — robot programming

What is the way to specify a robot’s task? “Someone has to sit next to the robot and make it do the thing.” Learning from demonstrations is limited to the speed of the demonstration; reinforcement learning needs a lot of data. They combine both: extract the behaviours from demonstration, improve with RL.

Papers mentioned:

Benned Hedegaard et al. — at this conference

Integrates pre-existing, heterogeneous robot skills (learned, force-controlled, and black-box policies) into a hierarchical planner.
Formal language is limited and constrained; you can instead extend to a higher-dimensional embedding, but the tradeoff is slower cross-platform generalisation. The survey studies these tradeoffs.

code at SaulBatman/GEM_code

Improve performance via task-specific fine-tuning. To reduce the amount of data needed, they parameterize skills with formal language.

K.5. Noémie Jaquier — traveling the robot learning manifold: a tale of geometries and inductive biases

Deep learning is now everywhere, and we are very used to just plugging in a network (a CNN, RNN, transformer, whatever) and hoping it works. Even more so with the rise of foundation models. But networks are encoding information in some dimensionality, and whether that dimensionality fits your problem or not matters a lot. What they propose: use the geometry of the robot as an inductive bias to constrain the network to the right dimensionalities — from geometry, to physics, to control theory.

Instead of unconstrained diffusion policies, they use Riemannian flow matching — the robot state lives on a manifold, so the policy should too. This work and its follow-up Fast and robust visuomotor Riemannian flow matching policy show smoother trajectories and faster inference than diffusion baselines.
Data efficiency via symmetry: using an equivariant network, they compose symmetries — sagittal symmetry, planar rotation, scaling — through vector fields, lifting them to configuration space.

ICRA 2026

Geometries of deformable objects: infinite DOF and complex dynamics, so they take inductive bias from geometry and from physics — predicting trajectories on a high-dimensional manifold with an encoder-decoder framework with Lagrangian and Hamiltonian structure. Learns finite-dimensional Lagrangian surrogate models of infinite-dimensional continuum dynamics, avoiding the discretizations that introduce bias and reduce generalization.

An interesting question from the audience: how many dimensions should the latent space have, and how do we beat the curse of dimensionality? More complexity needs more dimensions. How many? The answer is 42! Just kidding — it’s yet another hyperparameter to figure out :stuck_out_tongue_winking_eye: Good luck!

On a side note, shout-out to the visuals in this talk — very pretty and very explanatory of complex concepts. My camera did its best at capturing them.

More papers mentioned:

Same flavour as the talk: bake physical consistency in as inductive bias to fight the curse of dimensionality. From a Riemannian view they jointly learn a structure-preserving latent space and its low-dimensional dynamics, so a high-dimensional rigid or deformable system is captured by a small, interpretable reduced Lagrangian model — accurate long-term predictions with far less data.
The Hamiltonian sibling of the one above. RO-HNN keeps the energy-conservation laws of Hamiltonian mechanics but adds model-order reduction so it scales: a geometrically-constrained symplectic autoencoder learns a low-dimensional structure-preserving submanifold, and a geometric Hamiltonian net models the dynamics on it — physically-consistent, stable predictions for high-dimensional systems.
Now put a controller on top of those learned reduced models. Model-based control wants accurate dynamics, which we don't have for deformable objects or soft robots — so they derive a reduced tracking law over the learned structure-preserving latent dynamics and, from a Riemannian view of the projection, get interpretable stability and convergence conditions by quantifying the modeling error. Extended to underactuated systems with learned actuation patterns, validated in sim and on real hardware.

And there is a book :book:!

K.6. Paolo Robuffo Giordano — intrinsic robustness

A journey from control-aware planning to robust robot learning.

Awards

Expo

Random expo notes: camera + LiDAR rigs everywhere; MimosaX.

Day 5. Workshops: Robots Meet Prior Maps

The day started with an intro talk by Skydio… because I was in the wrong room :P Before I noticed and relocated, I learned what it takes to deploy autonomous robots at scale: autonomy, hardware reliability and manufacturing, support and regulatory readiness, and software & simulation — the talk focused on the latter. They run automated testing in CI, and deliberately didn’t move their Unreal Engine simulation to the cloud because they want to push the simulator boundary and keep it all integrated in CI. They simulate the gimbal, built their own ray-traced rendering, and use Google 3D tiles for map data (Paraverse). They’re hiring in Zurich.

Then, on to the actual Robots Meet Prior Maps workshop.

Maurice Fallon (Oxford) — where’s my glasses: identifying change in scene graphs over time

  • Feed-forward reconstruction models give you perturbed scales — ScaRF-SLAM fixes them by combining classical visual SLAM with the foundation-model reconstruction, doing incremental map fusion and constantly refining the scale of the submaps. See also ov_secondary. Their preferred setup: VGGT with Depth Anything.
  • Understanding to what degree the scene has changed matters, because change makes odometry struggle (fast-moving sensors too). They use a prior laser scan of the building.
  • Upcoming paper (in writing): object-level change detection via semantic correspondence association in long-term multi-session mapping. For scenes where image detection isn’t precise, Gaussian splatting is not so successful — e.g. under significant lighting change. They found dense correspondence with DINO to work better there. LT-mapper is the previous work on this.
  • The system they want to build: step 1, fuse depth from an external sensor (or estimate) to build a local map, with object segmentation to look for local dense features; step 2, semantic and geometric change detection.

Jen Jen Chung (UQ) — exploring interactions with object-level maps

Object-level maps to make grasping missions more efficient. Papers along the way: Learning affordance landscapes for interaction exploration in 3D environments (using the AI2-THOR interactive simulator), Learning affordances from interactive exploration using an object-level map, and TSDF++, a multi-object formulation for dynamic object tracking and reconstruction.

Abhinav Valada (Freiburg) — open-world autonomy: representations, mapping, interaction

How do we make autonomy reliable in the open world?

  • Amodal perception: estimate the entire shape of objects regardless of occlusions — amodal panoptic segmentation.
  • Class-incremental panoptic segmentation: don’t retrain for new labels, extend the knowledge.
  • Open-vocabulary dynamic 3D scene graphs.
  • What happens when you encounter never-before-seen objects? Predict that it’s an unknown object: PoDS, panoptic out-of-distribution segmentation.
  • Rethinking lifelong SLAM — continual SLAM: instead of taking pretrained models and adapting them offline, adapt on the fly as the robot moves. Dual architecture: a generalizer and an expert, with an uncertainty-based sampling strategy (CL-SLAM). Train on Cityscapes, move to KITTI, then RobotCar, then back to KITTI — and check it still remembers. Follow-up at RSS 2023 adds monocular depth and panoptic segmentation (depth, semantic, panoptic error).
  • ArtiPoint: articulated object estimation in the wild, with the Arti4D dataset — 45 egocentric videos of humans performing articulations.
  • MoMa-LLM: mapping and structure for language-grounded mobile manipulation.

Luca Carlone (MIT) — from maps to memories: present and future of spatial AI

Foundation-model-first SLAM and 3D scene graphs — a longer version of his Day 3 keynote, with extra detail on VGGT-SLAM:

  • Every single SLAM library has 50 to 100 parameters to tune :P VGGT works on uncalibrated videos — but it’s far from scalable, with limited processing ability.
  • Two years ago they started simple: break the long trajectory into submaps (16 keyframes each), run VGGT on each, then align the submaps. Issue: artifacts. The mismatch comes from a fundamental reason — projective ambiguity. With calibrated cameras, the ambiguity is just the scale of the reconstruction; uncalibrated, it’s a full projective ambiguity, much more complicated: you confuse calibration errors for deformation of the scene, and each submap gets a different deformation.
  • So instead of aligning submaps rigidly, they attach a homography transform to each submap: take pairs of aligning submaps (sequential + loop closures), get the relative homography, map each submap to a global homography, and optimize over SL(4) with GTSAMdense RGB SLAM optimized on the SL(4) manifold.
  • Next month, a new paper: assign a homography to each keyframe instead of each submap. Open question from the audience: how robust is VGGT-SLAM to scale inconsistencies in the depth prediction over time — perhaps the SL(4) alignment absorbs that error.
  • Found-it addresses two big problems: going from RGB-D to monocular cameras, and dynamic tasks. Hydra is closed-set semantics, a rigid understanding of the scene; Clio addressed task-driven mapping but requires pre-specifying the tasks. They want to change tasks at runtime, hence Found-it.

David Hsu (NUS) — open scene graphs for open-world navigation

A personal journey of “maps” for open-world robot navigation.

José Luis Sánchez-López (Luxembourg) — tightly integrating semantic-relational priors into SLAM

What are scene priors? Previous knowledge, which he organized along four axes:

  • Representation axis: scene level / environment type (the talk focused on indoors), geometric priors, semantic priors, relational/structural priors.
  • Entity axis: the role of the scene — structure (floors, walls, etc.).
  • Granularity axis: level of abstraction — observable/tangible vs. higher level.
  • Source axis: implicit priors, previous-experience priors, BIM/CAD/architectural priors.

They have a survey on visual SLAM (MDPI). The core idea:

  • We need to connect maps with scene graphs. Most works connect the SLAM output to a scene graph; they go for a tightly coupled approach because the two influence each other — the situational graph (S-Graph), encoding different levels of abstraction with hierarchical-semantic optimization. Code: snt-arg/visual_sgraphs.
  • To encode it in a factor graph, they’re trying (ongoing) to replace the mathematical factors with graph neural networks → GNN S-Graphs 2.
  • Previous-experience priors in a multi-robot, distributed setting: each robot builds its own S-Graph. With vision → viS-Graphs.
  • Object entities in S-Graphs: they pretrain NeRFs with prior information — e.g. train on computers in general, then fine-tune to a specific computer: PRENOM.
  • Dynamic entities in S-Graphs: incorporate prior knowledge of dynamic entities to improve pose estimation (under review).

Huan Yin (Hunan University, online) — BIM as a prior semantic map

Humans can navigate with simple semantic maps — robots should manage with a BIM model. Global localization from scratch: global registration / place recognition + a particle filter, with the COMPASS descriptor. Dataset: SLABIM / LiBIM-UST, a SLAM-BIM coupled dataset.

Javier Civera (Zaragoza) — mapping inside the human body

Endoscopy as the ultimate prior-less environment:

  • Medical datasets either have privacy issues or are small.
  • LightDepth (ICCV 2023): depth self-supervision from illumination decline in real colonoscopy. The physics: light vanishes with distance, and also depends on the angle of the light, the normal of the surface, the albedo of the surface, and the camera parameters (gain and gamma correction). They train a network that predicts the albedo, depth and normals of the surface to reconstruct it — all differentiable, so all errors can be propagated. The C3VD dataset serves as a synthetic ablation, just to demonstrate that it works.
  • For localization, they synthesize sparse images and match against those. Feature extraction and matching was a problem back then — they used D2-Net (pre-SuperPoint era).
  • VGGT-Ω doesn’t need loop closure.
  • Map Anything.

Links collected:

Also from the floor: the Sevensense camera.

MM-SpatialAI workshop

The MM-SpatialAI workshop (Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding) ran on day 1, and a keynote talk is up on YouTube. The keynote lineup:

  • Alex Wong (Yale) — Unsupervised extension of multimodal depth perception across scenes and sensors.
  • Hermann Blum (University of Bonn) — The future of mapping: beyond reconstruction.
  • Dezhen Song (MBZUAI) — Proprioceptive localization: when everything else fails.
  • Timothy D. Barfoot (University of Toronto) — Roads, forests, and lakes, oh my! New multi-modal datasets and some thoughts.
  • Margarita Chli (ETH Zurich / University of Cyprus) — Robust perception for single- and multi-robot systems: are we there yet?
  • Sebastian Scherer (CMU) — Multi-modal perception for resilient autonomy.
  • Andrew Davison (Imperial College London) — From SLAM to spatial AI.

Posters

Every poster I photographed across the week, regrouped by topic — loosely following my topic map. Titles link to my photos; the last few I only caught afterwards, via the authors’ own LinkedIn posts.

SLAM

SAGA-SLAM: scale-adaptive 3D Gaussian splatting for visual SLAM Kun Park, Seoul National University They use the Polyak step size — the step size being the learning rate in this context. They extract the pose from the Gaussians, and build up the Gaussians as they go.
MAD-BA: 3D LiDAR bundle adjustment — from uncertainty modeling to structure optimization Krzysztof Ćwian, Poznań University of Technology, with Sapienza Built on the newer version of g2o with custom factors — Giorgio Grisetti, one of the g2o developers, is a co-author. Code on GitHub.
Dr-PoGO: direct radar pose graph optimization Cedric Le Gentil, University of Toronto, with Timothy D. Barfoot
ivS-Graphs: BIM-informed visual SLAM for construction monitoring Asier Bikandi-Noya, SnT, University of Luxembourg
vS-Graphs: passage-aware structural mapping for RGB-D visual SLAM Ali Tourani, SnT, University of Luxembourg — accepted at RA-L Inspired by LiDAR S-Graphs, it integrates vision-based scene understanding directly into live, optimizable 3D scene graphs. Open source: code (ROS 2 Jazzy), project page, preprint.
Edged USLAM: edge-aware event-based SLAM with learning-based depth priors Şebnem Sarıözkan, Hürkan Şahin, Olaya Álvarez-Tuñón, Erdal Kayacan — Paderborn University / EIVA

From left to right: Erdal Kayacan, Şebnem Sarıözkan, myself, Hürkan Şahin, and Davide Scaramuzza.

Multi-robot & collaborative SLAM

Coko-SLAM: compact multi-agent Gaussian splatting SLAM Polytechnique Montréal / UW–Madison
3D foundation model-based loop closing for decentralized collaborative SLAM Pierre-Yves Lajoie, Polytechnique Montréal / Oxford Robotics Institute You can plug in different 3D foundation models; here they used MASt3R. End-to-end, and they do Sim(3) alignment rather than SE(3) to counteract the scale inconsistencies of the foundation-model reconstructions — each agent reconstructs at a different scale, so the Sim(3) pose-graph alignment absorbs the per-agent scale drift. Paper.
Distributed pose graph optimization via contractive belief sharing Xiangyu Liu, University of Cyprus, with Margarita Chli

Localization & place recognition

On motion blur and deblurring in visual place recognition Timur Ismagilov, with Bruno Ferrarini, Michael Milford, Tan Viet Tuyen Nguyen They blur the images artificially by averaging a set of images, and deblur with an off-the-shelf network. They compare place recognition across different baselines — DINOv2-based, ResNet-based, geometry-based — and DINO surpasses all the rest.
A global localization pipeline (title escaped me) Spanish IGN aerial LiDAR raster maps as the prior, aligned with Umeyama + ICP on top of FAST-LIO2, validated on a Unitree Go2 — I only photographed the detail panels, so the title escaped me.
CFEAR-Teach-and-Repeat: fast and accurate radar-only localization Maximilian Hilger — code Radar is robust in visually degraded environments, but radar localization still lags LiDAR — particularly in heading estimation. This narrows the gap using a single spinning radar. (CFEAR really got around this week — see also InsSo3D on Day 3.)

State estimation, calibration & optimization

Learning multiple initial solutions to optimization problems Elad Sharony, Technion / NVIDIA Research
MUSE: multimodal uncertainty quantification of state estimation University of Illinois / University of South Carolina
Frequency-weighted neural Kalman filters (FW-NKF) ETH SIPLab Real sensor noise is rarely white — it's often colored and concentrated in certain frequency bands, which hurts Kalman-based estimates. FW-NKF keeps the Kalman recursion intact, but filters the innovation with a learnable IIR filter before the correction step, with spectral supervision to match the clean signal spectrum. Strong results on Lorenz, pendulum, EuRoC MAV and UWB-IMU tracking; outperforms KalmanNet variants, RKN and AR-KF. Code, project page.
Unleashing the power of discrete-time state representation: ultrafast target-based IMU-camera spatial-temporal calibration Paper · code Is continuous-time state estimation superior to discrete-time? The common criticism of discrete-time (à la Kalibr debate) is that it's less accurate for temporal calibration — the time offset between sensor clocks. The author argues both should be comparable if no measurement information is lost: the real weakness was IMU preintegration relying on Euler integration, and replacing it with a higher-order integration fixes it. Meanwhile, discrete-time needs much lower state dimensions in the optimization — big gains in efficiency and convergence basin.
Exploiting chordal sparsity for globally optimal estimation with factor graphs Frank Dellaert, with Avinash Subramanian, Connor Holmes, Timothy Barfoot, Frederike Dümbgen — presented at the Frontiers of Optimization for Robotics workshop Certifiable optimization is having a moment at ICRA. The core idea is making convex SDP relaxations respect the sparse structure of factor graphs: lift the problem to a QCQP, use GTSAM's Bayes tree machinery to expose chordal structure, and solve smaller clique-wise semidefinite problems rather than one monolithic SDP. Dellaert announced that GTSAM will soon support certifiable estimation in a big way, building on its recent QP/QCQP support and related work by David Rosen & co at Northeastern. Blog post.

3D reconstruction & Gaussian splatting

Pose-anchored and scale-consistent dense mapping with geometric priors Yuhao Zhang, University of Oxford, with Yifu Tao and Maurice Fallon
DepthMesh: a dual-end complementary online depth estimation and mesh reconstruction Jiaqi Yang et al. Tightly couples online multi-view depth estimation and TSDF reconstruction for fast online meshing.
TUN3D: towards real-world scene understanding from unposed images Anton Konushin, Lomonosov Moscow State University

Mapping & scene graphs

Marine & underwater

ALAR: customizable multimodal underwater scenarios for harsh domain perception Andrea Bedei, University of Bologna They extend HoloOcean to spawn arbitrary objects and allow domain randomization. Code on GitHub.
A sonar-visual dataset for cross-modal underwater robot perception Weitung Chen, MIT / SINTEF / NTNU, with Martin Ludvigsen among the co-authors The dataset is SOVIS: over 76,000 paired sonar-camera frames collected across 17 dives at six sites in the Trondheimfjord, with a proof-of-concept fish detection task on a small labeled subset. They show how camera-to-sonar correspondences can be learned.
Direct ping-level landmark detection for side-scan sonar SLAM Jinho Im, Seonghun Hong Feature-poor seabed environments — most side-scan sonar SLAM operates in the image domain; here they detect landmarks directly at the ping level.
Benchmarking 3D reconstruction for under-ice robotic perception Arctic environments Sub-surface ice mapping is dominated by multibeam sonar, which gives robust large-scale geometry but lacks the spectral and high-resolution textural information needed to record fine morphological features.
Why domain matters (poster form) Melanie Wille, QUT See Tobias Fischer's talk above.
InsSo3D: inertial navigation system and 3D sonar SLAM for turbid environment inspection Heriot-Watt University, with Yvan Petillot CFEAR comes from radar odometry; here they generalize it to 3D sonar — the point-to-distribution variant was the most robust one.

Autonomous driving

Other topics


Our poster: the building blocks of learning-based monocular visual odometry for underwater environments

I presented our poster at the S2S: From Sea to Space workshop on day 1 — and it won the Best Poster Award! 🏆 A big thank you to everyone who visited, and to the organizers. It was great to share our progress on building the foundations of an AI-based monocular visual odometry system, and to exchange ideas with all of you.

S2S award winners: Yujin Park (Best Talk Award + Travel Grant) and myself (Best Poster Award).


Other links

The Good Reviewer: shaping up peer-review in the robotics community

One workshop I sadly couldn’t attend (it clashed with S2S on day 1): The Good Reviewer, on fixing the peer-review process in robotics — organized by Alejandro Fontan, Javier Civera, Tobias Fischer, Michael Milford and others. Luckily, they compiled a great list of resources on their website, so we can all become better reviewers from home:

Guides, blogs and tutorials:

Official reviewer guidelines:


Ideas I left with

Notes-to-self scribbled in the margins during the week:

  • Rather than focusing so much on monocular vision: 3D scene graphs underwater?
  • Try VGGT for depth.

This post is a work in progress…




Enjoy Reading This Article?

Here are some more articles you might like to read next: