Intellectually Curious

Google DeepMind Gemini ER 1.6 AI for Real-World Robotics

Mike Breault

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 6:00

We unpack DeepMind's Gemini ER 1.6, an embodied reasoning model that grounds language in physical space with precise pointing, multi-camera success checks, and agentic action. See how its 'frontal lobe' plans tools and tasks, writes on-the-fly code to measure dial angles, and coordinates with 'VLA' muscle models to safely operate in messy environments—from reading gauges to Spot inspections. We'll explore the architecture, grounding techniques, safety constraints, and what this means for the future of autonomous robots and AI training.


Note:  This podcast was AI-generated, and sometimes AI can make mistakes.  Please double-check any critical information.

Sponsored by Embersilk LLC

SPEAKER_00

You know, the other day I decided to be like super responsible and check my car's tire pressure.

SPEAKER_01

Oh no, I can already see where this is going.

SPEAKER_00

Yeah, I had grabbed this old analog gauge out of the glove box, right? And I'm staring at these tiny, completely impossible to read tick marks.

SPEAKER_01

They really are the worst.

SPEAKER_00

Right. So um I thought it read 40 PSI, and I tried to let a little air out to hit 35. And, well, long story short, I completely misread the dial and just deflated my tire right there in the driveway.

SPEAKER_01

Oh man. Yeah, human error meets a totally confusing physical interface.

SPEAKER_00

Aaron Powell Exactly. And getting machines to navigate that exact kind of messy physical confusion is, you know, essentially the holy grail of AI. And DeepMind's new Gemini Robotics ER 1.6 model might have just cracked it.

SPEAKER_01

Yeah. Looking at DeepMind sources today, we're really focusing on this new embodied reasoning model or ER.

SPEAKER_00

But real quick, speaking of tracking tough problems, if you need help with AI training, automation, software development, or uncovering where agents could make the most impact for your business or personal life, check out Embersilt.com for your AI needs.

SPEAKER_01

Definitely check them out. But yeah, what we're seeing with ER 1.6 is how physical AI agents are evolving. I mean, they're going from just blindly following basic instructions to dynamically reasoning about complex environments. Trevor Burrus, Jr.

SPEAKER_00

Really messy environments, right?

SPEAKER_01

Exactly. And in real time, it's just an incredibly optimistic leap forward for the future of robotics.

SPEAKER_00

Aaron Powell Okay, let's unpack this. Because to understand how it works, we really need to look at the architecture. DeepMind has a few systems working together here.

SPEAKER_01

Right, they do.

SPEAKER_00

You've probably heard of VLA or vision language action models. Think of the VLA model as like the robot's muscle memory. It's the physical instinct that actually extends the arm and you know grasps the tool.

SPEAKER_01

Yeah, the brawn of the operation.

SPEAKER_00

Exactly. Yeah. But this new ER1.6 model, on the other hand, is the frontal lobe. It's looking at the entire workbench, deciding which tool to grab first, and actively checking if the task is progressing correctly.

SPEAKER_01

And what's fascinating here is that because it operates as that frontal lobe, ER1.6 is highly agemtic.

SPEAKER_00

Meaning it's acting on its own.

SPEAKER_01

Right. It doesn't just sit there processing text in isolation. It can natively call on tools to execute complex physical plans. Like if it needs context, it actually queries Google search.

SPEAKER_00

Oh wow. That's wild.

SPEAKER_01

Yeah. And once it decides on a strategy, it directs those VLA muscle models to physically move.

SPEAKER_00

Aaron Powell I mean, I struggle to see how this works reliably in a truly messy space, though. Like my completely disorganized garage, language models are notorious for hallucinating.

SPEAKER_01

True. That's been a big hurdle.

SPEAKER_00

So what stops this frontal lobe from uh looking at a shadow in the corner and hallucinating a wheelbarrow that just isn't actually there?

SPEAKER_01

Aaron Ross Powell Well, that is where a really cool new capability called pointing comes in. It's essentially the foundation of their spatial reasoning.

SPEAKER_00

Pointing, like literally pointing at something.

SPEAKER_01

Sort of, yeah. Instead of just generating a text description like uh there are pliers on the table, ER1.6 actually generates specific X and Y pixel coordinates.

SPEAKER_00

Oh, interesting.

SPEAKER_01

Right. It literally draws an invisible knot location of each tool. This forces the AI to ground its language in physical reality.

SPEAKER_00

So it has to actually prove it sees the object.

SPEAKER_01

Precisely. Like in their benchmark tests, it accurately identified exactly six pliers and two hammers in a highly cluttered image while completely ignoring requests to find objects that just weren't physically present.

SPEAKER_00

That is huge. So the pointing keeps it totally grounded.

SPEAKER_01

Yeah. And it pairs that spatial awareness with something called success detection. Most modern robots have multiple cameras, right? Usually an overhead view and maybe a wrist-mounted feed.

SPEAKER_00

Right, to see what the hand is actually doing.

SPEAKER_01

Exactly. So ER 1.6 synthesizes those different camera streams simultaneously. It isn't just guessing based on a single obscured angle, you know?

SPEAKER_00

Yeah.

SPEAKER_01

It confirms a task is genuinely finished across multiple viewpoints.

SPEAKER_00

Here's where it gets really interesting. Because it can confirm success across multiple cameras and accurately pinpoint objects, it can finally be trusted in high-stakes, unpredictable environments.

SPEAKER_01

Yeah, the really dynamic stuff.

SPEAKER_00

Which is exactly why Boston Dynamics is putting this brain into Spot, their robot dog, for industrial facility inspections.

SPEAKER_01

That's right. They're using ER 1.6, so Spot can read analog dials and site classes using agentic vision.

SPEAKER_00

And it does this in such a brilliant way. It doesn't just try to visually guess the number on the dial like I did.

SPEAKER_01

Thankfully, no.

SPEAKER_00

Right. The model actually writes a mini Python script on the fly to calculate the geometric angle of the needle against the tick marks. It literally does the exact math I completely failed to do with my tire gauge, but it does it perfectly.

SPEAKER_01

And importantly, because of that spatial grounding, it does all of this incredibly safely. This is actually their safest model yet because it naturally adheres to physical constraints.

SPEAKER_00

Like what kind of constraints?

SPEAKER_01

Well, you can give it instructions like uh don't handle liquids or don't pick up objects heavier than 20 kilograms, and it complies perfectly.

SPEAKER_00

Wow. So what does this all mean for you? We are looking at a deeply inspiring future where robots become incredibly capable, autonomous partners.

SPEAKER_01

Absolutely.

SPEAKER_00

They're getting ready to take over the tedious physical tasks we'd really rather not do alone.

SPEAKER_01

If we connect this to the bigger picture, you know, if robots can dynamically reason and interact with physical constraints natively, it makes you wonder at what point do we stop writing rigid software for industrial machines and just start asking them to figure out the factory floor themselves?

SPEAKER_00

I love that thought. Just letting them figure it out. Yeah. What everyday physical chore would you delegate to a robot that can dynamically reason just like you?

SPEAKER_01

That really is the dream.

SPEAKER_00

Totally. Well, if you enjoyed this podcast, please subscribe to the show. Hey, leave us a five star review if you can. It really does help get the word out. Thanks for tuning in.