AI Robot with OpenAI Realtime API

It’s been a little while without a new post on the little robot project.

I’m a bit sad about it, but I moved away from using the eyes from the previous design. I didn’t really set out to build a humanoid robot in the first place, and those eyes, with the 6 servos needed to animate them, end up taking up a lot of resources in terms of computing and energy.

So I have a brand new design for a little sci-fi-ish robot, and new code.

The goal for this post is to show off the conversational capability, and the recent Realtime API released by OpenAI is perfect for that, since it’s now multimodal and can take in and return audio directly, which previously needed to be handled with multiple separate requests, adding a lot of lag and really not working great with the conversational aspect of things. It’s all much faster now and usable.

This little robot is powered by a Raspberry Pi 5 plus a Hailo 8 board (the AI+ kit). I will probably look into moving to a Jetson Nano in the future, if I can get one.

There are some examples out there on how to use the Realtime API in Python, but there are a few gotchas:

  • It’s really intended for call centers, tech support and so on, and that usually takes place on the phone. When using a regular speaker and a mic, modifications are needed so the robot doesn’t hear itself, and interrupt itself. That can become confusing :)

  • Along the same lines, we need a trigger, since no one is going to call on the phone. We can’t have the robot just record and send to OpenAI all day long. For this, we’ll use OpenWakeWord.

  • The LLM needs to receive audio at 24kHz (and OWW will needs its own sample rate as well). If another sample rate is used, the API will not understand you, and you’ll wonder why it doesn’t work. Mics and speakers are not all able to work at those frequencies directly.

Code

The code below is not going to just work out of the box. A bunch of libraries and dependencies are needed, it will also depend on the specific hardware and so on, but it should be helpful as a guide.

The code and comments should be relatively easy to follow, but what it does is:

  • Run a main loop, call it sleep mode as far as the voice is concerned. In this sleep mode, it will be listening for a wakeup work with OpenWakeWord. I’m using the pre-trained “Hey Jarvis” model. I figured I would look into training it on something else later on, but it ended up sticking. It’s a good name for a robot.

  • Once the wake word has been detected, the main conversational app will start, and it will start recording, processing, and sending audio to the API. We’re relying on the speech detection mechanism included in the model to detect when the user stops talking and it can start processing a response.

  • The audio response will be processed and played, as a stream, once the API starts responding.

  • When it is done responding and all audio has been played, it will be ready for the next voice input.

  • If 10 seconds have passed since the last response, the app will automatically exit and return to the main sleep loop, to be reawakened with the wake word. Again, we do not want to be constantly processing and sending audio.

A few more notes

  • This uses a bit of basic code from the OpenAI Python SDK and their Realtime example, much trimmed down and modified. It also uses a part of the “async” audio player they provide, also trimmed and modified.

  • As mentioned before, audio needs to be at a specific sample rate. OpenAI will require it to be at 24kHz, while OpenWakeWord will need it at 16kHz. Many microphones will be limited in what rates they can record at, and USB audio cards will be limited in what they can play. It seems that MEMS mics support a bit more. So it’s likely most mics will not be able to record at less than 44.1kHz for example, or some might do 16kHz but not 24. In this code, I’m recording everything at 48kHz and resampling on the fly to what’s needed using the soxr library.

  • The robot shouldn’t be able to hear itself talk, or all hell will break loose (it just won’t work), so any recording will be stopped while a response is being played back. The downside of this is that we cannot interrupt the robot while it’s talking, and just have to wait until it’s done. I’ll eventually find a workaround for this, but it’s not a big deal for me at the moment.

  • I’m using a series of LEDs in front to pulse while in recording mode, and to light up to the audio stream when the robot talks back. Those run in separate processing threads.

  • Some tweaks and fixes will still be needed as there are a few bugs still. For example, my 10s timeout is not ideal, and it needs to take into account the moment a user starts talking as well.

That’s about it!

While continuing to tweak this, I’ll be working next on autonomous motion and computer vision, and getting Jarvis to interact with the environment.


Next
Next

3D printed emulation computer case