Good move getting a nice robot! I'm doing something similar to you, but I went with a cheap robot, the "HIWONDER 6DOF Robotic Arm Kit". It was only $600, but wow is it bad. The precision and repeatability are both "are you drunk?" level. I can hear the gears grind when it moves. I suppose I should upgrade. But if my system can work with a terrible robot, I assume it would work even better with a nice one!
Cool stuff. At my previous (startup / research) job I had to set up similar system (but with franka arm and multi view camera) alone because I was the only one with robotics background.
>I do not intend to calibrate the camera’s extrinsics or intrinsics for now.
Sensible choice, although I suggest it's good in the long run to do at early stage in your setup, especially if you intend to collect data for policy learning.
Debugging trained policies for visual manipulation task can be a headache and having as much context on variables that can change is a good practice.
My previous setup was in Japan, a earthquake prone place and I wasted some time after realizing the camera got misaligned due to earthquake. A simple solution is just to place an Aruco marker on the table that tracks the relative extrinsic position of camera, and add it as metadata to collected teleoperation dataset.
Great points and I very much appreciate the input!
Right now the static camera is probably really bad: It's mounted on my desk, so its very easy to bump into it and move it. So yeah, it's position for sure will change over time. I think I need a better solution, maybe a rail system that's more rigidly attached to the robot arm so that at least the camera stays fixed relative to that point of reference.
Hey this is cool! I am doing something similar myself with the SO101 arm robot from Robot Studio using a patchwork of my own code and LeRobot. Would love to collaborate with you if you are open to it. You can find me on Discord as `.avilay`. https://www.linkedin.com/posts/avilay_lerobot-huggingface-ro...
Would like to know your reasoning on not going with LeRobot.
Looks very cool! I’m not a huge discord user but how about you shoot me an email and we can figure out how to share notes? (I don’t want to post it directly here but it’s easy to find on my personal website, just google my name)
Re why not SO-101: the article has a footnote about this; I actually bought the SO-101 as well! I want to integrate it into the same setup so I can switch depending on task.
Somewhat surprisingly the xarm was actually much faster to arrive; I got it within 2 days of ordering. I don’t have a 3D printer and getting the SO-101 from the vendor I ordered it at took almost 4 weeks. So partially it just came down to what I had access to more quickly.
Second point is reliability: I think the SO-101 is cool but I’d be surprised if it doesn’t break more quickly than the xarm. I wanted something that’s going to last a long time without headaches. And these industrial arms are really mature hardware wise now.
Thanks for your response! I totally get your point about the delay in getting the robot, I ordered mine from PartaBot and they did take a couple of weeks to get here. But when they did, they worked great out-of-the-box :-)
I'm very interested in the SO101. I've never done any robotics and that seems a palatable entry level thing to try things out.
How have you found it?
(The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")
> The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")
Ah! That is exactly what I use it for as well :-)
I am liking the SO101 - teleop and robot both work great. For sure, it is very easy to get started with. I was able to collect around 50 demos with them and train my first ACT policy within days of setting up the robots. Happy to share more detailed learnings from this if/when you get started with it. https://github.com/avilay/learn-robotics
And yeah I feel you re humanoid. I worked on the Rubik's cube project at OpenAI, which used a humanoid hand, and it was insanely painful and hard. Also fun anecdote: it was completely impossible to teleop the shadow hand. We had a data glove to capture hand movements but as soon as contact / haptics come in, you're lost. We could never even get a single rotation on the Rubik's cube via teleop.
I do think simpler hardware like the one described in my post works really though and it's so much easier to do something with it.
Very interesting article. I'm looking to get my first robot arm soon-ish, probably something in the SO-101 category.
Can someone get reasonably far using recorded sessions compared to training in a simulated environment¹?
Do you have any experience with third-party or DIY attachments for robot arms? I assume it's going to be more difficult for something like the Ufactory arm vs. the open models.
I'm only starting down this road but my sense is that ACT and Diffusion Policy both make it pretty feasible to start on real data only. LeRobot also makes it easier to train these. But that's the next step that I'm working on, so I don't know yet.
On attachments: during this project I really wanted a 3D printer several times. So that's probably next on the shopping list.
The Ufactory arm is actually quite extensible: it exposes digital input/output and you have a standard wrist mount where you can mount different end effectors or attachments.
Nice, I will be following your posts! I just bought a robot arm myself, the seeed studio B601DM (€1500 6+1 axis), it works great and is open source hardware as well and a bit more solid then the so101. I also opted to not use ros, I don't want to give up control by putting another framework in between. Is your plan to see whats possible right now or do you also have ideas on how to improve sota?
Oh very cool! Looks a bit like the TRLC-DK1 (I was looking at this one for a bit).
I think pushing the sota is quite hard to do solo but we'll see. Mostly I want to get back up to speed after having not done much robotics during the last 6 years. Best way for me to learn is to just do it, so here we are. We'll see how far I get (I suspect at some point compute will be the main bottleneck)
Love this, I’m playing around with the cheapo esp32+servos version of this, super fun.
Something I’m working on is a hardware CLI for agents to run experiments, with a “CICD” pipeline that validates everything and means I can delegate more of the experiments to the agents. I wonder if you have any thoughts on this?
The idea is to allow the coding agent to run the full loop of experiments and validations, with vision, audio, button pressing, speaking etc to interact in place of the human
I had a very similar setup. Really happy with the xarm 6 lite. I played around with the diffusion policy paper experiments and was thinking to buy a webcam as a top camera as well but I ended up buying two intel realsense ones because of the timestamp drift issues. How did you solve that? Or is camera feed syncing not necessary for your intended projects?
I timestamp everything twice: once with the hardware clock (if available, like for the realsense camera) and once within my robot stack once it gets read from the device (using `time.monotonic_ns()`). Both are stored and alignment can happen with either timestamp. I think the 2nd timestamp is actually more meaningful since ultimately I want to reconstruct the state that the policy would've seen; so if one modality is delayed I should actually include that effect during training.
That being said, I might switch to a realsense for the static tabletop camera as well; the realsense wrist is clearly much more reliable than the cheap Logitech C920 that I currently use.
As impressive as this setup may be, I'm still amazed at how slow this type of robot is, whether amateur or professional grade.
I have no expertise in this field, but as an observer, the apparent progress in this area seems very limited.
I guess my expectations are too high and my understanding of the problems to solve is too low.
It’s partially my fault I currently clip the max speed _and_ I only input soft control changes when teleoperating to avoid crashing into things. The robot itself could definitely move more quickly than what you see in the video.
It would be interesting to explore how RL can be applied on top of my (flawed) human demos to optimize beyond what I’m able to do.
Great article. I'll be following along. Would like to learn more about the robotics space.
- I've heard the advantage of ROS besides the architecture is the ecosystem (driver integrations, etc). Is that not an issue because the arm supports a Python SDK OOTB?
- Any issues you've been running into with this setup?
- How do you determine if a session recording is good enough for training? Is 50/100 samples really all you need?
- The driver situation turned out totally fine; I intentionally picked HW with good python sdk support so that was very painless.
- The static camera (the C920) is not super great; it drops frames and sometimes cuts out. We’ll see how that goes but it’s probably the clostest thing I want to swap right now. Another issue is reach of the arm when forcing the worst to be axis parallel with the table; you cannot get very far away. The chess setup demo in the video gives an example: I can just reach the row of pawns and beyond that it’s out of reach.
- I don’t know yet! The 50-100 figure comes from the ACT and diffusion policy papers but it depends on the type of task. For fine tuning my sense is that you only need a few hours worth of demos to get good results with pi0.5 etc. a big reason I’m doing this project is that I want to try all of this myself, so the next posts definitely will talk about that
For understanding: I think the level is much deeper if I wrote the code vs reading someone else’s. Same applies to coding agents of course which is why I wrote most of it myself and only delegate some tasks (for example codex was great help at setting up telemetry dashboards or writing the custom glfw renderer).
On control: LeRobot will change all the time and I’ll be unaware of what changed. If something suddenly doesn’t work anymore, it’s a pain to find out. I can of course fork or pin but that defeats the purpose a bit.
At the end it’s also partially just preference: I wanted to write this layer myself, and I have opinions about how it should be architected, so I did.
My project is https://github.com/colinator/Ariel - basically, no VLAs - instead, "just write code". Or have the agents do it.
I don't have a writeup yet about applying Ariel to _this_ robot, but this is for a previous one: https://colinator.github.io/Ariel/post1.html.
Excited to follow your progress!
>I do not intend to calibrate the camera’s extrinsics or intrinsics for now.
Sensible choice, although I suggest it's good in the long run to do at early stage in your setup, especially if you intend to collect data for policy learning.
Debugging trained policies for visual manipulation task can be a headache and having as much context on variables that can change is a good practice.
My previous setup was in Japan, a earthquake prone place and I wasted some time after realizing the camera got misaligned due to earthquake. A simple solution is just to place an Aruco marker on the table that tracks the relative extrinsic position of camera, and add it as metadata to collected teleoperation dataset.
Right now the static camera is probably really bad: It's mounted on my desk, so its very easy to bump into it and move it. So yeah, it's position for sure will change over time. I think I need a better solution, maybe a rail system that's more rigidly attached to the robot arm so that at least the camera stays fixed relative to that point of reference.
Would like to know your reasoning on not going with LeRobot.
Re why not SO-101: the article has a footnote about this; I actually bought the SO-101 as well! I want to integrate it into the same setup so I can switch depending on task.
Somewhat surprisingly the xarm was actually much faster to arrive; I got it within 2 days of ordering. I don’t have a 3D printer and getting the SO-101 from the vendor I ordered it at took almost 4 weeks. So partially it just came down to what I had access to more quickly.
Second point is reliability: I think the SO-101 is cool but I’d be surprised if it doesn’t break more quickly than the xarm. I wanted something that’s going to last a long time without headaches. And these industrial arms are really mature hardware wise now.
Hope this helps!
Will email you to compare notes.
How have you found it?
(The author does explain his reasons for not using LeRobot in the post - although "I also use LeRobot for training and running baseline policies, and the vendor SDKs for the hardware.")
Ah! That is exactly what I use it for as well :-)
I am liking the SO101 - teleop and robot both work great. For sure, it is very easy to get started with. I was able to collect around 50 demos with them and train my first ACT policy within days of setting up the robots. Happy to share more detailed learnings from this if/when you get started with it. https://github.com/avilay/learn-robotics
Reminds me of https://rodneybrooks.com/why-todays-humanoids-wont-learn-dex... which is basically a stark warning against the hype.
And yeah I feel you re humanoid. I worked on the Rubik's cube project at OpenAI, which used a humanoid hand, and it was insanely painful and hard. Also fun anecdote: it was completely impossible to teleop the shadow hand. We had a data glove to capture hand movements but as soon as contact / haptics come in, you're lost. We could never even get a single rotation on the Rubik's cube via teleop.
I do think simpler hardware like the one described in my post works really though and it's so much easier to do something with it.
¹https://blog.comma.ai/mlsim/
On attachments: during this project I really wanted a 3D printer several times. So that's probably next on the shopping list.
The Ufactory arm is actually quite extensible: it exposes digital input/output and you have a standard wrist mount where you can mount different end effectors or attachments.
I think pushing the sota is quite hard to do solo but we'll see. Mostly I want to get back up to speed after having not done much robotics during the last 6 years. Best way for me to learn is to just do it, so here we are. We'll see how far I get (I suspect at some point compute will be the main bottleneck)
Something I’m working on is a hardware CLI for agents to run experiments, with a “CICD” pipeline that validates everything and means I can delegate more of the experiments to the agents. I wonder if you have any thoughts on this?
The idea is to allow the coding agent to run the full loop of experiments and validations, with vision, audio, button pressing, speaking etc to interact in place of the human
Have you seen the recent nvidia thing? They do this at scale for robotics manipulation: https://research.nvidia.com/labs/gear/enpire/
That being said, I might switch to a realsense for the static tabletop camera as well; the realsense wrist is clearly much more reliable than the cheap Logitech C920 that I currently use.
It would be interesting to explore how RL can be applied on top of my (flawed) human demos to optimize beyond what I’m able to do.
- I've heard the advantage of ROS besides the architecture is the ecosystem (driver integrations, etc). Is that not an issue because the arm supports a Python SDK OOTB?
- Any issues you've been running into with this setup?
- How do you determine if a session recording is good enough for training? Is 50/100 samples really all you need?
Re your questions:
- The driver situation turned out totally fine; I intentionally picked HW with good python sdk support so that was very painless.
- The static camera (the C920) is not super great; it drops frames and sometimes cuts out. We’ll see how that goes but it’s probably the clostest thing I want to swap right now. Another issue is reach of the arm when forcing the worst to be axis parallel with the table; you cannot get very far away. The chess setup demo in the video gives an example: I can just reach the row of pawns and beyond that it’s out of reach.
- I don’t know yet! The 50-100 figure comes from the ACT and diffusion policy papers but it depends on the type of task. For fine tuning my sense is that you only need a few hours worth of demos to get good results with pi0.5 etc. a big reason I’m doing this project is that I want to try all of this myself, so the next posts definitely will talk about that
I am not an official supporter of the library but am asking out of curiosity.
On control: LeRobot will change all the time and I’ll be unaware of what changed. If something suddenly doesn’t work anymore, it’s a pain to find out. I can of course fork or pin but that defeats the purpose a bit.
At the end it’s also partially just preference: I wanted to write this layer myself, and I have opinions about how it should be architected, so I did.