Research

How we are training PPO-agent to play tennis

All researchArtificialSeed team

Intro

Humanoid simulation represents one of the most demanding areas in robotics research. Even comparatively simpler problems, such as robotic hand manipulation, already require extensive exploration and technical expertise to achieve consistent results. Against this backdrop, we set out to investigate a more ambitious challenge: training a humanoid agent to perform the complex task of playing tennis.

Initial challenges

It shouldn’t be surprising for any aspiring tennis player that it’s difficult to just start and play. We learned this the hard way, both in simulation and on the court.

Physics modelling - While using proper physical simulation, we decided on IsaacSim, which allowed us to reduce the load on us as engineers; still, the amount of knobs and handles available makes your head spin.
Selection of an Articulation(robot) - IsaacSim has a somewhat extensive library of humanoid robots which makes a selection of an articulation particularly difficult. Especially while maintaining DOF-training complexity tradeoff
Control - What are the benefits of Torque control vs Velocity control vs Position control, and why do people prefer torque in humanoid tasks?
Observations - What exactly is the robot supposed to see? Does it need the same sensors as us humans to play, or would a lesser amount be enough?
Reward engineering - It’s easy for us humans to understand what it means to win in tennis. Just score more points than the opponent. But how to make an agent understand it and how to solve intermediate tasks?

Project progression

Step 1: Walking. It is understood that agents should be taught in a similar way to children, especially in trying and being human-like. So we started from the beginning - walking. Humanoid walking is one of the better-known tasks in humanoid simulation. As such it works out of the box in IsaacLab. Modifying rewards so it would move not to some far away point but to a physical object seemed like an easy enough task. All it required was normalization, so rewards that previously were unattainable and were thus a dangling carrot were now rewarded in the same way.

Yet, it lacked human grace, so we had to force them to be graceful

Step one in our big task was solved. We are walking. Coming to step two.

Step 2: Hitting the ball. Interacting with an outside object seems like an easy task as well. If robo-hand can do difficult manipulation such as picking up objects and manipulating them while not dropping, how hard would it be to just make a racket to touch the ball?

Just applying new rewards to the same agent doesn’t seem to motivate it. Even though the agent makes contact with the ball, and sometimes even hits it, for some reason it can’t learn to hit it with power, or at least with a racket. The solution was, as in most of LLM projects - more layers.

Step 3: Running up to the flying ball and hitting it.

Do or Die - our agent decided not to choose at all. Returning to the question of observations, after a couple of interactions with actual tennis players we understood that just having information about ball velocity and position may not be enough for the agent to understand where it needs to stand.

Step 4: Force agent to choose.

While our environment was improving step by step, it became clear that balancing while hitting the ball is a completely separate challenge which our team is yet to overcome. Ensuring that agent could stay on its feet afterwards, meaning striking the ball without collapsing seems like it requires too delicate of a balance for just reward modelling to overcome. The main difficulty is that the humanoid had to learn how to shift weight, stabilize its torso, and recover from the swing so that one hit could lead to the next.