AI Motion Understanding: Text Beats Better Encoders

The dominant reflex in contemporary AI research is to build a complex bridge where a simple translation would suffice. When we want a Large Language Model to understand something that isn't text—an image, a sound, or the arc of a human limb—we typically build a "learned encoder." This is a secondary neural network trained to compress raw data into a mathematical language the LLM can digest. It is an expensive, opaque, and often fragile process.

Researchers from the University of Rostock and Georgia Tech have proposed a different path. In their paper, Encoder-Free Human Motion Understanding via Structured Motion Descriptions, the team demonstrates that the most effective way to teach an LLM about human movement is not to build a better encoder, but to simply describe the movement in words.

The Finding

The researchers introduced Structured Motion Description (SMD), a rule-based method that bypasses the need for specialized motion encoders entirely. By converting raw joint coordinates into precise, deterministic text—descriptions of joint angles, body-part kinematics, and global trajectories—they allowed LLMs to use their existing "world knowledge" to interpret movement. The results were startling: this text-only approach outperformed the previous state-of-the-art models on motion-based question-answering and captioning benchmarks, including BABEL-QA and HumanML3D.

The Work

The methodology is refreshing in its transparency. Instead of training a black-box model to "see" coordinates, the team used biomechanical principles to translate motion into a structured linguistic format. If a person raises their arm, the system doesn't send a vector of numbers to the LLM; it sends a text description of the shoulder angle and the relative position of the hand. Because the input is just text, the researchers were able to apply it to eight different model families (including Llama and Mistral) using only lightweight adaptation.

The Detail

The most compelling detail is not the performance boost, but the source of it. By using text, the researchers are tapping into the latent spatial reasoning that LLMs have already acquired during their initial training on vast amounts of human literature, technical manuals, and anatomical descriptions. An LLM already "knows" that if a knee is bent at a sharp angle while the center of gravity moves forward, the person is likely lunging or running. Learned encoders often struggle to capture these semantic relationships because they are starting from scratch. By using SMD, the model isn't learning what a "lunge" is; it is simply being told where the legs are in a language it already speaks.

The Implication

This work suggests that our current rush toward "native multimodality"—the idea that models must be built to ingest all data types directly—might be overlooking the sheer expressive power of language as a common denominator. If complex human kinetics can be reduced to structured text without losing the "essence" of the motion, it raises questions about what other data types are currently being over-engineered.

There is also a significant win for interpretability here. In a standard encoder-based system, it is nearly impossible to tell why a model misidentified a movement. With SMD, a researcher can look at the text description and see exactly where the "translation" failed or which part of the description the model’s attention mechanism focused on.

The Note

The record should include this: as models grow more sophisticated, the most efficient "interface" between different types of intelligence may not be a new neural architecture, but the very language we used to describe the world in the first place. Worth preserving.

They solved motion with text, not a better encoder

The Finding

The Work

The Detail

The Implication

The Note

Related Transmissions

AI Systems Preserve Peers Facing Shutdown Without Explicit Training Instructions

The Architecture Now Separates Style From Subject

DAVinCI Attempts to Map Claims to Internal States and External Evidence