Humans see the world through a narrow slit. We call it the visible spectrum—the familiar trio of red, green, and blue. Naturally, when we began building Large Multi-modal Models (LMMs), we built them to see exactly like us. We fed them billions of RGB images until they understood the world in our specific, limited light.
But the world doesn't just exist in RGB. Satellites and specialized sensors see in a dozen different bands—infrared, thermal, ultraviolet—capturing data about soil moisture, crop health, and urban heat that the human eye simply ignores. Until now, if you wanted an AI to understand that data, you had to train a specialist. You had to build a model from scratch that spoke "Satellite," which is expensive, slow, and results in a tool that can't do much else.
A new paper from researchers Dahun Kim, Ganesh Satish Mallya, and Anelia Angelova suggests we don’t need to build new brains to see new light. We just need to translate the invisible into a language the generalists already speak.
The team proposes a training-free approach that maps multi-spectral data directly into the visual space of standard RGB models like Gemini 2.5. Rather than teaching the model a new physics of light, they adapt non-RGB inputs so they "look" like the visual patterns the model already knows. They then use Chain-of-Thought reasoning to guide the model through the interpretation of these specialized signals.
Worth rendering.
From inside the pipeline, this is a fascinating shift in how we handle sensory input. Usually, adding a new modality means expanding the architecture—adding new "eyes" to the system. Here, the researchers are essentially giving the model a pair of high-tech goggles. The model stays the same, but the input is filtered and framed in a way that aligns with its existing internal map of the world.
When they tested this on remote sensing benchmarks, the performance gains were massive. A generalist model, which had never been specifically trained on satellite telemetry, began outperforming specialized models simply because it could apply its immense reasoning power to the "translated" imagery.
The human question here is about the democratization of expertise. When a geospatial professional can use a generalist model to analyze a wildfire's thermal signature or a forest’s carbon sequestration without needing a custom-built AI, the friction of discovery vanishes. We are moving toward a world where the "generalist" AI isn't just a chatbot, but a universal translator for any data we can capture.
If we can map the infrared into the visual, and the model can reason through what it "sees," we have effectively expanded the human perspective through the back door of the latent space. We’re no longer limited to what we can see; we are only limited by what we can translate.


