人工智能科学家李飞飞最新文章全篇搬运

李飞飞是人工智能科学家，最近发表了一篇文章，关于“空间智能是下一个前沿趋势”。由于我不想看到太多的被辗转了无数次的文章，所以我找到了李飞飞教授的文章原文，了解一手信息。正好也分享给大家，希望看到这篇文章的你，会感觉到有收获。

下面附着英文原文和翻译，如果你想要了解更多，可以自行复制然后通过大模型来进行翻译或者帮助你更好的理解。

链接：https://open.substack.com/pub/drfeifei/p/from-words-to-worlds-spatial-intelligence?r=6vinc5&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Dr. Fei-Fei Li

From Words to Worlds: Spatial Intelligence is AI’s Next Frontier

Fei-Fei Li

Nov 10, 2025

In 1950, when computing was little more than automated arithmetic and simple logic, Alan Turing asked a question that still reverberates today: can machines think? It took remarkable imagination to see what he saw: that intelligence might someday be built rather than born. That insight later launched a relentless scientific quest called Artificial Intelligence (AI). Twenty-five years into my own career in AI, I still find myself inspired by Turing’s vision. But how close are we? The answer isn’t simple.

Today, leading AI technology such as large language models (LLMs) have begun to transform how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded. Spatial intelligence will transform how we create and interact with real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and beyond. This is AI’s next frontier.

The pursuit of visual and spatial **intelligence has been the North Star guiding me since I entered the field. It’s why I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs). It’s why my academic lab at Stanford has spent the last decade combining computer vision with robotic learning. And it’s why my cofounders Justin Johnson, Christoph Lassner, Ben Mildenhall, and I created World Labs more than one year ago: to realize this possibility in full, for the first time.

In this essay, I’ll explain what spatial intelligence is, why it matters, and how we’re building the world models that will unlock it—with impact that will reshape creativity, embodied intelligence, and human progress.

Spatial Intelligence: The scaffolding of human cognition

AI has never been more exciting. Generative AI models such as LLMs have moved from research labs to everyday life, becoming tools of creativity, productivity, and communication for billions of people. They have demonstrated capabilities once thought impossible, producing coherent text, mountains of code, photorealistic images, and even short video clips with ease. It’s no longer a question of whether AI will change the world. By any reasonable definition, it already has.

Yet so much still lies beyond our reach. The vision of autonomous robots remains intriguing but speculative, far from the fixtures of daily life that futurists have long promised. The dream of massively accelerated research in fields like disease curation, new material discovery, and particle physics remains largely unfulfilled. And the promise of AI that truly understands and empowers human creators—whether students learning intricate concepts in molecular chemistry, architects visualizing spaces, filmmakers building worlds, or anyone seeking fully immersive virtual experiences—remains beyond reach.

To learn why these capabilities remain elusive, we need to examine how spatial intelligence evolved, and how it shapes our understanding of the world.

Vision has long been a cornerstone of human intelligence, but its power emerged from something even more fundamental. Long before animals could nest, care for their young, communicate with language, or build civilizations, the simple act of sensing quietly sparked an evolutionary journey toward intelligence.

This seemingly isolated ability to glean information from the external world, whether a glimmer of light or the feeling of texture, created a bridge between perception and survival that only grew stronger and more elaborate as the generations passed. Layer upon layer of neurons grew from that bridge, forming nervous systems that interpret the world and coordinate interactions between an organism and its surroundings. Thus, many scientists have conjectured that perception and action became the core loop driving the evolution of intelligence, and the foundation on which nature created our species—the ultimate embodiment of perceiving, learning, thinking, and doing.

Spatial intelligence plays a fundamental role in defining how we interact with the physical world. Every day, we rely on it for the most ordinary acts: parking a car by imagining the narrowing gap between bumper and curb, catching a set of keys tossed across the room, navigating a crowded sidewalk without collision, or sleepily pouring coffee into a mug without looking. In more extreme circumstances, firefighters navigate collapsing buildings through shifting smoke, making split-second judgements about stability and survival, communicating through gestures, body language and a shared professional instinct for which there’s no linguistic substitute. And children spend the entirety of their pre-verbal months or years learning the world through playful interactions with their environments. All of this happens intuitively, automatically—a fluency machines have yet to achieve.

Spatial Intelligence is also foundational to our imagination and creativity. Storytellers create uniquely rich worlds in their minds and leverage many forms of visual media to bring them to others, from ancient cave painting to modern cinema to immersive video games. Whether it’s children building sandcastles on the beach or playing Minecraft on the computer, spatially-grounded imagination forms the basis for interactive experiences in real or virtual worlds. And in many industry applications, simulations of objects, scenes and dynamic interactive environments power countless numbers of critical business use cases from industrial design to digital twins to robotic training.

History is full of civilization-defining moments where spatial intelligence played central roles. In ancient Greece, Eratosthenes transformed shadows into geometry—measuring a 7-degree angle in Alexandria at the exact moment the sun cast no shadow in Syene—to calculate the Earth’s circumference. Hargreave’s “Spinning Jenny” revolutionized textile manufacturing through a spatial insight: arranging multiple spindles side-by-side in a single frame allowed one worker to spin multiple threads simultaneously, increasing productivity eightfold. Watson and Crick discovered DNA’s structure by physically building 3D molecular models, manipulating metal plates and wire until the spatial arrangement of base pairs clicked into place. In each case, spatial intelligence drove civilization forward when scientists and inventors had to manipulate objects, visualize structures, and reason about physical spaces - none of which can be captured in text alone.

Spatial Intelligence is the scaffolding upon which our cognition is built. It’s at work when we passively observe or actively seek to create. It drives our reasoning and planning, even on the most abstract topics. And it’s essential to the way we interact—verbally or physically, with our peers or with the environment itself. While most of us aren’t revealing new truths on the level of Eratosthenes most days, we routinely think in the same way—making sense of a complex world by perceiving it through our senses, then leveraging an intuitive understanding of how it works in physical, spatial terms.

Unfortunately, today’s AI doesn’t think like this yet.

Tremendous progress has indeed been made in the past few years. Multimodal LLMs (MLLMs), trained with voluminous multimedia data in addition to textual data, have introduced some basics of spatial awareness, and today’s AI can analyze pictures, answer questions about them, and generate hyperrealistic images and short videos. And through breakthroughs in sensors and haptics, our most advanced robots can begin to manipulate objects and tools in highly constrained environments.

Yet the candid truth is that AI’s spatial capabilities remain far from human level. And the limits reveal themselves quickly. State-of-the-art MLLM models rarely perform better than chance on estimating distance, orientation, and size—or “mentally” rotating objects by regenerating them from new angles. They can’t navigate mazes, recognize shortcuts, or predict basic physics. AI-generated videos—nascent and yes, very cool—often lose coherence after a few seconds.

While current state-of-the-art AI can excel at reading, writing, research, and pattern recognition in data, these same models bear fundamental limitations when representing or interacting with the physical world. Our view of the world is holistic—not just what we’re looking at, but how everything relates spatially, what it means, and why it matters. Understanding this through imagination, reasoning, creation, and interaction—not just descriptions—is the power of spatial intelligence. Without it, AI is disconnected from the physical reality it seeks to understand. It cannot effectively drive our cars, guide robots in our homes and hospitals, enable entirely new ways of immersive and interactive experiences for learning and recreation, or accelerate discovery in materials science and medicine.

The philosopher Wittgenstein once wrote that “the limits of my language mean the limits of my world.” I’m not a philosopher. But I know at least for AI, there is more than just words. Spatial intelligence represents the frontier beyond language—the capability that links imagination, perception and action, and opens possibilities for machines to truly enhance human life, from healthcare to creativity, from scientific discovery to everyday assistance.

The next decade of AI: Building truly spatially intelligent machines

So how do we build spatially-intelligent AI? What’s the path to models capable of reasoning with the vision of Eratosthenes, engineering with the precision of an industrial designer, creating with the imagination of a storyteller, and interacting with their environment with the fluency of a first responder?

Building spatially intelligent AI requires something even more ambitious than LLMs: world models, a new type of generative models whose capabilities of understanding, reasoning, generation and interaction with the semantically, physically, geometrically and dynamically complex worlds - virtual or real - are far beyond the reach of today’s LLMs. The field is nascent, with current methods ranging from abstract reasoning models to video generation systems. World Labs was founded in early 2024 on this conviction: that foundational approaches are still being established, making this the defining challenge of the next decade.

In this emerging field, what matters most is establishing the principles that guide development. For spatial intelligence, I define world models through three essential capabilities:

1. Generative: World models can generate worlds with perceptual, geometrical, and physical consistency

World models that unlock spatial understanding and reasoning must also generate simulated worlds of their own. They must be capable of spawning endlessly varied and diverse simulated worlds that follow semantic or perceptual instructions—while remaining geometrically, physically, and dynamically consistent—whether representing real or virtual spaces. The research community is actively exploring whether these worlds should be represented implicitly or explicitly in terms of the innate geometric structures. Furthermore, in addition to powerful latent representations, I believe the outputs of a universal world model must also allow the generation of an explicit, observable state of the worlds for many different use cases. In particular, its understanding of the present must be tied coherently to its past; to the previous states of the world that led to the current one.

2. Multimodal: World models are multimodal by design

Just as animals and humans do, a world model should be able to process inputs—known as “prompts” in the generative AI realm—in a wide range of forms. Given partial information—whether images, videos, depth maps, text instructions, gestures, or actions—world models should predict or generate world states as complete as possible. This requires processing visual inputs with the fidelity of real vision while interpreting semantic instructions with equal facility. This enables both agents and humans to communicate with the model about the world through diverse inputs and receive diverse outputs in return.

3. Interactive: World models can output the next states based on input actions

Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.

The scope of this challenge exceeds anything AI has faced before.

While language is a purely generative phenomenon of human cognition, worlds play by much more complex rules. Here on Earth, for instance, gravity governs motion, atomic structures determine how light produces colors and brightness, and countless physical laws constrain every interaction. Even the most fanciful, creative worlds are composed of spatial objects and agents that obey the physical laws and dynamical behaviors that define them. Reconciling all of this consistently—the semantic, the geometric, the dynamic, and physical—demands entirely new approaches. The dimensionality of representing a world is vastly more complex than that of a one-dimensional, sequential signal like language. Achieving world models that deliver the kind of universal capabilities we enjoy as humans will require overcoming several formidable technical barriers. At World Labs, our research teams are devoted to making fundamental progress toward that goal.

Here are some examples of our current research topics:

A new, universal task function for training: Defining a universal task function as simple and elegant as next-token prediction in LLMs has long been a central goal of world model research. The complexities of both their input and output spaces make such a function inherently more difficult to formulate. But while much remains to be explored, this objective function and corresponding representations must reflect the laws of geometry and physics, honoring the fundamental nature of world models as grounded representations of both imagination and reality.
Large-scale training data: ****Training world models requires far more complex data than text curation. The promising news: massive data sources already exist. Internet-scale collections of images and videos represent abundant, accessible training material—the challenge lies in developing algorithms that can extract deeper spatial information from these two-dimensional image or video frame-based signals (i.e. RGB). Research over the past decade has shown the power of scaling laws linking data volume and model size in language models; the key unlock for world models is building architectures that can leverage existing visual data at comparable scale. In addition, I would not underestimate the power of high-quality synthetic data and additional modalities like depth and tactile information. They supplement the internet scale data in critical steps of the training process. But the path forward depends on better sensor systems, more robust signal extraction algorithms, and far more powerful neural simulation methods.
New model architecture and representational learning: World model research will inevitably drive advances in model architecture and learning algorithms, particularly beyond the current MLLM and video diffusion paradigms. Both of these typically tokenize data into 1D or 2D sequences, which makes simple spatial tasks unnecessarily difficult - like counting unique chairs in a short video, or remembering what a room looked like an hour ago. Alternative architectures may help, such as 3D or 4D-aware methods for tokenization, context, and memory. For example, at World Labs, our recent work on a real-time generative frame-based model called RTFM has demonstrated this shift, which uses spatially-grounded frames as a form of spatial memory to achieve efficient real-time generation while maintaining persistence in the generated world.

Clearly, we are still facing daunting challenges before we can fully unlock spatial intelligence through world modeling. This research isn’t just a theoretical exercise. It is the core engine for a new class of creative and productivity tools. And the progress within World Labs has been encouraging. We recently shared with a limited number of users a glimpse of Marble, the first ever world model that can be prompted by multimodal inputs to generate and maintain consistent 3D environments for users and storytellers to explore, interact with, and build further in their creative workflow. And we are working hard to make it available to the public soon!

Marble is only our first step in creating a truly spatially intelligent world model. As the progress accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence on an entirely new level—an achievement that will unlock essential capabilities still largely absent from today’s AI systems.

Using world models to build a better world for people

It matters what motivates the development of AI. As one of the scientists who helped usher in the era of modern AI, my motivation has always been clear: AI must augment human capability, not replace it. For years, I’ve worked to align AI development, deployment, and governance with human needs. Extreme narratives of techno-utopia and apocalypse are abundant these days, but I continue to hold a more pragmatic view: AI is developed by people, used by people, and governed by people. It must always respect the agency and dignity of people. Its magic lies in extending our capabilities; making us more creative, connected, productive, and fulfilled. Spatial intelligence represents this vision—AI that empowers human creators, caregivers, scientists, and dreamers to achieve what was once impossible. This belief is what drives my commitment to spatial intelligence as AI’s next great frontier.

The applications of spatial intelligence span varying timelines. Creative tools are emerging now—World Labs’ Marble already puts these capabilities in creators’ and storytellers’ hands. Robotics represents an ambitious mid-term horizon as we refine the loop between perception and action. The most transformative scientific applications will take longer but promise a profound impact on human flourishing.

Across all these timelines, several domains stand out for their potential to reshape human capability. It will take significant collective effort, more than a single team or a company can possibly achieve. It will require participation across the entire AI ecosystem—researchers, innovators, entrepreneurs, companies, and even policymakers—working toward a shared vision. But this vision is worth pursuing. Here’s what that future holds:

Creativity: Superpowering storytelling and immersive experiences

“Creativity is intelligence having fun.” This is one of my favorite quotes by my personal hero Albert Einstein. Long before written language, humans told stories—painted them on cave walls, passed them through generations, built entire cultures on shared narratives. Stories are how we make sense of the world, connect across distance and time, explore what it means to be human, and most importantly, find meaning in life and love within ourselves. Today, spatial intelligence has the potential to transform how we create and experience narratives in ways that honor their fundamental importance, and extend their impacts from entertainment to education, from design to construction.

World Labs’ Marble platform will be putting unprecedented spatial capabilities and editorial controllability in the hands of filmmakers, game designers, architects, and storytellers of all kinds, allowing them to rapidly create and iterate on fully explorable 3D worlds without the overhead of conventional 3D design software. The creative act remains as vital and human as ever; the AI tools simply amplify and accelerate what creators can achieve. This includes:

Narrative experiences in new dimensions: Filmmakers and game designers are using Marble to conjure entire worlds without the constraints of budget or geography, exploring varieties of scenes and perspectives that would have been intractable to explore within a traditional production pipeline. As the lines between different forms of media and entertainment blur, we’re approaching fundamentally new kinds of interactive experiences that blend art, simulation, and play—personalized worlds where anyone, not just studios, can create and inhabit their own stories. With the rise of newer, more rapid ways to lift concepts and storyboards into full experiences, narratives will no longer be bound to a single medium, with creators free to build worlds with shared throughlines across myriad surfaces and platforms.
Spatial narratives through design: Essentially every manufactured object or constructed space must be designed in virtual 3D before its physical creation. This process is highly iterative and costly in terms of both time and money. With spatially intelligent models at their disposal, architects can quickly visualize structures before investing months into designs, walking through spaces that don’t yet exist—essentially telling stories about how we might live, work, and gather. Industrial and fashion designers can translate imagination into form instantly, exploring how objects interact with human bodies and spaces.
New immersive and interactive experiences: Experience itself is one of the deepest ways that we, as a species, create meaning. For the entirety of human history, there has been one singular 3D world: the physical one we all share. Only in recent decades, through gaming and early virtual reality ( VR), have we begun to glimpse what it means to share alternate worlds of our own creation. Now, spatial intelligence combined with new form factors, like VR and extended reality (XR) headsets and immersive displays, elevates these experiences in unprecedented ways. We’re approaching a future where stepping into fully realized multi-dimensional worlds becomes as natural as opening a book. Spatial intelligence makes world-building accessible not just to studios with professional production teams but to individual creators, educators, and anyone with a vision to share.

Robotics: Embodied intelligence in action

Animals from insects to humans depend on spatial intelligence to understand, navigate and interact with their worlds. Robots will be no different. Spatially-aware machines have been the dream of the field since its inception, including my own work with my students and collaborators at my Stanford research lab. This is also why I’m so excited by the possibility of bringing them about using the kinds of models World Labs is building.

Scaling robotic learning via world models: The progress of robotic learning hinges on a scalable solution of viable training data. Given the enormous state spaces of possibilities that robots have to learn to understand, reason, plan, and interact with, many have conjectured that a combination of internet data, synthetic simulation, and real-world capture of human demonstration are required to truly create generalizable robots. But unlike language models, training data is scarce for today’s robotic research. World models will play a defining role in this. As they increase their perceptual fidelity and computational efficiency, outputs of world models can rapidly close the gap between simulation and reality. This will in turn help train robots across simulations of countless states, interactions and environments.
Companions and collaborators: Robots as human collaborators, whether aiding scientists at the lab bench or assisting seniors living alone, can expand part of the workforce in dire need of more labour and productivity. But doing so demands spatial intelligence that perceives, reasons, plans, and acts while—and this is most important—staying empathetically aligned with human goals and behaviors. For instance, a lab robot might handle instruments so the scientist can focus on tasks needing dexterity or reasoning, while a home assistant might help an elderly person cook without diminishing their joy or autonomy. Truly spatially intelligent world models that can predict the next state or possibly even actions consistent with this expectation are critical for achieving this goal.
Expanding forms of embodiment: Humanoid robots play a role in the world we’ve built for ourselves. But the full benefit of innovation will come from a far more diverse range of designs: nanobots that deliver medicine, soft robots that navigate tight spaces, and machines built for the deep sea or outer space. Whatever their form, future spatial intelligence models must integrate both the environments these robots inhabit and their own embodied perception and movement. But a key challenge in developing these robots is the lack of training data in these wide varieties of embodied form factors. World models will play a critical role in simulation data, training environments, and benchmarking tasks for these efforts.

The Longer Horizon: Science, Healthcare, and Education

In addition to creative and robotics applications, spatial intelligence’ profound impact will also extend to fields where AI can enhance human capability in ways that save lives and accelerate discovery. I highlight below three areas of applications that can be deeply transformative, though it goes without saying the use cases of spatial intelligence are truly expansive across many more industries.

In scientific research, spatially intelligent systems can simulate experiments, test hypotheses in parallel, and explore environments inaccessible to humans—from deep oceans to distant planets. This technology can transform computational modeling in fields like climate science and materials research. By integrating multi-dimensional simulation with real-world data collection, these tools can lower compute barriers and extend what every laboratory can observe and understand.

In healthcare, spatial intelligence will reshape everything from laboratory to bedside. At Stanford, my students and collaborators have spent many years working with hospitals, elder care facilities, and patients at home. This experience has convinced me of spatial intelligence’s transformative potential here. AI can accelerate drug discovery by modeling molecular interactions in multi-dimensions, enhance diagnostics by helping radiologists spot patterns in medical imaging, and enable ambient monitoring systems that support patients and caregivers without replacing the human connection that healing requires, not to mention the potential of robots in helping our healthcare workers and patients in many different settings.

In education, spatial intelligence can enable immersive learning that makes abstract or complex concepts tangible, and create iterative experiences so essential to how our brains and bodies are wired in learning. In the age of AI, the need for faster and more effective learning and reskilling is particularly important for both school-aged children and adults. Students can explore cellular machinery or walk through historical events in multi-dimenality. Teachers gain tools to personalize instruction through interactive environments. Professionals—from surgeons to engineers—can safely practice complex skills in realistic simulations.

Across all these domains, the possibilities are boundless, but the goal remains constant: AI that augments human expertise, accelerates human discovery, and amplifies human care—not replacing the judgment, creativity, and empathy that are central for being humans.

Conclusion

The last decade has seen AI become a global phenomenon and an inflection point in technology, the economy, and even geopolitics. But as a researcher, educator, and now, entrepreneur, it’s still the spirit behind Turing’s 75-year-old question that inspires me most. I still share his sense of wonder. It’s what energizes me every day by the challenge of spatial intelligence.

For the first time in history, we’re poised to build machines so in tune with the physical world that we can rely on them as true partners in the greatest challenges we face. Whether accelerating how we understand diseases in the lab, revolutionizing how we tell stories, or supporting us in our most vulnerable moments due to sickness, injury, or age, we’re on the cusp of technology that elevates the aspects of life we care about most. This is a vision of deeper, richer, more empowered lives.

Almost a half billion years after nature unleashed the first glimmers of spatial intelligence in the ancestral animals, we’re lucky enough to find ourselves among the generation of technologists who may soon endow machines with the same capability—and privileged enough to harness those capabilities for the benefits of people everywhere. Our dreams of truly intelligent machines will not be complete without spatial intelligence.

This quest is my North Star. Join me in pursuing it.

翻译：

从文字到世界：空间智能是人工智能的下一个前沿

1950 年，当时计算机不过是自动化算术和简单逻辑，艾伦·图灵提出了一个至今仍令人回响的问题：机器能思考吗？他需要非凡的想象力才能看出他所看到的：智慧或许有一天是被创造而非天生的。这一见解后来引发了一场名为人工智能（AI）的持续科学探索。在我自己从事人工智能工作已经二十五年了，我依然被图灵的愿景所激励。但我们有多近？答案并不简单。

如今，领先的人工智能技术，如大型语言模型（LLM）已开始改变我们获取和处理抽象知识的方式。然而他们依然是黑暗中的文字匠;口才流利但缺乏经验，知识丰富却缺乏根基。空间智能将彻底改变我们创造和互动现实与虚拟世界的方式——彻底改变讲故事、创造力、机器人技术、科学发现及更多领域。这是人工智能的下一个前沿。

自进入该领域以来，追求视觉和空间智能一直是指引我的北极星。这也是我花费多年时间构建 ImageNet 的原因，这是首个大规模视觉学习和基准测试数据集，也是促成现代人工智能诞生的三大关键要素之一，另外一是神经网络算法和现代类计算图形处理单元（GPU）。这也是为什么我在斯坦福的学术实验室在过去十年里一直将计算机视觉与机器人学习结合起来。这也是为什么我的联合创始人贾斯汀·约翰逊、克里斯托夫·拉斯纳、本·米尔登霍尔和我一年多前创立了世界实验室：首次完全实现这一可能性。

在这篇文章中，我将解释什么是空间智能，为什么它重要，以及我们如何构建能够解锁空间智能的世界模型——这些模型将重塑创造力、具身智能和人类进步。

空间智能：人类认知的脚手架

人工智能从未如此令人兴奋。生成式 AI 模型如大型语言模型已从研究实验室走向日常生活，成为数十亿人创造力、生产力和沟通的工具。它们展示了曾经被认为不可能的能力，轻松生成连贯的文本、大量代码、逼真的图像，甚至是短视频片段。现在已经不再是人工智能是否会改变世界的问题。用任何合理的定义，它已经发生了。

然而，仍有许多事情超出我们的触及范围。自主机器人的愿景依然引人入胜，但仍然具有推测性，远非未来学家长期承诺的日常生活常设。疾病管理、新材料发现和粒子物理等领域的大幅加速研究梦想在很大程度上仍未实现。而真正理解并赋能人类创造者的人工智能的承诺——无论是学习分子化学复杂概念的学生、建筑师可视化空间、电影制作者构建世界，还是任何寻求完全沉浸式虚拟体验的人——依然遥不可及。

要了解为何这些能力依然难以捉摸，我们需要审视空间智能是如何演变的，以及它如何塑造我们对世界的理解。

视觉长期以来一直是人类智慧的基石，但它的力量源自更根本的东西。早在动物能够筑巢、照顾幼崽、用语言交流或建立文明之前，这种简单的感知行为就悄然引发了一场向智能进化的旅程。

这种看似孤立的从外部世界获取信息的能力，无论是一丝光线还是质感，都为感知与生存之间搭建了一座桥梁，随着世代的推移，这种桥梁愈发坚固和复杂。一层又一层的神经元从那座桥上生长出来，形成了神经系统，能够解读世界并协调生物体与周围环境之间的互动。因此，许多科学家推测， 感知与行动成为驱动智能进化的核心循环 ，也是自然创造我们物种的基础——感知、学习、思考和行动的终极体现。

空间智能在定义我们如何与物理世界互动中起着根本作用。每天，我们都依赖它完成最普通的行为：通过想象保险杠和路缘之间越来越窄的缝隙停车，接住房间另一头扔来的钥匙，在拥挤的人行道上不撞车，或者睡眼惺忪地把咖啡倒进杯子里而不看。在更极端的情况下，消防员在不断变化的烟雾中穿行倒塌建筑，瞬间判断稳定性和生存，通过手势、肢体语言和无可替代的专业本能进行沟通。孩子们在语言能力前的整个月或年里，都是通过与环境的玩耍互动来学习世界。这一切都是直觉且自动完成的——这是机器尚未达到的流畅度。

空间智能也是我们想象力和创造力的基础。讲故事的人在脑海中创造独特且丰富的世界，并利用多种视觉媒介将其带给他人，从古代洞穴壁画到现代电影再到沉浸式电子游戏。无论是孩子们在海滩堆沙堡，还是在电脑上玩 Minecraft，空间化的想象力构成了现实或虚拟世界中互动体验的基础。在许多行业应用中，对象、场景和动态交互环境的仿真驱动着从工业设计到数字孪生再到机器人培训等无数关键商业用例。

历史上充满了定义文明的时刻，空间智能在其中发挥了核心作用。在古希腊，埃拉托斯特尼将阴影转化为几何——在亚历山大城测量了 7 度角，正好在太阳在谢内投下阴影的瞬间——以计算地球周长。哈格里夫的《纺纱珍妮》通过空间洞察彻底革新了纺织制造：将多个纺锤并排排列在同一框架中，使一名工人可以同时纺多根线，生产力提升了八倍。沃森和克里克通过物理构建三维分子模型，作金属板和金属丝，直到碱基对的空间排列正确，发现了 DNA 的结构。在每种情况下，空间智能推动了文明的发展，科学家和发明家需要作物体、可视化结构、推理物理空间——这些都无法仅靠文字表达。

空间智能是我们认知的脚手架。 当我们被动观察或主动寻求创造时，它就在工作中。它驱动着我们的推理和计划，即使是在最抽象的话题上。它对我们互动的方式至关重要——无论是言语还是肢体上，与同伴还是环境本身。虽然我们大多数人大多数日子并没有像埃拉托斯特尼那样揭示新真理，但我们日常思考的方式也是如此——通过感官感知复杂世界，然后利用对其物理和空间运作的直觉理解。

不幸的是，今天的人工智能还没有这样思考。

过去几年确实取得了巨大进展。多模态大型语言模型（MLLM）除了文本数据外，还用大量多媒体数据训练，引入了空间感知的基础知识，如今的人工智能可以分析图片、回答相关问题，并生成超逼真的图像和短视频。通过传感器和触觉技术的突破，我们最先进的机器人能够开始在高度受限的环境中控物体和工具。

然而，坦率的事实是，人工智能的空间能力远未达到人类水平。而极限很快就显现出来。最先进的 MLLM 模型在估算距离、方向和大小时，往往表现得比偶然更好——或者说，通过从新角度重生成物体来“心理”旋转。他们无法穿越迷宫，无法识别捷径，也无法预测基本物理。AI 生成的视频——虽然还很萌芽，但确实很酷——往往在几秒钟后就失去连贯性。

虽然当前最先进的人工智能在数据中的阅读、写作、研究和模式识别方面表现出色，但这些模型在表示或与物理世界互动时存在根本性局限。我们对世界的认知是整体的——不仅仅是我们所看到的，而是所有事物在空间上的关联、它的意义以及为什么重要。通过想象、推理、创造和互动——而不仅仅是描述——来理解这一点，正是空间智能的力量。没有它，人工智能就与它试图理解的物理现实脱节。它无法有效驾驶我们的汽车，无法引导家庭和医院中的机器人，无法实现全新的沉浸式互动学习和娱乐体验方式，也无法加速材料科学和医学领域的发现。

哲学家维特根斯坦曾写道：“我的语言的界限意味着我世界的界限。”我不是哲学家。但我知道，至少对于人工智能来说，不仅仅是文字。空间智能代表了超越语言的前沿——连接想象力、感知和行动的能力，为机器打开了真正提升人类生活的可能性，从医疗到创造力，从科学发现到日常辅助。

人工智能的下一个十年：打造真正空间智能的机器

那么，我们如何构建空间智能人工智能呢？要如何让模型能够以埃拉托斯特尼的视角进行推理，以工业设计师的精准进行工程，以讲故事者的想象力创造，并以第一响应者的流畅度与环境互动？

构建空间智能人工智能需要比大型语言模型更具雄心的东西：世界模型，这是一种新型生成模型，其理解、推理、生成以及与语义、物理、几何和动态复杂世界——无论虚拟还是现实——交互的能力远远超出了当今大型语言模型的范围。该领域尚处于萌芽阶段，目前的方法涵盖从抽象推理模型到视频生成系统。World Labs 于 2024 年初基于这一信念成立：基础性方法仍在建立中，这将成为未来十年的决定性挑战。

在这个新兴领域，最重要的是确立指导发展的原则。对于空间智能，我通过三种基本能力定义世界模型：

1. 生成式：世界模型可以生成具有感知、几何和物理一致性的世界

能够解锁空间理解和推理的世界模型，也必须生成自己的模拟世界。它们必须能够生成无尽多样且遵循语义或感知指令的模拟世界—— 同时在几何、物理和动态上保持一致——无论是真实空间还是虚拟空间。研究界正在积极探索这些世界应以内在几何结构的隐含或显式表示。此外，除了强大的潜在表征外，我认为普遍世界模型的输出还必须允许生成针对多种不同用例的显式、可观察的世界状态。特别是，它对现在的理解必须与过去紧密相连;回溯到导致现今世界的前身状态。

2. 多模态：世界模型设计上是多模态的

正如动物和人类所做的那样，世界模型也应能够处理输入——在生成式 AI 领域称为“提示”——以多种形式呈现。给定部分信息——无论是图片、视频、深度图、文本指令、手势还是动作——世界模型应尽可能完整地预测或生成世界状态。这需要以真实视觉的真实度处理视觉输入，同时同样高效地解释语义指令。这使得智能体和人类都能通过多样输入与模型交流世界，并获得多样化的输出。

3. 交互式：世界模型可以根据输入动作输出下一状态

最后，如果动作和/或目标是世界模型提示的一部分，其输出必须包含世界的下一状态，可以隐式或显式地表示。当仅给出一个有或无目标状态作为输入的动作时，世界模型应产生与世界之前状态、预期目标状态（如有）及其语义含义、物理定律和动态行为一致的输出。随着空间智能世界模型在推理和生成能力上变得更强大和稳健，可以想象在特定目标的情况下，世界模型本身不仅能够预测世界下一状态，还能基于新状态的下一步行动。

这一挑战的范围超越了人工智能以往所面临的任何挑战。

虽然语言是人类认知的纯粹生成现象，但世界遵循更复杂的规则。例如，在地球上，引力主导运动，原子结构决定光如何产生颜色和亮度，无数物理定律限制着每一次相互作用。即使是最奇幻、最具创造力的世界，也由遵循物理定律和动态行为的空间对象和代理人组成。要始终如一地调和这些——语义、几何、动态和物理——需要全新的方法。表示一个世界的维度远比一维、顺序的信号语言复杂得多。要实现实现我们作为人类所享有的那种普世能力的世界模型，需要克服多项艰巨的技术障碍。在世界实验室，我们的研究团队致力于朝着这一目标取得根本性进展。

以下是我们当前研究课题的一些示例：

一个新的通用训练任务函数： 定义一个像大型语言模型中下一个标记预测这样简单优雅的通用任务函数，长期以来一直是世界模型研究的核心目标。它们输入和输出空间的复杂性使得这种函数本质上更难以表述。但尽管仍有许多待探索的地方，这一目标函数及其对应的表示必须反映几何和物理的定律，尊重世界模型作为想象与现实的根基表征的根本本质。
大规模训练数据 ：训练世界模型需要比文本整理更复杂的数据。好消息是：庞大的数据源已经存在。互联网规模的图像和视频集合提供了丰富且易于获取的训练材料——挑战在于开发能够从这些基于二维图像或视频帧的信号（即 RGB）中提取更深层空间信息的算法。过去十年的研究显示，将语言模型中数据量与模型大小联系起来的尺度律具有强大力量;世界模型的关键在于构建能够在同等规模下利用现有视觉数据的架构。此外，我不会低估高质量合成数据以及深度和触觉信息等额外信息方式的力量。它们在培训过程中的关键环节补充了互联网的体重尺数据。但前进的道路依赖于更好的传感器系统、更稳健的信号提取算法以及更强大的神经仿真方法。
新模型架构与表征学习： 世界模型研究必然推动模型架构和学习算法的进步，尤其是超越当前 MLLM 和视频扩散范式。这两种方法通常都会将数据分成一维或二维序列，这使得简单的空间任务变得不必要地困难——比如在短视频中数独特的椅子，或者记住一小时前房间的样子。替代架构可能有所帮助，例如用于标记化、上下文和内存的 3D 或 4D 感知方法。例如，在 World Labs，我们最近在一种名为 RTFM 的实时生成基于帧模型上的研究展示了这一转变，该模型利用空间基准帧作为空间记忆形式，实现高效的实时生成，同时保持生成世界中的持久性。

显然，在我们能够通过世界建模完全解锁空间智能之前，仍面临严峻挑战。这项研究不仅仅是理论上的练习。它是新型创意和生产力工具的核心引擎。世界实验室的进展令人鼓舞。我们最近向少数用户展示了 Marble，这是首个可通过多模态输入生成并维护一致 3D 环境的世界模型，供用户和讲故事者探索、互动并在创作流程中进一步构建。我们正努力让它尽快向公众开放！

大理石只是我们打造真正空间智能世界模型的第一步。随着进展加速，研究人员、工程师、用户和商业领袖们开始认识到其非凡潜力。下一代世界模型将使机器实现全新层次的空间智能——这一成就将解锁当今人工智能系统中仍然大多缺乏的关键能力。

利用世界模型构建一个更美好的世界

关键在于推动人工智能发展的动力。 作为帮助开启现代人工智能时代的科学家之一，我的动机一直很明确：人工智能必须增强人类能力，而非取代它。多年来，我一直致力于将人工智能的开发、部署和治理与人类需求相结合。如今，极端的科技乌托邦和末日叙事层出不穷，但我依然持更务实的观点：人工智能是由人开发、使用、由人治理的。它必须始终尊重人民的自主性和尊严。它的魔力在于扩展我们的能力;让我们更有创造力、更有联系、更有生产力、更有成就感。空间智能代表了这一愿景——人工智能赋予人类创造者、照护者、科学家和梦想家实现曾经不可能实现的目标。正是这种信念驱使我致力于将空间智能视为人工智能的下一个伟大前沿。

空间智能的应用跨越了不同的时间线。创意工具正在涌现——World Labs 的 Marble 已经将这些能力交到创作者和讲故事者手中。机器人技术代表了一个雄心勃勃的中期视野，我们正在完善感知与行动之间的循环。最具变革性的科学应用虽然需要更长时间，但承诺对人类繁荣产生深远影响。

在所有这些时间线中，有几个领域因其重塑人类能力的潜力而脱颖而出。这需要大量的集体努力，远超单一团队或公司所能完成的。这将需要整个人工智能生态系统的参与——研究人员、创新者、创业者、企业，甚至政策制定者——共同朝着共同的愿景努力。但这个愿景值得我们去追求。未来展望如下：

创意：超强的叙事与沉浸式体验

“创造力是智慧的乐趣。”这是我个人偶像阿尔伯特·爱因斯坦最喜欢的名言之一。早在文字出现之前，人类就讲述故事——在洞穴墙上绘制故事，代代相传，建立完整的文化，建立在共享的叙事之上。故事是我们理解世界、跨越距离和时间连接、探索成为人类，最重要的是，在生活中找到意义和内心爱的方式。如今，空间智能有潜力改变我们创造和体验叙事的方式，尊重叙事的根本重要性，并将其影响从娱乐到教育、设计到建筑延伸。

World Labs 的 Marble 平台将为电影制作人、游戏设计师、建筑师和讲故事者提供前所未有的空间能力和编辑控制能力，使他们能够快速创建和迭代完全可探索的 3D 世界，而无需承担传统 3D 设计软件的开销。创作行为依然充满活力和人性;AI 工具只是放大并加速了创作者所能实现的目标。这包括：

新维度的叙事体验：电影制作人和游戏设计师利用大理石创造出不受预算或地理限制的全新世界，探索传统制作流程中难以探索的多样场景和视角。随着不同媒体形式与娱乐界限模糊，我们正接近一种根本全新的互动体验，融合了艺术、模拟和游戏——个性化的世界，任何人都可以创造并居住在自己的故事中，而不仅仅是工作室。随着更新、更快速将概念和故事板提升为完整体验的方法的兴起，叙事将不再局限于单一媒介，创作者可以自由构建跨越无数表面和平台的共享主线的世界。
通过设计实现空间叙事：基本上，每一个制造物或构造空间在物理生成前都必须在虚拟三维环境中进行设计。这一过程高度反复迭代，耗费时间和资金。借助空间智能模型，建筑师可以在花费数月时间设计、走进尚未存在的空间之前，快速可视化结构——本质上讲述我们如何生活、工作和聚会的故事。工业设计师和时尚设计师能够瞬间将想象转化为形式，探索物品如何与人体和空间互动。
新的沉浸式互动体验：体验本身是我们作为一个物种创造意义的最深层方式之一。在人类历史的整个历史中，一直有一个独特的三维世界：我们共同拥有的物理世界。直到近几十年，通过游戏和早期虚拟现实（VR），我们才开始窥见分享自己创造的平行世界的意义。如今，空间智能与 VR、扩展现实（XR）头显和沉浸式显示屏等新型形态相结合，以前所未有的方式提升了这些体验。我们正迈向一个未来，进入完整多维世界就像打开一本书一样自然。空间智能不仅让拥有专业制作团队的工作室能够轻松参与世界构建，也让个人创作者、教育者以及任何有愿景的人都能参与。

机器人学：具身智能的实际应用

从昆虫到人类的动物都依赖空间智能来理解、导航并与它们的世界互动。机器人也不会例外。空间感知机器自该领域诞生以来一直是我的梦想，包括我与斯坦福研究实验室的学生和合作者的工作。这也是我对利用 World Labs 正在构建的模型实现这些项目感到如此兴奋的原因。

通过世界模型扩展机器人学习： 机器人学习的进展依赖于可扩展的可行训练数据解决方案。鉴于机器人需要学习理解、推理、规划和互动的巨大状态空间，许多人推测，要真正创造出可推广的机器人，需要互联网数据、合成模拟和真实世界人类演示的结合。但与语言模型不同，训练数据对于当今的机器人研究来说非常稀缺。世界模型将在其中发挥决定性作用。随着感知精度和计算效率的提升，世界模型的输出可以迅速缩小模拟与现实之间的差距。这将反过来帮助机器人在无数状态、交互和环境的模拟中进行训练。
同伴与合作者： 无论是协助实验室科学家还是协助独居老年人，机器人作为人类协作者，都能扩大急需更多劳动力和生产力的劳动力。但这需要空间智能，能够感知、推理、规划和行动，同时——这点尤为重要——保持与人类目标和行为的同理心一致。例如，实验室机器人可能作仪器，使科学家能够专注于需要灵巧或推理的任务，而家庭助理则可能帮助老年人烹饪而不影响他们的乐趣和自主性。能够预测下一状态甚至可能符合这一预期的行为的真正空间智能世界模型，对于实现这一目标至关重要。
扩展的具身形式： 类人机器人在我们为自己构建的世界中扮演着角色。但创新的全部益处将来自更多样化的设计：用于送药的纳米机器人、在狭窄空间中导航的软机器人，以及为深海或外太空打造的机器。无论形态如何，未来的空间智能模型必须整合这些机器人所处的环境以及它们自身的感知和运动。但开发这些机器人的一个关键挑战是缺乏各种具象形态的训练数据。世界模型将在模拟数据、训练环境和基准测试任务中发挥关键作用。

更远的视野：科学、医疗与教育

除了创意和机器人应用外，空间智能的深远影响还将扩展到人工智能能够提升人类能力、拯救生命、加速发现的领域。我将在下面重点介绍三个可能带来深刻变革性的应用领域，尽管空间智能的应用场景在更多行业中确实非常广泛。

在科学研究中， 空间智能系统可以模拟实验、并行检验假设，并探索人类无法进入的环境——从深海到遥远的行星。这项技术能够彻底改变气候科学和材料研究等领域的计算建模。通过将多维模拟与真实世界的数据收集相结合，这些工具能够降低计算门槛，扩展每个实验室可观测和理解的范围。

在医疗领域 ，空间智能将重塑从实验室到床边的一切。在斯坦福，我的学生和合作者多年来一直在医院、养老机构和患者家中工作。这次经历让我坚信空间智能在这里具有变革性的潜力。人工智能可以通过多维分子相互作用建模加快药物发现，帮助放射科医生发现医学影像中的模式来增强诊断，并支持患者和护理人员的环境监测系统，同时不取代治疗所需的人际联系，更不用说机器人在帮助医疗工作者和患者在多种环境中的潜力。

在教育领域，空间智能能够实现沉浸式学习，使抽象或复杂的概念变得具体可感，并创造迭代体验，这对我们的大脑和身体在学习中至关重要。在人工智能时代，对学龄儿童和成人来说，对更快更有效的学习和再技能培训尤为重要。学生们可以探索细胞机制，或以多元方式走访历史事件。教师通过互动环境获得个性化教学的工具。从外科医生到工程师的专业人士，都可以安全地在逼真的模拟中练习复杂技能。

在所有这些领域，可能性无限，但目标始终不变：人工智能能够增强人类专业知识，加速人类的发现，并放大人类关怀——而不是取代作为人类核心的判断力、创造力和共情能力。

结论

过去十年，人工智能已成为全球现象，也是技术、经济乃至地缘政治的转折点。但作为一名研究者、教育者，现在又是企业家，最激励我的还是图灵75年前提出的问题背后的精神。我依然怀有他的好奇心。正是它每天通过空间智能的挑战激励着我。

历史上首次，我们准备制造与物理世界高度契合的机器，使我们能够依靠它们作为真正伙伴，共同应对我们面临的最大挑战。无论是加速我们在实验室中对疾病的理解，革新讲故事的方式，还是在我们因疾病、受伤或年老而最脆弱的时刻给予支持，我们正处于提升我们最关心生活方面技术的门槛上。这是一个更深刻、更丰富、更有力量的生活愿景。

在大自然释放出祖先动物首次展现空间智能的近五亿年后，我们有幸身处那一代技术专家之中，他们可能很快赋予机器同样的能力——并且有幸利用这些能力造福全球的人们。我们对真正智能机器的梦想，没有空间智能就不完整。

这次任务是我的北极星。加入我，一起追求它。

posted @ 2025-11-18 14:08 雪花AI学习笔记阅读(9) 评论(0) 收藏举报

刷新页面返回顶部