← All posts
Technology··8 min read

AI Image Descriptions for Travel: Be My AI, Seeing AI, and What's Coming

By Shahzad Eskandari

Three years ago, asking your phone to describe a photo to you meant getting back something like "image of building, daytime, sky." It was technically a description, in the way that a recipe for soup is technically a meal. Today, you can take a photo of the inside of a hotel room and get a paragraph that describes the layout, the bed configuration, where the bathroom door is, and whether the curtains are open. AI image description has, very quietly, become one of the biggest accessibility breakthroughs of the last decade.

What changed

The shift was the move from older computer-vision models, which could identify "a building" or "a person," to multimodal large language models that can describe the relationships between things in an image. The same model that can write an essay can, when shown a photo, write you a paragraph that captures spatial layout, mood, and detail.

For travelers, this is transformative in ways that aren't obvious until you've used it. A few examples from our recent trips:

  • Walking into a hotel room, taking a photo, and getting back: "You're in a rectangular hotel room. The bed is centered against the wall opposite the door, with nightstands on both sides. There's a desk to your left as you enter, a TV mounted on the wall above the desk, and an open door on the right side of the room leading to what appears to be a bathroom with a tiled floor."
  • Photographing a restaurant menu in Czech and getting back not just a translation but a structured "Appetizers" / "Main courses" / "Desserts" breakdown with prices.
  • Standing in front of a public artwork and asking what it shows and what's written on the plaque, getting back a substantive paragraph about the piece.

Be My AI

Be My Eyes (the volunteer-based service) integrated GPT-4 vision as "Be My AI" in 2023, and it's been our most-used image- description tool since. It's free, fast (typically 5–10 seconds per image), and the descriptions are good enough to make decisions from.

Where it shines for travel:

  • Reading menus, including handwritten chalkboard ones.
  • Describing the layout of unfamiliar rooms (hotels, restaurants, public spaces).
  • Identifying landmarks visible in photos.
  • Understanding signage in languages you don't read.

Where it's still limited:

  • Real-time guidance is not its strength — you take a photo, you wait for a description. Live-camera tools are getting better but aren't there yet.
  • The descriptions are generally accurate but occasionally confidently wrong. Treat them as helpful, not gospel.
  • People in photos are described in vague terms (no specific identifications), which is a privacy choice but sometimes frustrating.

Seeing AI

Microsoft's free app has been around since 2017 and predates the modern AI boom, but it's added GPT-powered descriptions in the last year and is now competitive with Be My AI. Its specialty is speed: real-time text recognition and short-text mode are still the fastest way to read text out of a camera feed.

For travel, Seeing AI's "Scene" mode (long-form description) is roughly equivalent to Be My AI. The "Short Text" mode is unique — it reads any text in the camera's view as you point it around, which is faster than taking-a-photo for things like reading signs as you walk past them.

Aira's AI mode

Aira primarily connects you to human agents, but they've added an AI mode for situations where speed matters more than nuance. It's decent but, honestly, the human agents are still the better product for the things people use Aira for. The AI mode is more useful as a 24/7 fallback when a human agent isn't available.

Apple Intelligence and the platform shift

Apple's accessibility team is integrating image description into the OS itself in iOS 18+. VoiceOver now offers AI-generated image descriptions for any image on screen, not just those with explicit alt text. This is a meaningful shift — alt text is often missing or unhelpful, and the AI fallback closes a long-standing gap.

For travel apps specifically, this means that even apps that haven't done their accessibility homework now have at least passable image descriptions for their photos. Hotel listings, restaurant menus with images, transit signage screenshots — all get described automatically.

Android is moving in a similar direction with Google's TalkBack updates, though slower.

What we built into Luma

We integrated AI image descriptions for product images and travel listings directly into the app, rather than asking users to go through a separate app. The flow is: tap an image, get an immediate description. No screenshot, no upload, no waiting in a queue.

The cost-per-description is low enough that we can offer it without subscription gates. We see this as an accessibility baseline, not a premium feature.

What's coming

The next step is real-time descriptions through smart glasses or camera-equipped earbuds. Apple's Vision Pro and Meta's Ray-Ban glasses are early experiments in this; the descriptions are currently primitive but the trajectory is clear. Within two to three years we'll have continuous, contextual scene descriptions running in the background as we move through the world.

For travel, this is the missing piece. Most accessibility problems come from the gap between "what's happening around me" and "what I can perceive." AI image description is closing that gap, slowly, in ways that compound. We'll look back at this decade and recognize that something fundamental shifted.

One practical tip

Use AI image description proactively, not just reactively. Take a photo of every hotel room you check into, just for the layout description. Photograph the cab driver's dashboard before a long ride to confirm the meter is running. Photograph the menu when you sit down so you have it to reference later. The cost of a photo is zero. The value of having structured descriptions of your environment is enormous.

We've stopped thinking of AI image description as a tool to use when stuck. It's now part of how we move through unfamiliar places — a layer of perception that the technology has finally made viable.

Affiliate disclosure

Some links in this article are affiliate links. If you book through them, Luma may earn a small commission at no extra cost to you. We only recommend services we've used or researched ourselves.