The Reality of AI Video Production: A Producer's Hands-On Look at VEO's Potential and Clear Limitations

Soomin Kim
Jun 30
4 min read

I currently run the blog aiwithsoomin, where I conduct micro-tests to see if AI tools are genuinely viable for professional use. This entry is my candid experience using video generation tools, particularly Google's Veo.

The advantages of AI video generation tools are already well-documented. The ability to quickly create what you imagine, incorporating effects you hadn't even considered, and now even with synced audio, is undeniably revolutionary. This post, however, will focus more on the realistic limitations from a producer's perspective.

1. Prompts: The Smarter the Command, the More the AI Chokes

The first barrier I hit was the limitation of the prompt.

Using GPT or Gemini to craft prompts can elevate the final product, as they can describe a scene with professional terminology I might not think of (like "Dolly zoom" or "Rim light"). However, the moment a producer's desire for detail becomes too specific, VEO simply chokes.

If you pack a multi-layered directive into a single prompt—like, "As the protagonist delivers this line, the color tone of the room changes simultaneously, and the camera transitions to a specific object"—the scene either fails to generate entirely or spits out a nonsensical result. The AI cannot yet comprehend multi-layered directorial intent all at once.

Ironically, the best way to improve the quality of the output is to simplify the directive and, as we say, "turn your brain off and just trust the process." This is especially true with VEO, where a single error can mean throwing 5 dollars down the drain. It's better to temper your ambitions. Here are some examples of where it went wrong.

For instance, I directed a scene where a YouTuber from 2025 time-warps via drone to the 1960s. For dramatic effect, a man in black-and-white from that era was supposed to see the YouTuber and say, "머리가 핑크여!" ("Why is your hair pink?!"). But VEO, overwhelmed by the prompt, made the YouTuber himself say the line, a glaring error.

I wanted to create a scene of the famous YouTuber 'Mr. Beast' doing a live broadcast at Gwangjang Market in Korea, and I even provided a reference image. Instead, VEO generated another 'beast,' and the audio was tainted with a subtle electronic hum.

2. The Industry's Shared Challenges: We're All Facing the Same Problems

These aren't just my struggles. In the AI video creator communities I'm a part of (one of which has 1,054 members as of June 30, 2025), the daily conversations always converge on these three topics. This is the current state of AI video production, and it's the Holy Grail everyone is desperately hoping to see resolved.

Challenge 1: Image Quality (The Dilemma of Detail vs. Resolution)

Most AI tools currently generate video at a base resolution of 1920x1080. However, for commercial markets like advertising and film, this resolution is clearly insufficient. This naturally leads to attempts at upscaling, which creates a dilemma. The upscaling process often introduces an overly smooth, "plastic texture" that is characteristic of AI, wiping out the very details that made the original generation great.
In the end, the producer is faced with a choice: "Should I preserve the incredible detail of the leather jacket the AI rendered, or should I get a 4K resolution?" For now, most producers are choosing the former. Making the output look less like AI is still the more critical task.

Challenge 2: Character Consistency (The Art of Deception)

We all remember the phase when the world was using GPT to generate Ghibli-style images. You'd ask it to "just remove that one effect in the corner," and it would generate a completely different character. Of course, with advanced tools like Veo and Midjourney, character consistency has improved dramatically by allowing for reference images or using the last frame of a clip as the first frame of the next.
Still, if you analyze the video frame by frame, the "morphing" issue where the face subtly changes is not yet fully solved. As a result, the best method producers are currently using is more of a clever workaround, a form of deception.
By using a fixed descriptive prompt for the character and employing intentionally fast cuts or diverting the viewer's attention to other props, we create the illusion of consistency. This has become the survival strategy in the industry.

Challenge 3: The Lack of an All-in-One Tool (The Fragmented Workflow)

From AI video and effects to audio generation and translation—the desire to handle all of this within a single interface is a collective cry from all creators. The more fragmented the workflow, the more friction there is in production, which eventually makes you want to just go back to the old way of doing things. The platform that solves this integration problem will inevitably dominate the market.

Conclusion: Aiming for Integration, But still is a Collection of Separate Tools

The fragmented workflow must end. It's clear that the platform that successfully creates a true 'All-in-One' tool will become the next leader in this market.

But this raises a fundamental question: just how massive are the technical resources required to make all of this possible within a single interface?

Imagine combining Veo's video generation capabilities, ElevenLabs' sophisticated voice synthesis, Midjourney's aesthetic image creation, and DeepL's multilingual translation and contextual understanding.

To integrate all of this seamlessly would require unimaginable computational power, a sophisticated architecture to orchestrate the different models without conflict, and a massive dataset to house it all.

The one who overcomes this technical barrier and succeeds in packaging it all into an intuitive interface that even professionals find robust—they will be the ones who create the next 'operating system (OS)' for content creation.

Whoever the winner turns out to be, it’s a future I,

for one, am eagerly awaiting.