Text-to-video prompts that work for image generation often fail badly for video. Motion, camera behavior, and pacing need to be specified explicitly - and the structure that works is different from what most image prompters expect.
Why image prompt habits break in video
Image generation is stateless. Video generation is temporal. The model has to commit to what happens over time, maintain consistency frame to frame, and handle motion physics. A prompt that produces a great single frame often produces incoherent or boring video because it says nothing about movement.
The 6-block prompt structure
Write one block at a time, then merge into a single clean prompt. This separation forces you to think about each dimension explicitly.
—
Block 1: Subject and action
What is in the scene and what is it doing. Be specific about motion verbs.
Weak: a woman in a cafe
Strong: a woman sitting at a cafe window, slowly stirring her coffee, glancing toward the street
The action words (“slowly stirring”, “glancing”) give the model motion direction.
—
Block 2: Camera and shot
Type of shot and camera position.
extreme close-up on hands
medium shot, slightly low angle
wide establishing shot
over-the-shoulder perspective
Without this, models default to whatever feels compositionally safe - often a static medium shot.
—
Block 3: Motion and pacing
Camera movement and scene energy. These are separate from subject motion.
Camera: slow dolly forward, static locked-off shot, gentle handheld drift, crane shot rising slowly
Pacing: unhurried, frenetic, calm and meditative, building tension
—
Block 4: Lighting
Be specific. “Good lighting” means nothing.
golden hour backlighting
cool blue neon ambient, rain-wet reflections
harsh overhead fluorescent
soft window light from the left
—
Block 5: Lens and focal length
This is the most underused block and one of the highest-leverage ones.
85mm portrait lens, shallow depth of field, background bokeh
24mm wide, everything in focus
telephoto compression, 200mm
fisheye lens distortion
Lens terms dramatically shift the feel of the output.
—
Block 6: Style and negative constraints
Visual style and what to exclude.
Style: cinematic film look, subtle grain, documentary style, naturalistic, hyper-real commercial aesthetic
Negatives: no text, no watermarks, no jump cuts, no camera shake (add whatever the specific tool responds to)
—
Assembled example
A woman sitting at a cafe window, slowly stirring her coffee, glancing toward the street. Medium shot, slightly low angle. Static locked-off camera, calm and unhurried pacing. Golden hour backlighting from the window. 85mm portrait lens, shallow depth of field, background bokeh. Cinematic film look, subtle grain. No text, no watermarks.
—
Tips per tool
- Runway: responds well to camera instruction terms, generous with style keywords
- Kling: rewards shorter prompts, benefits most from explicit negative constraints
- Veo: documentary and cinematic reference styles trigger its best output
What structures are working for you? Drop your go-to video prompt formats below.
Curated by Selendia AI 🎥