Lip Sync

Create

6 cr/sImage mode (max 35 s)

3 cr/sVideo mode (max 120 s)

Lip Sync has two modes. Image mode animates a static portrait photo to speak in sync with an audio clip — no camera required. Video mode takes an existing video clip and replaces the voice with new audio, useful for dubbing, re-voicing, or adding narration to footage you already have.

Lip Sync requires a paid plan (Starter, Pro, or Business). Free accounts will see an upgrade prompt.

Image mode — animate a portrait

1Navigate to Create → Lip Sync and select the Image tab.
2Choose a portrait: pick a saved avatar or upload a new photo (JPG, PNG, WEBP, max 4 MB). The photo should show a clear, forward-facing face.
3Upload an audio file (MP3 or WAV, max 4 MB). Audio must be 35 seconds or shorter.
4The page automatically detects the audio duration and shows the credit cost (duration × 6 cr/s).
5Optionally add a Style Prompt to guide subtle expression and body movement.
6Click Generate. The job typically completes in 30–90 seconds.
7The output video appears in the results panel and is saved to your Library.

Video mode — re-voice an existing clip

1Navigate to Create → Lip Sync and select the Video tab.
2Upload a source video (MP4 or MOV) that contains a face. This is the clip whose voice will be replaced.
3Upload a replacement audio file (MP3 or WAV, max 120 seconds).
4Optionally upload a Reference Face photo — only needed if the video contains multiple people and you want to target a specific face.
5Choose Video Length: No Extension (output matches the shorter of the video or audio) or Extend to Audio (the video is extended to match the full audio length).
6Click Generate. The job typically completes in 1–3 minutes depending on clip length.

Credit cost examples

Mode	Audio Duration	Credit Cost
Image	5 s	30 credits
Image	15 s	90 credits
Image	35 s (max)	210 credits
Video	30 s	90 credits
Video	60 s	180 credits
Video	120 s (max)	360 credits

Tips

Image mode: use a clean, well-lit headshot with a neutral expression for the most natural results
Image mode: avoid photos with heavy shadows across the face or extreme angles
Image mode: generate voiceover in Audio Studio, then upload it here for a full talking-head workflow
Video mode: the source video should have a single, clearly visible face for best accuracy
Video mode: use Reference Face when the video has multiple people and you only want to sync one of them
Keep clips concise — 5–15 seconds works best for short-form social content

Lip Sync works as a node inside the Storyboard canvas in both modes. Image mode: connect an Image Gen node to the portrait slot. Video mode: connect a Video Gen or Library node to the source video slot. Upload audio directly inside the node, then wire the output to a Video Combiner.