Spell and speak

Spell and speak

This example is based on the work of Anotonio Roberts aka hellocatfood who made spel aend spik), a generative video based on the phonemes used by a speech synthesizer. The sources for the project are published as a git repository. This project makes use of multiple command line tools: espeak, sox, python, ffmpeg, imagemagick

Preliminary results by the students of the course

Speak the Audio to a File

espeak -v fr "bonjour je suis une ordinateur. . . tu peux m'addresser via la ligne de command. . . merci" -w bonjour.wav


sox bonjour.wav padded.wav pad 3 3


sox padded.wav -1 -u -c 1 -r 4000 -t raw rawfile


framerate = 10 ; slice=4000/framerate
dat = open("rawfile").read()
frames = []
import os
for i in range(0,len(dat),slice):
    samples = map(lambda x:ord(x)-128,

pics = ["close.png", "semi.png", "open.png"] max_mouthOpen = len(pics)-1

step = int(max(frames)/(max_mouthOpen*2)) for i in range(len(frames)): mouth=min(int(frames[i]/step),max_mouthOpen) if i: if mouth>frames[i-1]+1: mouth=frames[i-1]+1 elif mouth < frames[i-1]-1: mouth=frames[i-1]-1 else: mouth=0 frames[i] = mouth os.system("ln -s %s frame%09d.png" % (pics[mouth],i))

NB You need to change the names of the three images (close, semi, open) to match your inputs and change "png"s to be "jpg" if you are using JPEG images (also in the second to last line!).

Sound to Image

python speak.py

Running the speak.py script maps the raw audio to frames, creating numbered links to one of the three original images corresponding to the loudness level in the audio file.

frames + sound = video

ffmpeg -r 10 -i frame%09d.png -i padded.wav -y voice.mp4


ffmpeg -r 10 -i frame%09d.jpg -i padded.wav -y voice.mp4

"animate" Script version 1

espeak "and now for something completely different" -v en -p 99 -s 50 -w hello.wav
sox hello.wav padded.wav pad 3 3
sox padded.wav -b 8 -e unsigned-integer -c 1 -r 4000 -t raw rawfile
rm frame*.jpg
python speak.py
ffmpeg -r 10 -i frame%09d.jpg -i padded.wav -y output.mp4