create photorealistic talking face

,*,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Master title style,Create Photo-Realistic Talking Face,Changbo Hu,*,This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang,Outline,Introduction of talking face,Motivations,System overview,Techniques,Conclusions,Introduction,What is a talking face,Face (lip) animation, driven by voice,Applications,The process of talking face,Face model,Motion capture,Mapping between,audio and video,Rendering,Photo-realistic?,Literatures,Walter,93, DecFace, 2Dwire frame model,Terzopoulos,95, Skin and muscle model,Breglar,97, Video Rewrite, Sample image based,TS Huang,98,Mesh model from range data,Poggio,98, MikeTalk, Viseme morphing,Guenter,99, Making face, 3D from multicamera,Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint,Cosatto,00, Planar quads model,Some Face models,Motivations,Aim: a graphics interface for conversation agent,Photo-realistic,Driven by Chinese,Smooth connection between sentences,Extended from “Video rewrite”,System overview:Pipeline of the system(1),Video with Sound,Images,Sound,Pose tracking,Phoneme,segmentation,Annotation,Lip motion Tracking,Train database,System overview: Pipeline of the system(2),New text,Wav sound,TTS system,Triphone sequence,Segmentation,Synthesized triphone sequence,Train database,Lip motion sequence,Rewrite to faces,Background sequence,Techniques,Analysis:,Audio process,Image process,Synthesis,Lip image,Background image,Stitch together,Audio part:,Sound Segmentation,Given the wav file and the script,Using HMM to train the segment system,Segment wav file to phoneme sequence,Example of the segmentation result:,SILOPEN023,SILOPEN2442,s4361,if46274,j7580,ia18197,sh98109,ang1110121,y122130,e4131133,y134145,in2146154,h155164,ang2165194,Annotation with Phoneme,Using phoneme to annotate video frames,Each phoneme in a sentence corresponds to a short time of video sequence,Training Sentence,Audio Frames,Video Frames,Phoneme Sequence,Frames for Phoneme1,Frames for Phoneme1,Phoneme1,Frames for Phoneme2,Frames for Phoneme2,Phoneme2,Phoneme Distance Analysis,Phoneme&triphone basics,Chinese Phoneme vs. English Phoneme,Distance Metrics definitions,Results,Phoneme Basics,Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.,CH, JH, S, EH, EY, OY, AE, SIL,Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.,T-IY-P, IY-P-AA, P-AA-T,Chinese Phoneme vs. English,Chinese phoneme has two basic groups: Initials and Finals.,Initials: B, P, M, F, ,Finals: a3, o1, e2, eng3, iang4, ue5, ,Chinese finals each has 5 tones: 1,2,3,4,5.,Different tones: a1, a2, a3, a4, a5.,Chinese finals actually is not a basic elements of speech.,For example: iang1, iao1, uang1, iong1,Chinese phoneme set is much larger than English.,Phoneme Distance Analysis,Define the distance between any two phonemes.,Since we only synthesis video but not sound, so tone is ignored,Lip shape motion is the core element for distance metrics.,Phoneme Distance Analysis,Video 1,Video 2,Video 4,Video 1,Video 2,Video 3,Phoneme 1:,Phoneme 2:,Time Align to an uniform length,Video 2,Video 3,Video 4,Video 2,Video 1,Video 1,Average the videos to,get an average video,Video Average,Video Average,By comparing the two aligned average videos, we generate the,distance matrix of the whole phoneme set.,Image part:,Pose Tracking,Assume a plane model for face,Standard minimization method to find transform matrix (affine transform)Black,95,Mask is used to constrain interests part of the face,Template Picture,Mask Image,Pose tracking,Motion prediction using parameters with physical meaning,Pose Tracking,Some tracking results:,Lip Motion Tracking,Using Eigen Points (Covell, 91),Feature Points include Jaw, lip and teeth,Training database specified manually,Auto tracking through all pose-tracked images,Lip motion tracking,Lip Motion,Tracking,Train Database,(hand-labeled),Auto Tracking Results,Synthesis new sentences,New text converted by TTS system to wav,Wav is segmented to phoneme sequence,Using DP to find an optimal video sequence from the training database,Time-align triphone videos and stitch them together.,Transform the lip sequence and paste them to background faces.,Lip sequence synthesis,Optimal phoneme sequences,Triphone 1,Triphone 2,Triphone 5,Triphone 3,Triphone 4,Triphone 6,Triphone 7,Triphone 8,Triphone B,Triphone 9,Triphone A,Triphone C,New phoneme sequences,New phoneme sequences,Dynamic Programming,Begin,Triphone1,Triphone3,Triphone2,Triphone4,End,Triphone5,Edge Cost Definition,Two parts:,phoneme distance: 3 phonemes distances added together,Lip shape distance for the overlap portion of triphone video,Weighted add together two part,Background video generation,Background is a video sequence when the virtual character spoke something else,Similarity measurement of background,Select “standard frame”,The frame with maximal number of frames similar to it,Filter out the frames with jerkiness,Stitch the time-aligned result to background faces,Write back with a mask,Transform the synthesized lip to the background face,Mask image for,write-back operation,Original background frame,Write-back result of the same frame,More video results,More video results,Conclusion and Future Work,Pose tracking and lip motion tracking,Size of the train database,Talking face with expression,Real-time generation?,Fast modeling for different person,Animation,Thank you,


