Page 36 - IJEEE-2022-Vol18-ISSUE-1
P. 36
32 | Shakir & Al-Azza
[43] and muscle based anatomical representations combined Thies et al. [48] suggested a real-time photo-realistic
with deformable skin [44]. The capture stage of the facial monocular reenactment method. They track facial
performance driven facial animation could be supposed as landmarks depending on a dense photometric consistency
the extraction of relevant valuable information from the input measure and utilise GPU-based iteratively reweighted least
video such that this information could then be applied onto squares solver to achieve real-time frame rates. Recently,
the underlying face representation to synthesise the some commercial facial performance capture software have
animation. This capture could be achieved using approaches been released, for example Apple’s iPhone X application to
that are active or passive. Active approaches include Marker- animate a virtual character with its depth camera [49]. Barros
based capture where physical markers are located on the et al. [50] introduces a method for real-time performance-
actor’s face and tracked through the performance [45]. driven facial animation from monocular videos. facial
Passive approaches include approaches that utilise video expressions are transferred from 2D images to a 3D virtual
inputs of the actor’s face without any markers placed on the character, by determining the rigid head pose and non-rigid
actor’s face [43]. The aim of the Retargeting stage is to adapt face deformation from detected and tracked face landmarks.
the parameters acquired from the capture step and animating Blendshape models are used to map the input face into the
the virtual target character. The parameters utilised to drive facial expression space of the 3D head model.
this character could be different from the obtained capture
parameters. This is not easy task especially when the target Recently, deep learning approaches have shown
character has proportions different from the actor’s face. Fig. interesting effort for high-quality facial performance capture.
7 shows an example of Retargeting an actor’s facial Olszewski et al. [51] utilised convolutional neural networks
expression onto multiple target characters. to recover blendshape weights corresponding to the mouth
expression of virtual reality headset users. Laine et al. [52]
Fig. 6: Example of mesh propagation being utilised for the used eep learning to learn a mapping from an actor’s image
underlying representation for the animation [40]. to the corresponding performance captured mesh, permitting
for the appropriate capture of extra high-quality data. These
Fig. 7: Retargeting an actor’s facial expression onto techniques can infer coherent data through lips contacts if
multiple target characters [53]. such data was present in the training set.
A technique for computing the suitable mesh deformations G. Visual Speech Animation
for simulating human facial expressions is called
performance-driven strategy [38]. This method requires a 3D Visual speech animation could be considered as visual
polygon mesh illustrating a human face. This mesh could be motions of the face when humans are talking. Generating
hand-crafted or it could be automatically constructed realistic visual speech animations corresponding to new text
utilising a 3D scanner. Afterward, facial gestures from a or pre-recorded audio speech input has been a hard task for
human speaker are captured. Speech-related deformations of decades. This is because human languages, generally, have a
the mesh are learned when these original gestures are large vocabulary and a large number of phonemes (the basic
mapped on the polygon mesh. Then, these mesh units of speech), but also the phenomena of speech co-
deformations can be utilised to animate the virtual speaker to articulation that complicates the mappings between audio
synthesize new visual speech. Recently, hybrid models have speech signals or phonemes and visual speech motions.
joined physically-based methods with other approaches [46, Visual speech co-articulation can be defined as follows: The
35]. The recent industry standard approaches are motion visual appearance of a phoneme depends on the phonemes
capture, marker less [40] or marker-based [47], and that come before and after it.
blendshape animation, relied on extreme poses named blend
targets, where each blend target encodes one action, i.e., Previous researchers classified visual speech animation
raising an eyebrow, then multiple blend targets can be synthesis into two different categories: viseme-driven and
linearly joined. data-driven approaches. Viseme-driven methods need
animators to design key mouth shapes for phonemes to
synthesis new speech animations. While data-driven
methods do not require pre-designed key shapes, but
generally require a pre-recorded facial motion database for
synthesis purposes. Viseme-driven methods typically utilise
some form of hand-tuned dominance function to describe
how visual parameters representing phone-level classes are
blended to generate the animation [54]. This typically results
in satisfactory animation. However, the need to hand-tune
the blending functions for each character makes the utilising
of this method unpractical. Instead, sample-based methods
concatenate segments of pre-recorded visual speech, where
the segments might correspond to fixed-sized units [55, 56]
or variable length units [57, 58]. A limitation of sample-
based methods is that good quality animation needs a large
corpus from which units can be selected, and some form of
smoothing is needed at the concatenation boundaries to