Page 36 - IJEEE-2022-Vol18-ISSUE-1
P. 36

32 |                                                                                                             Shakir & Al-Azza

[43] and muscle based anatomical representations combined              Thies et al. [48] suggested a real-time photo-realistic
with deformable skin [44]. The capture stage of the              facial monocular reenactment method. They track facial
performance driven facial animation could be supposed as         landmarks depending on a dense photometric consistency
the extraction of relevant valuable information from the input   measure and utilise GPU-based iteratively reweighted least
video such that this information could then be applied onto      squares solver to achieve real-time frame rates. Recently,
the underlying face representation to synthesise the             some commercial facial performance capture software have
animation. This capture could be achieved using approaches       been released, for example Apple’s iPhone X application to
that are active or passive. Active approaches include Marker-    animate a virtual character with its depth camera [49]. Barros
based capture where physical markers are located on the          et al. [50] introduces a method for real-time performance-
actor’s face and tracked through the performance [45].           driven facial animation from monocular videos. facial
Passive approaches include approaches that utilise video         expressions are transferred from 2D images to a 3D virtual
inputs of the actor’s face without any markers placed on the     character, by determining the rigid head pose and non-rigid
actor’s face [43]. The aim of the Retargeting stage is to adapt  face deformation from detected and tracked face landmarks.
the parameters acquired from the capture step and animating      Blendshape models are used to map the input face into the
the virtual target character. The parameters utilised to drive   facial expression space of the 3D head model.
this character could be different from the obtained capture
parameters. This is not easy task especially when the target           Recently, deep learning approaches have shown
character has proportions different from the actor’s face. Fig.  interesting effort for high-quality facial performance capture.
7 shows an example of Retargeting an actor’s facial              Olszewski et al. [51] utilised convolutional neural networks
expression onto multiple target characters.                      to recover blendshape weights corresponding to the mouth
                                                                 expression of virtual reality headset users. Laine et al. [52]
 Fig. 6: Example of mesh propagation being utilised for the      used eep learning to learn a mapping from an actor’s image
      underlying representation for the animation [40].          to the corresponding performance captured mesh, permitting
                                                                 for the appropriate capture of extra high-quality data. These
     Fig. 7: Retargeting an actor’s facial expression onto       techniques can infer coherent data through lips contacts if
                 multiple target characters [53].                such data was present in the training set.

   A technique for computing the suitable mesh deformations        G. Visual Speech Animation
for simulating human facial expressions is called
performance-driven strategy [38]. This method requires a 3D            Visual speech animation could be considered as visual
polygon mesh illustrating a human face. This mesh could be       motions of the face when humans are talking. Generating
hand-crafted or it could be automatically constructed            realistic visual speech animations corresponding to new text
utilising a 3D scanner. Afterward, facial gestures from a        or pre-recorded audio speech input has been a hard task for
human speaker are captured. Speech-related deformations of       decades. This is because human languages, generally, have a
the mesh are learned when these original gestures are            large vocabulary and a large number of phonemes (the basic
mapped on the polygon mesh. Then, these mesh                     units of speech), but also the phenomena of speech co-
deformations can be utilised to animate the virtual speaker to   articulation that complicates the mappings between audio
synthesize new visual speech. Recently, hybrid models have       speech signals or phonemes and visual speech motions.
joined physically-based methods with other approaches [46,       Visual speech co-articulation can be defined as follows: The
35]. The recent industry standard approaches are motion          visual appearance of a phoneme depends on the phonemes
capture, marker less [40] or marker-based [47], and              that come before and after it.
blendshape animation, relied on extreme poses named blend
targets, where each blend target encodes one action, i.e.,             Previous researchers classified visual speech animation
raising an eyebrow, then multiple blend targets can be           synthesis into two different categories: viseme-driven and
linearly joined.                                                 data-driven approaches. Viseme-driven methods need
                                                                 animators to design key mouth shapes for phonemes to
                                                                 synthesis new speech animations. While data-driven
                                                                 methods do not require pre-designed key shapes, but
                                                                 generally require a pre-recorded facial motion database for
                                                                 synthesis purposes. Viseme-driven methods typically utilise
                                                                 some form of hand-tuned dominance function to describe
                                                                 how visual parameters representing phone-level classes are
                                                                 blended to generate the animation [54]. This typically results
                                                                 in satisfactory animation. However, the need to hand-tune
                                                                 the blending functions for each character makes the utilising
                                                                 of this method unpractical. Instead, sample-based methods
                                                                 concatenate segments of pre-recorded visual speech, where
                                                                 the segments might correspond to fixed-sized units [55, 56]
                                                                 or variable length units [57, 58]. A limitation of sample-
                                                                 based methods is that good quality animation needs a large
                                                                 corpus from which units can be selected, and some form of
                                                                 smoothing is needed at the concatenation boundaries to
   31   32   33   34   35   36   37   38   39   40   41