Page 37 - IJEEE-2022-Vol18-ISSUE-1
P. 37

Shakir & Al-Azza                                                                                                                      | 33

reduce the visual artifacts which result from discontinuities.     into the proposed system to simultaneously model
The advantage of variable length units is that there are fewer     personalised head pose, expression and lip synchronisation.
concatenation boundaries, but the search is mostly more
difficult. Moreover, the output animation is generally                   The current overview of generative adversarial network
constrained to the talker and the environment of the original      GANs [76] has shifted the motivation of the machine
recording. Methods relied on parametric statistical models         learning group to generative modelling. GANs contain two
comprise switching linear dynamical systems [59], shared           challenging networks: generative and discriminative
Gaussian process latent variable models [60], artificial neural    networks. The generator’s target is to generate realistic
networks [61], and hidden Markov models [62–64]. One of            samples while the discriminator’s target is to discriminate
the noteworthy early work, Voice Puppetry [65], suggested          between the real and produced samples. This competition
an HMM-based talking face synthesis driven by speech               leads the generator to produce robustly realistic samples.
signal. Xie et al. [66] suggested coupled HMMs (cHMMs) to          Vougioukas [77–79] proposed an end-to-end model using
model auditory-visual asynchrony. Choi et al. [67] and             temporal GANs for speech-driven facial animation, capable
Terissi et. al [68] utilised HMM inversion (HMMI) to infer         of generating a video of a talking head from an audio signal.
the visual parameters from speech signal. Zhang et al. [69]        Guo [80] introduce a GAN-based, end-to-end TTS training
utilised a DNN to map speech features into HMM states, then        algorithm, which propose the generated sequence to GAN
further maps to synthesised faces.                                 training to avoid exposure bias in autoregressive decoder.
                                                                   The suggested algorithm improves both output quality and
     Deep Learning is a recent direction in artificial             generalisation of the model.
intelligence and machine learning research. Lately, new deep
learning frameworks are being born, outperforming state-of-               III. METHODS COMPARISON AND EVALUATION
the-art machine learning approaches. A few DNN-based
methods have been investigated. Suwajanakorn et al. [70]                 Facial blendshapes are the general option for realistic
designed an LSTM network to synthesis photo-realistic              face animation in the film industry. They have driven
talking head videos of a target identity directly from speech      animated characters in Hollywood movies and attracted
feature. This system needs a number of hours of face videos        many research attentions. Facial blendshapes can be
of the target identity, which limits its application practically.  classified into geometric and physics-based. Linear
                                                                   interpolation plays a principal role in geometric face
     Chung et al. [71] introduced an encoder-decoder               blendshapes. Linear interpolation-based face blendshapes
convolutional neural network (CNN) model to synthesis              are general meanwhile they have the advantages of
talking face video from speech feature and a single face           expressiveness, simplicity and interpretability. Despite of
image of the target identity. In this work, the reduction from     this, the following limitations have been identified in [81].
a number of hours of face videos to a single face image to         Facial blendshapes might be considered as samples from a
learn the target identity is a great improvement. The main         hypothesized manifold of face expressions. Producing a new
limitation of end-to-end synthesis is the lack of freedom for      face shape needs enough target face shapes to sample the
further manipulation of the synthesised face video. For            manifold and describe local linear interpolation functions.
instance, within a synthesised video, one might need to vary       Generating sufficient target face shapes is typically an
the gestures, facial expressions, and lighting conditions, all     iterative and effort intensive procedure. Physics-based face
of which could be independent of the content of the speech.        blendshapes are to augment physics to facial blendshapes
These end-to-end frameworks could not provide such                 which has a prospect to tackle the above issue. Particularly,
manipulations unless these factors could be taken as extra         when physics-based simulations are combined with data-
inputs. However, that would significantly increase the             driven methods, truthful face animation can be generated.
amount and diversity of data required for training the
systems. For such manipulations a modular design which                   The weaknesses with physically-based methods
splits the generation of key parameters and the fine details of    produced by their complexity. Acquiring physics-based rigs
synthesised face images is more flexible. Pham et al. [72]         configured could be a very hard and boring task for
adopted a modular design, speech features firstly mapped to        performers. There has been effort achieved to automate some
3D deformable shape and rotation parameters utilising an           of the creation of these rigs for physics based facial
LSTM framework, and then a 3D animated face synthesised            animation [81], however there is still a noteworthy amount
in realtime from the predicted parameters. This approach is        of effort that has to be achieved by hand to make the rigs look
improved in [73] by substituting speech features with raw          accurate and not fall into the uncanny valley. As well as the
waveforms as the input and substituting the LSTM                   complexity causing the rigs to be hard to set up, these
framework with a convolutional architecture. Chen et al. [74]      complex rigs need enormous computations to compute the
first transferred the auditory to face and then synthesised        animations. These large computations make physically-
video frames conditioned on the landmarks. Song et al. [75]        based animations inappropriate for real-time applications.
suggested a conditional recurrent adversarial network that         Physics-based solutions to face animation have given the
integrated auditory and image features in recurrent units.         entertainment industry great animations that are becoming
But, the head pose generated by these 2D-based methods is          closer to being anatomically truthful, however the great
almost fixed during talking. This drawback is caused by the        amount of effort to get these rigs completed will keep the
defect inherent in 2D-based methods, since it is difficult to      method from being utilised more widely.
only use 2D information alone for naturally modelling the
change of pose. They introduce 3D geometry information                   MPEG-4 is could be seen as a formalisation of FACS,
                                                                   however what makes it different method is that, unlike
   32   33   34   35   36   37   38   39   40   41   42