Page 37 - IJEEE-2022-Vol18-ISSUE-1
P. 37
Shakir & Al-Azza | 33
reduce the visual artifacts which result from discontinuities. into the proposed system to simultaneously model
The advantage of variable length units is that there are fewer personalised head pose, expression and lip synchronisation.
concatenation boundaries, but the search is mostly more
difficult. Moreover, the output animation is generally The current overview of generative adversarial network
constrained to the talker and the environment of the original GANs [76] has shifted the motivation of the machine
recording. Methods relied on parametric statistical models learning group to generative modelling. GANs contain two
comprise switching linear dynamical systems [59], shared challenging networks: generative and discriminative
Gaussian process latent variable models [60], artificial neural networks. The generator’s target is to generate realistic
networks [61], and hidden Markov models [62–64]. One of samples while the discriminator’s target is to discriminate
the noteworthy early work, Voice Puppetry [65], suggested between the real and produced samples. This competition
an HMM-based talking face synthesis driven by speech leads the generator to produce robustly realistic samples.
signal. Xie et al. [66] suggested coupled HMMs (cHMMs) to Vougioukas [77–79] proposed an end-to-end model using
model auditory-visual asynchrony. Choi et al. [67] and temporal GANs for speech-driven facial animation, capable
Terissi et. al [68] utilised HMM inversion (HMMI) to infer of generating a video of a talking head from an audio signal.
the visual parameters from speech signal. Zhang et al. [69] Guo [80] introduce a GAN-based, end-to-end TTS training
utilised a DNN to map speech features into HMM states, then algorithm, which propose the generated sequence to GAN
further maps to synthesised faces. training to avoid exposure bias in autoregressive decoder.
The suggested algorithm improves both output quality and
Deep Learning is a recent direction in artificial generalisation of the model.
intelligence and machine learning research. Lately, new deep
learning frameworks are being born, outperforming state-of- III. METHODS COMPARISON AND EVALUATION
the-art machine learning approaches. A few DNN-based
methods have been investigated. Suwajanakorn et al. [70] Facial blendshapes are the general option for realistic
designed an LSTM network to synthesis photo-realistic face animation in the film industry. They have driven
talking head videos of a target identity directly from speech animated characters in Hollywood movies and attracted
feature. This system needs a number of hours of face videos many research attentions. Facial blendshapes can be
of the target identity, which limits its application practically. classified into geometric and physics-based. Linear
interpolation plays a principal role in geometric face
Chung et al. [71] introduced an encoder-decoder blendshapes. Linear interpolation-based face blendshapes
convolutional neural network (CNN) model to synthesis are general meanwhile they have the advantages of
talking face video from speech feature and a single face expressiveness, simplicity and interpretability. Despite of
image of the target identity. In this work, the reduction from this, the following limitations have been identified in [81].
a number of hours of face videos to a single face image to Facial blendshapes might be considered as samples from a
learn the target identity is a great improvement. The main hypothesized manifold of face expressions. Producing a new
limitation of end-to-end synthesis is the lack of freedom for face shape needs enough target face shapes to sample the
further manipulation of the synthesised face video. For manifold and describe local linear interpolation functions.
instance, within a synthesised video, one might need to vary Generating sufficient target face shapes is typically an
the gestures, facial expressions, and lighting conditions, all iterative and effort intensive procedure. Physics-based face
of which could be independent of the content of the speech. blendshapes are to augment physics to facial blendshapes
These end-to-end frameworks could not provide such which has a prospect to tackle the above issue. Particularly,
manipulations unless these factors could be taken as extra when physics-based simulations are combined with data-
inputs. However, that would significantly increase the driven methods, truthful face animation can be generated.
amount and diversity of data required for training the
systems. For such manipulations a modular design which The weaknesses with physically-based methods
splits the generation of key parameters and the fine details of produced by their complexity. Acquiring physics-based rigs
synthesised face images is more flexible. Pham et al. [72] configured could be a very hard and boring task for
adopted a modular design, speech features firstly mapped to performers. There has been effort achieved to automate some
3D deformable shape and rotation parameters utilising an of the creation of these rigs for physics based facial
LSTM framework, and then a 3D animated face synthesised animation [81], however there is still a noteworthy amount
in realtime from the predicted parameters. This approach is of effort that has to be achieved by hand to make the rigs look
improved in [73] by substituting speech features with raw accurate and not fall into the uncanny valley. As well as the
waveforms as the input and substituting the LSTM complexity causing the rigs to be hard to set up, these
framework with a convolutional architecture. Chen et al. [74] complex rigs need enormous computations to compute the
first transferred the auditory to face and then synthesised animations. These large computations make physically-
video frames conditioned on the landmarks. Song et al. [75] based animations inappropriate for real-time applications.
suggested a conditional recurrent adversarial network that Physics-based solutions to face animation have given the
integrated auditory and image features in recurrent units. entertainment industry great animations that are becoming
But, the head pose generated by these 2D-based methods is closer to being anatomically truthful, however the great
almost fixed during talking. This drawback is caused by the amount of effort to get these rigs completed will keep the
defect inherent in 2D-based methods, since it is difficult to method from being utilised more widely.
only use 2D information alone for naturally modelling the
change of pose. They introduce 3D geometry information MPEG-4 is could be seen as a formalisation of FACS,
however what makes it different method is that, unlike