Page 34 - IJEEE-2022-Vol18-ISSUE-1
P. 34

30 |                                                                                                            Shakir & Al-Azza

movement is called an Action Unit (AU). Each AU is              level facial expression synthesis approach named Local
identified by a number (AU1, AU3, AU20, . . . etc.).            Attentive Conditional Generative Adversarial Network
Samples of these action units are presented in Table I. Facial  (LAC-GAN) relied on face action units annotations. They
expressions are produced by combining the action units. For     build a model for facial action unit synthesis with more local
instance, combining AU1 (Cheeks raiser), AU4 (Lip Corner        texture details. In this approach local AU regions is
Puller), and AU15 (Lips Part) produces Happiness                integrated with conditional generative adversarial network.
expression.                                                     The proposed method manipulates AUs between various
                                                                states, which learns a mapping between a facial manifold
      Recently, the interest in utilising the FACS for          related to AU manipulation. Moreover, the key point of this
producing visual speech has declined. This is because two       approach is to do the manipulation module concentrate only
reasons. Firstly, nowadays most of the face models designed     on the generate of local AU region without touching the
for visual speech synthesis purposes are not relied on human    remainder identity information and the other AUs. The
anatomy, but consist of high detailed polygon meshes and        development of deep graph networks modelling has recently
textures which are generally automatically computed by 3D       attracted increasing attention. Zhilei Liu [24] introduces an
scanning approaches. Normal mesh deformations are learned       end-to-end deep learning framework for facial AU detection
by advanced 3D motion capture methods, which is faster and      with graph convolutional network (GCN) for AU relation
easier than a manual explanation of the numerous muscles of     modelling. They use the graph convolutional network (GCN)
the face and their effect on the facial appearance. Secondly,   [25] for AU relation modelling to support the facial AU
FACS offers many Action Units that can be utilised to           detection. AU related areas are extracted; these AU areas are
precisely mimic certain expressions, but, these Action Units    feed into some specific AU auto-encoder for deep
are less appropriate to simulate all the detailed gestures of   representation extraction. In addition, each latent
the face corresponding to speech production. Such that          representation is pull into GCN as a node.
FACS is not optimised for modelling visual speech. Facial
expression generation or synthesis has recently received                                        Table I
increasing attention in the facial expression modelling                         Sample single facial action units.
domain. Ekman and Friesen [17] developed the FACS for                           AU FACS Name
describing facial expressions with some basic face action                        1 Inner Brow Raiser
units (AUs), each of which represents a basic face muscle                       14 Dimpler
movement or expression change.                                                   5 Upper Lid Raiser
                                                                                17 Chin Raiser
      Kumar and Sharma [18] suggested an improved Waters
facial model utilised as an avatar for research published in      D. Moving Pictures Experts Group-4
[19], which discussed a facial animation system driven by             Moving pictures experts’ group-4 (MPEG-4) is an
the FACS in a low-bandwidth video streaming setting. To
build facial Expressions, FACS defines 32 single Action         object-based multimedia compression standard that permits
Units (AUs) which are created by an underlying muscle           encoding independently diverse scene’s audiovisual objects
action that interact in various methods. In this work           (AVO). MPEG-4 has facial definition parameter set (FDP)
enhancements were provided to the Waters facial model by        and the facial animation parameter set (FAP) which were
enhancing its UI, adding sheet muscles, providing an            designed to describe the facial shape and texture, as well as
alternative implementation to the jaw rotation function,        regenerating the animation of faces for instance speech
introducing a new sphincter muscle model that can be            pronunciation, expressions, and emotions. MPEG-4 facial
utilised around the eyes and alterations to operation of the    animation outlines many parameters of a talking face in a
sphincter muscle utilised around the mouth. Zhou et al. [20]    standardised approach. It identifies and animates 3D face
introduced a conditional difference adversarial autoencoder     models by describing face definition parameters (FDP) and
(CDAAE) to transfer AUs from absence to presence on the         facial animation parameters (FAP). FDPs enclose
global face. This approach uses the low-resolution images,      information for building particular 3D face geometry, while
which could lose facial details vital for AU production.        FAPs encode motion parameters of key feature points on the
Pumarola et al. [21] proposed GANimation which transfers        face over time. In MPEG-4, the head is grouped into 84
AUs on the whole face and generate a co-generated               feature points (FPs), every point defines the shape of an area.
phenomenon between different AUs. For this approach it is       Fig. 3 demonstrates part of the MPEG-4 feature points. After
difficult to generate a single AU respectively without          excluding the feature points that are not simulated by FAPs,
touched the other AU.                                           68 FAPs are classified into collections. Samples of these
                                                                collections are presented in Table II. The FAPs are two
      With the recent rise of deep learning, CNN have been      groups, one represents the facial expressions which consist
widely used to extract AU features. Zhao et al. [22]            of six basic emotions, i.e., surprise, anger, sadness, joy,
suggested a deep region and multi-label learning (DRML)         disgust, and fear. The second one concentrates on facial areas
system to partition the face images into 8 - 8 blocks and       such as the left mouth corner, the chin bottom and the right
utilised individual convolutional kernels to convolve each      eyebrow. Refer to the MPEG- 4 facial animation book for
block. Despite this method treats each face as a set of         more details about MPEG-4 facial animation standard [26].
individual parts, it partitions blocks uniformly and does not
reflect the FACS knowledge, thus leading to poor                      El Rhalibi et al. [27] presented a method relied on 3D
performance. Zhilei Liu [23], proposes an Action Unit (AU)      Homura that integrate MPEG-4 standards to realistic
   29   30   31   32   33   34   35   36   37   38   39