Page 34 - IJEEE-2022-Vol18-ISSUE-1
P. 34
30 | Shakir & Al-Azza
movement is called an Action Unit (AU). Each AU is level facial expression synthesis approach named Local
identified by a number (AU1, AU3, AU20, . . . etc.). Attentive Conditional Generative Adversarial Network
Samples of these action units are presented in Table I. Facial (LAC-GAN) relied on face action units annotations. They
expressions are produced by combining the action units. For build a model for facial action unit synthesis with more local
instance, combining AU1 (Cheeks raiser), AU4 (Lip Corner texture details. In this approach local AU regions is
Puller), and AU15 (Lips Part) produces Happiness integrated with conditional generative adversarial network.
expression. The proposed method manipulates AUs between various
states, which learns a mapping between a facial manifold
Recently, the interest in utilising the FACS for related to AU manipulation. Moreover, the key point of this
producing visual speech has declined. This is because two approach is to do the manipulation module concentrate only
reasons. Firstly, nowadays most of the face models designed on the generate of local AU region without touching the
for visual speech synthesis purposes are not relied on human remainder identity information and the other AUs. The
anatomy, but consist of high detailed polygon meshes and development of deep graph networks modelling has recently
textures which are generally automatically computed by 3D attracted increasing attention. Zhilei Liu [24] introduces an
scanning approaches. Normal mesh deformations are learned end-to-end deep learning framework for facial AU detection
by advanced 3D motion capture methods, which is faster and with graph convolutional network (GCN) for AU relation
easier than a manual explanation of the numerous muscles of modelling. They use the graph convolutional network (GCN)
the face and their effect on the facial appearance. Secondly, [25] for AU relation modelling to support the facial AU
FACS offers many Action Units that can be utilised to detection. AU related areas are extracted; these AU areas are
precisely mimic certain expressions, but, these Action Units feed into some specific AU auto-encoder for deep
are less appropriate to simulate all the detailed gestures of representation extraction. In addition, each latent
the face corresponding to speech production. Such that representation is pull into GCN as a node.
FACS is not optimised for modelling visual speech. Facial
expression generation or synthesis has recently received Table I
increasing attention in the facial expression modelling Sample single facial action units.
domain. Ekman and Friesen [17] developed the FACS for AU FACS Name
describing facial expressions with some basic face action 1 Inner Brow Raiser
units (AUs), each of which represents a basic face muscle 14 Dimpler
movement or expression change. 5 Upper Lid Raiser
17 Chin Raiser
Kumar and Sharma [18] suggested an improved Waters
facial model utilised as an avatar for research published in D. Moving Pictures Experts Group-4
[19], which discussed a facial animation system driven by Moving pictures experts’ group-4 (MPEG-4) is an
the FACS in a low-bandwidth video streaming setting. To
build facial Expressions, FACS defines 32 single Action object-based multimedia compression standard that permits
Units (AUs) which are created by an underlying muscle encoding independently diverse scene’s audiovisual objects
action that interact in various methods. In this work (AVO). MPEG-4 has facial definition parameter set (FDP)
enhancements were provided to the Waters facial model by and the facial animation parameter set (FAP) which were
enhancing its UI, adding sheet muscles, providing an designed to describe the facial shape and texture, as well as
alternative implementation to the jaw rotation function, regenerating the animation of faces for instance speech
introducing a new sphincter muscle model that can be pronunciation, expressions, and emotions. MPEG-4 facial
utilised around the eyes and alterations to operation of the animation outlines many parameters of a talking face in a
sphincter muscle utilised around the mouth. Zhou et al. [20] standardised approach. It identifies and animates 3D face
introduced a conditional difference adversarial autoencoder models by describing face definition parameters (FDP) and
(CDAAE) to transfer AUs from absence to presence on the facial animation parameters (FAP). FDPs enclose
global face. This approach uses the low-resolution images, information for building particular 3D face geometry, while
which could lose facial details vital for AU production. FAPs encode motion parameters of key feature points on the
Pumarola et al. [21] proposed GANimation which transfers face over time. In MPEG-4, the head is grouped into 84
AUs on the whole face and generate a co-generated feature points (FPs), every point defines the shape of an area.
phenomenon between different AUs. For this approach it is Fig. 3 demonstrates part of the MPEG-4 feature points. After
difficult to generate a single AU respectively without excluding the feature points that are not simulated by FAPs,
touched the other AU. 68 FAPs are classified into collections. Samples of these
collections are presented in Table II. The FAPs are two
With the recent rise of deep learning, CNN have been groups, one represents the facial expressions which consist
widely used to extract AU features. Zhao et al. [22] of six basic emotions, i.e., surprise, anger, sadness, joy,
suggested a deep region and multi-label learning (DRML) disgust, and fear. The second one concentrates on facial areas
system to partition the face images into 8 - 8 blocks and such as the left mouth corner, the chin bottom and the right
utilised individual convolutional kernels to convolve each eyebrow. Refer to the MPEG- 4 facial animation book for
block. Despite this method treats each face as a set of more details about MPEG-4 facial animation standard [26].
individual parts, it partitions blocks uniformly and does not
reflect the FACS knowledge, thus leading to poor El Rhalibi et al. [27] presented a method relied on 3D
performance. Zhilei Liu [23], proposes an Action Unit (AU) Homura that integrate MPEG-4 standards to realistic