(Source: Scharfsinn/Shutterstock.com)
When discussing the future of human society, the topic of smart communities cannot be neglected. In May 2020, China's 2020 State Council Government Working Report proposed focusing on support for the “two new and one major” types of smart communities.
Smart communities can include new infrastructure construction, new city construction and transportation, water conservancy, among other major projects. The first two–new infrastructure and new city construction–include smart communities as a focus.
Smart communities take advantage of a slew of novel technologies to improve and facilitate daily living. In addition to unmanned community supermarkets, typical applications include smart home systems and automated parking. Among these myriad applications, community security systems are the most crucial. From neighborhood and residential building access control systems to networks of cameras throughout the community, smart systems can replace security guards in performing identification, neighborhood watches, hazard alerts, and more.
In the US, the CBS television drama, Person of Interest, depicts a security system backed by advanced artificial intelligence equipped with powerful features. A network of cameras installed throughout a given city might be capable of recording a full range of information, including identity, behavior, and even human relationships, and a central brain analyzes this information to determine threats and even make predictions about potential threats. Of course, the show's near-god-like AI system currently remains firmly within the realm of science fiction. But the intelligent security system it depicts is now slowly becoming a reality. In smart communities and smart cities, intelligent security systems can act as AI systems that combine facial recognition, behavior recognition, and human identification.
Research into computer-based facial recognition technology started in earnest in the mid-20th century. The earliest efforts were seen in pattern recognition, after which various algorithms for face detection, face alignment, facial attribute recognition, facial verification, and recognition were gradually developed and refined. These technologies are now widely used in everyday life, including facial capture software on mobile phones and cameras and automatic face recognition for clocking in and out at work and access control systems equipped with facial recognition technology that have been added to newly built communities.
The first step in developing a facial recognition algorithm is determining whether a face is present in a given image or video and identifying the pixel range that corresponds to said face. In 2001, Paul Viola and Michael Jones co-invented the now-famous Viola-Jones target detection method, which served as a basis for later face detection algorithms.
The Viola-Jones algorithm comprises two components: features and classifiers. The algorithm exploits the Haar features of the human face, corresponding to features formed by black-and-white rectangles that simulate the light-dark relationship between different parts of the target. These features can be used to look for areas of contrast between light and dark that exist on the face of a subject, such as the bridge of the nose, which is brighter than the eyes, and the mouth, which is generally darker than other areas. These features can be used to match certain candidate frames in the target image, which are then passed through an AdaBoost classifier to output Face or No Face tags. It is worth noting that multiple classifiers are chained together to form an integrated classifier in the Viola-Jones algorithm. This has the advantage of gradually reducing the number of candidate frames, which increases the algorithm’s computational speed.
Other subsequent studies have also tackled the problem based on both features and classifiers. In terms of features, today's security systems use other relatively complex features instead of Haar features. On the one hand, this can improve the detection rate of the system, and on the other hand. It can better solve the problem of detection failure caused by subject faces that are not facing the camera directly. In terms of classifiers, the non-maximum suppression (NMS) approach can be used to combine candidate frames of similar location and size and subsequently massively reduce the number of candidate frames. In contrast, deep neural networks can use graphics cards to perform most of the needed computation, greatly increasing the speed of computation.
Because the use of a standardized face makes the results of algorithms—including facial recognition—more stable, a key step is to algorithmically match faces with different angles and resolutions to a standard location in a process called facial alignment. From this point of view, all human faces can be seen as the result of a standard face’s affine transformation (scaling, rotation, and translation), and the goal of the facial alignment algorithm is to reverse this transformation process based on the feature points of the target face.
Computer scientists initially defined 68 feature points that can roughly capture the primary features of a human face. A typical approach to developing such algorithms is to have a computer learn how a standard face image is transformed step-by-step into a real image using these feature points. The mapping of a standard face image onto a real face image is achieved by training a series of regressors so that each one learns a portion of the transformed information.
Facial attributes include gender, race, age, and expression, and the accurate differentiation of these attributes can better determine the preferences and psychological state of a subject. Once face detection and alignment are performed, face attribute recognition is relatively simple. It is essentially reduced to image classification and regression with the help of big data.
In 2015, Microsoft developed an age prediction app (how-old.net) that gives age predictions for people pictured based on users’ images. In this system, faces are first circled. Then the extracted feature vectors are fed through a classifier to assign a gender label, after which an age regression analyzer is used to obtain the corresponding age data. When a deep neural network is utilized, feature extraction and classification regression can be integrated into a single algorithm to achieve the real-time prediction of multiple attributes simultaneously. Similarly, facial expressions can be classified and subject to regression. They can then be used in smart home control systems and security systems so that, in the event of danger, it becomes possible to issue an alert in the blink of an eye.
Using the above algorithm, it is possible to determine whether two pictures are of the same person in a process called facial verification. By extension, for any facial image input, the computer can match a relevant person’s data in a database and output his or her identity information and attribute information in a process known as facial recognition (Figure 1).
Figure 1: Computers can use facial-recognition algorithms to match a relevant person’s data in a database and output their indentify information and attribute information. (Source: metamorworks/Shutterstock.com)
The algorithm’s speed is crucial to ensuring a smooth user experience because of the need to compare input images with a large number of images contained in a database. One solution is to extract features from each target image. One such method is principal component analysis (PCA), a process whereby the specific characteristics of a face are obtained from a detected face selection frame, after which correlation analysis is used to obtain the closest information match. Another important feature is scale-invariant feature transform (SIFT), which can match feature points from an image with high accuracy even if the image has been subject to rotation, scaling, or even a change in resolution when a different camera is used.
A human face will look different in different lighting conditions or when presented through different media. Direct feature extraction might not be able to meet the needs of all facial-recognition scenarios, so it is crucial to decouple target facial features from lighting data other information. The well-established local binary patterns (LBP) algorithm can be used to remove lighting information. In LBP, each pixel is compared to its neighbors, and then the size relationships among the pixels in the entire image are preserved, but their specific values are removed. In this way, facial features are still preserved while pixel shifts caused by lighting or textures are removed. Disentangled representation, a new technique developed in recent years, uses a similar idea to divide the facial features extracted from deep neural networks into shape and appearance, better preserving the facial images’ features while improving recognition accuracy.
In addition to facial recognition technology, behavior recognition and identification also constitute an important component of smart security systems. In particular, behavior recognition refers to the classification of behaviors carried out by a person in a video. In contrast, identity recognition refers to the identification of the same person based on a network of cameras, after which the trajectory of their movement is determined, and their intent is assessed as suspicious or not. Combining identity and behavior recognition allows us to better determine a subject’s action state in a given video.
Initially, behavior recognition was treated as a special case of image classification. The classification target was changed from an image to a video, and actions were subject to classification instead of objects and faces. As the main storage medium used in smart security systems, video can be viewed as a combination of multiple images, so image classification methods (such as deep-learning algorithms) can be directly used in behavior recognition (Figure 2). However, because of the inherently temporal nature of subject behavior, relevant temporal features can also improve accuracy. Optical flow is one such feature applied to video that marks the changing path of a corresponding point between two consecutive frames of an image. When corresponding points belonging to multiple consecutive frames and their surrounding pixel information are encoded into a single feature, a video trajectory is formed. The combination of multiple trajectories provides a good representation of behavioral information.
Figure 2: Image classification methods (such as deep-learning algorithms) can be directly used in behavior recognition (Source: Scharfsinn/Shutterstock.com)
Deep-learning algorithms have made great strides in the area of behavior recognition in recent years. The temporal segment network (TSN) algorithm proposed by computer scientists at The Chinese University of Hong Kong improves the behavior recognition accuracy. In the TSN algorithm, raw video and a corresponding optical flow map are simultaneously used to train a deep neural network, which allows a single model to encode both appearance information and dynamic information. Also, the same video is randomly sampled to build multiple combinations so that different speeds of the same action can also be recognized. In addition to the class of algorithms exemplified by TSN, Nanyang Technological University of Singapore has also released a large tagged behavior recognition database (NTU RGB+D), which contains several actions common to hospitals and nursing homes (such as sitting down, lying down, and falling down). Behavior recognition systems trained with these algorithms and data are well suited for performing surveillance of key people and areas.
Features used for identification can be all-encompassing and include facial features, physical features, posture features, movement features, clothing features, etc. Because of camera resolution limitations, facial features can only be used as an aid in identity recognition. In contrast, larger features such as posture, movement, and clothing are used as primary features, with clothing features occupying a larger share, as part of a recognition process that is similar to that of the human eye. Therefore, the key to constructing an identity recognition algorithm lies in how to best utilize multiple features.
Deep-learning algorithms still play an important role. They allow deep neural networks to automatically extract features and assign different weights to different features via the input of large amounts of data. At the same time, they also train multiple classifiers to make determinations along different dimensions. Specifically, an algorithm for identity recognition incorporates several combined objectives, including appearance classification (clothing, backpack, pendants, etc.), body shape classification (male/female, height, etc.), and component classification (arm, leg, torso, etc.), and the final result is a weighted combination of multiple classifiers. In recent years, to simultaneously amplify distinctions between different individuals and reduce distinctions between different scenarios for the same individual, the triplet loss function has been introduced into deep-learning algorithms to train and differentiate a set of three samples, yielding favorable results.
Both facial recognition and identity and behavior recognition enjoy great advantages in security system applications. Foremost, computers can accomplish complete 24-hour surveillance so that humans cannot, and increased coverage improves the security of the overall system. Second, powerful computers can quickly process large amounts of data, greatly increasing the speed at which security hazards can be identified. Also, the information used is external (such as faces, actions, and clothing) and is not only easily accessible but also allows for comprehensive monitoring and analysis to be performed without the subject's knowledge. However, while smart cameras with the above features are starting to see use in some public places and communities, certain technical challenges will still need to be addressed before large-scale deployment is realized.
In facial recognition, the face is often obscured by glasses, sunglasses, masks, etc. In behavioral and identity recognition, limbs can be sometimes obscured. These issues pose significant challenges to the algorithm. Although certain illumination issues can be partially solved via a decoupling algorithm, certain conditions such as a dark environment or cameras with varying resolutions can still affect the algorithm accuracy’s. Also, faces that look similar, people who dress and move similarly, and changes in facial and movement characteristics over time can result in inaccurate identification.
Theoretically, the larger the amount of data, the more comprehensively a computer can be trained. In reality, however, face, behavior, and identity recognition data sets can be massive, and only after they are manually tagged can they be used to train machine-learning algorithms. As a result, tagging alone requires a significant human investment. On the other hand, once a security system is deployed, the computer needs to process huge amounts of new data every second, which slows the feedback rate. In security systems, the computer also needs to extract key features and information from the data and synthesize it to obtain more complex results (Figure 3). Currently, algorithms are still only tasked with a specific function, such as face detection and behavior recognition. In the future, when data-set sizes and computing power reach a certain level, new algorithms will be needed to synthesize information from multiple angles and provide rapid feedback to security managers.
Figure 3: Security systems need to extract key features and information from data and synthesize it to obtain more complex results. (Source: MONOPOLY919/Shutterstock.com)
The security performance of the security system itself is an important criterion for evaluating such systems. Yet, in the age of the internet, data security remains a tremendous challenge. Because of social media’s popularity, almost everyone's facial data and identity information is available online. Once this information is combined with imaging technology and even 3D printing, facial-recognition systems might be compromised. For example, some researchers are now focused on how to incorporate real and fake face discrimination into facial recognition systems to guard against such potential security threats.
Other algorithms are continually being upgraded, presenting new challenges to existing recognition techniques. For example, in recent years, generative adversarial networks have generated images of real faces, and even videos with automatic face-swapping have become commonplace. These generated faces can even fool existing face-recognition systems. Additionally, a recent paper has shown that if jamming of the identity recognition system is performed, the algorithm's identity matching results can be modified such that they do not reflect reality. Criminals can even interfere with the algorithm to elude tracking by the system.
We have shown above that the development of new algorithms remains a prerequisite for realizing smart security systems for smart communities. In addition to improving the robustness of existing algorithms for large-scale data processing, new types of data and algorithm protection mechanisms will need to be gradually introduced to address new challenges and demands. Computer scientists remain engaged in an ongoing struggle to overcome these difficulties. Facial-recognition systems built based on sparse expression can recognize faces under different cover conditions, which improve the facial recognition algorithm’s ability to process data originating from special environments. While training a recognition algorithm, learning mechanisms such as generative adversarial networks and migration learning can be introduced to leverage container technology and federation learning at the time of deployment. This not only allows the algorithm to perform recognition tasks but also allows the algorithm to distinguish among different data sources and malicious attacks, which better protects the system's data and algorithms. In the future, as algorithms continue to undergo iterative improvements in these areas, more advanced automatic recognition technologies will become an integral part of smart communities and smart cities.
Wang Dongang is a PhD candidate in the University of Sydney. His research involves medical imaging, artificial intelligence, neuroscience and video analysis, and he is always devoting to implementing machine learning techniques into applications in daily life. He has published papers in top international conferences including CVPR and ECCV, and he serves as the reviewer for journals including IEEE Transactions on Circuits and Systems for Video Technology and IEEE Transactions on Multimedia and conferences including AAAI and ICML. He is experienced in developing algorithms in machine learning and computer vision. He has cooperated with companies and institutes in China, US and Australia in projects including multi-view action recognition, road management based on surveillance videos and auto-triage system for brain CT.