ATM Protection Using Embedded Machine Learning Solutions

Antonio Rizzo, Francesco Montefoschi, Alessandro Rossi, Maurizio Caporali
University of Siena
Siena, Italy
antonio.rizzo@unisi.it, francesco.montefoschi@unisi.it, alessandro.rossi2@unisi.it, maurizio.caporali@unisi.it


Antonio J. Peña, Marc Jorda
Barcelona Supercomputing Center (BSC), Barcelona, Spain
antonio.pena@bsc.es, marc.jorda@bsc.es


Gianluca Venere
SECO Srl, Arezzo, Italy



Carlo Festucci
Monte dei Paschi di Siena, Siena, Italy




Abstract ATMs are an easy target for fraud attacks, like card skimming/trapping, cash trapping, malware and physical attacks. Attacks based on explosives are a rising problem in Europe and many other parts of the world. A report from the EAST association shows a rise of 80% of such attacks between the first six months of 2015 and 2016. This trend is particularly worrying, not only for the stolen cash, but also for the significant collateral damages to buildings and equipment [1].
We developed a video surveillance application based on Intel RealSense depth cameras that can run on SECO's A80 Single Board Computer. The camera can be embedded in the ATM’s chassis, and focus the area under the screen, where explosive based attacks begin. The use of depth cameras avoids privacy-related regulatory issues. The computer vision analysis rests on Machine Learning algorithms. We designed a model based on Convolutional Neural Networks able to discriminate between regular ATM usage and breaking attempts. The dataset has been built by recording and tagging depth videos where different people stage withdrawals and attacks on a retired ATM, replicating the actions the thieves do, thanks to the knowledge of the Security Department of the Monte dei Paschi di Siena Bank.
The results show that the implemented architecture is able to classify depth data in real-time on an embedded system, detecting all the test attacks in a few seconds.


Keywords— Bank Security; Machine Learning; Convolutional Neural Networks; Computer Vision; Intel RealSense; Single Board Computer







In recent years the global digitalisation and the consolidation of information technologies sensibly changed our daily life and the way we interact together, both at local and global level. This digital revolution is also changing how users access banks and financial services, turning a relationship based on the peer-to-peer trust into a mainly online service, with sporadic human interactions. Such mutation and the resulting change in the bank branch structure obviously affect the criminal behaviour related to this environment. Sectorial international studies [2] show that despite the use of explosives and other physical attacks continues to spread, in the long term the attacks will focus on the cyber and logical approaches. In fact, ATM malware and logical security attacks were reported by seven countries in Europe during the year 2017.

Moreover, statistics from ABI (Italian Banking Association) show a sensible increase of attacks to the ATMs in opposition to a reduction to bank branches robberies. This is due both to the juridical categorisation of the committed crime and to the lower amount of money that can be stolen in a
robbery. Indeed, security systems are in general concentrated on the branch rather than on the ATM area, which is usually located outside of the buildings. This also allows perpetrators to perform the assaults during nightly hours. An important issue to consider about these gestures is related to the collateral effects. In fact, the violence necessary in such attacks often lead to serious physical damage to buildings and objects in the neighbourhood of the targeted area, such as cars; this is when considering the best scenario, where no human is involved.

After these premises it is clear how can be fundamental to develop technologies capable of preventing in some way this kind of situation. Crucial features of such a system are the low rate of false alarms and effective promptness in detecting the potential risk, both to alarm the interested control systems and, in the first place, to try to automatically discourage the underway criminal action with some deterrents.

In this paper we propose ATMSense, an automatic surveillance system based on video stream analysis of depth frames. This approach allows to analyse in real-time the action performed in front of the ATM, while preserving the privacy of customers. Depth images are processed by a Machine Learning algorithm in order to predict the nature of the running situation. Even if the tests are performed on data recorded in our laboratory, the goodness of the obtained results lays the groundwork for an in-depth experimentation on the field.





A. Video Surveillance

Recent advances in Deep Learning techniques and in particular in those approach dedicated to Computer Vision [3][4] lead to a cutting-edge improvement in Image and Video Analysis algorithms. Even if methodologies for Video Surveillance and, more in general, for Action Recognition [5] based on different approaches had been investigated in the past, allowing us to reach good results in restricted scenarios, Deep Learning methods can provide state-of-the-art achievements, at best in the short term. Taking in account these results and the possibility of fast and portable prototyping of such algorithms, it seems reasonable to follow this direction and to going towards technologies that should be even more widespread and consolidated in the future. Moreover, such approaches should also allow us a direct scalability when facing new kind of specific situation and typologies of attacks.

B. ATMs Protection

As ATMs started to play a central role in the customers services, many works had been developed trying to improve the security of these interactions. Several systems designed to deal with identity thefts [6][7][8], interactions with forged documents and certificates [9] and the detection of the different specific dangerous situation [10][11] had been developed through the investigation and the integration of various hardware devices. However, the most common approach is the analysis by surveillance cameras trying to recognise those actions characterising a potential critical scenario [13]. In other cases, more specific systems had been oriented towards face detection and tracking [14] or to the recognition of partially occluded faces and bodies [15][16].




Fig. 1. ATMSense uses a depth camera connected to a Single Board Computer to analyse the surrounding of an ATM.




In our approach, we head towards a quite new technology like the images analysis throughout depth cameras, which is, at the best of our knowledge, unexplored. This should allow us to join the representation capabilities of videos processing and the need for customer privacy protection, both for ethical and juridical reasons.





ATMSense is intended to discriminate people's behaviour exhibited in front of an ATM, in order to detect risky situations at an early stage. The sensor used to analyse the scene is the Intel RealSense depth camera. Using the depth image instead of the RGB one provides great advantages: we can avoid dealing with personal data and privacy issues; the image is unaffected by lighting conditions; from a computational point of view, we can rely on a slight improvement by reducing the input channels from three to one. Depth images are processed on a Single Board Computer (SECO A80) with image processing techniques and Convolutional Neural Networks.

A. Intel RealSense

Intel RealSense is a family of depth cameras proving several video streams: RGB, depth and Infrared.
ATMSense is compatible with two camera models. The short-range RealSense SR300 can be placed in the ATM chassis, focusing the ATM keyboard area. The long-range RealSense R200 camera is intended to be placed above the ATM, focusing the whole interested scene. As stated in the Results section, the performance is similar for both cameras. The short-range camera should be embedded in new ATMs, the long-range fits better as an external plugin for already installed ATMs.
Whichever camera is used, the depth video stream is used to classify what is going on in the ATM area. For debugging purposes RGB streams can be collected, but they are not used, neither for the system training, nor for the runtime.


In fact, relying on the RGB stream would create a dependency on factors that we do not want to depend on, like light conditions. Moreover, dealing with faces and other personal images can be an issue for the privacy laws. Having only a low-resolution shape of the person does not allow the personal identification.


SECO A80 [17] (depicted in Figure 2) is a low power Single Board Computer based on the Intel Braswell CPU family, up to the quad-core Intel Pentium N3710. RAM memory is modular, providing two DDR3L SO-DIMM slots. The board offers standard desktop connectivity: USB3 ports, HDMI output, M.2 for SSDs, Gigabit Ethernet ports.


Fig. 2. SECO A80 Single Board Computer.


By providing a standard UEFI firmware, it runs mainstream X86 operating systems. Our tests are done on Ubuntu 16.04, although any modern Linux distribution providing Python 2.7 can be used.

C. Image Processing

Depth images collected from the cameras are preprocessed before the classification. In this phase we want to remove both the noise and the background objects. The noise is intrinsic in the camera sensor and is reduced using a cascade of standard image processing filters (i.e. median filtering, erosion, depth clipping and so on). This technique leads to the generation of one video frame starting from 5 frames read from the depth camera. Although the dynamics of the system scales down from 30 fps to 6 fps, the information necessary to classify the images is preserved. The background suppression is related to the environment in which the ATM is located, and includes the device itself. The background is subtracted (using kNN based techniques) making the solution independent from the ATM machines and environments. Moreover, in order to improve the generalization capabilities of learning algorithms, it is better to provide only the necessary information.


The difference between the original image read from the camera and the cleaned version is visible in Figure 3 (Intel R200) and Figure 4 (Intel SR300).




Fig. 3. On the left is shown a frame from Intel RealSense R200. On the right, the same frame is

preprocessed reducing the noise and subtracting the background.


Fig. 4. Intel RealSense SR300 frames are less noisy. Background information is removed from the right image.



D. Convolutional Neural Networks


Once we get a cleaned stream from the camera, we need to perform computations needed to predict the state of the current scene. As already said, the algorithmic approach relies on Deep Learning techniques. In particular, Convolutional Neural Networks (CNNs) represent the state-of-the-art in almost all Computer Vision applications as Image Segmentation and Classification, Object Detection and Recognition. This kind of architectures are biologically inspired by the human visual system [18] and the characterizing property is expressed through the concept of receptive field. These elements are a sort of pattern detectors, which are used to generate internal features maps representing the presence of specific shape in each region of the images. This process is reiterated throughout several layers (see Figure 5) to come up with a numerical 1-D vector by iteratively performing dimensionality reduction (Max-Pooling) and producing an encoding of the original image. Hence, the obtained representation can be feed to a standard Artificial Neural Network (ANN) classifier which should perform the desired predictions. This composition allows a high representational capability, a relatively simple training procedure (which is derived straightforward from the standard Back-Propagation algorithm), and weights sharing policy between hidden units that reduces the computational cost.

However, the large number of parameters (of the order of tens of millions) of such algorithms requires a correspondent large dataset to achieve an effective training leading to accurate and general prediction performances.



Fig. 5. Convolutional Neural Network architecture.





In order to collect the required data, we reproduced in our laboratory the real working environment by installing ATMSense on a dismissed ATM provided by Monte dei Paschi di Siena Bank. As a prototype, we taped an Intel RealSense SR300 to the ATM frame, and we installed the R200 camera on the top of a support above the ATM. With both the cameras connected, we recorded 132 depth videos simulating both the withdrawal and the attack scenarios, representing the two class to be discriminated by the classifier. To improve variability and generalisation, these videos has been staged by several actors in different sessions, using different light conditions (which only slightly affect the acquired images). Videos have been manually labelled at the single frame level. Background profiling has been carried out by recording 25 videos without any kind of interaction with the ATM.


A. CNN Training


In the training phase, pre-processed videos (as stated in section III.C) are split as reported in Table 1 among Train and Test sets. Hence, the dataset is generated by separating and shuffling sequences of consecutive frames together with the correspondent labels. In this way we obtained about 250,000 and 30,000 labelled samples for training and test respectively.

The training phase has been performed within the Keras framework using the TensorFlow backend. This enables an easy implementation capable of exploiting the multi-GPU cluster (provided by Barcelona Supercomputing Center). Since this process requires several hours to be completed and considerable trial and error tests have been necessary to find the best hyper-parameters and network configurations, we also carried out an investigation on a few settings related to computational issues. In practice, a preliminary tuning of a few variables (i.e. the mini-batch size of the network forward step) enabled to halving the execution time of the training phase. Applying this tuning, the global train-validation-test process has been accelerated by a scaling factor of 1.86 while maintaining the same accuracy.












Different CNN architectures have been tested, but we only report the results of the best one, composed by 3 convolutional layers, with ReLU as non-linear activation and Max-Pooling to perform dimensionality reduction. The fully connected classification layer is composed of 256 hidden units. All the architectures have been tested on different datasets, generated using different lengths of the input sequences. The obtained classification accuracies of the best networks are reported in Table 2.

Since the predictions are, in practice, not perfect, in order to refine the working performances, we added an additional layer. Such layer determines if to raise an alarm, based on a majority voting on a buffer of recent network predictions (of length varying from 10 to 20 elements). In fact, an alarm is raised only if more than the 95% of the last predictions are classified as attacks. This allows to correctly classify each video from the test set in a more realistic scenario. We can find many configurations in which no false alarm is raised on withdrawal videos and, on the other hand, all the attacks are detected. In Table 3 we report statistics on the detection time w.r.t. the beginning of an assault. For brevity, we only report the best case for each sequence length.





As we can observe, the reported detection times are admissible w.r.t. a real situation, since a potential attack can be detected in few seconds giving enough time to the Surveillance Control Room to analyse the scene and, possibly, to take dissuasive actions or call the security. From a practical point of view, we can see how the additional layer used to filter the network’s predictions by the majority voting is fundamental to reach the final results. We can also observe that feeding the classifier with a sequence of frames (5 or 10 in our tests) instead of than a single frame does not lead to a remarkable improvement and, at the end, this choice only delays the system promptness. This can be due to the fact that the scene understanding task is collapsed to a two-class classification problem. However, from an external point of view, it also seems reasonable that a human could be able to decide from a single picture of the scene if an assault is taking place or not.


A. Real-Time Classification


After the training, we tested the real-time performance on a SECO A80 SBC. Having relatively low computer power available, the application creates different threads to parallelize the computation. The first one handles the USB connection with the Intel RealSense camera and stores the incoming video frames in a buffer; another one preprocesses the incoming frames subtracting the background and reducing the noise; the last one classifies the image.

The A80 SBC can execute all the computation in real time. The heaviest threads are the image preprocessor, which runs in 23ms, and the CNN classifier, which runs in 17 ms. Considering that we need to classify 6 frames each second, the required computational power is more than enough for real-time operation.





In this work we propose an application of Automatic Video Analysis to improve the surveillance and the security on ATMs. From laboratory tests, the system can detect attacks very quickly, both when the depth camera is integrated into the ATM itself, and when it is installed nearby. Moreover, the approach employs off-the-shelf technologies of a total cost which is quite inexpensive when compared with an ATM cost or with the potential financial and general damages. The software solution is general for the approach, even if an additional data collection and a re-training phase will be necessary, depending on particular needs of specific situations.

Although the current solution is customised for a single mode of assault, the obtained results allowed us a short terms scheduling of a more real experimentation phase on the field. Indeed, the very fast attack detection time will allow to the Surveillance Control Room to promptly intervene. Moreover, the high accuracy reduces the possibility of false alarms.





Detection accuracy in a real-world scenario could be improved by collecting further data, statistically enlarging the events analysed by the system. In general, adding more training data helps the CNN to better generalize, instead of over-fit on training examples.

The depth footage recorded for the training is focused on explosive-based attacks. New videos could be recorded with the perspective to detect additional kind of ATM assaults, providing a more complete surveillance equipment.

The downside of having more depth videos is the need of manually tagging the frames. A complementary approach could be to introduce a Novelty Detection algorithm, which runs parallelly with the CNN. As an example, the solution we proposed in [19] to bank branch Audio-Surveillance can be redesigned in this scenario. This algorithm would be totally unsupervised, and capable of detecting any kind of anomaly which comes from unexpected users behaviour. An arbiter would take as input the outputs of both the algorithms, and rule a final decision.





This work was supported by Monte dei Paschi Bank grant DISPOC017/6. We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.





[1] European Association for Secure Transactions: ATM Explosive Attacks surge in Europe, https://www.association-secure-transactions.eu/atm-explosive-attacks-surge-in-europe/, 2016
[2] European Association for Secure Transactions: EAST Publishes European Fraud Update 3-2017, https://www.association-secure-transactions.eu/east-publishes-european-fraud-update-3-2017/, 2017
[3] J. Deng el al. “A large-scale hierarchical image database,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248-255, 2009
[4] A. Krizhevsky et al., “Imagenet classification with deep convolutional neural networks”, Advances in neural information processing systems (NIPS), pp. 1097-1105, 2012
[5] S. Herath et al., “Going deeper into action recognition: A survey,” Image and Vision Computing, pp. 4-21, 2017
[6] F. Puente et al., “Improving online banking security with hardware devices,”, 39th Annual International Carnahan Conference on Security Technology (CCST), pp. 174-177, 2005
[7] H. Lasisi and A.A. Ajisafe, “Development of stripe biometric based fingerprint authentications systems in Automated Teller Machines,” 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA), pp. 172-175, 2012
[8] R. AshokaRajan et al., “A novel approach for secure ATM transactions using fingerprint watermarking,” Fifth International Conference on Advanced Computing (ICoAC), pp. 547-552, 2013
[9] H. Sako et al., “Self-defense-technologies for automated teller machines,”, International Machine Vision and Image Processing Conference (IMVIP), pp. 177-184, 2007
[10] M.M.E. Raj and A. Julian, “Design and implementation of anti-theft ATM machine using embedded systems,” International Conference on Circuit, Power and Computing Technologies (ICCPCT), pp. 1-5, 2015
[11] S. Shriram et al., “Smart ATM surveillance system,” International Conference on Circuit, Power and Computing Technologies (ICCPCT), pp. 1-6, 2016
[12] A. De Luca et al. , “Towards understanding ATM security: a field study of real world ATM use,” Proceedings of the sixth symposium on usable privacy and security, 2010
[13] N. Ding et al. “Energy-based surveillance systems for ATM machines,” 8th World Congress on Intelligent Control and Automation (WCICA), pp. 2880-2887, 2010
[14] Y. Tang et al. “ATM intelligent surveillance based on omni-directional vision,” World Congress on Computer Science and Information Engineering (WRI), pp. 660-664, 2009
[15] I-P. Chen et al., “International Conference on Image processing based burglarproof system using silhouette image, Multimedia Technology (ICMT), ” pp. 6394-6397, 2011

[16] X. Zhang, “A novel efficient method for abnormal face detection in ATM,” International Conference on Audio, Language and Image Processing (ICALIP), pp. 695-700, 2014
[17] SECO SBC A80, http://www.seco.com/prods/it/sbc-a80-enuc.html, 2017
[18] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, pp. 193-202, 1980
[19] A. Rossi et al. “Auto-Associative Recurrent Neural Networks and Long Term Dependencies in Novelty Detection for Audio Surveillance Applications, ” IOP Conference Series: Materials Science and Engineering, 2017