Practical Applications of Deep Learning: Build a DIY Smart Security Camera Using the Intel® Movidius™ Neural Compute Stick

Have you ever wanted to deploy your deep neural network (DNN) models onto a weekend project that is not only fun but also creates a useful gadget to have around the house? Maybe Fido figured out the motion triggered camera system and fills up your disk space with videos of him jumping around the room. Or maybe you don’t want to go through 10 hours of CCTV footage trying to figure out what happened to your delivery package. A friend of mine has been asking me to help him update his motion software based DIY CCTV camera for a while, so I decided to finally help him out. The end result was a smart camera proof of concept created using the Intel® Movidius™ Neural Compute Stick that improves upon traditional CCTV systems by intelligently detecting and recording activity in which the user is specifically interested.

In order to better understand the real-world problem that my project would solve, I did some online research on the need for home security cameras and was quite surprised by the results. As more consumers move towards online shopping and have their items delivered to their doorstep, the risk of having these packages stolen has increased. Some stats are listed below from an online survey [1]:

  • 11 million homeowners have had a package stolen in the last year
  • 53% of homeowners are worried about packages left outside their home being stolen
  • 74% of packages are stolen during the day when homeowners are at work
  • The average value of stolen packages is $50 – $100
  • Victims spend close to $200 to replace each stolen package
  • 70% of all homeowners expect to receive a package over the holiday season
  • FedEx and UPS delivered more than 30 million packages a day between Black Friday and Christmas Day 2016

Gathering the requirements

The best way to define requirements for a product is to first experience the problem(s) it is expected to solve. In order to do this, I replicated my friend’s DIY CCTV camera and used it for a few days. The camera performed well when detecting motion, recording activity, and viewing live streams, but I strongly felt the need for more smartness in its features and functionalities. For example, despite going through quite a bit of trial and error attempts to tweak the motion detection configuration, I ended up with video recordings of cars driving by, wind blowing through the trees, and of course my two-year-old son learning how to trigger motion detection on the camera. I needed a system that was not only easy to setup (no hand-tweaking the motion detection configuration), but also reliable in terms of detecting the subjects, objects and activity in which I am specifically interested.

In an attempt to add ‘smartness’ to the camera, I interviewed a couple of my friends about their perception of a smart security camera. Below is a visual representation of their collective requirements.


Figure 1: Camera detects specific objects or subjects, instead of just pixel changes.


Figure 2: Camera detects, but does not recognize the subject.

In order to set achievable goals for this project, I simplified the requirements into four main tasks. See Figure 3 for a visual representation of these simplified requirements.

Figure 3: Simplified requirements


Analyzing the requirements

Given the fixed set of requirements, I decided to break them down into simpler blocks and find suitable hardware and software components for this project.

Requirement #1: Small enough to fit on/near a door

Unless I live in the giant’s house from Jack and the Beanstalk (which I don’t), my development laptop won’t fit on or near my door. So I need something that is not only small, but light enough to physically hang on the door. I had built the CCTV camera using a Raspberry Pi 3 Model B (RPi) and a PiCamera, so I decided to leverage that hardware. Since I wanted to apply AI (deep learning) to this project, I paired the RPi with an Intel® Movidius™ Neural Compute Stick, which is designed to offload deep neural network inference from an application processor. Below is an illustration of my hardware setup.


Figure 4: Illustration of hardware setup

Requirement #2: Low-power consumption

Since the entire setup will potentially be powered by a battery pack, I had to ensure that none of my hardware components were power guzzlers. Fortunately, the Intel® Movidius™ Neural Compute Stick is a low-power device and is designed to run off a single USB 2.0 or 3.0 port. Then, I plugged this into one of the 4 USB 2.0 ports on the RPi and supply it with power either through the micro-USB port or through the RPi stacking header. I used an off the shelf power bank (portable phone charger) to supply power via the micro USB port.

Requirement #3: Low-cost hardware

I started with a budget of $120 for the entire project, but a quick dive into my e-dumpster basically gave me everything I needed, so I ended up building my smart camera at no additional cost. Below is a cost estimation for your reference.

Item Cost
Intel® Movidius™ Neural Compute Stick $79
Raspberry Pi Zero W $10
Raspberry Pi Camera V2 $25
USB OTG cable $1
SD Card (min 8GB)
Total $121

NOTE: Alternatively you can buy the OctoCam kit which bundles all the required parts, except for the SD Card. I eventually ended up investing in this kit, so I could replace my ‘duct-taped’ hardware setup with this cute octopus guarding my house.

Requirement #4: Detects a person in real-time

The most difficult part about building neural-network-based products is finding the relevant dataset and training a model based on the chosen dataset. Since I was building just a proof of concept, I decided to experiment with some of the freely downloadable pre-trained neural networks. I used “An Analysis of Deep Neural Network Models for Practical Applications” by Alfredo Canziani et al. as a guide to help pick the neural network that would meet my requirements; see Figure 5 for a comparison chart.

Figure 5: Top-1 accuracy vs operations [2]

Since the results in this chart are based on tests performed on a different hardware, I had to re-run these networks on the Intel® Movidius™ Neural Compute Stick. Rather than running all networks, I picked one network from each extreme case: Inception-v4, the most accurate, and AlextNet, the least complex (i.e. fastest). I also ran MobileNets, a class of efficient convolutional neural networks (CNNs) designed for mobile and embedded vision applications. Table 1 lists the performance results from my test case. There are two takeaways from this test:

  1. These networks are trained on ImageNet dataset, which doesn’t have a class/category related to ‘person’.
  2. These networks are image classifiers, which assume there is only one subject or object in the entire image. i.e. images with multiple subjects or objects result in either erroneous output or classify the dominant subject. 
Network Inference time Frames per second
AlexNet 91.33099 ms 10.9 fps
Inception-v4 645.0548 ms 1.55 fps
MobileNet (1.0 | 224) 39.26307 ms 25.4 fps

Table 1: Bench-test results for speed

In order ensure that my proof of concept would work in real-world situations, I had to pick a network that not only runs fast on the Intel® Movidius™ Neural Compute Stick, but also has a ‘person’ category and is capable of dealing with multiple subjects or objects in a single image (or a single camera frame). Fortunately, MobileNet SSD meets all of these requirements, and a pre-trained model is readily available [3]. A quick test of running MobileNet SSD on the system yielded the following results..

Network Inference time Frames per second
MobileNet SSD 80.47414 ms 12.4 fps

NOTE: Having a strong microcontroller background, especially in automotive and access control systems, I tell myself not to use the term real-time as often. In mission-critical systems such as adaptive cruise control or blind spot detection systems, a 5ms delay in reading the radar data would prove to be dangerous. For our use case, I think we can get by with a detection performance of 10 frames per second (fps).

Developing the application

Thanks to the Intel® Movidius™ Neural Compute SDK’s comprehensive API framework, it was quite easy to develop the app for this project. The basic structure of any app featuring this hardware breaks down into 5 simple steps:

  • Step 1: Open the enumerated device and get a handle on it.
  • Step 2: Load a graph file onto the Intel® Movidius™ Neural Compute Stick.
  • Step 3: Pre-process the images. Ex. resize, crop, color-mode conversions, etc.
  • Step 4: Read and process inference results from the device.
  • Step 5: Unload the graph and close the device.

The Neural Compute App Zoo [4] is loaded with example apps, so I leveraged an existing app called live-image-classifier as the foundation for this project. Apart from writing a utility script to de-serialize the output into a Python dictionary, I only had to update steps 3 and 4 to create a working prototype of the application.

Putting it all together

There was a small hitch while migrating the app from my development laptop to the OctoCam’s Raspberry Pi Zero W. The code relied heavily on OpenCV[7] to capture and preprocess frames from the camera, but there is no pre-compiled OpenCV binary or python wheel for the Raspberry Pi. On a Raspberry Pi 3 Model B, compiling OpenCV from source takes about 4 hours, but it failed after 56 painful hours on my RPi Zero W. I could have tried cross compiling on a development machine, but I decided to take a much more effective approach of using PiCamera [5] for capturing camera frames and PIL [6] for pre-processing images and visualizing the output.

Below is a video recording of my ‘DIY smart security camera’ in action. Although the MobileNet SSD model is capable of detecting twenty different classes, the code is designed to capture images (or record video snippets) when a person is detected. Since the RPi Zero has built-in WiFi, I can easily SSH into the device from my development laptop and tweak the trigger mechanism. For example, it only takes a 2-3 line code change in Step 4 to start recording only if both a dog and a person is detected.

Figure 6: This is me trying to steal a package from my own porch

NOTE: I had to change the trigger mechanism to ‘person + dog’ on my security cam, because I have an ongoing problem with someone not cleaning up after their dog.


Replicating the project

If you are interested in replicating this project, you can access the source code for both your development machine and Raspberry Pi at


[1] The package guard –
[2] An Analysis of Deep Neural Network Models for Practical Applications by Alfredo Canziani, et al.
[3] Caffe* implementation of MobileNet SSD by Chuanqi305 –
[4] Neural Compute App Zoo –
[5] RPi camera library –
[6] Python Imaging Library (PIL) –
[7] OpenCV library –

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase.  For more complete information about performance and benchmark results, visit

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at

Hardware: Laptop based on Quad core Intel Core i5-6600 CPU @ 3.3GHz, 32GB RAM
Software: Ubuntu 16.04 + NCSDK 1.12
Test code: FPS numbers were generated using sample codes in, which is released to the public under MIT license, and periodically goes through IP scans.

Intel, the Intel logo, and Movidius are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.