Identifying Moving Objects using OpenCV Background Subtraction

Frameworks used: C++, Python, OpenCV, Tensorflow, Git

Link to Github repo

Introduction

Background subtraction enables the detection of moving objects in video frames and as such is a critical video pre-processing step in many computer vision applications such as smart environments (i.e., room and parking occupancy monitoring, fall detection) or visual content analysis (i.e., action detection and recognition, post-event forensics) that rely on accurate motion detection to operate effectively. As such it should come as unsurprising that background subtraction methods have received a relatively high degree of academic interest with various evaluation datasets being made available as open-source benchmarks to allow researchers to train and evaluate novel background subtraction algorithms. In this project we make use of the publicly available CDTNet-14 dataset in order to evaluate the performance of various background subtraction algorithms made available through the OpenCV library using both C++ and Python and test these models across variable lighting video feeds to evaluate model reliability in accurately distinguishing object from shadows and other surrounding noise. Please feel free to skip to the Results section to check out the final background subtraction outputs of these models.


CDTNet-14 Dataset

The CDTNet-14 dataset was originally developed as part of the Conference on Computer Vision and Pattern Recognition’s (“CVPR”) 2014 Change Detection Workshop and makes available a total of 53 videos spanning ~140,000 frames meant to cover a wide range of detection applications across both indoor and outdoor environments. CDTNet-14 further categorizes its video data according to the specific processing challenge presented in each, with category Baseline for instance including relatively low difficulty videos with subtle background motion and category Shadows comprising videos with shadows accompanying moving foreground objects. In this project we make use of the below highway and indoor office foot traffic video feeds included under the Baseline and Shadows CDTNet-14 categories in order to compare the relative performances of our background subtraction models in both environment types:


Methodology

As a first step in our explorations we leverage the OpenCV framework in C++ to implement both k-Nearest Neighbors– (kNN) and MOG2 Adaptive Gaussian Mixture-based models as originally proposed by Z. Zivkovic and Zivkovic et al. respectively. The below C++ script takes as input a video path specified by the user as well as the type of background subtraction algorithm to be used, either kNN or MOG2, in our background subtraction analysis:

#include <iostream>
#include <sstream>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/video.hpp>

using namespace cv;
using namespace std;

const char* params
= "{ help h         |           | Print usage }"
"{ input          | highway_traffic.mp4 | Path to a video or a sequence of image }"
"{ algo           | KNN      | Background subtraction method (KNN, MOG2) }";

With our char* input argument variable defined we can next implement our main function as shown below in order to create an OpenCV BackgroundSubtractor object used in removing our foreground objects from background and an OpenCV VideoCapture object taking as input our video frames to be analyzed later on. Our main function next executes a while loop to iteratively pass each frame of our VideoCapture object to our pBackSub background subtraction model, returning a predicted background subtraction mask highlighting white moving objects against a dark background.

int main(int argc, char* argv[])
{
    CommandLineParser parser(argc, argv, params);
    parser.about("This program shows how to use background subtraction methods provided by "
        " OpenCV. You can process both videos and images.\n");
    if (parser.has("help"))
    {
        //print help information
        parser.printMessage();
    }

    //! [create]
    //create Background Subtractor objects
    Ptr<BackgroundSubtractor> pBackSub;
    if (parser.get<String>("algo") == "MOG2")
        pBackSub = createBackgroundSubtractorMOG2();
    else
        pBackSub = createBackgroundSubtractorKNN();
    //! [create]

    //! [capture]
    VideoCapture capture(samples::findFile(parser.get<String>("input")));
    if (!capture.isOpened()) {
        //error in opening the video input
        cerr << "Unable to open: " << parser.get<String>("input") << endl;
        return 0;
    }
    //! [capture]

    Mat frame, fgMask;
    while (true) {
        capture >> frame;
        if (frame.empty())
            break;

        //! [apply]
        //update the background model
        pBackSub->apply(frame, fgMask);
        //! [apply]

        //! [display_frame_number]
        //get the frame number and write it on the current frame
        rectangle(frame, cv::Point(10, 2), cv::Point(100, 20),
            cv::Scalar(255, 255, 255), -1);
        stringstream ss;
        ss << capture.get(CAP_PROP_POS_FRAMES);
        string frameNumberString = ss.str();
        putText(frame, frameNumberString.c_str(), cv::Point(15, 15),
            FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));
        //! [display_frame_number]

        //! [show]
        //show the current frame and the fg masks
        imshow("Frame", frame);
        imshow("FG Mask", fgMask);
        //! [show]

        //get the input from the keyboard
        int keyboard = waitKey(30);
        if (keyboard == 'q' || keyboard == 27)
            break;
    }

    return 0;
}

Running this script in our IDE environment returns the below video feeds as output via command line, one showing our original unaltered video and the second showing the background subtraction mask created by our model:

While we can observe both our KNN and MOG2 models are able to accurately highlight the pixels corresponding to moving cars in our image we can see these outputted masks also capture shadows of our moving cars which may introduce complications for later analysis. In order to investigate other model types that may produce better performance on video feeds with shadows, we can leverage CDTNet-14’s Shadows videos and corresponding ground truth masks showing pixel-level outlines of shadowed moving objects in order to quantify the relative strengths and weaknesses of each model on this data. We will therefore leverage the two below videos from the CDTNet-14’s Baseline and Shadows datasets to evaluate our models:

As the OpenCV library’s bgsegm background subtraction module comes pre-loaded with different background subtraction algorithms, we will evaluate the following eight background subtraction models in order to identify the model with best performance on both non-shadow and shadow data as measured by F1-score:

  • MOG
  • GMG
  • LSBP-vanilla
  • LSBP-speed
  • LSBP-quality
  • LSBP-comp
  • GSOC
  • GSOC-comp

To evaluate the performance of our models on both shadow and non-shadow video feeds, we can adapt the opencv-contrib package’s background subtraction Python evaluation script per the following evaluation pipeline:

def main():
    #parse command line arguments used later in our args variable
    parser = argparse.ArgumentParser(description='Evaluate all background subtractors using Change Detection 2014 dataset')
    parser.add_argument('--dataset_path', help='Path to the directory with dataset. It may contain multiple inner directories. It will be scanned recursively.', required=True)
    parser.add_argument('--algorithm', help='Test particular algorithm instead of all.')

    args = parser.parse_args()
    #get groundtruth and input data dirs
    dataset_dirs = find_relevant_dirs(args.dataset_path)
    assert len(dataset_dirs) > 0, ("Passed directory must contain at least one sequence from the Change Detection dataset. There is no relevant directories in %s. Check that this directory is correct." % (args.dataset_path))
    if args.algorithm is not None:
        global ALGORITHMS_TO_EVALUATE
        #defining OpenCV background subtraction algorithm to evaluate 
        ALGORITHMS_TO_EVALUATE = [algo_tuple for algo_tuple in ALGORITHMS_TO_EVALUATE if algo_tuple[1].lower() == args.algorithm.lower()]
    summary = {}
    #calculating pixel-level recall, precision and f1-score performance metrics of our model vs groundtruth 
    for seq in dataset_dirs:
        evaluate_on_sequence(seq, summary)
    
    #compiling performance metrics of our models 
    for category in summary:
        for algo_name in summary[category]:
            summary[category][algo_name] = np.mean(summary[category][algo_name], axis=0)
    #printing performance summaries of our models 
    for category in summary:
        print('=== SUMMARY for %s (Precision, Recall, F1, Accuracy) ===' % category)
        for algo_name in summary[category]:
            print('%05s: %.3f %.3f %.3f %.3f' % ((algo_name,) + tuple(summary[category][algo_name])))

if __name__ == '__main__':
    main()

As an overview of the above this main() function performs the following actions in evaluating our model:

  1. Parse our –dataset and –algorithm command line arguments pointing to our dataset directory and background subtraction model object to use respectively
  2. Verify and store our groundtruth and input dataset paths in variable dataset_dirs
  3. Create our algorithm object as specified via command line argument
  4. Calculate average recall, precision and F1-score of our selected algorithms’s input data prediction against our groundtruth directory using the latter’s pixel-level masks
  5. Create a dictionary summary compiling performances of each model on each category and print floats of average recall, precision and F1-score for all models tested

As expected running this pipeline across our eight models produces the below results demonstrating reduced F1-score performance across all models except LSBP-vanilla on our Shadows dataset vs non-shadow Baseline:

CDTNet-14 Baseline

CDTNet-14 Shadows

Model

Recall

Precision

F1

Recall

Precision

F1

MOG

0.35

0.87

0.50

0.47

0.66

0.55

GMG

0.99

0.92

0.95

0.54

0.30

0.39

LSBP-vanilla

0.59

0.12

0.20

0.90

0.17

0.29

LSBP-speed

0.97

0.25

0.40

0.97

0.17

0.29

LSBP-quality

0.86

0.27

0.41

0.97

0.19

0.32

LSBP-comp

0.86

0.28

0.42

0.97

0.17

0.29

GSOC

0.96

0.99

0.97

0.82

0.52

0.64

GSOC-comp

0.96

0.99

0.97

0.82

0.50

0.62

With our highest-performing GSOC model displaying near-perfect 0.96 and 0.99 recall and precision scores on our no shadow Baseline dataset versus reduced 0.82 and 0.52 recall and precision scores on our Shadows dataset, this would suggest that while these out of the box models may be suitable for applications requiring high model sensitivity such as threat detection in environments with few or no shadows, the converse should not be heeded in shadow environments in which 82% recall / 52% precision would lead to an unacceptably high ~20% missed threat detection rate and ~50% false alarm rate on all positive predictions.

Visualizing our models’s prediction masks produced on a single movement dataframe of our Shadows dataset allows for better understanding relative strengths and weaknesses of each and showcases our GSOC greater performance:

Figure 5. Shadows input frame #2450 image
Figure 6. Groundtruth mask on Shadows input frame #2450
Figure 7. GSOC prediction mask on Shadows input frame #2450
Figure 8. GSOC-comp prediction mask on Shadows input frame #2450
Figure 9. GMG prediction mask on Shadows input frame #2450
Figure 10. MOG prediction mask on Shadows input frame #2450
Figure 11. LSBP-vanilla prediction mask on Shadows input frame #2450
Figure 12. LSBP-comp prediction mask on Shadows input frame #2450
Figure 13. LSBP-quality prediction mask on Shadows input frame #2450
Figure 14. LSBP-speed prediction mask on Shadows input frame #2450

Conclusion

In this project we demonstrated implementing OpenCV background subtraction using both C++ and Python, showing that while suitable recall and precision may be achieved with GSOC-based background subtraction models in no shadow environments, reduced performance of the same in environments with greater shadow and other lighting variance should make these ill-suited for applications required high specificity such as threat detection in crowd environments.

Thanks for reading!