Intro to AI for processing camera trap data

A primer on using machine learning to assist with species recognition

Individual camera traps can easily produce tens-of-thousands of images, many of which are empty, so using artificial intelligence (AI) - or, more specifically, machine learning (ML) - to help process and prioritize these data can lead to significant efficiency gains. That said, it's important to manage expectations about what machine learning can and cannot do in the context of your camera trap workflow, as well as understand the different types of machine learning "models" and what they're used for.

Machine learning life cycle

The following assumes some understanding of the typical ML model development life cycle - from data collection and annotation to training, deployment, and inference. If you are unfamiliar with the ML life cycle, a good overview can be found here.

ML offers increased efficiency, not automation

From a consumer's perspective, in just a few short years we've become accustomed to increasingly powerful AI, which is why it's understandable why many people assume that predicting what kind of animal is in a camera trap image should be a relatively trivial computer science task. Unfortunately, however, for a variety of reasons AI-assisted species classification in camera trap images is very difficult (more on that below), and it's important to set expectations about what it can do for you accordingly.

For example, at the moment, there is no single, general-purpose model that can predict with reasonable accuracy any and all species you might encounter with your camera traps (the most advanced attempt to do this is Wildlife Insights, which performs well on common species but struggles with less-common ones).

And you shouldn't necessarily expect perfection from bespoke models trained on your own data and tailored to the environments and species you're interested in detecting, either. Although custom classifiers will perform a lot better on your unique datasets, for use cases that have a low-tolerance for false negatives/positives, even a model that is 95% accurate may require a manual review of most if not all images.

But don't despair! The benefits of AI for camera trapping are real and growing, but they should be thought of in terms of efficiency gains - greatly speeding up manual review performed by "humans-in-the-loop" or quickly surfacing the most important/interesting images - rather than full-on, hands-off automation. We are in the "driver assist" stage, and it may be some time before we have a reliable and affordable self-driving car.

Why ML is hard with camera trap images

In pictures taken by humans typically it's obvious what the photo is of: the subject is in focus, it’s in the center of the frame, and it's probably not obscured by something in the foreground. If you are looking for training data to train a teapot classifier, you can do a Google Image search and find a million clear examples with different shapes, sizes, colors, angles, and backgrounds, and that sample diversity combined with subject clarity is going to help the model learn what makes makes a teapot a teapot pretty easily.

With camera trap images, on the other hand, the subject might be 2 inches away from the camera - completely out of focus and taking up the whole frame - or 20 feet away and 90% obscured by a bush. It might be blurry because it’s moving, barely visible because it's out of flash range, or pixelated because of file corruption; the quality and composition of the images is just far less controlled. Not to mention, for rare species (which, for conservation purposes, are often the ones you care the most about), you likely have a very limited number of sample images (this is known as a long-tail distribution problem). Given the nature of the data, it makes sense why teaching an algorithm to learn what a species looks like from camera trap images is a challenging task. Garbage in, garbage out, as they say.

What’s more, your training data might include two hundred thousand images (a lot of samples), but they might have only been taken at just 12 locations, and of course the backgrounds of images from the same location don’t change much. So what typically happens is that the models tend to get really good at detecting animals in very specific locations (i.e., those that they were trained on), but as soon as you point the camera somewhere else, or use the model in a new environment, the performance plummets because the models have learned too much about the background and foliage in the training data and can’t generalize to other environments.

Types of ML models

For our purposes, you can group ML models into three buckets: Object Detection models, in which the model is given an entire image and its job is to detect whether there are any "objects" of interest within it, Classification models, which either take a full or cropped image as input and try to predict what "class(es)" are present (in our case the classes would likely be species), and Re-identification models, which are trained to identify whether two animals are the same individual or not.

These three types of models solve different problems and are useful for different use cases, but they can also be used together sequentially in different stages of a machine learning inference pipeline.

Object detection

Object Detection models perform a high-level but crucial service and are often the first model to be invoked in an image recognition pipeline. As mentioned above, their job is to identify objects of interest within an image and return a "bounding-box" that describes where in the image the objects are located. Object Detection models can also be trained to return "classes" along with their bounding-boxes (i.e. this box contains a person, this box contains a car), but the classes are often intentionally very high-level.

Thanks to Microsoft's AI for Earth team, the challenge of detecting objects in camera trap images has essentially been solved. Megadetector, a highly accurate, open-source model that Microsoft developed & trained on millions of camera trap images from a wide variety of environments, is the gold standard for this stage of the pipeline.

Out-of-the-box, it can detect and return bounding boxes for three classes (animal, person, or vehicle), but perhaps most importantly, if it doesn't return any objects you can assume that the image is empty. Because camera traps produce so many empty images due to false triggers from moving foliage, shadows, etc. (often > 60%) , the value of separating empty images from images that contain something of interesting is enormous.

Megadetector is used by a large and growing list of researchers and conservation organizations, and in our experience we've been very impressed by its performance. That said, like all ML models it's not perfect, and how well it is able to detect animals depends on the species, how the cameras were set up, the complexity of the backgrounds, and many other variables.

Classification

Object detection models like Megadetector help you weed-out empty images or tell you whether there's an animal in the photo, but if you need finer-grained classification - i.e. a species-level prediction - that's where classifiers come in.

Classifiers take an image as input and return a class as output (technically they return a list of classes and percentages indicating how confident they are that the image contains each class). When trained on a lot of images of each species you want to classify, and taken in an environment that is very similar to the one you'll be ultimately using the model in, classifiers can get quite good. If you have training data taken by the exact same cameras in the exact same locations that the models will be used on, they can become excellent.

The challenge is that as soon as the real data starts to look different from the training data the performance starts to suffer: even small changes like pointing a camera in a new direction can significantly impact accuracy (see note on "Why ML is hard with camera trap images" above). If you try to use the model in a totally new environment that may have the same animals but different flora, it will likely struggle even more. The problem is that the nature of camera trap data makes it hard for classifiers to generalize to new, unseen environments, and it's one of the reasons we don't have a single, general-purpose classifier that can predict every species in every environment under the sun.

Instead, if you need species-level classification with a high degree of accuracy, your best bet is to train your own classifier with your own data. Training a classifier is outside of the scope of this document, but once you develop your own training data set (labeling a lot of images), you essentially have two options:

  • manually train the classifier yourself (or with the help of a data scientist) - this typically requires at least intermediate experience with the Python programming language and familiarity with deep learning and convolutional neural net (CNN) concepts, as well as access to a computer with a GPU. If you're interested in learning more about manual classifier training, we recommend checking out Sara Beery's Wildlabs webinar on the topic as a starting point.

  • use an "Auto ML" service like Google's AutoML or Zendo, which automate a lot of the training process and allow users with minimal technical background to train classifiers. The downside of this approach is that training models is part science, part art, and without a human actively participating in the training, evaluating where the model is struggling, and intervening to improve it, the ceiling for how well an auto ML-trained classifier can perform is generally lower than that of a manually trained one.

One last note on classifiers: classifiers are only capable of making predictions about classes that they have been trained on, and they don't know anything else about the world. If they encounter an animal that was not present in the training data, they won't be able to recognize that it is a novel animal that they've never seen before. In other words, they aren't capable of applying a generic "unknown" label to new species. In data science parlance this is called an “anomaly detection” or “open-set” classification problem and it's difficult to do with CNNs. What this means is that for a model to perform well, you ideally want a comprehensive training dataset that includes samples of all of the animals you might encounter, even if you really only care about classifying a subset of them.

Re-identification (Re-ID)

The Re-ID training process involves sending the model three images at once - all three of which are the same species but two are the same individual - and rewarding it when it guesses correctly which two are the same. In effect, you’re training the model to understand what features can be used to distinguish between individual animals within an entire species. For species with visually distinctive "fingerprints" (zebras, whale sharks) this approach can help you determine if two animals are the same, which can be useful in population modeling.

Last updated