As part of an effort to detect “deepfake” videos, engineers have developed software that improves a computer’s ability to track an object through a video clip by 11 percent on average.
The software, called BubbleNets, chooses the best frame for a human to annotate. In addition to helping train algorithms for spotting doctored clips, it could improve computer vision in many emerging areas such as driverless cars, drones, surveillance, and home robotics.
“The US government has a real concern about state-sponsored groups manipulating videos and releasing them on social media,” says Brent Griffin, assistant research scientist in electrical and computer engineering at the University of Michigan. “There are way too many videos for analysts to assess, so we need autonomous systems that can detect whether or not a video is authentic.”
Current software for parsing video clips relies on humans to mark up the objects—such as people, animals, and vehicles—in the video. “Video object segmentation” algorithms then follow the boundaries of the marked objects through the videos.
Today’s advanced “deep learning” programs require the human to mark only a single frame. The frame presented to the human is typically the first frame in the video, which is rarely the best choice. But until now, there was no automated way to choose a preferable frame.
When the Defense Advanced Research Projects Agency (DARPA) requested this capability, the research team was skeptical that it was even possible. The software wouldn’t even know what in the video you were trying to track, so how could it recommend a frame?
But with deep learning techniques, Griffin and senior author Jason Corso, professor of electrical engineering and computer science, didn’t have to figure out how to choose the best annotation frame—the algorithm would do that. Their challenge was creating enough “training” data so that the algorithm could draw its own conclusions from a large set of examples.
Griffin and Corso started with 60 videos in which every frame had been annotated. If they posed the question the obvious way—”Which frame is the best annotation frame in each video?”—they would have only 60 training examples. Instead, they designed their “BubbleNets” software to compare two frames at a time. The software predicts which frame, if selected for a human to annotate, will enable the segmentation software to stay truer to the object’s boundaries. This gave them nearly 745,000 pairs of frames for training the algorithm.
It is hard to say exactly what BubbleNets looks for in an annotation frame, but testing showed it preferred frames that:
- Weren’t too near the beginning or end of the video.
- Looked most like other frames in the video.
- Showed a clear view of the objects in the video.
BubbleNets is already “a small cog” in DARPA’s multi-university media forensics program, Griffin says. In an effort to identify falsified propaganda videos, DARPA needs to train its own algorithms on manipulated videos. BubbleNets helps other software automatically erase objects from videos to create training data.
But BubbleNets could also be useful in other robotics and computer vision tasks. For instance, future home robots will need to learn the layout and contents of a house. The robot would be able to present its owner with a set of frames that contain unidentified objects.
“Think about a toddler. A toddler sort of knows what they know, and then at some point, they realize they don’t really know something. So they ask a question. And that’s what we want to enable the computer to do,” Corso says.
Computer vision algorithms that have to operate without human input, such as those for driverless cars or drones, could also benefit. In these cases, the software would sift through training video clips looking for things that it didn’t recognize. Then, when it found a problematic clip, BubbleNets would choose that best frame for a human to explain.
Source: University of Michigan