I wanted to share with you an example of what the AI research community can do with the data that you’ve been contributing. Collectors like you have submitted over 400K videos of 37 simple activities that people perform in public places. These are activities like sitting, opening a door, or using a laptop that can be performed in a few seconds by one or more people. The Visual AI research community uses these videos to teach an AI system to learn to recognize when a person performs one of these activities, even if the system has never seen this person before. We use videos from hundreds of different people in over fifty countries to teach an AI system how people from around the world perform these activities so that we can accurately identify them in new videos.
Here is an example. Take a look at this video:
This is the output of our Visual AI system on a video that was collected by one of our team members. The video camera was set up on a shelf in a home office and the subject was asked to perform simple activities. Our Visual AI system finds the people (and lazy dogs) and draws boxes around them. When the subject performs one of the 37 activities our system knows about, the system outputs a caption over the box along with a confidence score. This system was trained on 8 GPUs over the course of about two weeks on the videos that you have been submitting, in order to automatically identify these activities in new videos. We call this task “activity detection”, and we’re pleased with the performance here.
However, our job is not done yet. Contrast with this video.
This again does a decent job for activity detection at the beginning. However, at the end (>1:04) we show that there is a significant challenge still remaining. Visual AI has a problem when faced with activities that look very similar to the activities it knows about, and can easily get confused. For example:
- “Scratching head” is confidently misclassified as person talks on phone.
- “Rubbing knuckles” is misclassified as person texts on phone
- “Squatting” is confidently misclassified as person sits down
- “Tapping ground” is confidently misclassified as person puts down object.
This suggests that we still need more data. We need to collect more difficult examples of the activities that the system knows about and we need new activities that are closely related to these activities. This will challenge the AI system to learn a more robust representation for activities, which will reduce these mistakes. This will be the primary goal of the next collection sprint.
All Visual AI systems are built on the foundation of high quality data. Your contributions are making an impact and our research collaborators are all super excited to receive more data from you. I look forward to continuing the mission of ethical and privacy preserving data collection for large scale visual AI, and sharing our progress along the way.