Facebook collects video 'in the wild' to help AI develop first-person perspective

As Facebook continues to face scrutiny over a recent whistleblower report, the social media giant on Thursday unveiled a long-term project that will use thousands of hours of first-person video captured “in the wild” to train AI to act more human-like.

The Ego4D project by Facebook AI “aims to solve research challenges around egocentric perception: the ability for AI to understand and interact with the world like we do, from a first-person perspective,” a Facebook blog post stated.

As part of the project, the company said in the blog post that it worked researchers around the world “who collected more than 2,200 hours of first-person video in the wild, featuring over 700 participants going about their daily lives.” 

Facebook told CNBC that the parties collecting the video footage, which included members of universities and labs across nine different countries, were told to avoid recording personal identifying characteristics, such as faces and audio of conversation. Data such as license plate numbers also was blurred in the resulting video.

This collection of footage is 20x more than the amount of “egocentric” data publicly available to the research community, the company stated. AI training usually involves photos and videos captured in third-person, but the post stated that “next-generation AI will need to learn from videos that show the world from the center of action.”

Facebook’s hope is that training AI to understand the world from a first-person perspective may help enable more immersive experiences for users of devices like augmented reality (AR) glasses, virtual reality (VR) headsets and other products.

The project also involved the creation of five benchmark challenges for developing smarter, more useful AI assistants including:

  • Episodic memory: What happened when? (e.g., “Where did I leave my keys?”)
  • Forecasting: What am I likely to do next? (e.g., “Wait, you’ve already added salt to this recipe”)
  • Hand and object manipulation: What am I doing? (e.g., “Teach me how to play the drums”)
  • Audio-visual diarization: Who said what when? (e.g., “What was the main topic during class?”)
  • Social interaction: Who is interacting with whom? (e.g., “Help me better hear the person talking to me at this noisy restaurant”)

RELATED: Nvidia projects are helping AI find its human-like voice