Adventures in Facial Capture: Using Kinect Data (Part 1)

April 7, 2017

In the middle of last year I started playing around with Microsoft’s Kinect sensor for the Xbox One as a tool for recording motion capture data for the character models for my Robyn HUD game. Recently, I’ve started playing with another aspect of the Kinect sensor: facial animation capture.

The Xbox One Kinect sensor is the second version Kinect sensor. It can record a lot of different data including full body motion capture, facial data, high definition video, infra-red data, and audio, all at a brisk 30 frames per second. All of this data is exposed to a programmer through the Kinect 2.0 SDK. The SDK provides two different ways of accessing facial data.

Facial Features

The first method of retrieving face data uses the IFaceFrameReader class provided by the SDK. With this class it’s possible to retrieve basic facial feature information from the Kinect sensor for a tracked face. The information is fairly limited, consisting of:

  • The overall position and orientation of the face in 3D space.
  • The positions of the eyes, nose, and mouth corners of the face.
  • Whether the face is happy, engaged, or looking away.
  • Whether the face is wearing glasses.
  • Whether the left eye, right eye, and mouth are opened or closed.

It’s interesting data but not too useful as far as animating a 3D model face goes. An example of the facial feature data capture is provided in the SDK’s Face Basics sample.

High Definition Face Data

The second method of retrieving face data uses the HighDefinitionFaceFrameReader class provided by the SDK. From one frame to the next, this class gives access to the positions of 1,347 vertices that make up the form of a face.

[The androgynous, theoretical face provided by the Kinect.]
The androgynous, theoretical face provided by the Kinect.

It’s important to understand that the high definition data that’s returned is first interpreted onto a theoretical face instead of representing the actual facial features of the person standing in front of the sensor. The data returned is fitted to a generic, androgynous face.

For my purposes, the actual topology of the face returned by the Kinect HighDefinitionFaceFrameReader isn’t overly important. My intent is to take the proportional differences in the positions of the vertices from one frame to another and apply those same proportional differences to the vertices I’m actually using in the faces of my 3D game characters (more on all that in a future blog).

The first step to using the Kinect high definition data is to get it written to disk so it can be used as source information for other programs that will do the application of that data to the specific 3D models I intend to use. One thing of note here. Each vertex in the face consists of three 8 byte floating point numbers for its X, Y, and Z location. At 1,347 vertices at 30 frames per second this works out to a raw file size just shy of 1 megabyte per second of recorded data. So just be careful with long recordings.

An example of the high definition face data capturing is provided in the SDK’s HD Face Basics sample.

That’s it for this week. Watch for a future blog post where I discuss mapping the movements of the theoretical Kinect face to an actual model face.