Documentation

How do I use this?

Create an instance of the WebcamHeadTracker class.
Call WebcamHeadTracker::initWebcam() to initialize the webcam.
Call WebcamHeadTracker::initPoseEstimator() to initialize the head pose estimator.
While WebcamHeadTracker::isReady() returns true:
- Acquire a new webcam frame with WebcamHeadTracker::getNewFrame().
- Compute a new head pose with WebcamHeadTracker::computeHeadPose().
- Get the latest known pose with WebcamHeadTracker::getHeadPosition() and WebcamHeadTracker::getHeadOrientation().

How does it work?

We use mainly OpenCV and dlib functionality:

Acquire a webcam video frame with OpenCV
Detect the main face in it using the OpenCV haar feature-based cascade classifier (this is much faster than the dlib face detector)
Find face landmarks using the dlib implementation of Millisecond Face Alignment
Extract a subset of face landmarks that fulfill two requirements:
- They are reasonably robust with regard to changes in facial expressions
- We can estimate 3D positions of them on a model of the average human head
Use the correspondences of the 3D average human head points and the extracted 2D image landmarks to estimate the head pose, using OpenCV solvePnP
Filter the estimated head pose to remove jitter and predict the pose at a short time step into the future, to make the pose data usable in interactive systems. We offer both a Kalman filter for object tracking and a double exponential smoothing-based prediction filter for this purpose. The latter is the default.

These ideas were borrowed from various sources, including screenReality, eyeLike, gazr, this OpenCV tutorial, and this paper. We ended up using an approach similar to gazr, but faster, independent of ROS, and with better filtering.

Limitations

Both the face detector and the face landmark detector work best for frontal faces. They fail early if you tilt your head too far.
The pose estimation before filtering is very noisy, so extensive filtering is required, which leads to swimming artefacts. With a less noisy pose estimation, we could tweak filter parameters to reduce this effect...
This is all just approximation, do not expect the resulting values to guarantee reasonable error bounds.
The library uses crude guesses for the camera intrinsic parameters and distortion coefficients. This seems to work surprisingly well most of the time. However, you can also properly calibrate your webcam and use the correct values (see comments in webcam-snapshot.cpp)
Under bad lighting, the webcam will have trouble delivering reasonably good images, and the detectors/estimators will have trouble with noisy data that is very unlike the data they were trained with.

Library reference

Library documentation generated with Doxygen