Documentation
How do I use this?
- Create an instance of the
WebcamHeadTracker
class. - Call
WebcamHeadTracker::initWebcam()
to initialize the webcam. - Call
WebcamHeadTracker::initPoseEstimator()
to initialize the head pose estimator. - While
WebcamHeadTracker::isReady()
returns true:- Acquire a new webcam frame with
WebcamHeadTracker::getNewFrame()
. - Compute a new head pose with
WebcamHeadTracker::computeHeadPose()
. - Get the latest known pose with
WebcamHeadTracker::getHeadPosition()
andWebcamHeadTracker::getHeadOrientation()
.
- Acquire a new webcam frame with
How does it work?
We use mainly OpenCV and dlib functionality:
- Acquire a webcam video frame with OpenCV
- Detect the main face in it using the OpenCV haar feature-based cascade classifier (this is much faster than the dlib face detector)
- Find face landmarks using the dlib implementation of Millisecond Face Alignment
- Extract a subset of face landmarks that fulfill two requirements:
- They are reasonably robust with regard to changes in facial expressions
- We can estimate 3D positions of them on a model of the average human head
- Use the correspondences of the 3D average human head points and the extracted 2D image landmarks to estimate the head pose, using OpenCV solvePnP
- Filter the estimated head pose to remove jitter and predict the pose at a short time step into the future, to make the pose data usable in interactive systems. We offer both a Kalman filter for object tracking and a double exponential smoothing-based prediction filter for this purpose. The latter is the default.
These ideas were borrowed from various sources, including screenReality, eyeLike, gazr, this OpenCV tutorial, and this paper. We ended up using an approach similar to gazr, but faster, independent of ROS, and with better filtering.
Limitations
- Both the face detector and the face landmark detector work best for frontal faces. They fail early if you tilt your head too far.
- The pose estimation before filtering is very noisy, so extensive filtering is required, which leads to swimming artefacts. With a less noisy pose estimation, we could tweak filter parameters to reduce this effect...
- This is all just approximation, do not expect the resulting values to guarantee reasonable error bounds.
- The library uses crude guesses for the camera intrinsic parameters and
distortion coefficients. This seems to work surprisingly well most of the
time. However, you can also properly calibrate your webcam and use the
correct values (see comments in
webcam-snapshot.cpp
) - Under bad lighting, the webcam will have trouble delivering reasonably good images, and the detectors/estimators will have trouble with noisy data that is very unlike the data they were trained with.
Library reference