An investigation in unsupervised 2D To 3D human pose estimation

Hardy, Peter Timothy David (2024) An investigation in unsupervised 2D To 3D human pose estimation. University of Southampton, Doctoral Thesis, 164pp.

Record type: Thesis (Doctoral)

Abstract

3D human pose estimation is a highly evolving area of research within computer vision with numerous real-world applications such as animation, human-robot interaction and security. With 3D pose datasets being expensive, lacking in terms of pose diversity and typically only involving controlled scenes, unsupervised 2D-3D lifting has become a popular branch of 3D human pose estimation that attempts to lift a 2D pose intermediate into 3D without the requirement of 3D data. However, all 2D-3D lifting approaches simply assume that the 2D pose will be available during inference and that 2D detectors are accurate. In fact no research investigates their accuracy or how we can link 2D and 3D human pose estimation into a usable end-to-end approach. This research is designed to bridge this gap and investigate the end-to-end flow of unsupervised 2D-3D pose estimation highlighting and addressing limitations at various points. In this thesis five total contributions are introduced and analysed to accomplish this objective.

Our first contribution was our investigation into the drop in performance of 2D pose detectors in low-resolution scenarios. In particular we investigated if super-resolution can be used in order to improve results. However we found that while the detection of people within our images improved, their keypoint detection did not. This leads to our novel thresholded Mask-RCNN approach that achieved the highest performance results in our low-resolution 2D pose datasets.

Our second contribution investigated 2D pose representations within the field of unsupervised 2D-3D human pose estimation (HPE). All prior unsupervised 2D-3D HPE approaches provided the entire 2D kinematic skeleton to a model during training. We argued that this is sub-optimal and disruptive as long-range correlations will be induced between independent 2D key points and predicted 3D coordinates during training. With a maximum architecture capacity of 6 residual blocks, we evaluated the performance of 7 lifting models which each represented a 2D pose differently during the unsupervised 2D-3D lifting process. Additionally, we showed the correlations induced between 2D key points when a full pose is lifted, highlighting the unintuitive correlations learned. Our results showed that the most optimal representation of a 2D pose during the lifting stage is that of two independent segments, the torso and legs, with no shared features between each lifting network. This approach decreased the average error by 20% on the Human3.6M dataset when compared to a model with a near identical parameter count trained on the entire 2D kinematic skeleton.

Our third contribution was our lifting network LInKs (Lifting Independent Keypoints), a novel unsupervised learning method that was able recover 3D human poses from 2D poses obtained from a single image, even when the certain forms of occlusions were present. Our approach followed a unique two-step process, which involved first lifting the occluded 2D pose to the 3D domain, followed by filling in the occluded parts using the partially reconstructed 3D coordinates. This lift-then-fill approach lead to significantly more accurate results compared to an identical model that completed the pose in 2D space prior to being lifted into 3D. Additionally, we improved the stability and likelihood estimation of normalising flows through a custom sampling function replacing PCA dimensionality reduction used in prior work.

Our fourth contribution is one of the first 3D multi-person human pose estimation systems that is able to work in real-time and is also able to handle basic forms of occlusion. First, we combined our previous lifting network LInKs with an off-the-shelf 2D detector, which would be used with a 360 panoramic camera and three mmWave radar sensors. We then introduced our improved matching of people within the image and radar space. This system addressed both the depth and scale ambiguity problems by employing a lightweight 2D-3D pose lifting algorithm that is able to work in real-time while exhibiting accurate performance in both indoor and outdoor environments which offers both an affordable and scalable solution. Notably, the time complexity remains nearly constant irrespective of the number of detected individuals, achieving a frame rate of approximately 7-8 fps on a laptop with a commercial-grade GPU.

Our final contribution tackled the fact that unsupervised 2D-3D human pose estimation (HPE) methods do not work in multi-person scenarios due to perspective ambiguity in monocular images. Therefore, we presented one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D HPE from just 2D poses alone, focusing on reconstructing human interactions. To address the issue of perspective ambiguity, we expand upon prior work by predicting the cameras' elevation angle relative to the subjects' pelvis. This allowed us to rotate the predicted poses to be level with the ground plane, while obtaining an estimate for the vertical offset in 3D between individuals. Our method involved independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system. The poses are then rotated and offset by the predicted elevation angle before being scaled. This by itself enables us to retrieve an accurate 3D reconstruction of their poses. We present our results on the CHI3D dataset, introducing its use within unsupervised 2D-3D pose estimation with three new quantitative metrics, and establishing a benchmark for future research.

Text

An investigation in unsupervised 2D to 3D human pose estimation - Version of Record

Available under License University of Southampton Thesis Licence.

Download (31MB)

Text

Final-thesis-submission-Examination-Mr-Peter-Hardy

Restricted to Repository staff only