Integration of a skier-specific keypoint detection model in a hybrid 3D motion capture pipeline
Abstract
Introduction & Purpose
Alpine skiing, like many outdoor sports, presents significant challenges for motion capture due to its large capture volumes, high athlete speeds, variable environmental conditions, and occlusions, e.g., due to snow spray. While traditional marker-based motion capture systems offer highest precision in the lab, they are usually unsuitable for outdoor settings. Sensor-based methods, such as inertial measurement units, however, may suffer from inaccuracies due to sensor noise and drift, while they only provide relative segment positions (Fasel et al., 2018). Therefore, recent studies in alpine skiing preferably used video-based systems (Heinrich et al., 2023; Spörri, 2016). These methods rely on multi-camera setups that require synchronization and camera calibration. However, the extensive manual digitization required for both keypoints and reference points introduces a substantial workload in post-processing, particularly when cameras must pan, tilt, and zoom to cover large capture volumes (Spörri, 2016).
Recent advancements in computer vision have unveiled great potential for human motion capture, especially in automating much of the manual work required for video-based systems (Fang et al., 2017; Redmon et al., 2016; Zwölfer et al., 2023a). We therefore developed a novel, hybrid 3D motion capture approach that automates the detection of reference points using a reference point detection algorithm and the digitization of keypoints using a skier-specific keypoint detection algorithm. This reduces the reliance on manual digitization and enhances the scalability and practicality of large-scale motion capture in outdoor environments. The aim of this study was to a) evaluate the performance of the skier-specific keypoint detection against manual digitization and b) determine the impact of the skier-specific keypoint detection on the overall performance of the hybrid 3D motion capture pipeline.
Methods
The experimental setup involved a multi-camera system comprising eight Sony AX53 cameras uniformly distributed around a capture volume on a ski slope in Zürs am Arlberg, AT. The capture volume measured approximately 250 x 80 x 20 meters. To calibrate the cameras, approximately 300 reference points were placed within this area and surveyed geodetically. Each reference point was equipped with a 9 x 9 cm cube that displayed Aruco markers on all sides, enabling their automated detection by our reference point detection algorithm. This arrangement allowed for continuous calibration of each camera in every frame, accommodating for panning, tilting, and zooming. Camera calibration and 3D reconstruction were performed using the Direct Linear Transformation (DLT) method (Abdel-Aziz & Karara, 1971). In total, ten state-certified Austrian ski instructors performed eight runs according to the progression levels of the Austrian ski curriculum (Österreichischer Skischulverband, 2018).
To develop a keypoint detection model, capable of detecting a skier, including equipment, e.g., skis and poles, we finetuned AlphaPose’s HALPE26 model (Fang et al., 2017), which was designed to estimate 26 body keypoints for general motions, on a skier-specific dataset. Training was done for 200 epochs with a learning rate of 10-3 and data augmentation enabled. AlphaPose was chosen for its proven performance in alpine skiing scenarios (Zwölfer et al., 2023a). For the skier-specific dataset, we manually digitized six runs, marking 24 keypoint, including 18 body keypoints, ski tips, ski tails, and poles, in each image. The six runs were selected to include slow, medium and high-speed skiing of one male and one female subject. These digitized images complement the datasets built by Bachmann et al. (2019) and later by Zwölfer et al. (2023b) and Heinrich et al. (2023), all using the same set of keypoints. In total, our comprehensive skier-specific dataset comprised about 15,000 images, with approximately two-thirds of all images originating from the current measurement.
The accuracy of our model was evaluated in 2D image space by calculating the mean per joint position error (MPJPE), percentage of correct keypoints (PCK), and mean average precision (mAP) metrics on a test set of about 2,000 images that were excluded from training. In addition, we determined the impact of the keypoint detection algorithm on the hybrid motion capture pipeline. We, therefore, processed one of the manually digitized runs by the skier-specific model as well as the HALPE26 model. Using the calibration matrices, we reconstructed the 3D motion of the skier for each method and calculated the mean lengths of eight body segments (upper arms, forearms, thighs, and shanks). We compared the measured physical lengths and the mean segment lengths reconstructed from manually digitized keypoints, keypoints processed by the HALPE26 model, and our skier-specific model. We also quantified the variation of segment lengths by calculating their mean standard deviation across all 250 frames of the run. Results on segment lengths (mean values and variations) were calculated without any smoothing or filtering.
Results
Our skier-specific keypoint detection model achieved a PCK of 98%, a mAP of 0.97, and a MPJPE of 10.32 pixels on the test set. Visual assessment of the detected keypoints supported these quantitative results, showing only a few flawed detections. Most inaccuracies involved ski tails or poles and were primarily due to occlusions. Representative images showcasing the model's performance are displayed in Figure 1 (left). The plausibility of the skier-specific model for 3D reconstruction is demonstrated in the 3D visualization of a sample run shown in Figure 1 (right). Differences between measured and reconstructed segment lengths (mean across all frames) ranged from a minimum of 0.2 cm to a maximum of 2.3 cm, with only small differences observed among the different keypoint detection methods investigated. The evaluation of the variations in segment lengths revealed a mean standard deviation of 4.6 cm for manually annotated frames, compared to 4.5 cm for frames processed by the HALPE26 model. The mean deviation for frames processed by our skier-specific model was reduced to 3.4 cm.
Discussion
Our results demonstrated that the accuracy and precision of our model are at least on par with manual digitization, as evidenced both visually and through quantitative evaluation. On the 2D de
tection level, the PCK, MPJPE, and mAP metrics reflected the model’s high performance, aligning with previous studies (Bachmann et al., 2019, Zwölfer et al., 2023a), who reported comparable MPJPE values when applying keypoint detection to regular skiing scenarios.
On the 3D reconstruction level, differences between measured and reconstructed segment lengths for all three keypoint detection methods were within the range of typical measurement errors of about 2 cm. However, on this single run, our new model outperformed manual digitization and plain AlphaPose detections in terms of segment length variation. This was especially surprising, as our 3D reconstruction did not enforce temporal smoothness or kinematic constraints such as segment length consistency. This suggests that our model may benefit synergistically from both the pretrained data and manual annotations, possibly averaging out the low precision in manual digitization. While these variations in segment lengths may appear large, no smoothing or filtering was applied. For biomechanical analysis, results can be significantly improved by smoothing the data, e.g., using splines.
By integrating our skier-specific keypoint detection model to our hybrid motion capture pipeline, we reduced the manual work required for digitizing all frames in all perspectives from dozens of hours for a single run to just a few minutes of computing time.
It is essential to mention that this evaluation was limited to a single run. Moreover, the accuracy of the 3D data heavily relies on accurate camera calibration and temporal synchronization. Therefore, ablation studies to evaluate the influence of the reference point detection algorithm and the temporal synchronization method will be realized in a future study.
Conclusion
We implemented a skier-specific keypoint detection model capable of detecting a skier, including skis and poles, which showed good performance in both 2D image space and 3D reconstruction. By eliminating the manual digitization workload, the hybrid 3D motion capture pipeline facilitates large-scale motion capture in similar outdoor settings and enhances the scalability of biomechanical research in outdoor sports like alpine skiing. Additionally, this method allows us to automatically digitize the remaining runs recorded during this field study, resulting in an extensive 3D dataset crucial for the future development of fully computer vision-based motion capture methods.
References
Abdel-Aziz, Y. I., & Karara, H. M. (1971). Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogrammetric Engineering and Remote Sensing, 38(1), 49-55.
Bachmann, R., Spörri, J., Fua, P., & Rhodin, H. (2019). Motion capture from pan-tilt cameras with unknown orientation. arXiv, 1908.11676. https://doi.org/10.48550/arXiv.1908.11676
Fang, H. S., Xie, S., Tai, Y. W., & Lu, C. (2017). RMPE: Regional multi-person pose estimation. 2017 IEEE International Conference on Computer Vision (ICCV), 2353-2362. https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.256
Fasel, B., Spörri, J., Chardonnens, J., Kröll, J., Müller, E., & Aminian, K. (2018). Joint inertial sensor orientation drift reduction for highly dynamic movements. IEEE Journal of Biomedical and Health Informatics, 22(1), 77-86. https://doi.org/10.1109/JBHI.2017.2659758
Heinrich, D., van den Bogert, A., Mössner, M., & Nachbauer, W. (2023). Model-based estimation of muscle and ACL forces during turning maneuvers in alpine skiing. Scientific Reports, 13, Article 9026. https://doi.org/10.1038/s41598-023-35775-4
Österreichischer Skischulverband. (2018). Snowsport Austria - Die Österreichische Skischule - Vom Einstieg zur Perfektion. In vier Stufen zum Erfolg (2nd ed.) [Snowsport Austria - The Austrian Ski School - From entry to perfection. Four steps to success]. Brüder Hollinek.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779-788. https://doi.ieeecomputersociety.org/10.1109/CVPR.2016.91
Spörri, J. (2016). Research dedicated to sports injury prevention – the ‘sequence of prevention’ on the example of alpine ski racing. University of Salzburg, Austria. http://dx.doi.org/10.13140/RG.2.2.28451.89126
Zwölfer, M., Heinrich, D., Wandt, B., Rhodin, H., Spörri, J., & Nachbauer, W. (2023a). Deep learning-based 2D keypoint detection in alpine skiing – A performance analysis of state-of-the-art algorithms applied to regular skiing and injury situations. JSAMS Plus, 2, Article 100034. https://doi.org/10.1016/j.jsampl.2023.100034
Zwölfer, M., Heinrich, D., Wandt, B., Rhodin, H., Spörri, J., & Nachbauer, W. (2023b). A graph-based approach can improve keypoint detection of complex poses: a proof-of-concept on injury occurrences in alpine ski racing. Scientific Reports, 13, Article 21465. https://doi.org/10.1038/s41598-023-47875-2
License
Copyright (c) 2024 Michael Zwölfer, Martin Mössner, Helge Rhodin, Werner Nachbauer, Dieter Heinrich
This work is licensed under a Creative Commons Attribution 4.0 International License.