I am working on fine-tuning the PoseClassificationNet within the Pose Classification pipeline, and I need guidance on handling multi-person scenarios in video clips during dataset preparation.
Current Workflow:
For single-person action videos, my data processing steps are as follows:
- Extract clips from videos with diverse poses and viewpoints.
- Run the BodyPose3D model to generate JSON metadata.
- Convert 3D points to 2D keypoints.
- Convert JSON metadata to NumPy arrays (per video).
- Save
.pkl
files containing the video’s keypoints and corresponding action labels. - Merge arrays and split them into Train, Validation, and Test sets.
Concern:
For videos containing two or more persons performing the same action, I would like clarity on:
- How should I handle the keypoints of each unique person in a video?
- What should the NumPy array format look like to support multiple persons?
- Should I create one combined
.npy
file per video containing all persons, or separate.npy
files per person? - How do I assign the correct action label if there are multiple people in a single clip?
- What is the best practice for splitting Train/Val/Test when multiple persons are present in one video?
- Are there any NVIDIA-recommended guidelines for handling multi-person action clips in the PoseClassificationNet dataset pipeline?
Additional Context:
- I am currently following the dataset preparation documentation designed for single-person videos but would like to scale this to handle multi-person cases while preserving the action context.
- If there are any reference implementations, sample datasets, or scripts for multi-person handling in PoseClassificationNet, please do share.
1 post - 1 participant