PoseNormNet: Identity-Preserved Posture Normalization of 3-D Body Scans in Arbitrary Postures

Three-dimensional (3-D) human models accurately represent the shape of the subjects, which is key to many human-centric industrial applications, including fashion design, body biometrics extraction, and computer animation. These tasks usually require a high-fidelity human body mesh in a canonical posture (e.g., “A” pose or “T” pose). Although 3-D scanning technology is fast and popular for acquiring the subject's body shape, automatically normalizing the posture of scanned bodies is still under-researched. Existing methods highly rely on skeleton-driven animation technologies. However, these methods require carefully designed skeleton and skin weights, which is time-consuming and fails when the initial posture is complicated. In this article, a novel deep learning-based approach, dubbed PoseNormNet, is proposed to automatically normalize the postures of scanned bodies. The proposed algorithm provides strong operability since it does not require any rigging priors and works well for subjects in arbitrary postures. Extensive experimental results on both synthetic and real-world datasets demonstrate that the proposed method achieves state-of-the-art performance in both objective and subjective terms.


I. INTRODUCTION
A CCURATE human body models in canonical postures are necessary for many human-centric industrial applications.For instance, when designing high-end clothes, virtual mannequins representing the physical shapes of subjects are valuable tools for the fashion designer.Such virtual mannequins must be symmetric as garments are usually symmetric [1].Ran Zhao, Xinxin Dai, and Adrian Munteanu are with the Department of Electronics and Informatics, Vrije Universiteit Brussel, 1050 Brussels, Belgium (e-mail: Ran.Zhao@vub.be;xinxin.dai@vub.be;acmuntea@etrovub.be).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TII.2023.3245682.
Digital Object Identifier 10.1109/TII.2023.3245682 When creating character animations, animators must insert a skeleton in the body model and paint proper skin weights on the body surface.In computer graphics, such an operation is called rigging.Once characters are rigged, dynamic characters are obtained by importing and applying motion data.However, the rigging requires a body model in a canonical posture ("A" pose or "T" pose) and will fail if the character has a complicated posture [2].Although human body shapes can be acquired in a few seconds by leveraging 3-D scanning technology [3], poor postures will result in bad results for the above applications.
Early studies have analyzed the effect of posture on body biometrics extraction [4], but how to normalize the posture of the scanned body is still an open problem.In practice, subjects are usually asked to keep the "A" pose or "T" pose for a few seconds to minutes during the 3-D scanning process.Holding such poses aims at reducing self-occlusions and offering a proper posture for downstream applications.Although this solution works for most people, it is often difficult to ask patients, old people, or kids to do so.Consequently, additional scanning sessions are required if the scanned body has an imperfect posture.
Statistical shape modeling is highly popular in 3-D anthropometric analysis to determine the variability of body shapes for specific populations [5].Posture normalization is of critical importance in statistical shape modeling as it imposes an automatic adjustment of the posture of the scanned bodies to a canonical body pose.
Existing methods for posture normalization can be classified into two main categories: skeleton-driven methods and parametric body-based methods.As their names illustrate, the former requires the well-defined skeleton as input while the latter needs the parametric body model.Once the surface of the body model has been attached to the bones of the skeleton, its posture can be easily adjusted by controlling the joint angles in a hierarchical structure.This kind of method highly relies on the quality of rigging, namely the accuracy of the created skeleton and skin weights.However, despite a few automatic rigging methods that have been proposed [2], [6], obtaining high-quality rigging still requires manual intervention, which is expensive and time-consuming.Moreover, neither automatic nor manual rigging works if the posture of the scanned body is complicated.In addition, optimizing the joint angles to adjust the posture to be an "A" pose or "T" pose is yet another challenge.
In the second category, parametric body-based methods usually fit parametric body models to the scanned body in order to regress the shape and pose parameters.Next, the fitted body in a perfect canonical posture can be obtained by simply resetting its pose parameter [3].Compared to the skeleton-driven methods, parametric body-based methods avoid the challenge of optimizing the joint angles and are more robust to the posture variations of the scanned body.However, identity details such as the face, hair, and skin wrinkles are missing when using parametric body-based methods.Furthermore, the process of using a human template to replace the input human body will introduce inevitable errors and ignore the personal characteristics of the input human body.
In this work, we propose a novel method termed PoseNorm-Net for normalizing postures of scanned 3-D human bodies.PoseNormNet takes raw point clouds of scanned bodies (or the vertices of the body meshes) as input and outputs identitypreserved body models in the "T" pose.Compared to existing methods, our method has the following advantages: 1) it is fully automatic; 2) it works well for arbitrary postures; 3) it does not need rigging; 4) it can preserve the identity (e.g., faces and skin wrinkles) of the subject.The main contributions in this article include the following.
1) We propose, to the best of our knowledge, the first deep learning-based approach that normalizes the posture of scanned bodies while preserving their raw details, such as the face, hair, and skin wrinkles.2) We propose a novel formulation for the task of posture normalization inspired by the nonrigid point cloud registration problem.3) We give a theoretical analysis of the error bounds for the proposed PointNet-based encoder-decoder framework.4) Extensive experiments based on real-world as well as synthetic data demonstrate the efficiency and robustness of the proposed method compared to the state of the art.

A. Posture Normalization
Although it is convenient to get a static 3-D point cloud of the human body based on 3-D scanning technologies, it remains very difficult to automatically optimize the posture of the scanned body to the desired posture, especially when dealing with complicated postures.As aforementioned, existing methods of posture normalization can be mainly classified into two categories according to the type of output: skeleton-driven based methods which maintain identity details of the scanned subject, and parametric body model-based methods, which do not preserve identity information.
The key point for skeleton-driven posture normalization is rigging.Once a scanned body obtains a proper skeleton and skin weights, its posture can be controlled by changing the joint angles.The linear blend skinning-based (LBS-based) method [7] is the most popular for automatic rigging due to its simplicity and feasibility.However, when the joint angles are large or when a bone undergoes a twisting motion, LBS will meet "bow-tie" or "candy wrapper" issues.Thanks to the work on human templates [6], transferring the high-quality rigging from a well-aligned human template, such as SMPL [8], effectively avoid joint problems caused by LBS.However, the alignment itself is not a simple task.In addition, all the skeleton-driven based posture normalization methods face the same challenge, that is, that they may not work for complex postures.In addition, accurate rigging still requires manual efforts, which is expensive and time-consuming.
Parametric body model-based methods are popular for applications for which the facial characteristics and other details of the person are less important.The main idea followed by these techniques is to fit a parametric body model to the input body and to reset the pose parameters in order to normalize the posture.Thanks to the development of human body templates, SMPL [8] is widely used to align the reconstructed input point cloud and to obtain a fitted template that is similar to the actual input human body; examples of such methods include implicit part network (IP-Net) [9] and impaired-to-high-fidelity human body network (I2H) [10].IP-Net [9] first reconstructs the body surface via an implicit function, then fits the SMPL model to the predicted surface by optimizing the SMPL shape, pose, and translation parameters, subsequently followed by posture normalization by setting the pose parameters to origin.I2H [10] directly regresses the vertices of the SMPL body in T-pose.
Posture normalization based on some parametric body models is efficient, but the person's identity is lost.In addition, given the limited resolution of the human body template, the result is too smooth and fails to capture specific characteristics of the scanned subject.Another interesting example is proposed by [5] based on statistical shape models [11].It uses the average body shape and average pose of the input human body as a standard, calculates the translation of each vertex from different poses to the average pose to achieve posture normalization.However, this technique is only applicable to the scanned body at near-standard poses.

B. Nonrigid Point Cloud Registration
Nonrigid point cloud registration aims at finding a spatial transformation (e.g., scaling, rotation, and transformation) that deforms the source point cloud to be the same as the target point cloud.Inspired by nonrigid registration, we propose a novel formulation to address the posture normalization problem, whereby the input body scan is transformed in order to produce a normalized output point cloud.We note that pose normalization is not a registration problem due to the lack of the target point cloud, which is exactly what needs to be determined by the pose normalization process.
This section briefly reviews the works on nonrigid point cloud registration.Traditional nonrigid point cloud registration methods are explainable and accurate under certain conditions.One popular solution is the iterative closest point (ICP) based nonrigid point cloud registration [12] extending techniques from rigid registration [13].Rooted in the ICP family, the thin plate spline (TPS) [14] based nonrigid point cloud registration is a popular traditional parametric method that replaces the binary correspondence condition of ICP with soft-assignment.Statistical model-based algorithms include popular nonparametric methods, such as the coherence point drift (CPD) algorithm [15] based on the Gaussian mixture model (GMM), the Bayesian coherent point drift (BCPD) [16], and BCPD++ [17] formulating CPD in a Bayesian setting.A large amount of effort has been put into improving the computational efficiency and performance of traditional methods, resulting in algorithms that combine ICP with statistical models [18], and algorithms that combine TPS with statistical models [19].Nevertheless, traditional methods are still time-consuming, and many of them are sensitive to initial values of the parameters.
Compared to traditional methods, deep learning-based point cloud registration methods have recently attracted more research attention since they admit faster execution and are easy to operate on point clouds [20], [21], [22].One main challenge for this family of methods is the conflict between the sparse, disordered, and irregular nature of point clouds and the regular input format requirement of standard deep neural networks.
To solve this problem, some researchers tried to change the point cloud format.For example, PR-Net [23] used point clouds voxelization to combine disordered point clouds with a convolutional neural network.On the other hand, PointNet-based [24] methods changed the network structure, making it possible to input the original point cloud directly.For example, CPD-Net [25] used the PointNet-based structure directly, and RMA-Net [26] combined a GRU-based [27] framework, multiview loss, and a PointNet-based structure.However, voxel-based methods are not suitable for high-resolution tasks due to their large memory requirements, while multiview rendering loses the 3-D information when performing a 3-D to 2-D conversion.Although the existing PointNet-based networks can directly obtain features from point clouds, they solely work for limited deformations and fail for large deformations.
Posture normalization takes a single body as input.Although it is not a nonrigid registration problem, we propose a novel formulation for posture normalization by introducing a virtual target shape and by training the neural network on synthetic data inspired by the formulation of nonrigid registration.

A. Problem Statement
As aforementioned, inspired by nonrigid registration, we propose a novel formulation for the posture normalization problem addressed in this study.Nonrigid point cloud registration takes as input the pairs (S i , T i ) of source and target point clouds, respectively, and finds the proper mapping of the source to the target.Unlike nonrigid point cloud registration, posture normalization only has access to sources (input point clouds) without targets. Let . ., N} be an input point cloud with N points; S is a discrete representation for both unordered body point clouds, and vertices from a body mesh.Our idea can be summarized in three steps.
1) Predict a virtual target which is a T-pose body mesh denoted by ( T , τ).T represents a set of vertices, and τ is its topology.
T and S are two body point sets corresponding to the same body shape but with different postures.
2) Attach S to a body mesh denoted by ( Ŝ, τ), which is seen as a virtual source.Ŝ denotes a set of vertices, and τ is its topology shared with ( T , τ).Ŝ and S are two body point sets corresponding to the same body shape and posture.For simplicity, we shall denote meshes by their vertex sets.
3) Normalize the posture of input to output the final result T , which is the point cloud of the input in T-pose; this is obtained by leveraging the obtained two topology-shared meshes.
Given S, there exist two mappings Φ : R N ×3 → R M ×3 and Ψ : R N ×3 → R M ×3 to map S onto one SMPL model [8] in input posture and T-pose to obtain two topology-shared meshes, that is, Ŝ and T , respectively.We observe that more accurate Ŝ and T improve T .Here, the accuracy for T refers to the similarity between the T and the input body shape, while for Ŝ, it refers to the similarity between Ŝ and S (both in terms of shape and posture).
It is changeling to obtain the topology-shared meshes Ŝ and T .We train a first neural network to get coarse estimates of Φ and Ψ, simultaneously.This yields Tcoarse and Ŝcoarse which are coarse estimates of T and Ŝ, respectively.To further improve the performance, we exploit another neural network to adjust the shape of Tcoarse generated by the first network to output the high-accuracy T ; we also propose a two-step optimization to get Ŝ by refining Ŝcoarse obtained from the first network.Finally, we achieve topology-driven posture normalization and change the posture through a homeomorphic mapping between the high-accuracy input topological space and canonical posture topological space.
As shown in Fig. 1, the proposed coarse-to-fine method comprises the following building blocks: 1) D2CC Network to get Ŝcoarse and Tcoarse ; 2) refinement architecture to refine Ŝcoarse and Tcoarse ; 3) posture normalization based on two high-accuracy topology-shared meshes Ŝ and T .

B. Dataset
To train the proposed model, a high-quality, large-scale dataset is necessary.It should contain posed bodies and corresponding paired ground-truth bodies in the canonical pose.The ideal solution is to collect such a dataset by scanning many subjects.However, it is costly to do so.Furthermore, it is impossible to let all the subjects keep the same canonical pose.Thus, to resolve the dataset problem, we train the proposed model based on a synthetic dataset.
Similar to the work of [10], we leverage the SMPL model [8] to synthesize 150 K pairs of male bodies and 150 K pairs of female bodies, of which each pair contains a body in a noncanonical pose and the same body in the "T" pose.Fig. 2 illustrates the proposed pipeline for data generation.As the SMPL model is controlled by shape and pose parameters, we extract SMPL parameters in the SURREAL [28] dataset, and perform random combinations to synthesize the input bodies in random shapes and arbitrary postures.The T-pose bodies are obtained by setting the pose parameter in each combination to zero.To strengthen  the robustness of our model, we add a slight random rotation to each pair of human bodies.
Preprocessing: For each point cloud S, we compute the centroid and scale of this point cloud first as The normalized point cloud is then To train and test the proposed model, we randomly split the pre-processed synthetic dataset into a training set, validation set, and testing set, which account for 97%, 2%, and 1% of data samples, respectively.
Postprocessing: The output needs to be converted to its original shape and location as follows:

C. D2CC Network
The first step of our PoseNormNet is to convert one discrete input point cloud into two continuous topology-shared meshes of this point cloud in the input posture and a T-pose, respectively.This is achieved through a novel PointNet-based encoder-decoder structure which we call D2CC Network.
Encoder: An efficient encoder that is robust and can obtain sufficient and effective geometric information from the input point cloud is necessary.Similar to PCN [29], we first learn an encoder function φ : R N ×3 → R 1024 to extract features from the input point cloud by cascading two PointNet architectures.The first PointNet φ 1 with weights w 1 consists of multilayer perceptron MLP 1 with hidden nodes sizes of 128 and 256, and a max-pooling process.MLP 1 maps each point x i from S to get the pointwise feature vector xi = (x ij ) 256 j=1 ; max-pooling on the feature matrix (5) The updated input for the second PointNet layer φ 2 with weights w 2 is an augmented point feature matrix (6) which is obtained by concatenating Ω 1 to each x i .φ 2 is MLP 2 with hidden feature sizes of 512, 1024, and pointwise maxpooling; mapping χ 1 yields the final 1024-size feature vector: The feature extractor φ with parameters w is the composition of φ 1 with weights w 1 and φ 2 with weights w 2 Decoder: From the feature vector Ω, we need to learn two functions, ϕ input and ϕ T : R 1024 → R M ×3 , to output two continuous topology-shared meshes, Ŝcoarse and Tcoarse , simultaneously.Therefore, the decoder needs two shared branches but different weights for each neural node.Here, we call the branch outputting Ŝcoarse as the input-pose branch, and the other branch as the T-pose branch.The formulated input-pose branch and T-pose branch are ϕ input with parameters w input and ϕ T with parameters w T , respectively.We design a concurrent decoder following a 3-layer fully connected network (FC) with feature sizes of 1024,1024, and 6890 × 3 (denoted as ϕ 1 ).A reshaping procedure (denoted as ϕ 2 ) in each branch following FC can finally output the coordinates of 6890 points in the same order as vertices in the SMPL model.Therefore

D. Refinement Architecture
In our designs, we observed that the results of D2CC can be further improved.Ŝcoarse can be refined by an additional optimization step by minimizing the error between S and Ŝcoarse .Tcoarse can be optimized by offset learning.

1) Refined Virtual Source Shape:
The theoretical analysis in Section IV shows that the proposed PointNet-based structure is effective as it leads to bounded errors in the mapping process.Considering that these errors originate from the encoder and decoder, we introduce a two-step optimization process based on Adam Optimizer [30] (termed Algorithm 1) which is devised to minimize the errors induced by the encoder and decoder.The parameters of Adam Optimizer we set are as follows: 1) learning rate = 0.001; 2) the exponential decay rates for the moment estimates parameters β 1 = 0.9 and β 2 = 0.999; 3) iterations = 300.To this end, we adapt the Chamfer distance L( 2) Refined Virtual Target Shape: Different from Ŝcoarse refinement, we do not have the corresponding ground-truth target shape.We design an Offset-Refine network to optimize the coarse mesh Tcoarse .The refinement process of Tcoarse is shown in Fig. 1.Given the coarse T-pose result Tcoarse = (t 1 , t 2 , . . ., t 6890 ), we use PointNet with hidden layers of feature sizes 16, 64, 128, 256, and 512 to map Tcoarse to a local point feature matrix The global feature Ω 2 is obtained by max pooling f 2 , which is concatenated with each point in Tcoarse yielding the updated input, denoted as χ 2 , to the next PointNet which in turn produces a 6890 × 3 offset matrix ΔT = (Δt ij ) 6890×3 .The refinement of the T-pose is then given by

E. Posture Normalization
The essence of the high accuracy topology-shared meshes produced by D2CC and refined by refinement steps is to find Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 3. Illustration of proposed topology-driven deformation: The red point v i is a point from S, which is attached to the most appropriate vertex (black point) of Ŝ.The green points are the centroids of triangles j 1 , j 2 , . .., j n with the black points as their shared vertex.With v i attached to this black point, nearest(i) consists of j 1 , j 2 , . .., j n .Similar to [31], the transformation of v i is determined by the triangles in nearest(i).
a SMPL model with the same body size of S in two postures: input posture and T-pose, and attach each point in S to the most appropriate vertex on these two meshes, respectively.As Fig. 3 shows, a point attached to one vertex will transform with this vertex, while the transformation of this vertex depends on the triangles it belongs to.As two topology-shared meshes, Ŝ and T , have one-to-one triangle correspondences, the posture normalization corresponds to a topology-driven deformation (see Fig. 3), whereby the topology-shared mesh in the input posture is deformed to the T-pose.The mapping to map one point v i in S to its corresponding point v i attached on T is mathematically expressed by where nearest(i) denotes the nearest triangles in Ŝ of v i , T j and D j are the rotation matrix and translation vector of triangle j, respectively, and w j is the weight that is calculated by where r j is the reciprocal of the distance between v i and the centroid of triangle j.

F. Loss Function
As shown in Fig. 1, PoseNormNet has two neural networks.We adopt the following training loss functions to measure the reconstruction errors.
D2CC: Its scope is to reconstruct two topology-shared meshes Ŝcoarse and Tcoarse from S, respectively.We train D2CC in a supervised manner by setting the SMPL model in the input posture and T-pose, denoted as S and T , as the ground truth for two decoder branches, respectively.We define the complex reconstruction error for these two branches as where α i ∈ S and β i ∈ Ŝcoarse are the ith corresponding points pair in S and Ŝcoarse , while η i ∈ T and ζ i ∈ Tcoarse are the ith corresponding points pair in T and Tcoarse .
Offset-refine network: To refine Tcoarse , we train a network supervised by the offset vector calculated between Tcoarse and T .We denote the ground-truth offset vector as Δ T , and define the offset loss function as where Δη i ∈ Δ T and Δζ i ∈ ΔT are the ith corresponding points pair in Δ T and ΔT , respectively.

IV. THEORETICAL ANALYSIS
An ideal Ŝ should satisfy: 1) have the same topology as T ; 2) have the same body size and posture as S. As the meshes obtained from D2CC by reconstructing points onto the SMPL model, the first condition has been satisfied.Therefore, in this section, we give a theoretical analysis of the second condition that demonstrates, for any S, the existence of an encoderdecoder pair that can lead to bounded errors between S and the reconstructed point set Ŝ when producing the topology-shared mesh from its corresponding mesh.
Theorem 1: Let S = {x 1 , x 2 , . .., x n } be a disordered point set corresponding to a 3-D point cloud with n points.Assume that there exists at least one ordered point set of vertices on a human body mesh Ŝ = {y 1 , y 2 , . .., y m }, which satisfies that one open cover of Ŝ can cover S and one open cover of S can cover Ŝ.Then, there exists a pair of encoder φ : S −→ R max{m,n} and decoder ϕ input : R max{m,n} −→ Ŝ that satisfies L(S, ϕ input (φ(S))) = L(S, Ŝ) ≤ C, where L(•, •) is the augmented Chamfer distance given by (12) and C is a constant depending on max max min We prove this by creating different finite covers based on the different relationships between n and m, as shown in Fig. 4.
where is an arbitrarily small constant, close to 0.
If n ≤ m, create n spheres B i with x i as center and r as radius.When n > m, we create m spheres B i with y i as center and r as radius.Denote Obviously, B covers the whole point set S and Ŝ.We define a distance function between S and Ŝ as Then the distance matrix of S and Ŝ is If n ≤ m, the encoder function φ is a composite function of max-pooling and g(S, Ŝ)

. , max g(S, y
Then decode each point x i to the points within its sphere; x i is the closest point in S for its corresponding points in Ŝ (one point in S has more than one point in Ŝ), which can be formulated as If n > m, similarly, the encoder function φ is a composite function of max-pooling and g( Ŝ, S) = g(S, Ŝ) T .Therefore, for each point x i , we can find the unique point y j in Ŝ which satisfies that x i is in the sphere with y j as centroid and r as radius, with y j being the closest point in Ŝ to x i .Then the decoding process will reconstruct x i to this unique point y j in Ŝ (one notes that more than one point in S may correspond to one point in Ŝ).
Because of the open cover we constructed, for any x in S, we can find at least one y in Ŝ satisfying x − y 2 < r; at the same time, for any y in Ŝ, we can also find at least one point x in S satisfying y − x 2 < r.Then, for the Chamfer distance, we can write Corollary 1: For high-density input (n m), the error depends on the density of the mesh.Improving the mesh leads to better performance.
Proof: When n m max min ) Therefore, at this time The higher density of the mesh, the smaller value of C.And the error L(S, Ŝ) ≤ C will be smaller.
Corollary 2: The human body mesh Ŝ described in the Theorem 1 is not unique.
Therefore, we can find the most proper Ŝ by minimizing the error between the input and output.

A. Training Environment
The training is operated on a desktop PC with Intel(R) Core i9-9900 CPU @ 3.10 GHz × 16 as the processor, GeForce RTX 2080 Ti/PCIe/SSE2 as graphics, 62,7 GiB memory, and Ubuntu 16.04 LTS as the operating system.All the training and test are operated under CUDA 9.0, cudnn 7.0.5, python 3.6, and Tensorflow-GPU 1.0.5.
The proposed D2CC was trained by the Adam optimizer [30] with an initial learning rate of 0.0001 for 300 k iterations and a batch size of 16.The learning rate is decayed by 0.7 every 50 K iterations.The proposed offset-refine net was trained using the Adam optimizer with an initial learning rate of 0.001, and the exponential decay rates for the moment estimates parameters β 1 and β 2 are 0.9 and 0.99.Its maximum steps are 100 k, and the batch size is 8.The training time of D2CC and the offset-refine network are 11 and 1 h, respectively.In the inference phase, it takes less than 10 s for completing the posture normalization.

B. Evaluation Criteria
We compare our method against different related works based on both synthetic and real-world datasets.For the synthetic experiments, we have the ground truth for each input.Therefore, we employ the maximum error (29) and the Chamfer distance (12) to measure the maximum and average Euclidean distance between the closest pairs of points in the predicted body Ŝ and ground truth S GT , respectively Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I OVERALL COMPARISONS
In addition, we further perform an evaluation based on the earth mover's distance (EMD) given by Here, G is considered as a bijection to minimize the average distance between corresponding points in Ŝ and S GT .
For real-world data, we employ the ground-truth body measurements from [32] (denoted as V true ) due to the lack of the ground-truth T-pose body.We denote the predicted body measurements as V predict .Then, we can adapt the absolute error defined as to evaluate the accuracy of the body measurements.
We use the synthetic test dataset mentioned in Section III-B.We randomly select unseen 450 female data and 450 male data, which have T-pose models as ground truth and no intersection with the training and validation datasets.Therefore, for the synthetic data, we have access to the ground-truth T-pose body.
In Table I, we compare our method with related works in terms of input, processing details, and output.The processing efficiency of skeleton-based methods is low because they need rigging and further manual processing for accurately adjusting the postures, which is time-consuming and fails to deal with complex postures.Moreover, accurate automatic rigging is still an open problem.Compared to skeleton-driven based methods, the proposed method avoids the expensive rigging and is fully automatic, which can deal with arbitrary postures.Unlike skeletonbased methods, template-based methods are more efficient as they avoid rigging operations.However, the template-based methods cannot preserve the identity information because the deformed template is a shape approximation of the input scan.Nonrigid registration (e.g., CPD-Net and RMA-Net) may be able to preserve the identity information of the raw scan, but it requires a T-pose target to run the optimization of nonrigid registration, which is lacking in practice.Fig. 5 shows the point-to-point comparisons between our method and other six methods.From Fig. 5, Automatic rigging [6] and Rig-Net [7] cannot embed the proper joints to the body in a complex posture.For simple postures, the results of [6] and [7] have an obvious twist on the joints, such as the knees.RMA-Net [26] almost fails to output a human body.The results of CPD-Net [25] just "look like" a body.IP-Net [9] outputs human bodies that are smaller in size than the ground truth, while I2H [10] outputs larger bodies compared with the ground truth.Therefore, skeleton-driven based methods cannot handle complex postures and avoid joint twists.Nonrigid point cloud registration methods encounter difficulties to output a real body, while the output body sizes of the template-based methods are not exactly the same as the input bodies.
Complementing the results in Fig. 5, we report in Table II.Table II shows the average maximum errors, the average Chamfer distances and the average EMD for each method.The average values obtained with our method are smaller compared to the reference methods.These results show that our method has the best performance both by being able to properly handle all postures as well as in objective terms by providing the smallest errors between the output and the ground truth.

D. Results on Real-World Data
Our model is trained on the synthetic dataset, but it can generalize well on real-world data.We employ FAUST data [33] (100 male bodies and 100 female bodies) and real scan dataset (23 male bodies and 25 female bodies) from [32] with ground-truth body measurement values collected by a professional tailor to demonstrate the performance of our proposed method on real-world data.[32] (UNIT : CM) 1) Qualitative Comparisons: We qualitatively compare our method with skeleton-driven based methods: Automatic rigging [6] and Rig-Net [7], human template-based methods: IP-Net [9] and I2H [10] on the FAUST data, as shown in Fig. 6.
The joint twists of Automatic rigging [6] and Rig-Net [7] are obvious.Template-based methods can avoid joint twists, but the output surface is smooth and missing personal physical traits.Compared with them, our method can both avoid joint problems and preserve personal details, such as muscles on the legs, skin on the knees and busts, and even facial details.As shown in Fig. 6 (rightmost column), more intuitive comparisons can be observed by overlapping the results of IP-Net, I2H, and ours.The reconstructed bodies from IP-Net are shorter.The closeups of each row in Fig. 6 are given in red rectangles.It can be noted that the results of IP-Net and I2H lose the identity information of the raw scan while fine geometric details can be well preserved in our results.
2) Quantitative Comparisons: We perform quantitative comparisons between our algorithm and four posture normalization methods: two skeleton-driven based methods, namely, Automatic rigging [6] and Rig-Net [7] two template-based methods, namely, IP-Net [9], and I2H [10].The experiments are performed on the dataset from [32] as it provides ground-truth body measurements extracted by a professional tailor.
We measured the girth of the arm, torso, and leg for five different posture normalization methods on six males and six females [32].The results in Table III show the average absolute errors between tailor's measurements and three posture normalization methods' outputs.Compared with the body measurements of Automatic rigging [6], Rig-Net [7], IP-Net [9], and I2H [10], our outputs are closer to the input body.For all measured sizes, two skeleton-driven based methods have the average errors close to 3 (Automatic rigging) and 2 (Rig-Net) centimeters, respectively, while for two human body templatebased methods, the average errors are approximately to 4 (IP-Net) and 3 (I2H) centimeters, respectively.In contrast, all the errors of our method are within 1 cm, even approximate to 0 on hip, knee, and thigh measurements.These results demonstrate Fig. 6.Qualitative comparisons based on the FAUST data.By comparing the details on the body (legs, knees, faces, and busts) shown in the red rectangles, one notices that our method can better preserve personal physical traits.In the last column, we overlap the results of IP-Net (orange meshes), I2H (green meshes), and ours (blue meshes) to obtain more intuitive comparisons.that the proposed method preserves better the biometric identity, that is, the biometric measurements, compared to the state of the art.

E. Ablation Study
Without two-step optimization in the input branch, the posture normalization processing time for one real-world scan can be less than one second because all the steps are deep learningbased.To illustrate the value of our design and the influence of the two-step optimization in our framework, we conduct ablation experiments on the unseen dataset that we adopted in Section V-C.
1) Time and Point Sizes: First, we discuss the relationship between point sizes and processing time.We uniformly sample 50 unseen synthetic data (25 males and 25 females) from point sizes of 5 k to 150 k with 1 k as a step, to generate 145 groups of samples.We compute the average processing time and chamfer distance (12) for three different input branch designs: the input branch without optimization, the input branch with one-step optimization (feature optimization), and the input branch with two-step optimization (feature optimization and point-to-point optimization) on 145 groups.Here, (12) for each design is L(S, Ŝcoarse ), L(S, Ŝref1 ) and L(S, Ŝ), respectively, and Ŝref1 denotes the virtual source obtained from the input branch with one-step optimization.Fig. 7 shows the comparison results in term of processing time and errors.Without optimization, the processing time is faster, but the errors are larger.The two optimization strategies make the processing time be linear with the number of points, while they reduce errors.In addition, when the number of points is large enough, the point size will not influence the error.
2) Performance and Input Branch Designs: We further compare the influences of the input branch performance on posture normalization results.Fig. 8 illustrates the two-step optimization can make the mean error approximate 0.05 mm.Combined with time analysis, the two-step optimization framework only takes 10 s even if the input point size is 150 k.It is worth sacrificing a small amount of processing time for more accurate results.

VI. CONCLUSION
In this article, we proposed, to the best of our knowledge, the first deep learning-based method for automatically adjusting the posed scanned body into a canonical posture.Our method operated on raw human scans directly captured by 3-D scanners and deformed them to the target "T" pose while preserving the subject's identity.The proposed model was solely trained on the proposed synthetic dataset but generalized well on real-world data.The proposed theoretical analysis proved that the errors incurred by the encoder were bounded.Extensive experimental results based on both unseen synthetic and realistic data demonstrated that the proposed method achieved state-of-the-art performance in both objective and subjective terms.
The proposed method also had limitations.For instance, our training dataset lacked clothes and hair, which made it fail to deal with bodies wearing wide clothes or long hair.Additionally, our method depended on the predicted virtual source and virtual target.Nonideal topology of virtual source and virtual target could affect posture normalization results.We are interested to extend the proposed method to deal with bodies wearing wide clothes and long hair in the future.To propose novel formulations of pose normalization is also interesting.

Fig. 1 .
Fig. 1.Overview: Given the input point cloud S, we use only one encoder to get a global feature Ω that serves as input for a two-stream concurrent decoder that determines two coarse topology-shared meshes, ( Ŝcoarse , τ) in input posture and ( Tcoarse , τ) in T-pose, respectively, from the corresponding SMPL model.We refine them separately and perform topology-driven point cloud registration based on the high-precision topology-shared meshes ( Ŝ, τ) and ( T , τ).

Fig. 2 .
Fig.2.Overview of data synthesis: We first sample the shape (β i , i ∈ {1, 2, . . ., N}) and pose (θ i , i ∈ {1, 2, . . ., M}) parameters of SMPL from SURREAL.Then we randomly combine these parameters as (β i , θ j ) to synthesize the input body (the SMPL model in random shape and posture).At last, we set θ j to be canonical pose parameter θ (zero vector) to obtain the corresponding ground truth (the SMPL model of the input body shape and T-pose).

Fig. 4 .
Fig. 4. Red points are the input point cloud S, and the blue points are the corresponding mesh vertices Ŝ. Construct an open cover B on the sparse point sets.The left one shows the constructed open cover covering both red and blue when the number of red points (n) is larger than the number of blue points (m).The right figure illustrates the opposite case with n ≤ m.Reconstruct dense points on the closest ball center.

Fig. 5 .
Fig. 5. Quantitative comparisons between our method and different reference methods based on unseen synthetic data.Points are colored by the point-to-point errors.The missing results indicate that the corresponding method failed to produce a result.

Fig. 7 .
Fig. 7.For three input branch designs, the left figure shows the relationship between point sizes and errors, while the right figure shows the relationship between point sizes and processing time.The blue curve presents the input branch without optimization.The orange curve illustrates the input branch with one-step optimization (feature optimization), and the green curve shows the input branch with two-step optimization (feature optimization and point-to-point optimization).

Fig. 8 .
Fig. 8.For three input branch designs, the left boxplots show their error distribution.Here, the "coarse" boxplot presents the results based on the input branch without optimization.The "ref1" boxplot illustrates the results based on the input branch with one-step optimization(feature optimization), and the "ref2" boxplot shows results based on the input branch with two-step optimization (feature optimization and point-topoint optimization).The right figures give an example of error comparison.Tref1 denotes the posture normalization result based on feature optimization.

TABLE III MEAN
OF ABSOLUTE ANTHROPOMETRIC MEASUREMENTS (GRITH) ERRORS COMPARISON ON THE DATASET