Authors:
Junho Moon、Haejun Chung、Ikbeom Jang
Paper:
https://arxiv.org/abs/2408.10060
Introduction
Facial wrinkle detection is a critical aspect of cosmetic dermatology, serving as an indicator of aging and skin health. However, manual segmentation of facial wrinkles is a challenging and time-consuming task, often leading to inconsistent results due to subjectivity among graders. To address these issues, the study proposes two main solutions: the creation of a public facial wrinkle dataset and a novel training strategy for U-Net-like encoder-decoder models to automatically detect facial wrinkles.
Related Work
Deep Learning-Based Facial Wrinkle Segmentation
Deep learning methods have been increasingly applied to facial wrinkle segmentation. Kim et al. introduced a semi-automatic labeling strategy that combines texture maps with roughly labeled wrinkle masks using a U-Net architecture. They further improved segmentation accuracy with a weighted deep supervision technique. Yang et al. developed Striped WriNet, which uses a Striped Attention Module within a U-shaped network to segment both coarse and fine wrinkles effectively.
Weakly Supervised Learning
Weakly supervised learning trains models using incomplete or inaccurate labeled data. Xu et al. proposed CAMEL, a framework that uses a MIL-based label expansion technique for histopathology image segmentation. Shen et al. trained a deep learning model using scribbles and global labels to segment brain tumors.
Research Methodology
Dataset Specifications
The study introduces the ‘FFHQ-Wrinkle’ dataset, an extension of the NVIDIA FFHQ dataset. It includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. The dataset features diverse individuals of various ages, races, and skin conditions, making it suitable for training models to handle a wide range of clinical scenarios.
Ground Truth Wrinkle Annotation
Wrinkle annotation was performed by three experienced annotators, focusing on dynamic and static wrinkles. The annotation process involved synchronization sessions to minimize inter-rater variability. The final ground truth wrinkle masks were created using majority voting to reduce subjectivity.
Experimental Design
Model Architecture
The study evaluated the proposed method using U-Net and Swin UNETR architectures. U-Net features a standard architecture with four encoder and decoder blocks, while Swin UNETR employs an encoder with a window size of 16 and patches of size 4×4.
Training Strategy
The training strategy involves two stages: weakly supervised pretraining and supervised finetuning. In the pretraining stage, the model is trained on a large dataset with weak labels, using masked texture maps as ground truth. In the finetuning stage, the model is refined using a smaller set of manually labeled wrinkle masks.
Weakly Supervised Pretraining Stage
The pretraining stage uses weakly labeled wrinkle data extracted through computer vision techniques. The texture map is extracted from face images using a Gaussian kernel-based filter, and non-facial regions are masked using a BiSeNet architecture-based facial parsing model.
Supervised Finetuning Stage
In the finetuning stage, the model is refined using human-labeled wrinkle data. The model takes as input a 3-channel RGB face image and a 1-channel masked texture map, producing a 2-channel output indicating the presence of wrinkles and background.
Results and Analysis
Implementation Details
The dataset was partitioned into 80% for training, 10% for validation, and 10% for testing. The AdamW optimizer and SGDR scheduler were used for training. Various augmentations were applied to maintain dataset diversity.
Evaluation Metrics
The performance of the final model was evaluated using the Jaccard Similarity Index (JSI), F1-score, and Accuracy (Acc). These metrics measure the overlap between predicted and ground truth wrinkle regions, the harmonic mean of precision and recall, and the proportion of correctly predicted pixels, respectively.
Quantitative Comparisons
The proposed method outperformed the latest wrinkle segmentation methods and other pretraining techniques. The performance gap was more significant in data-limited scenarios, demonstrating the effectiveness of the two-stage training strategy.
Ablation Study
The inclusion of the masked texture map as an additional input during the finetuning stage led to significant improvements in wrinkle segmentation. This demonstrates the effectiveness of the proposed approach.
Discussion
The study achieved state-of-the-art performance in facial wrinkle segmentation, demonstrating the potential of the two-stage training strategy. The approach shows promise in achieving high performance with limited data, enhancing scalability and flexibility in clinical settings. However, challenges such as false positives and subjectivity in wrinkle annotation remain.
Overall Conclusion
The study proposes a novel two-stage learning strategy for facial wrinkle segmentation using deep learning. By leveraging weakly labeled data for pretraining and manually labeled data for finetuning, the approach significantly reduces the time and cost associated with manual labeling. The release of the ‘FFHQ-Wrinkle’ dataset aims to support ongoing research and enhance reproducibility. Future research will focus on addressing false positives and improving the reliability of ground truth wrinkles through collaboration with dermatologists.