A Comparative Evaluation of Interpolation and Generative Oversampling Techniques for Predictive Maintenance

Abdazeez Atere; Hasan Kivrak

Open Access

A Comparative Evaluation of Interpolation and Generative Oversampling Techniques for Predictive Maintenance

Abdazeez Atere¹^*, Hasan Kivrak²
¹Northumbria University, Newcastle, UK
²Northumbria University, Newcastle, UK
* Corresponding author: abdazeez.atere@northumbria.ac.uk

Presented at the International Symposium on AI-Driven Engineering Systems (ISADES2025), Tokat, Turkiye, Jun 19, 2025

SETSCI Conference Proceedings, 2025, 22, Page (s): 20-26 , https://doi.org/10.36287/setsci.22.21.001

Published Date: 10 July 2025

Predictive maintenance (PdM) enhances industrial operational efficiency by facilitating timely detection of equipment failures using machine learning models developed from historical maintenance data. Real-world industrial datasets frequently exhibit significant class imbalance, as failures are infrequent occurrences. This imbalance substantially diminishes predictive accuracy for the minority class (failures). This study systematically evaluates three data augmentation techniques—Synthetic Minority Oversampling Technique (SMOTE), SMOTETomek, and Conditional Tabular Generative Adversarial Networks (CTGAN)—to address this challenge, utilising the AI4I 2020 Predictive Maintenance dataset. A Random Forest classifier was trained on augmented data, with a comparison of augmentation methods conducted through various performance metrics, including precision, recall, F1-score, ROC-AUC, and PR-AUC. The findings indicate that both SMOTE and SMOTETomek significantly enhance failure detection performance, with F1-scores and recall rates surpassing 0.99. In contrast, CTGAN demonstrates marginally lower classification performance (F1-score ≈ 0.88) while effectively generating realistic synthetic samples that maintain the original data distributions and inter-variable relationships. These results underscore the trade-offs between oversampling methods and generative models: SMOTE-based approaches optimise raw predictive accuracy for rare failures, whereas CTGAN demonstrates significant potential for improving model generalisation in complex industrial applications.

Keywords - Predictive Maintenance, Class Imbalance, SMOTE, SMOTETomek, Generative Adversarial Networks, Random Forest, Data Augmentation

[1] Altalhan, M., Algarni, A. and Alouane, M.T.H. (2025) ‘Imbalanced data problem in machine learning: A review’, IEEE Access, 11. doi:10.1109/ACCESS.2025.3531662.

[2] Batista, G.E.A.P.A., Prati, R.C. and Monard, M.C. (2004) ‘A study of the behavior of several methods for balancing machine learning training data’, SIGKDD Explorations Newsletter, 6(1), pp. 20–29.

[3] Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) ‘SMOTE: Synthetic Minority Over-sampling Technique’, Journal of Artificial Intelligence Research, 16, pp. 321–357.

[4] Chen, C., Liaw, A. and Breiman, L., 2004. Using random forest to learn imbalanced data. University of California, Berkeley.

[5] Chen, Irene & Joshi, Shalmali & Ghassemi, Marzyeh. (2020). Treating health disparities with artificial intelligence. Nature Medicine. 26. 16-17. 10.1038/s41591-019-0649-2.

[6] Davis, J. and Goadrich, M., 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, pp.233–240.

[7] Drummond, C. and Holte, R.C., 2003. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Data Sets II.

[8] Elkan, C., 2001. The foundations of cost-sensitive learning. In Proceedings of the 17th international joint conference on Artificial intelligence (Vol. 1, pp. 973-978).

[9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014) ‘Generative adversarial nets’, Advances in Neural Information Processing Systems, 27, pp. 2672–2680.

[10] Hakami, A., 2024. Strategies for overcoming data scarcity, imbalance, and feature selection challenges in machine learning models for predictive maintenance. Scientific Reports, 14(9645).

[11] ICO, 2022. Explaining decisions made with AI. [online] Available at: https://ico.org.uk [Accessed 14 Apr. 2025].

[12] Japkowicz, N. and Stephen, S., 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), pp.429–449.

[13] Jardine, A.K.S., Lin, D. and Banjevic, D., 2006. A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mechanical Systems and Signal Processing, 20(7), pp.1483–1510.

[14] Kulkarni, A., Batarseh, F.A. and Chong, D. (2023) ‘Foundations of data imbalance and solutions for a data democracy’, in Foundations of Data Science for Engineers. Springer, pp. 135–157.

[15] Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3), pp.18-22.

[16] Liu, R., Yang, B., Zio, E. and Chen, X. (2018) ‘Artificial intelligence for fault diagnosis of rotating machinery: A review’, Mechanical Systems and Signal Processing, 108, pp. 33–47.

[17] Mahale, Y., Kolhar, S. and More, A.S. (2025) ‘Enhancing predictive maintenance in automotive industry: addressing class imbalance using advanced machine learning techniques’, Discover Applied Sciences, 7:340. doi:10.1007/s42452-025-06827-3.

[18] Matzka, S. (2020). AI4I 2020 Predictive Maintenance Dataset. UCI Machine Learning Repository.

[19] Mobley, R.K., 2002. An Introduction to Predictive Maintenance. Elsevier.

[20] Saito, T. and Rehmsmeier, M., 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10(3), p.e0118432.

[21] Sipos, R., Fradkin, D., Moerchen, F. and Wang, Z., 2014. Log-based predictive maintenance.

[22] Wang, Zhe & Wu, Chunhua & Zheng, Kangfeng & Niu, Xinxin & Wang, Xiujuan. (2019). SMOTETomek-based Resampling for Personality Recognition (July 2019). IEEE Access. PP. 1-1. 10.1109/ACCESS.2019.2940061.

[23] Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K., 2019. Modeling Tabular Data Using Conditional GAN. Advances in Neural Information Processing Systems, 32

[24] Yoon, J., Jarrett, D. and van der Schaar, M., 2020. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics, 24(8), pp.2378–2388.

[25] Zhang, W., Yang, D. and Wang, H. (2019) ‘Data-driven methods for predictive maintenance of industrial equipment: A survey’, IEEE Systems Journal, 13(3), pp. 2213–2227.

1
Citations (Crossref)

Click for
Google Scholar
Citations

6.9K
Total Views

394
Total Downloads

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

RIS

EndNote