Open Access

Comparative Review of Machine Learning Methods in Revenue Forecasting

Ali Taghiyev1*, Özkan İnik2
1Faculty of Computer Engineering Tokat Gaziosmanpaşa University, Tokat, Turkiye
2Faculty of Computer Engineering Tokat Gaziosmanpaşa University, Tokat, Turkiye
* Corresponding author: ali.taghiyev3424@gop.edu.tr

Presented at the International Symposium on AI-Driven Engineering Systems (ISADES2025), Tokat, Turkiye, Jun 19, 2025

SETSCI Conference Proceedings, 2025, 22, Page (s): 82-88 , https://doi.org/10.36287/setsci.22.54.001

Published Date: 10 July 2025

In this study, a classification problem is discussed to predict whether individuals' annual income levels are above $50,000. The UCI Adult Income Dataset, which is derived from the 1994 US Census (Census) data and is widely used in the field of machine learning, was preferred as the data source. This dataset contains 48,842 samples and 14 independent variables; It covers categorical and numerical characteristics such as age, educational status, marital status, profession. In addition, some categorical variables in the dataset have missing values and the classes of the target variable are unbalanced. Individuals with incomes below 50K account for 76%, while those above 24%. Therefore, in order to interpret model performance more accurately, metrics such as the ROC curve and AUC (Area Under Curve) were taken into account as well as the accuracy rate. In the study, after the data analysis and preprocessing process was completed, four different classification algorithms were applied: Logistic Regression, Support Vector Machines (SVM), Random Forest and Naive Bayes. The Logistic Regression model achieved an accuracy rate of 83.2%, while the SVM algorithm showed a moderate performance with an accuracy of 80.3%. While the Naive Bayes algorithm was moderately successful with an accuracy of 78.7%, the highest accuracy rate was obtained in the Random Forest algorithm with 85.1%. These results revealed that despite the unbalanced class distribution, the Random Forest algorithm is an effective method in such classification problems thanks to its high accuracy.

Keywords - Machine Learning, Income Estimation, Income Classification, UCI Adult Dataset, Data Analysis, Logistic Regression, Support Vector Machine, Random Forest, Naive Bayes, Classification Algorithms, Model

[1] Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2, 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

[2] Stiglitz, J. E., Sen, A., & Fitoussi, J.-P. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress.

[3] Kohavi, R. (1996). Census Income [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5GP7S.

[4] Powers, D. M. W. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37-63. https://doi.org/10.9735/2229-3981

[5] Berkson, J. (1944). Application of the Logistic Function to Bio-Assay. Journal of the American Statistical Association, 39(227), 357–365. https://doi.org/10.2307/2280041

[6] Cortes, C., Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995). https://doi.org/10.1007/BF00994018

[7] Ho, T. K. (1995). Random Decision Forests. Proceedings of the Third International Conference on Document Analysis and Recognition,1,278-282. https://doi.org/10.1109/ICDAR.1995.598994

[8] Frank, E., Trigg, L., Holmes, G. et al. Technical Note: Naive Bayes for Regression. Machine Learning 41, 5–25 (2000). https://doi.org/10.1023/A:1007670802811

[9] X. Guo, Y. Yin, C. Dong, G. Yang and G. Zhou, "On the Class Imbalance Problem," 2008 Fourth International Conference on Natural Computation, Jinan, China, 2008, pp. 192-201, doi: 10.1109/ICNC.2008.871.

[10] Soupcıoğlu ŞK, Aksel G. Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. Turk J Emerg Med. 2023 Oct 3; 23(4):195-198. doi: 10.4103/tjem.tjem_182_23. PMID: 38024184; PMCID: PMC10664195.

[11] Q. H. Nguyen et al., "Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil," Mathematical Problems in Engineering, vol. 2021, Article ID 4832864, 2021, doi: 10.1155/2021/4832864.

[12] D. Deepika, A. Vijaya Lakshmi, P. G. Sravya, P. S. P. Abhiram, P. C. Chandu, and P. S. Kiran, "Performance analysis of machine learning classification models for predicting the adult income levels," in Proc. 2023 Global Conf. Inf. Technol. Commun. (GCITC), Bangalore, India, Dec.2023,doi:10.1109/GCITC60406.2023.10425912.

[13] M. A. Islam, A. Nag, N. Roy, A. R. Dey, S. M. F. A. Fahim, and A. Ghosh, "An investigation into the prediction of annual income levels through the utilization of demographic features employing the modified UCI adult dataset," in Proc. 2023 Int. Conf. Comput., Commun. Intell. Syst. (ICCCIS), Greater Noida, India, Nov. 2023, doi: 10.1109/ICCCIS60361.2023.10425394.

[14] E. E. Moe, S. S. M. Win, and K. L. L. Khine, "Adult income classification using machine learning techniques," in Proc. 2023 IEEE Conf. Comput. Appl. (ICCA), Yangon, Myanmar, Feb. 2023,doi:10.1109/ICCA51723.2023.10181907.

[15] L.-P. Chen, "Supervised learning for binary classification on US adult income," Journal of Modeling and Optimization, vol. 13, no. 2, pp. 80–90, Dec. 2021, doi:10.32732/jmo.2021.13.2.80.

[16] Wan, Z. (2023). Performances evaluation of machine learning models on income forecasting. Applied and Computational Engineering, 27(1), 24-29. https://doi.org/10.54254/2755-2721/27/20230111

[17] Labatut, V., & Cherifi, H. (2011, July). Accuracy measures for the comparison of classifiers. 5th International Conference on Information Technology (ICIT), Amman, Jordan. arXiv:1207.3790. https://doi.org/10.48550/arXiv.1207.3790

[18] Lipton, Z. C., Elkan, C., & Narayanaswamy, B. (2014). Thresholding Classifiers to Maximize F1 Score. arXiv preprint arXiv:1402.1892. https://doi.org/10.48550/arXiv.1402.1892

[19] Thambawita, V., Jha, D., Hammer, H. L., Johansen, H. D., Johansen, D., Halvorsen, P., & Riegler, M. A. (2020). An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning applied to Gastrointestinal Tract Abnormality Classification. arXiv preprint arXiv:2005.03912. https://doi.org/10.48550/arXiv.2005.03912

[20] Obi, J. C. (2023). A Comparative Study of Several Classification Metrics and Their Performances on Data. World Journal of Advanced Engineering Technology and Sciences, 8(1), 308-314 https://doi.org/10.30574/wjaets.2023.8.1.0054.

0
Citations (Crossref)
7.5K
Total Views
218
Total Downloads

Licence Creative Commons This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
SETSCI 2025
info@set-science.com
Copyright © 2025 SETECH
Tokat Technology Development Zone Gaziosmanpaşa University Taşlıçiftlik Campus, 60240 TOKAT-TÜRKİYE