Enhancing Financial Market Sentiment Analysis Using Dataset Augmentation
Dante Contella1*, Johnson Kinyua2, Charles Mutigwe3
1The Pennsylvania State University, State College, USA
2The Pennsylvania State University, State College, USA
3Western New England University, Springfield, USA
* Corresponding author: dantecontella@gmail.com
Presented at the International Conference on Open Finance (ICOF2025), Springfield, USA, Aug 28, 2025
SETSCI Conference Proceedings, 2025, 24, Page (s): 39-44 , https://doi.org/10.36287/setsci.24.5.039
Published Date: 08 September 2025
We developed a series of financial social media sentiment analysis models starting with PyFin-Sentiment, a recent state-of-the-art domain-specific large language model for the financial domain. We focused on label accuracy and improvements in the quality and quantity of the dataset by collecting more data. We were able to improve model accuracy from 0.70 to 0.93, supporting a well-known maxim that models are just as good as the data they are trained on. This created a consistently accurate model that outperformed other existing financial sentiment models. We also used nuanced approaches such as separating social media data into short and long comments which drastically improved the accuracy of our model.
Keywords - sentiment analysis, machine learning, financial market, social media, StockTwits
[1] R. Ortmann, M. Pelster, and S. T. Wengerek, "COVID-19 and investor behavior," *Finance Res. Lett.*, vol. 37, p. 101802, 2020. [Online]. Available: https://doi.org/10.1016/j.frl.2020.101802https://doi.org/10.1016/j.frl.2020.101802
[2] Ludwig, Zachary & Perkowski, Patryk. (2021). An Analysis of How Twitter Impacts Financial Markets. Journal of Student Research. 10. 10.47611/jsrhs.v10i3.2224.
[3] Jiang, Y. (2023). A Primer on Machine Learning Methods for Credit Rating Modeling. IntechOpen. doi: 10.5772/intechopen.107317
[4] Georgios Fatouros, John Soldatos, Kalliopi Kouroumali, Georgios Makridis, Dimosthenis Kyriazis, Transforming sentiment analysis in the financial domain with ChatGPT, Machine Learning with Applications, Volume 14, 2023, 100508, ISSN 2666-8270, https://doi.org/10.1016/j.mlwa.2023.100508. (https://www.sciencedirect.com/science/article/pii/S2666827023000610)
[5] Delgadillo, J.; Kinyua, J.; Mutigwe, C. FinSoSent: Advancing Financial Market Sentiment Analysis through Pretrained Large Language Models. Big Data Cogn. Comput. 2024, 8, 87. https://doi.org/10.3390/bdcc8080087
[6] Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186.
[7] Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 328–339.
[8] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1 (Long Papers), pp. 2227–2237.
[9] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 8.
[10] OpenAI, GPT-4.1. Available online: https://platform.openai.com/docs/models/gpt-4.1.
[11] Meta, Llama 3.0. Available online: https://ai.meta.com/blog/meta-llama-3/.
[12] Araci, D.T.; Zulkuf Genc, Z. FinBERT: Financial Sentiment Analysis with BERT. Prosus AI Tech Blog. 2020. Available online: https: //medium.com/prosus-ai-tech-blog/finbert-financial-sentiment-analysis-with-bert-b277a3607101.
[13] Desola, V.; Hanna, K.; and Nonis, P. FinBERT: Pretrained Model on SEC Filings for Financial Natural Language Tasks; Technical Report; University of California: Los Angeles, CA, USA, 2019.
[14] Liu, Z.; Huang, D.; Huang, K.; Li, Z.; and Zhao, J. FinBERT: A Pretrained Financial Language Representation Model for Financial Text Mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Virtual, 7–15 January 2021; pp. 4513–4519.
[15] C. Chen, H. Huang, and H. Chen, “NTUSD-Fin: A Market Sentiment Dictionary for Financial Social Media Data Applications,” in Proceedings of the LREC 2018 Workshop “The First Financial Narrative Processing Workshop (FNP 2018)”.
[16] M. Wilksch, and O. Abramova, “PyFin-sentiment: Towards a machine-learning-based model for deriving sentiment from financial tweets,” International Journal of Information Management Data Insights, 3 (2023) 100171. https://doi.org/10.1016/j.jjimei.2023.100171
[17] Cortis, K., Freitas, A., Daudert, T., Huerlimann, M., Zarrouk, M., Handschuh, S., et al., (2017). Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Association for computational linguistics (ACL) (pp. 519–535). Association for Computational Linguistics. 10.18653/v1/S17-2089.
[18] Chen, C.-C., Huang, H.-H., and Chen, H.-H. (2020). Issues and perspectives from 10,000 annotated financial social media data. In Proceedings of the 12th language resources and evaluation conference (pp. 6106–6110)
[19] Gaillat, T.; Zarrouk, M.; Freitas, A.; Davis, B. The SSIX Corpora: Three Gold Standard Corpora for Sentiment Analysis in English, Spanish and German Financial Microblogs. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan; 7–12 May 2018; pp. 2671–2675.
[20] Daudert, T. A Multi-Source Entity-Level Sentiment Corpus for the Financial Domain: The Fin-Lin Corpus. arXiv 2020, arXiv:2003.04073. Available online: http://arxiv.org/abs/2003.04073
[21] Saif, H.; Fernandez, M.; He, Y.; Alani, H. Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold. In Proceedings of the 1st International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013), Turin, Italy, 3 December 2013.
[22] Taborda, B.; de Almeida, A.; Dias, J.C.; Batista, F.; Ribeiro, R. Stock Market Tweets Data. IEEE Dataport 2021.
[23] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. Available online: http://arxiv.org/abs/1907.11692
|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
