Assessing pre-training bias in Health data and estimating its impact on machine learning algorithms

Rodrigues, Diego Dimer

dc.contributor.advisor	Recamonde-Mendoza, Mariana	pt_BR
dc.contributor.author	Rodrigues, Diego Dimer	pt_BR
dc.date.accessioned	2023-05-13T03:26:41Z	pt_BR
dc.date.issued	2023	pt_BR
dc.identifier.uri	http://hdl.handle.net/10183/258017	pt_BR
dc.description.abstract	Machine learning (ML) is a rapidly growing field of computer science that has found many fruitful applications in several domains, including Health. However, ML is also highly susceptible to bias, which introduces concerns regarding their ability to inflict harm. Bias can come from various sources, such as the design of the algorithm, the selection of data, and the strategies underlying data collection. Thus, data scientists must be vigilant in ensuring that the developed models do not perpetuate social disparities based on gender, religion, sexual orientation, or ethnicity. This work aims to explore pre-training bias met rics to investigate the existence of bias in Health data. The metrics also analyze how pro tected attributes and their correlated features are distributed for the predicted class against the target attributes, giving insight into how the trained model may produce biased pre dictions. Our goal is to evaluate pre-training bias metrics in three different health datasets and assess the impact of bias on the performance of ML algorithms. O Our experiments in volve artificially modified versions of the dataset to increase the values of the pre-training bias metrics to favor privileged classes as well as to lower the values of these metrics to reduce the discrepancy in the data and the risk of bias. We trained models using four supervised learning algorithms: Logistic Regression, Decision Tree, Random Forest, and K-Nearest Neighbors. Each algorithm was tested on six to ten different training sets with varying random seeds to split the data in each iteration. We evaluated the performance of the trained models using the same test sets for every dataset variation, reporting the Accuracy and F1-Score. By analyzing pre-training metric bias and the predictive perfor mance of models, this study demonstrates that performance can be significantly affected by skewed data distribution and that the performance metrics may sometimes mask the bias incorporated by the algorithm. In some cases, classification errors may be more pro nounced in one group (e.g., the disadvantaged group), accentuating specific errors such as false positives and false negatives, which may have different implications depending on the clinical prediction problem under analysis.	en
dc.format.mimetype	application/pdf	pt_BR
dc.language.iso	por	pt_BR
dc.rights	Open Access	en
dc.subject	Bias	en
dc.subject	Aprendizado de máquina	pt_BR
dc.subject	Saúde	pt_BR
dc.subject	Bias metrics	en
dc.subject	Dado	pt_BR
dc.subject	Pre-training	en
dc.subject	Model evaluation	en
dc.title	Assessing pre-training bias in Health data and estimating its impact on machine learning algorithms	pt_BR
dc.type	Trabalho de conclusão de graduação	pt_BR
dc.identifier.nrb	001168640	pt_BR
dc.degree.grantor	Universidade Federal do Rio Grande do Sul	pt_BR
dc.degree.department	Instituto de Informática	pt_BR
dc.degree.local	Porto Alegre, BR-RS	pt_BR
dc.degree.date	2023	pt_BR
dc.degree.graduation	Ciência da Computação: Ênfase em Ciência da Computação: Bacharelado	pt_BR
dc.degree.level	graduação	pt_BR

Nome:: 001168640.pdf
Tamanho:: 4.115Mb
Formato:: PDF
Descrição:: Texto completo (inglês)

Visualizar/abrir

Este item está licenciado na Creative Commons License

Trabalhos de Conclusão de Curso de Graduação (37607)

TCC Ciência da Computação (1025)

Mostrar registro simples