An improved text classification modelling approach to identify security messages in heterogeneous projects

Oyetoyan, Tosin Daniel; Morrison, Patrick

dc.contributor.author	Oyetoyan, Tosin Daniel
dc.contributor.author	Morrison, Patrick
dc.date.accessioned	2021-10-20T13:53:40Z
dc.date.available	2021-10-20T13:53:40Z
dc.date.created	2021-06-07T16:28:22Z
dc.date.issued	2021
dc.identifier.citation	Oyetoyan, T. D., & Morrison, P. (2021). An improved text classification modelling approach to identify security messages in heterogeneous projects. Software Quality Journal, 29(2), 509-553.	en_US
dc.identifier.issn	0963-9314
dc.identifier.uri	https://hdl.handle.net/11250/2824205
dc.description.abstract	Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software’s design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26–44% in recall, 22–50% in g-measure, 0.4–28% in f-score, and 15–19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.	en_US
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	An improved text classification modelling approach to identify security messages in heterogeneous projects	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© The Author(s) 2021	en_US
dc.subject.nsi	VDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420	en_US
dc.source.pagenumber	509–553	en_US
dc.source.volume	29	en_US
dc.source.journal	Software quality journal	en_US
dc.source.issue	2	en_US
dc.identifier.doi	10.1007/s11219-020-09546-7
dc.identifier.cristin	1914247
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	2

Tilhørende fil(er)

Filnavn:: Oyetoyan.pdf
Størrelse:: 1.848Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Import fra CRIStin [3604]
Institutt for datateknologi, elektroteknologi og realfag [1163]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal