Abstract
The evolving field of natural language processing (NLP) is increasingly intersecting with bioinformatics, particularly in DNA/RNA analysis. Previous studies highlighted the potential for classifying sequences containing selected genes. By applying NLP encoding, nucleotide strings are transformed into numerical representations, serving as inputs for predictive models. This paper analyzes the effectiveness of 25 machine learning algorithms in such classification. Impressively, 20 of them achieved a balanced accuracy (BACC) above 90%. The top performers: the MLP Classifier neural network and the SVC from the SVM family, surpassed 98% BACC and underwent hyperparameter tuning. Additionally, the performance of Automated Machine Learning (AutoML) was evaluated, allowing for the selection of optimized predictive algorithm. The preferred Hist Gradient Boosting Classifier, after hyperparameter adjustment, achieved a positive predictive value (PPV) of approximately 97.30%, outperforming the neural network and SVM representatives. The classifiers with the best BACCs exhibited negative predictive values (NPVs) higher than PPVs, reaching over 99% effectiveness. The results indicate that most models successfully handled sequence classification, with NLP encoding having a significant role. This provides flexibility in classifier selection based on complexity, score and speed. It also confirms that AutoML can adapt to data, and appropriately adjusting a model’s default values enhances classifier efficiency.
| Original language | English |
|---|---|
| Title of host publication | Modelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024 |
| Editors | Jose David Nunez-Gonzalez, Manuel Grana Romay, Philippe Geril |
| Publisher | EUROSIS |
| Pages | 199-204 |
| Number of pages | 6 |
| ISBN (Electronic) | 9789492859334 |
| ISBN (Print) | 9789492859334 |
| Publication status | Published - 2024 |
| Event | 38th Annual European Simulation and Modelling Conference, ESM 2024 - San Sebastian, Spain Duration: 23 Oct 2024 → 25 Oct 2024 |
Publication series
| Name | Modelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024 |
|---|
Conference
| Conference | 38th Annual European Simulation and Modelling Conference, ESM 2024 |
|---|---|
| Country/Territory | Spain |
| City | San Sebastian |
| Period | 23/10/24 → 25/10/24 |
Bibliographical note
Publisher Copyright:© 2024 Modelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024. All rights reserved.
Keywords
- classification
- machine learning
- natural language processing
- sequencing
- graph-based learning
- Energy communities
- graph centrality measures
Fingerprint
Dive into the research topics of 'IN SEARCH OF A SEQUENCE CLASSIFIER FOR A SYSTEM EMPLOYING NLP ENCODING'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver