Skip to main navigation Skip to search Skip to main content

IN SEARCH OF A SEQUENCE CLASSIFIER FOR A SYSTEM EMPLOYING NLP ENCODING

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The evolving field of natural language processing (NLP) is increasingly intersecting with bioinformatics, particularly in DNA/RNA analysis. Previous studies highlighted the potential for classifying sequences containing selected genes. By applying NLP encoding, nucleotide strings are transformed into numerical representations, serving as inputs for predictive models. This paper analyzes the effectiveness of 25 machine learning algorithms in such classification. Impressively, 20 of them achieved a balanced accuracy (BACC) above 90%. The top performers: the MLP Classifier neural network and the SVC from the SVM family, surpassed 98% BACC and underwent hyperparameter tuning. Additionally, the performance of Automated Machine Learning (AutoML) was evaluated, allowing for the selection of optimized predictive algorithm. The preferred Hist Gradient Boosting Classifier, after hyperparameter adjustment, achieved a positive predictive value (PPV) of approximately 97.30%, outperforming the neural network and SVM representatives. The classifiers with the best BACCs exhibited negative predictive values (NPVs) higher than PPVs, reaching over 99% effectiveness. The results indicate that most models successfully handled sequence classification, with NLP encoding having a significant role. This provides flexibility in classifier selection based on complexity, score and speed. It also confirms that AutoML can adapt to data, and appropriately adjusting a model’s default values enhances classifier efficiency.

Original languageEnglish
Title of host publicationModelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024
EditorsJose David Nunez-Gonzalez, Manuel Grana Romay, Philippe Geril
PublisherEUROSIS
Pages199-204
Number of pages6
ISBN (Electronic)9789492859334
ISBN (Print)9789492859334
Publication statusPublished - 2024
Event38th Annual European Simulation and Modelling Conference, ESM 2024 - San Sebastian, Spain
Duration: 23 Oct 202425 Oct 2024

Publication series

NameModelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024

Conference

Conference38th Annual European Simulation and Modelling Conference, ESM 2024
Country/TerritorySpain
CitySan Sebastian
Period23/10/2425/10/24

Bibliographical note

Publisher Copyright:
© 2024 Modelling and Simulation 2024 - 38th Annual European Simulation and Modelling Conference 2024, ESM 2024. All rights reserved.

Keywords

  • classification
  • machine learning
  • natural language processing
  • sequencing
  • graph-based learning
  • Energy communities
  • graph centrality measures

Fingerprint

Dive into the research topics of 'IN SEARCH OF A SEQUENCE CLASSIFIER FOR A SYSTEM EMPLOYING NLP ENCODING'. Together they form a unique fingerprint.

Cite this