Training Deep Learning AI to Predict microRNA-Gene Interactions

Apr 3, 2025 | Life Sciences & Biology, Medical & Health Sciences

Non-coding microRNAs (miRNAs) have important regulatory functions but are also implicated in various diseases. Mr Seung-won Yoon, PhD candidate at Chungnam National University, Republic of Korea, is training deep learning AI models to predict miRNA-gene associations. His research has implications for understanding disease pathogenesis, particularly cancer, and repurposing drugs for untreatable diseases.

The Diverse miRNAome

Think of RNA, and you probably think of messenger RNA (mRNA), transfer RNA (tRNA), or perhaps even ribosomal RNA. However, RNA’s scope extends far beyond protein synthesis. Cells contain a plethora of microRNA (miRNA) molecules – small non-coding single-stranded RNAs, around 22 nucleotides long. miRNAs regulate gene expression by hybridising to mRNA gene transcripts, usually to the mRNA’s 3’-UTR. This leads to gene silencing via various mechanisms – repressing translation, mRNA deadenylation, or activating mRNA cleavage.

The miRNAome is turning out to be more diverse than was ever thought possible. At least 2,000 human miRNAs have been identified as being important to survival. It is discovered that miRNAs are pivotal to a host of biological processes, including cell division, differentiation and death, nervous system development, immunity, and signal transduction. Conversely, miRNAs have been implicated in cellular dysfunction, leading to diseases such as cancers. Despite the physiological significance of miRNAs, their interactions with gene mRNA transcripts are not fully known. Hoping to elucidate these interactions is Mr Seoung-Won Yoon, a graduate student at Chungnam National University, South Korea.

Probing the miRNAome with Deep Learning AI

Mr Yoon is a PhD candidate in Professor Kyu-Chul Lee’s group at Chungnam’s Department of Computer Science and Engineering. The group has previously applied advanced computational approaches to various real-world applications, including databases and the internet-of-things. In recent years, they have turned their computational expertise to bioinformatics applications.

As miRNAs and gene transcripts cannot be observed directly, wet lab experiments are inadequate for studying their interactions, being complicated, time-consuming, and expensive. Instead, Mr Yoon and the group are deploying deep learning models to predict human miRNA-gene associations. Given the diversity of the miRNAome and putative gene targets, data-heavy computational methods are needed. Deep learning lends itself well to this, as it can identify sequence and mechanistic features of miRNAs and genes, and predict relationships.

Machine learning (ML) is a branch of AI with enormous potential in the biological sciences. In ML, algorithms take in data, and generalise to unknown data. Artificial neural networks (NNs) are a type of ML consisting of a network of connected nodes. In NNs, a signal is taken in via an input layer, passed through hidden layers, and then out through an output layer. A common analogy is that of the brain, with nodes representing neurons, and links between nodes representing synapses. Deep learning refers to an advanced type of NN with multiple hidden layers. Just like a brain, it’s capable of advanced learning.

Positive Training

Much like our own brains, AI must be trained for optimum ‘cognitive’ performance – no mean feat, as massive datasets are needed! To train their deep learning models, Mr Yoon and the group used three datasets. Firstly, they used a dataset of proven positive miRNA-gene associations. These were taken from miRTarBase, a curated database of over half a million miRNA-gene association pairs. After filtering out duplicates, 380,634 pairs remained. They used two additional databases, miRBase and biomaRt. miRBase contains sequence information and annotation information for miRNAs. biomaRt is an open dataset of gene sequences from the European Bioinformatics Institute. Of the 380,634 pairs, miRNAs and genes lacking sequence information on miRBase or biomaRt were excluded. This resulted in 2656 miRNA and 14,319 gene sequences, generating a total of 38,031,264 datasets (2656 × 14,319) and 358,864 positive miRNA-gene relationships.

The next step is ‘data embedding’. This involves extraction and vectorising of sequence features of the 2656 miRNAs and 14,319 genes. Each data element was embedded in 64 dimensions.

Balancing the Positive with the Negative

When training deep learning models, it’s beneficial to have a balanced dataset – not only positive data but also negative data (with no interaction). As negative data do not exist in nature, they must be curated. Negative data may be generated randomly, but this is not the most robust way. It’s better to generate negative data methodically using sophisticated criteria. To generate the negative data, the 358,864 positive relationships were removed from the 38,031,264 datasets. Further data were filtered out using ‘distance’ criteria (Euclidean distance, cosine similarity, and Mahalanobis distance) – this works as the embedded datasets exist in vector space. After filtering, 4,932,554 negative candidates were obtained. From these, 358,864 negative datasets were randomly selected to exactly balance the positive datasets (1:1 ratio). This yielded 717,728 data elements (358,864 + 358,864), each representing 124 dimensions. This is the largest sequence dataset ever constructed for miRNA-gene associations.

Unidirectional and Bidirectional Deep Learning

Mr Yoon and the group investigated two different types of deep learning model – long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM). Both are recurrent NNs (RNNs), a type of NN suitable for sequential data. LSTMs have a feature known as cell state that allows them to ‘remember’ and predict future data points. Traditional LSTMs input data unidirectionally front-to-back. In contrast, Bi-LSTMs input data both forwards and backwards.

The group tested an LSTM with three layers and a Bi-LSTM with two layers. The 717,728 data elements were divided into training data and test data in an 8:2 ratio, and fed into both models. Which model has the better performance in predicting miRNA-gene associations? In principle, this should be the Bi-LSTM, as the bidirectional information flow provides a richer representation. However, this takes up computational power, and the group found that the bi-LSTM had a slower training time than the LSTM. Instead, they considered the simpler yet faster LSTM as being more appropriate for the miRNA-gene dataset and selected it as their deep learning model.

Assessing Model Performance

A metric known as the area under the receiver operating characteristic curve (AUC) may be used to assess the performance of deep learning models. This takes into account ‘true positives’ and the avoidance of ‘false negatives’. Using a statistical method called K-fold cross-validation, the group determined that the LSTM model’s average AUC was 0.98 – close to 1.0 (the maximum), indicating a very good generalisation performance. Finally, they validated the model, confirming that it is able to uncover novel miRNA-gene association pairs not present in their positive training dataset.

Predicting BRCA2-associated miRNAs

The influence of miRNAs on cancer is a lively field of research. Variants or mutations of certain genes are implicated in various cancers. How do miRNA-gene interactions contribute to this? This is an important research question for Mr Yoon. BRCA2 is a human gene encoding a protein that repairs DNA replication errors and regulates the cell cycle. BRCA2 mutations can lead to malignant cells with unrepaired DNA damage, implicated in breast, ovarian, and prostate cancers. The group used their deep learning model to predict miRNAs that associate with BRCA2. Curiously, among the top 10 predicted candidates were miRNAs with known associations with prostate, ovarian, breast, and cervical cancers.

Going forward, Mr Yoon and the group want to further train the deep learning model for better prediction of miRNA-gene associations, and understand how these translate to disease phenotypes. Beyond BRCA2, they will focus on incurable and intractable diseases, deploying deep learning approaches to elucidate the pathogenetic entities involved. They hope to apply their deep learning insights to repurposing drugs to treat currently incurable diseases.

DOWNLOAD E-BOOK

REFERENCE

https://doi.org/10.33548/SCIENTIA1185

MEET THE RESEARCHERS

Mr Seoung-Won Yoon
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Mr Seoung-Won Yoon is a PhD Candidate at Chungnam National University, Republic of Korea, under the supervision of Professor Kyuchul Lee in the Computer Science & Engineering Department. Mr Yoon’s research involves developing deep learning models, primarily for bioinformatics and drug repurposing. He has so far participated in more than 10 nationally funded projects. His work on deep learning has included model development for pancreatic cancer-related genes, miRNA and mRNA prediction, and predicting the relationships between bio-genetic data. Additionally, he has conducted research projects in risk prediction, evaluating the performance of user movement path predictions, and assessing similarity patterns in user data. Furthermore, he has worked on an internet-of-things-based wearable flu vaccine project aimed at preventing the spread of influenza.

CONTACT

E: yoonenoch11@gmail.com

Professor Kyuchul Lee
Chungnam National University, Yuseong-Gu, Daejeon, Republic of Korea

Professor Kyuchul Lee received his Bachelor’s degree in Computer Science from Seoul National University in 1984 and a PhD in the same field from the same university in 1990. He has been a faculty member at Chungnam National University since 1989 and has also held the positions of visiting researcher at IBM Almaden Center and visiting professor at Syracuse University. At the time of writing, he has successfully led 144 national research projects, published 56 international journal papers, 97 international conference presentations, 125 domestic journal papers, 206 domestic conference presentations, and holds 46 patents and intellectual property records. Additionally, he has facilitated 12 technology transfers. As an advisor in the Data Artificial Intelligence Research Lab (formerly Database Systems Lab), he has mentored over 110 Master’s and PhD students, developing core technologies in the fields of databases and AI. His research spans across multimedia data, XML, semantic web, IoT, AI, and big data. Through courses like File Processing, Database Systems, and Web Programming, he has equipped students with the knowledge to excel in their professional careers.

FUNDING

National Research Foundation of Korea (NRF)

Korean Government (MSIT)

FURTHER READING

S Yoon, I Hwang, J Cho, et al., miGAP: miRNA–Gene Association Prediction Method Based on Deep Learning Model, Applied Sciences, 2023, 13(22), 12349. DOI: https://doi.org/10.3390/app132212349

REPUBLISH OUR ARTICLES

We encourage all formats of sharing and republishing of our articles. Whether you want to host on your website, publication or blog, we welcome this. Find out more

Creative Commons Licence (CC BY 4.0)

This work is licensed under a Creative Commons Attribution 4.0 International License.

What does this mean?

Share: You can copy and redistribute the material in any medium or format

Adapt: You can change, and build upon the material for any purpose, even commercially.

Credit: You must give appropriate credit, provide a link to the license, and indicate if changes were made.

SUBSCRIBE NOW

← PREVIOUS ARTICLE NEXT ARTICLE →

MORE ARTICLES YOU MAY LIKE

Epigenetic Mysteries Unravelled: The Zinc-Finger Proteins

Exploring the complex mechanisms of cell development processes and DNA structure is critical to understanding how certain diseases, such as cancer, can arise. Professor Danny Reinberg and Dr Havva Ortabozkoyun from the University of Miami in Florida, USA, work to reveal the epigenetic mechanisms at play during cell division and development and, in turn, disease processes. Together, they are discovering new protein molecules involved in genome organisation, deepening our understanding of how cancers and other related conditions can develop.

International Isocyanate Institute | TDI-induced Asthma: Reanalysing Data to Find Hidden Trends

Even if you’ve never heard of them, you’ve used polyurethanes. Producing them requires toluene diisocyanates, which may/can induce asthma when inhaled. A 5-year study claimed to conclude that cumulative TDI exposure over time was indicative of asthma incidence. However, a reanalysis by a team at the International Isocyanate Institute points the finger instead at the frequency of unprotected high-exposure events, like accidental spills or plant maintenance. This finding guides the way for future advances in worker safety.

A New Way to Detect and Identify Forensic Bloodstains

Accurately identifying bodily fluids at crime scenes is vital to aid forensic examinations and obtain information for use in criminal proceedings. However, collecting viable material for analysis can be challenging, especially if samples are difficult to access or the amount is minute. Dr Lamyaa Almehmadi and Professor Igor K Lednev at the University at Albany, State University of New York, USA, have introduced a new technique to assist in analysing bloodstains for forensic examination without compromising sample integrity.

Improving the Immune Response with Intraoperative Cell Salvage

Undergoing surgery comes with many risks. Numerous factors can influence the outcome, from the choice of anaesthesia to the type of operation. Long and complex procedures can require blood transfusions, which introduce yet another risk factor to the mix. Dr Michelle Roets from the Royal Brisbane and Women’s Hospital in Queensland seeks to help mitigate these risks through intraoperative cell salvage, a different type of blood transfusion which could revolutionise surgical outcomes.