This paper is a great and gentle introduction into the world of support vector machines and also gives insight into some cool applications of machine learning technology.
BenHur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G (2008) Support Vector Machines and Kernels for Computational Biology.
The widespread adoption of highthroughput sequencing machinery has produced an unprecedented amount of genomic data for biologists to analyse. To fully leverage the potential patterns hidden in the petabytes of DNA and RNA sequence information requires the use of machine learning algorithms and specialised kernels, which can capture the valuable domain knowledge provided by biological scientists. A common problem in computational biology is that of binary classification. Support vector machines (SVMs) have achieved good results in this domain and have thus been eagerly adopted by computational biology researchers. BenHur et al provide a gentle introduction to support vector machines and kernels in the context of binary biological prediction problems.
To explain the concepts of large margin separation and kernel functions, BenHur et al use a computational biology problem known as splicesite recognition. In eukaryotic organisms, the process of gene expression involves transcribing a sequence of DNA into a molecule known as premature mRNA. Premature mRNA contains two types of regions: coding regions known as exons and 'junk' regions known as introns. The boundary of these two sites is often recognized by the presence of specific dimers GT and AG at these sites. However, only 0.1%–1% of occurrences of these dimers in the genome represent true locations of splice sites, which leads to an interesting question: Can we use support vector machines to help classify sites as splice sites and nonsplice sites?
BenHur et al explain the principles of maximum margin separation, kernel functions and classifier performance by exploring various aspects of this question. The paper is a great and gentle introduction into the world of support vector machines and also gives insight into some cool applications of machine learning technology. Moreover all of the data and code used in the paper is opensource. In six words: this is a paper to love!
Camilla Montonen works at Prelert as a Data Engineer. She uses the Python data analysis stack to explore machine data in order to create unsupervised anomaly detection systems. In her spare time, she enjoys participating in PyLadies, contributing to Pandas and trying to understand the flow of commuters on the London Tube.