The online sequence search algorithms will come with three flavors: exact, approximate, and regular expression search algorithms. The exact sequence search literally means finding exactly matching positions of a query sequence in another sequence. The approximate search is similar to the exact search but allows up to a specified number of errors: mismatches, insertions, and deletions. The regular expression search accepts a query in regular expression, which enables flexible description of a query pattern like motifs. For these algorithms, there are already half-done pull requests I'm working on: #152, #153, #143.
After the last release of Bio.jl v0.1.0, the sequence data structure has been significantly rewritten to make biological sequence types coherent and extensible. But because we chose an encoding that requires 4 bits per base to represent DNA sequences, the DNA sequence type consumes too much memory than necessary to store a reference genome, which is usually composed of four kinds of DNA nucleotides (denoted by A/C/G/T) and (consecutive and relatively small number of) undetermined nucleotides (denoted by N). After trying some data structures, I found that memory space of N positions can be dramatically saved using IndexableBitVectors.jl, which is a package I created in JSoC 2015. I'm developing a separated package for reference genomes, ReferenceSequences.jl, and going to improve the functionality and performance to handle huge genomes like the human genome.
If you are a researcher or an engineer who handles high-throughput sequencing data, BAM and CRAM parsers would be the most longing feature addition in the list. BAM is the de facto standard file format to accommodate aligned sequences and most sequence mappers generate alignments in this format. CRAM is a storage-efficient alternative of BAM and is getting popular reflecting explosion of accumulated sequence data. Since these files contain massive amounts of DNA sequences from high-throughput sequencing machines, high-speed parsing is a practically desirable feature. I'm going to concentrate on the speed by careful tuning and multi-thread parallel computation which is planned to be introduced in the next Julia release.
Many databases distribute their data in some standardized file formats. As for genetic annotations and variants, GFF3 and VCF would be the most common formats. If you are using data from human or mouse, you should know various annotations are available from the GENCODE project. It offers data in GTF or GFF3 file formats. NCBI provides human variation sets in VCF file formats here. These file formats are text, so you may think it is trivial to write parsers when you need them. It is partially true — if you don't care about completeness and performance. Parsing a text file format in a naive way (for example,
split a line by a tab character) allocates many temporary objects and often leads to degrade performance, while careful tuning of a parser leads to complicated code that is hard to maintain. @dcjones challenged this problem and made a great work and made Julia support for Ragel, which generates Julia code that executes a finite state machine. Daniel's talk of the JuliaCon 2015 is helpful to know about the details if you are interested:
Sometimes you may need only a part of data provided by a database. In such a case, web-based APIs are handy to fetch necessary data on demand. BioMart Central Portal offers a unified access point to a range of biological databases that is programmatically accessible via REST and SOAP APIs. Julian wrapper to BioMart will make it much easier to access data by automatically converting response to Julia objects. In the R language, the biomaRt package is one of the most downloaded packages in Bioconductor packages: https://www.bioconductor.org/packages/stats/.
We need users and collaborators of our libraries. Feedbacks from users in the real world are the most precious thing to improve the quality of our libraries. We welcome feature requests and discussions that will make bioinformatics easier and faster. Tools for phylogenetics and structural biology, which I didn't mention in this post, are also under active development. You can post issues here: https://github.com/BioJulia/Bio.jl/issues; if you want to get in touch with us more casually, this Gitter room may be more convenient: https://gitter.im/BioJulia/Bio.jl.