Fragmenting the Genomic Wheel

Update: The paper titled ‘GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research’ has been accepted for presentation in the Conference on Semantics in Healthcare and Life Sciences (CSHALS), to be held in Boston, MA from February 26-18, 2014. View Presentation.


My recent contribution to the LinkedTCGA project through the development of a visualization dashboard to foster serendipitous discovery in cancer research (which won the Big Data Prize in the Semantic Web Challenge held at ISWC 2013, Sydney) had me thinking towards the development of an integrated genomics visual analytics platform.
The prime motivation leading to this project was (more or less):

‘an interface which allows biomedical researchers to access the base-level genetic code (A,C,T and G) on different levels of abstraction similar to how computer scientists access byte code using higher-level assembly languages’

Traditional Genome Browsers as such limit the capacity of users to analyze genes simultaneously and informatively, by gathering insights from previously extracted knowledge – for example genes which encode proteins functioning in the same pathway, or implicated simultaneously in a disease. In terms of cancer, it has been shown on numerous occasions when Genome-wide Association Studies (GWAS) have been able to essentially isolate genomic loci which may be directly implicated in the various forms of cancer. This knowledge has been catalogued and stored in multiple knowledgebases, and could be integrated and incorporated in genomic analyses, hence helping cancer researchers to just isolate genomic segments of interest and visualize the cancer datasets, like DNA Methylation, exon expression, etc. across them.

The approach that I had in mind relied more on combining the salient aspects of two methods of genomic visualization – linear genome browsers and circular plots. On a higher resolution, the users would be able to see the human genome laid out in a circular layout, the individual chromosomes forming the arcs of this circle (screenshots above), and different genomic regions connected with each other using chords. The thickness of the chords is based on the similarity between the connected components, which would depend on the co-occurrence of contained genes, i.e. involvement in pathways in the form of protein inputs or catalyst, implication in a certain disease, or reference in publications. Co-occurrence is commonly used to cluster network users together based on similar interest or simultaneous participation in online forums or groups. Co-occurrence is modeled as data cubes, and similarities between different genomic regions could be calculated using a variation of Tversky’s feature-based similarity score. The ‘Genomic Wheel’ could be fragmented down to different levels of genomic hierarchy – namely, chromosomes, ideograms, genes, and point alterations. Genomic Loci whose sequences bore somatic or germline mutations leading to cancer, as identified through GWAS, are represented using different shades of red to facilitate visual discernibility.

At the ‘gene’ level the user, could click any desired gene to launch a ‘Genomic Tracks Viewer’ which would be designed similarly to the linear genome browsers present today (with a possibility of replacement with these traditional versions later). The Viewer facilitates the selection of any tumor and patient, and a SPARQL query (shown below) is executed against the corresponding Linked TCGA Endpoint. The DNA Methylation and the Exon Expression datasets are retrieved, based on the start/stop co-ordinates of the gene on the Human Genome, and they are visualized as bar charts (red and green respectively). Any number of patients can be selected and the charts would be stacked. Moreover, the Viewer allows simultaneous analysis on different genes through a tabbed display, and has features of zooming and automatic scrolling.

PREFIX xsd: <>
PREFIX tcga: <>
   <PatientID> tcga:result ?exonResult.
   ?exonResult tcga:chromosome "17"; tcga:RPKM ?value;
               tcga:start ?start; tcga:stop ?stop
   FILTER(xsd:double(?start) > 37844393 && xsd:double(?stop) < 37884915)

I used KineticJS, an HTML5 Canvas JavaScript framework, that enables node nesting, layering, caching and event handling, to develop the entire platform, as compared to popular libraries like D3JS towards which I had developed a attraction last year. For my last project on Linked TCGA Dashboard, I had found out that while SVG was suitable for developing interactive visualizations for smaller datasets, the functionality was deeply impacted when rendering larger datasets as SVG stores the rendered objects directly in the browser DOM (also based on older experiences gained during the development Reactome Pathway Visualization). The Linked TCGA Dashboard also got a major upheaval later, though the force-directed visualization was rendered using SigmaJS.