Visualizing Paths using Google Analytics

Query Formulation is a complex task, especially when building a federated query targeting multiple, heterogeneous and complex datasets. Granatum ReVeaLD aims at simplifying this process, by allowing any user to select any concepts whose instances he wishes to retrieve, as well as literal attributes of these instances. Moreover he can set constraints on either the concepts or the literal attributes, for example, only those drugs which treat diseases, OR molecules which have molecular weight under 400. This selection process, can be done using a very intuitive interface, which provides visual cues about possible relationships and attributes of pre-defined concepts. Each step of the process entails a dynamic reload of the browser page (new URL parameters) enabling the users to share their queries by simple URL exchange (temporary solution currently).

For evaluating the intuitiveness and interactivity of the Granatum ReVeaLD platform, we had provided the evaluators, who were mainly bio-medical researchers with 5 tasks. As ReVeaLD is a knowledge discovery platform from the Linked Bio-medical Data Space, the evaluators were asked to formulate five real-life queries, for example,

“Diseases labelled Colon Cancer which have possible Drugs with Molecular weight less than 400″ OR
“Chemopreventive Agents, derived from ‘Pomegranate’ Source, which affect Pathways titled ‘Estrogen’, and all the Toxicity details about these agents.”, etc.

We kept the evaluation process as a game instead of a lame survey procedure, so as to keep the process competitive, and encouraging more users to actually participate in the evaluations. The users were scored according to the number of correctly selected nodes and constraints, with negative marking for an node/constraint which was not required. The final ranking was done on the basis of the score and the time taken to formulate the query. As the URL in the address bar, undergoes a change on every step of the query formulation process, we decided to use the Google Analytics, which logs almost everything from the time taken on any particular page on the website, the number of views on any page, to the client system configuration of the users.

Initially, I was just targeting to acquire the time taken for each step, in each task dictated to the user. However while experimenting with the Google Analytics DashBoard, I found two interesting features. One was the Navigation Summary of each page, which indicates all the next pages migrated to from the respective page, as well as all the previous pages, which allowed the user to land on our specific page. This was quite the tool, because I was able to get a clear idea through this page, about the path navigated by a user to formulate a query. The Navigation Summary however shows this information for one page at a time only, so I can not effectively visualize the entire path altogether, just a node in different paths. This however can be circumvented through yet another feature of Google Analytics DashBoard, recently introduced, called Visitor Flow. Google Analytics uses its native technology based on the Sankey Diagram, to create an awesome flow visualization. It presents a clear idea to the user about the pages navigated most, and the different paths taken to access them. You can set an initial dimension on the visitor demographics or the traffic sources, as well as highlight any particular path. Over the traditional JavaScript Libraries (like D3) which I found for rendering Sankey Layouts, Google Analytics, also shows the number of users who quit the application at any particular page (visualizing as ‘leak’ in the flow).

In Google Analytics, the navigation summary of one particular query step, shows the previous step, and indicates that in the next step, the user includes a new filter. However if I wished to present these results, they were not in the desirable format (URL in itself is non-human readable) As the Visitor flow, itself also uses these URLs as labels instead of human-readable versions (Included a new constraint – ‘Estrogen’ here), and I had to set up complex conversion goals (I did not even know if this was doable), I ended up using D3JS to make my own JavaScript Application for path navigation visualization.

I was faced with the challenge of retrieving the navigation summary for each query step. As there were 5 tasks involved in the evaluation, and each task had around 9 query steps on the minimum, and the various other query steps made by the users themselves, the total nodes came over 200. It did not seem feasible to manually export the navigation summary for each step (Analytics DashBoard does allow CSV export). So after initial export of all the possible query steps, in CSV format, involved in the evaluation (Content–>Site Content–>All Pages with a filter on all URLs with a pre-designated parameter introduced for differentiate between evaluation and normal query steps), I deployed a small JavaScript which would iterate through each row in this data after the initial authentication, and make a query to the Google Analytics Core Reporting API (guide). You can use this tool to decide upon the query dimensions and metrics.

  var apiQuery = gapi.client.analytics.data.ga.get({
          'ids': TABLE_ID,
          'start-date': '2012-11-12',
          'end-date': '2012-11-28',
          'metrics': 'ga:pageviews',
          'dimensions': 'ga:previousPagePath,ga:nextPagePath',
          'sort': '-ga:pageViews',
          'filters': 'ga:previousPagePath=='+QUERY_STEP_URL,
          'max-results': 50
  });

The TABLE_ID is in the format of ga:xxxx where xxxx is the id of your profile on Google Analytics. In this construct, I am retrieving all the possible steps taken after a particular QUERY_STEP_URL, within a given time span. It is worth noting that only 10 concurrent requests are allowed at a time, and 10,000 requests per profile. The first limitation might cause a problem, if you are using a simple ‘for’ loop. An approach we took was to use the Window setInterval() method, and call the construct from within it, allowing the iteration after 3000 milliseconds or so.

 var i = 1;
 var myVar = setInterval(function(){
   //  Your apiQuery goes here
   apiQuery.execute(handleCoreReportingResults);
   // The CallBack can deal with inserting the results in an array OR
   // appending to an HTML <div>.
   i++;
   if(i > TOTAL_NO_OF_QUERY_STEPS){
     clearInterval(myVar);
     //  Compile final results here
   }
 },3000);

Once we had the data about all the possible query steps taken in the evaluation, and the number of users jumping from one query step, to the next query step, the remaining steps were simple for me. We created a data structure where all this query steps are to be rendered as ‘circular nodes’ whose radius is the square root of the time taken on each query step (in seconds) and are grouped according to the task (separate colors). Hovering above the node gives the description of the node, fetched by running a simple parser on the URL Parameters. It also indicates the time taken on the step. Separate nodes are linked together utilizing on the information we fetched previously, depending on the next query steps from one step. The thickness of the ‘edge’ is proportionate to the number of ‘pageViews’ on the next step, as a direct consequence of one particular step. These nodes can be arranged in force-directed layout (as in our case) or can be rendered using the Sankey Diagram (D3JS uses the same data structure for both layouts)

 "nodes": [
   {
         "number": 0,
         "name": "/explorer?task=1",
         "group": 1,
         "radius": 26,
         "type": "uri",
         "description": "Start of Task 1"
   }, // ... and so on
 ],
 "links": [
   {
         "name": "Migrated from task 1 to 5",
         "source": 0,
         "target": 4,
         "value": 30
   }, // ... and so on
 ]

The final visualization can be seen in a crude layout as shown in the image below. You can easily generalize the main paths taken to formulate the query, and on which query step, do the users divert. I would soon link the interactive version of this visualization, but it is just the default force-directed layout bundled with D3JS.

We never did use this information for the presentation of evaluation results, because it was overkill and beyond the previously-decided scope of the evaluation, it provided me with an unusual idea of developing interactive site-maps based on Google Analytics Tracking Data and a simple Force-Directed Graph Layout using D3JS (as the Visitor Flow can only be accessed by users authenticated on the Analytics Account). These sitemaps would allow users to know which pages are ‘trending’ depending on the user’s region and various other parameters, as set by the moderators. It can also allow a user to determine which article to read next, based on popular user opinion. It can filter and group based on user requirements and customization and on clicking a node, you can visit the represented page. For the Analytics to generate accurate results one would need to set a training set initially for the API to log. Also, as it takes time to build the data structure for the visualization, it would be feasible to build it on regular intervals without manual intervention and save it on a local file, using server side PHP or Java. This functionality can be incorporated as a Drupal or WordPress Plugin, and integrated in sites with large content bases, accessed from various regions of the world.