Skip to main content
You are viewing the documentation for Interana version 2. For documentation on the most recent version of Interana, go to


Scuba Docs

Ingesting GIT source control history into Interana

Jeff Falk


Extract event data from your Git source control repository so you can visualize it in Interana.


  • access to your desired Git repository
  • admin access to Interana 

Extract Event Data From Your Git Repository

Prerequisite: Access to a Linux system (with Python installed) where you have already done a "git pull" of your repository.

Download the script below:

import subprocess
from subprocess import check_output
import re
import argparse
def find_branch_point(branch, git_location):
    # finds the git hash where the branch was created off master
    branch_point = check_output(['git', 'merge-base', 'origin/master', 'origin/'+branch], cwd=git_location).rstrip()
    return branch_point
def find_ga_tag(version, git_location):
    # try to do it programmatically
    proc = subprocess.Popen(['git', 'tag', '-l', '*'+version+'*GA*'], stdout=subprocess.PIPE, cwd=git_location)
    ga_tags = proc.stdout.readlines()
    for line in ga_tags:
        return line.rstrip()
# insert quote characters into git format string
def nvp(name, value, comma=True):
    pair = QUOTE + name + QUOTE + ': ' + QUOTE + value + QUOTE
    if comma:
        pair += ', '
    return pair
#def cleanse(lines):
#    for line in lines:
#        line = re.sub(r'[\t"\\]', '', line)
#        line = re.sub(QUOTE, '"', line)
#        yield line
def commits(commit_range, tag, version, git_location):
    # returns a list of JSON-formatted commit events
    format_string = '--format={'
    format_string += nvp('tag', tag)
    format_string += nvp('version', version)
    format_string += nvp('hash', '%h')
    format_string += nvp('ts', '%ci')
    format_string += nvp('committer', '%cn')
    format_string += nvp('committer_email', '%ce')
    format_string += nvp('subject', '%f', False)
    format_string += '}'
    proc = subprocess.Popen(['git', '--no-pager', 'log', format_string, commit_range],
                            stdout=subprocess.PIPE, cwd=git_location)
    result = proc.stdout.readlines()
    return result
def post_ga_commits(version, git_location):
    # finds all commits on a branch AFTER the listed GA tag
    ga_tag = find_ga_tag(version, git_location)
    if not ga_tag:
    print "post_ga_commits for (" + ga_tag + ")"
    tag = "post_ga"
    commit_range = ga_tag + "..origin/" + version
    return commits(commit_range, tag, version, git_location)
def post_branch_commits(version, git_location):
    # finds all commits on a branch AFTER it was branched from master
    tag = "post_branch"
    # if there's a GA tag, include only commits up to that point
    ga_tag = find_ga_tag(version, git_location)
    if ga_tag:
        print "post_branch_commits for (" + ga_tag + " <- " + version + ")"
        commit_range = find_branch_point(version, git_location) + ".." + ga_tag
        print "post_branch_commits for (" + version + ")"
        commit_range = find_branch_point(version, git_location) + "..origin/" + version
    return commits(commit_range, tag, version, git_location)
def pre_branch_commits(version, pre_version, git_location):
    # finds all commits on master in between two branch points
    print "pre_branch_commits for (" + version + " <- " + pre_version + ")"
    tag = "pre_branch"
    commit_range = find_branch_point(pre_version, git_location) + ".." + find_branch_point(version, git_location)
    return commits(commit_range, tag, version, git_location)
def get_commits_for_versions(file_out, git_location, version_list):
    # gets pre- and post- branch commits for the versions listed
    # P.S. the last version in the list is only used to help
    # compute the pre-branch transactions for the second-to-last
        with open(file_out, 'w') as fout:
            # pre-branch commits
            ver_list = list(version_list)
            del ver_list[-1]
            pre_ver_list = version_list[1:]
            for (version, pre_version) in zip(ver_list, pre_ver_list):
                result = pre_branch_commits(version, pre_version, git_location)
                for line in result:
            # post-branch commits
            for version in ver_list:
                result = post_branch_commits(version, git_location)
                for line in result:
            # post-ga commits
            for version in ver_list:
                result = post_ga_commits(version, git_location)
                if result:
                    for line in result:
    except Exception as e:
        print "Error encountered, exiting: {e}".format(e=e)
        return str(e)
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Extract git commit history (by branch) as JSON')
    # required
    parser.add_argument('--output_file', '-o',
                        help="Output file",
    parser.add_argument('--git_location', '-g',
                        help="Location (on disk) of the git repo",
    parser.add_argument('--branches', '-b',
                        help="Branches to query in descending order; e.g. 2.20 2.19 2.18 2.17 2.15 2.12 2.11 2.10",
    args = parser.parse_args()                                                                               

Now use the script to extract event data from your Git repo:

./ -o /tmp/github/commit_list.json -g /home/ubuntu/interana/backend/ -b 2.23 2.22 2.21 2.20 2.19 2.18 2.17 2.16 2.15

At this point you will have a commit_list.json file containing the event data from your Git repo. It will contain one raw event on each line, and each event will be a JSON object:

{"tag": "pre_branch", "version": "2.23", "hash": "985cf1c", "ts": "2016-11-14 18:57:22 -0800", "committer": "Jeff", "committer_email": "", "subject": "the-most-amazing-commit-ever"}

Load Your Event Data Into Interana

Prerequisite: Your Interana login must be granted "admin" role in order to have permission to ingest new data into the system. If you're not sure, try accessing the https://<mycluster>/?users endpoint. As an admin, you'll be able to see all users and their roles, and your own user will show as being an admin.

To ingest the data file you generated above, first navigate to the ingest wizard by hitting the https://<mycluster>/?wizard=import endpoint and then select Upload Local File and select your data file.

Give your table a name (I called mine "git_commits") and select "committer" and "version" as your two shard keys. This will allow you to explore the behavior of individual developers on the team, as well as releases as a whole. 

Double check that Interana is correctly detecting the data type of each column in your raw data file. It turns out that for Interana, our release numbers look like decimals (2.19, 2.20, 2.21) and Interana incorrectly thinks it should ingest them that way. So we will manually override the data type of the "version" column to be a string instead.

Notice how the preview now shows "2.23" instead of "2230000" as the value of the column.

Click NEXT to proceed with the import. You will see a confirmation screen, which tells you that ingest is underway (but might not be complete yet). Give it a minute and then navigate back to the Explorer. 

Explore Your Git Data

On the Explorer screen, select the git_commits table, group by version, choose a Stacked Area Time View, and you'll be able to see all your Git commits grouped by version. 

What's Next

Good next steps would be to visit the Settings page to see all columns available in this data set, and then use a Samples View query to take a look at some example values. Then you'll be ready to ask behavioral questions about the actual pace of commits, length of releases, etc.

  • Was this article helpful?