GFA is, as time of writing, the current standard to share variation graphs. Human-readable and used by many software, featuring tab-separated fields that contain graph information, it however lacks two crucial things:
- this file format has no random access built-in.
- this file format can feature billions of nodes, and thus may be hard to visualize. This work tries to address the second problematic. I want here to define another sub-standard of GFA, which is the bGFA. bGFA consists of a fully-compatible GFA1.0, 1.1 or even rGFA with at its end some B-lines. B-lines will follow the GFAspec format, and allows for this syntax:
H VN:Z:1.1 RS:Z:seq0
S 1 ATTA
...
S 12 GGTC
W seq1 1 seq1 0 749948 >1>2>4>5>6>7>8>9>10>11>9>10>11>12
W seq2 2 seq2 0 748216 >1>3>4>5>7>9>10>12
W seq0 0 seq0 0 749379 >1>2>4>5>6>7>9>10>11>12
L 1 + 2 + 0M
...
L 11 + 12 + 0M
B b0 4,1,2,3
B b1 7,5,6
Each B-line describes a bubble in the graph, and gives the node names it contains for easy access and isolation. Key idea (not implemented yet) is to have a recusive definition, where bubbles can be composed of nodes and bubbles, embedding a hierarchy which would allow for different levels of abstraction over
Other thoughts I had when thinking about this format are:
- define another type of record, the bubble chain, which describes the juxtaposition of at least two bubbles
- precompute some values as optional fields such as starting and ending points of the bubble