EMC home home

 
X-Hive/Diff Algorithm
The X-Hive/Diff algorithm for visualizing differences between two XML trees (the "source" and the "target" tree) includes three main phases:
The first phase calculates a matching between the source and the target tree. This means that for each node in the source tree the best matching node in the target tree is determined (and vice cersa). Ultimately, the matching is stored in a format where for each node there is a "meta-node" (class XhiveDiffNode), storing the node itself and information about the node it matches and how "good" this match is. For TextNodes (and CDataSections etc.), an LCS matching with the matched TextNode is also stored.
The next phase builds two XML documents (one for the source tree and one for the target tree), containing the original documents but in addition including markup to identify matchings, deletions, insertions, moves, element and attribute name changes, and changes in comments and processing instructions.
The final phase builds two HTML files (one for the source tree and one for the target tree), by applying a stylesheet to the XML documents from phase two. It then combines the two HTML files in a frames document.
Elaboration on Phase 1
The Phase one algorithm is complete in the sense that it will identify each node (and within each TextNode, each character) that has no match in the other document. In other words, it will not skip any differences. The question is, of course, whether it will correctly type them. For example, does a node in the target tree that exactly matches a node in the source tree, but appearing under a different parent correspond to a deletion followed by an insertion, or was it an actual move? Such a call cannot be unambiguously made without an edit log, which is, of course, not available. Therefore, it will come as no surprise that such decisions are based on heuristics, or "educated guesses", which can be slightly influenced by the user through the setting of configuration parameters, listed and documented in bin/xhivediff.properties.
The matching algorithm itself is performance-wise based on the thought that the number of changes is small compared to the number of nodes in the source and target trees. It starts out by quickly establishing an initial matching based on an LCS algorithm adapted for tree structures, which in the typical case will render most nodes matched. By an assignment algorithm, an attempt is then made to match remaining unmatched or not fully matched nodes in the source and target trees. Some of the original matchings may then be replaced by others. If two nodes are matched during this phase, their subtrees are enitrely re-evaluated using, again, the sequence of LCS mathing first, followed by assignment. This algorithm repeats itself until no additional matchings are found, at which point it terminates.
Elaboration on Phase 2
In phase two the matching structure built in phase 1 is traversed and unmatched nodes, matched nodes under different names, matched nodes under different parents, unmatched pieces of text within matched nodes, etc., are registered as deletions, insertions, renamings, moves, text changes, etc. This analysis is stored in the two XML documents created in this phase. They form the preparation for HTML output in phase three.
Every node in an original tree (i.e. source or target) is assigned a corresponding element node in the diff markup document, with additional attributes indicating the type of changes, the kind of node, etc. All required change information is contained in this structure somehow, albeit somewhat hidden in some cases. This means that the stylesheet in Phase 3 has some work to do as well.
Elaboration on phase 3
In phase three, a stylesheet is used to transform the XML trees built in Phase 2 into HTML documents. When "verbatim" XML output is required, XhiveDiff provides a pre-built stylesheet in which all changes between the source and target trees are visualized in horizontally or vertivally positioned frames (dependent on the value of the xhivediff.frame.layout [horizontal, vertical] configuration parameter). Clicking on a node in either frame will make the other frame scroll to the matching node. Jumping from one change to the next is also possible.
Xhive Three-way Diff Algorithm
The three-way diff algorithm handles a situation in which two separate branches exist (T1 and T2), both originating from a common source document (S). When T1 and T2 are compared, the user who is interested in the differences between the two documents, may be also interested in which differences between T2 and T1 are "new", and which are "old" in the sense that they already occurred as differences between T2 and S. The three-way merge algorithm solves this diff use case. After a full diff between each pair of documents, the algorithm compares the differences between T2 and T1 to those between T2 and S, and is thus able to qualify each difference between T2 and T1 as either "new" or "old".