|
|
|||||||||||||
|
|
||||||||||||||
|
DB
Resources
Downloads
Demos
Support
Diff
Advanced XML Diffing
|
X-Hive/Diff Algorithm
The X-Hive/Diff algorithm for visualizing differences between two XML trees
(the "source" and the "target" tree) includes three main phases:
The first phase calculates a matching between the source and the
target tree. This means that for each node in the source tree the
best matching node in the target tree is determined (and vice cersa).
Ultimately, the matching is stored in a format where for each node
there is a "meta-node" (class XhiveDiffNode), storing the node itself
and information about the node it matches and how "good" this
match is. For TextNodes (and CDataSections etc.), an LCS matching
with the matched TextNode is also stored.
The next phase builds two XML documents (one for the source tree
and one for the target tree), containing the original documents
but in addition including markup to identify matchings, deletions,
insertions, moves, element and attribute name changes, and changes in
comments and processing instructions.
The final phase builds two HTML files (one for the source tree and
one for the target tree), by applying a stylesheet to the XML documents
from phase two. It then combines the two HTML files in a frames
document.
Elaboration on Phase 1
The Phase one algorithm is complete in the sense that it will
identify each node (and within each TextNode, each character)
that has no match in the other document. In other words, it
will not skip any differences. The question is, of course, whether
it will correctly type them. For example, does a node in the
target tree that exactly matches a node in the source tree, but
appearing under a different parent correspond to a deletion
followed by an insertion, or was it an actual move? Such a call
cannot be unambiguously made without an edit log, which is,
of course, not available. Therefore, it will come as no surprise
that such decisions are based on heuristics, or "educated guesses",
which can be slightly influenced by the user through the setting
of configuration parameters, listed and documented in
bin/xhivediff.properties.
The matching algorithm itself is performance-wise based on the
thought that the number of changes is small compared to the
number of nodes in the source and target trees. It starts out by
quickly establishing an initial matching based on an LCS algorithm
adapted for tree structures, which in the typical case will render
most nodes matched. By an assignment algorithm, an attempt is then
made to match remaining unmatched or not fully matched nodes in the
source and target trees. Some of the original matchings may then be
replaced by others. If two nodes are matched during this phase, their
subtrees are enitrely re-evaluated using, again, the sequence of LCS
mathing first, followed by assignment. This algorithm repeats itself
until no additional matchings are found, at which point it terminates.
Elaboration on Phase 2
In phase two the matching structure built in phase 1 is traversed
and unmatched nodes, matched nodes under different names, matched nodes
under different parents, unmatched pieces of text within matched nodes,
etc., are registered as deletions, insertions, renamings, moves,
text changes, etc. This analysis is stored in the two XML documents created
in this phase. They form the preparation for HTML output in phase three.
Every node in an original tree (i.e. source or
target) is assigned a corresponding element node in the diff markup document,
with additional attributes indicating the type of changes, the kind
of node, etc. All required change information is contained in this structure
somehow, albeit somewhat hidden in some cases. This means that the
stylesheet in Phase 3 has some work to do as well.
Elaboration on phase 3
In phase three, a stylesheet is used to transform the XML trees built
in Phase 2 into HTML documents. When "verbatim" XML output is required,
XhiveDiff provides a pre-built stylesheet in which all changes between
the source and target trees are visualized in horizontally or
vertivally positioned frames (dependent on the value of the
xhivediff.frame.layout [horizontal, vertical] configuration parameter).
Clicking on a node in either frame will make the other frame scroll to the
matching node. Jumping from one change to the next is also
possible.
Xhive Three-way Diff Algorithm
The three-way diff algorithm handles a situation in which two separate
branches exist (T1 and T2), both originating from a common source document (S).
When T1 and T2 are compared, the user who is interested in the differences
between the two documents, may be also interested in which differences between
T2 and T1 are "new", and which are "old" in the sense that they already
occurred as differences between T2 and S. The three-way merge algorithm solves
this diff use case. After a full diff between each pair of documents, the
algorithm compares the differences between T2 and T1 to those between T2 and S,
and is thus able to qualify each difference between T2 and T1 as either "new"
or "old".
|