Ctcompare: Code clone detection using hashed token sequences
Date of this Version
There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees , ; others have used variations of longest common substring algorithms , .This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere , but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.
This document has been peer reviewed.