Ctcompare: Code clone detection using hashed token sequences

Date of this Version


Document Type

Conference Paper

Publication Details

Citation only

Ctcompare: Code clone detection using hashed token sequences. Paper presented at the 34th International Conference on Software Engineering ICSE 2- 6 June 2012. Zurich, Switzerland

Access the conference

2012 HERDC submission. FoR code: 080201

© Copyright IEEE, 2012




There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5].This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.

This document is currently not available here.



This document has been peer reviewed.