Citation enrichment improves deduplication of primary evidence
Date of this Version
To automatically detect duplicate citations in a bibliographical database.
Citations retrieved from multiple search databases have different forms making manual and automatic detection of duplicates difficult. Existing methods rely on fuzzy-similarity measures which are error-prone.
We analysed four pairs of original search results from MEDLINE and EMBASE that were used to create systematic reviews. An automatic tool deduplicated citations by first enriching citations with Digital Object Identifiers (DOI), and/or other unique identifiers. Duplication of records was then determined by comparing these unique identifiers. We compared our method with the duplicate detection function of a popular citation management desktop application in several configurations.
Citation Enrichment identified 93 % (range 86 %–100 %) of the duplicates indexed online and erroneously marked 3 % (range 0 %–6 %) documents as duplicates. The citation management application found 68 % (range 64 %–72 %) without error using default setting. When set for highest deduplication, the citation management application found 94 % of duplicates (range 77 %–100 %) and 4 % error (range 0 %–8 %).
Citation enrichment using unique identifiers enhances automatic deduplication. On its own, the approach seems slightly superior to tools that compare citations without enrichment. Methods that combine citation enrichment with existing fuzzy-matching may substantially reduce resource requirements of evidence synthesis.
This document has been peer reviewed.