Near Duplicate Document Detection for Large Information Flows

Daniele Montanari; Piera Laura Puglisi

doi:10.1007/978-3-642-32498-7_16

Conference Papers Year : 2012

Near Duplicate Document Detection for Large Information Flows

(1) , (2)

1
2

Daniele Montanari

Function : Author
PersonId : 1010620

eni ICT [Bologna]

Piera Laura Puglisi

Function : Author
PersonId : 1010621

GESP - Geographic Information Systems [Bologna]

Abstract

Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an information corpus and therefore improve its quality.In this paper, we introduce a new method to find near duplicate documents based on q-grams extracted from the text. The algorithm exploits three major features: a similarity measure comparing document q-gram occurrences to evaluate the syntactic similarity of the compared texts; an indexing method maintaining an inverted index of q-gram; and an efficient allocation of the bitmaps using a window size of 24 hours supporting the documents comparison process.The proposed algorithm has been tested in a multifeed news content management system to filter out duplicated news items coming from different information channels. The experimental evaluation shows the efficiency and the accuracy of our solution compared with other existing techniques. The results on a real dataset report a F-measure of 9.53 with a similarity threshold of 0.8.

Keywords

Domains

Fichier principal

978-3-642-32498-7_16_Chapter.pdf (472.63 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-01542467

Submitted on : Monday, June 19, 2017-5:01:46 PM

Last modification on : Thursday, March 5, 2020-4:47:37 PM

Long-term archiving on : Friday, December 15, 2017-6:57:32 PM

Dates and versions

hal-01542467 , version 1 (19-06-2017)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-01542467 , version 1
DOI : 10.1007/978-3-642-32498-7_16

Cite

Daniele Montanari, Piera Laura Puglisi. Near Duplicate Document Detection for Large Information Flows. International Cross-Domain Conference and Workshop on Availability, Reliability, and Security (CD-ARES), Aug 2012, Prague, Czech Republic. pp.203-217, ⟨10.1007/978-3-642-32498-7_16⟩. ⟨hal-01542467⟩

Near Duplicate Document Detection for Large Information Flows

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share