%0 Conference Proceedings
%T Near Duplicate Document Detection for Large Information Flows
%+ eni ICT [Bologna]
%+ Geographic Information Systems [Bologna] (GESP )
%A Montanari, Daniele
%A Puglisi, Piera, Laura
%Z Part 1: Conference
%< avec comité de lecture
%( Lecture Notes in Computer Science
%B International Cross-Domain Conference and Workshop on Availability, Reliability, and Security (CD-ARES)
%C Prague, Czech Republic
%Y Gerald Quirchmayr
%Y Josef Basl
%Y Ilsun You
%Y Lida Xu
%Y Edgar Weippl
%I Springer
%3 Multidisciplinary Research and Practice for Information Systems
%V LNCS-7465
%P 203-217
%8 2012-08-20
%D 2012
%R 10.1007/978-3-642-32498-7_16
%K duplicate
%K information flows
%K q-grams
%Z Computer Science [cs]
%Z Humanities and Social Sciences/Library and information sciencesConference papers
%X Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an information corpus and therefore improve its quality.In this paper, we introduce a new method to find near duplicate documents based on q-grams extracted from the text. The algorithm exploits three major features: a similarity measure comparing document q-gram occurrences to evaluate the syntactic similarity of the compared texts; an indexing method maintaining an inverted index of q-gram; and an efficient allocation of the bitmaps using a window size of 24 hours supporting the documents comparison process.The proposed algorithm has been tested in a multifeed news content management system to filter out duplicated news items coming from different information channels. The experimental evaluation shows the efficiency and the accuracy of our solution compared with other existing techniques. The results on a real dataset report a F-measure of 9.53 with a similarity threshold of 0.8.
%G English
%Z TC 5
%Z TC 8
%Z WG 8.4
%Z WG 8.9
%2 https://inria.hal.science/hal-01542467/document
%2 https://inria.hal.science/hal-01542467/file/978-3-642-32498-7_16_Chapter.pdf
%L hal-01542467
%U https://inria.hal.science/hal-01542467
%~ SHS
%~ IFIP-LNCS
%~ IFIP
%~ IFIP-TC
%~ IFIP-TC5
%~ IFIP-WG
%~ IFIP-TC8
%~ IFIP-CD-ARES
%~ IFIP-WG8-4
%~ IFIP-WG8-9
%~ IFIP-LNCS-7465