Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

Kyle Porter; Slobodan Petrovic

doi:10.1007/978-3-319-99277-8_5

Conference Papers Year : 2018

Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

(1) , (1)

Kyle Porter

Function : Author
PersonId : 1041892

NTNU - Norwegian University of Science and Technology [Gjøvik]

Slobodan Petrovic

Function : Author

NTNU - Norwegian University of Science and Technology [Gjøvik]

Abstract

Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.

Keywords

Domains

Computer Science [cs]

Fichier principal

472401_1_En_5_Chapter.pdf (264.66 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-01988842

Submitted on : Tuesday, January 22, 2019-9:44:41 AM

Last modification on : Thursday, January 15, 2026-5:12:03 PM

Long-term archiving on : Tuesday, April 23, 2019-2:07:11 PM

Dates and versions

hal-01988842 , version 1 (22-01-2019)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-01988842 , version 1
DOI : 10.1007/978-3-319-99277-8_5

Cite

Kyle Porter, Slobodan Petrovic. Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora. 14th IFIP International Conference on Digital Forensics (DigitalForensics), Jan 2018, New Delhi, India. pp.67-85, ⟨10.1007/978-3-319-99277-8_5⟩. ⟨hal-01988842⟩

Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share