Fast Content-Based File Type Identification

Irfan Ahmed; Kyung-Suk Lhee; Hyun-Jung Shin; Man-Pyo Hong

doi:10.1007/978-3-642-24212-0_5

Conference Papers Year : 2011

Fast Content-Based File Type Identification

(1) , (2) , (2) , (2)

1
2

Irfan Ahmed

Function : Author

Information Security Institute

Kyung-Suk Lhee

Function : Author

Ajou University

Hyun-Jung Shin

Function : Author

Ajou University

Man-Pyo Hong

Function : Author

Ajou University

Abstract

Digital forensic examiners often need to identify the type of a file or file fragment based on the content of the file. Content-based file type identification schemes typically use a byte frequency distribution with statistical machine learning to classify file types. Most algorithms analyze the entire file content to obtain the byte frequency distribution, a technique that is inefficient and time consuming. This paper proposes two techniques for reducing the classification time. The first technique selects a subset of features based on the frequency of occurrence. The second speeds up classification by randomly sampling file blocks. Experimental results demonstrate that up to a fifteen-fold reduction in computational time can be achieved with limited impact on accuracy.

Keywords

Domains

Computer Science [cs]

Fichier principal

978-3-642-24212-0_5_Chapter.pdf (740.39 Ko)

Origin	Files produced by the author(s)

Hal Ifip : Connect in order to contact the contributor

https://inria.hal.science/hal-01569553

Submitted on : Thursday, July 27, 2017-8:22:27 AM

Last modification on : Thursday, March 5, 2020-4:46:42 PM

Dates and versions

hal-01569553 , version 1 (27-07-2017)

Licence

Attribution

Identifiers

HAL Id : hal-01569553 , version 1
DOI : 10.1007/978-3-642-24212-0_5

Cite

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, Man-Pyo Hong. Fast Content-Based File Type Identification. 7th Digital Forensics (DF), Jan 2011, Orlando, FL, United States. pp.65-75, ⟨10.1007/978-3-642-24212-0_5⟩. ⟨hal-01569553⟩

Fast Content-Based File Type Identification

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share