DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

Minne Li; Dongsheng Li; Siqi Shen; Zhaoning Zhang; Xicheng Lu

doi:10.1007/978-3-319-47099-3_11

Conference Papers Year : 2016

DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

(1) , (1) , (1) , (1) , (1)

Minne Li

Function : Author
PersonId : 1023712

National University of Defense Technology [China]

Dongsheng Li

Function : Author
PersonId : 1023713

National University of Defense Technology [China]

Siqi Shen

Function : Author

National University of Defense Technology [China]

Zhaoning Zhang

Function : Author

National University of Defense Technology [China]

Xicheng Lu

Function : Author

National University of Defense Technology [China]

Abstract

Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm named DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that it can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods with high sample representativeness.

Keywords

Domains

Computer Science [cs]

Fichier principal

432484_1_En_11_Chapter.pdf (1.03 Mo)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-01648006

Submitted on : Friday, November 24, 2017-4:49:17 PM

Last modification on : Tuesday, September 3, 2019-3:04:02 PM

Dates and versions

hal-01648006 , version 1 (24-11-2017)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-01648006 , version 1
DOI : 10.1007/978-3-319-47099-3_11

Cite

Minne Li, Dongsheng Li, Siqi Shen, Zhaoning Zhang, Xicheng Lu. DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. 13th IFIP International Conference on Network and Parallel Computing (NPC), Oct 2016, Xi'an, China. pp.133-146, ⟨10.1007/978-3-319-47099-3_11⟩. ⟨hal-01648006⟩

DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share