%0 Conference Proceedings %T SDAC: Porting Scientific Data to Spark RDDs %+ Beihang University (BUAA) %+ The University of Tokyo (UTokyo) %A Yang, Tian %A Taura, Kenjiro %A Chao, Liu %< avec comité de lecture %( Lecture Notes in Computer Science %B 14th IFIP International Conference on Network and Parallel Computing (NPC) %C Hefei, China %Y Xuanhua Shi %Y Hong An %Y Chao Wang %Y Mahmut Kandemir %Y Hai Jin %I Springer International Publishing %3 Network and Parallel Computing %V LNCS-10578 %P 127-130 %8 2017-10-20 %D 2017 %R 10.1007/978-3-319-68210-5_13 %K Scientific data %K Spark %K RDDs %K HDF5 %Z Computer Science [cs]Conference papers %X Scientific data processing has exposed a range of technical problems in industrial exploration and specific-domain applications due to its huge input volume and data format diversity. While Big Data analytic frameworks such as Hadoop and Spark lack their native supports for processing increasing heterogeneous scientific data efficiently. In this paper, we introduce our work named SDAC (Scientific Data Auto Chunk) for porting various scientific data to RDDs to support parallel processing and analytics in Apache Spark framework. With the integration of auto-chunk task granularity-specify method, a better-planned theoretical pipeline can be derived to navigate data partitioning and parallel I/O. We showcase performance comparison with H5Spark within 6 benchmarks in both standalone and distributed mode. Experimental results showed SDAC module achieved an overall improvement of 2.1 times over H5Spark in standalone mode, and 1.34 times in distributed mode. %G English %Z TC 10 %Z WG 10.3 %2 https://inria.hal.science/hal-01705439/document %2 https://inria.hal.science/hal-01705439/file/457609_1_En_13_Chapter.pdf %L hal-01705439 %U https://inria.hal.science/hal-01705439 %~ IFIP-LNCS %~ IFIP %~ IFIP-TC %~ IFIP-TC10 %~ IFIP-NPC %~ IFIP-WG10-3 %~ IFIP-LNCS-10578