%0 Conference Proceedings %T A Scalable Inline Cluster Deduplication Framework for Big Data Protection %+ National University of Defense Technology [China] %+ University of Nebraska–Lincoln %A Fu, Yinjin %A Jiang, Hong %A Xiao, Nong %Z Part 5: Big-Data and Cloud Computing %< avec comité de lecture %( Lecture Notes in Computer Science %B 13th International Middleware Conference (MIDDLEWARE) %C Montreal, QC, Canada %Y Priya Narasimhan %Y Peter Triantafillou %I Springer %3 Middleware 2012 %V LNCS-7662 %P 354-373 %8 2012-12-03 %D 2012 %R 10.1007/978-3-642-35170-9_18 %K Big Data protection %K cluster deduplication %K data routing %K superchunk %K handprinting %K similarity index %K load balance %Z Computer Science [cs] %Z Computer Science [cs]/Networking and Internet Architecture [cs.NI]Conference papers %X Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great challenge for cluster deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio in cluster systems with low-end individual secondary storage nodes. We propose ∑-Dedupe, a scalable inline cluster deduplication framework, as a middleware deployable in cloud data centers, to meet this challenge by exploiting data similarity and locality to optimize cluster deduplication in inter-node and intra-node scenarios, respectively. Governed by a similarity-based stateful data routing scheme, ∑-Dedupe assigns similar data to the same backup server at the super-chunk granularity using a handprinting technique to maintain high cluster-deduplication efficiency without cross-node deduplication, and balances the workload of servers from backup clients. Meanwhile, ∑-Dedupe builds a similarity index over the traditional locality-preserved caching design to alleviate the chunk index-lookup bottleneck in each node. Extensive evaluation of our ∑-Dedupe prototype against state-of-the-art schemes, driven by real-world datasets, demonstrates that ∑-Dedupe achieves a cluster-wide duplicate elimination ratio almost as high as the high-overhead and poorly scalable traditional stateful routing scheme but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio stateless routing approaches. %G English %Z TC 6 %Z WG 6.1 %2 https://inria.hal.science/hal-01555548/document %2 https://inria.hal.science/hal-01555548/file/978-3-642-35170-9_18_Chapter.pdf %L hal-01555548 %U https://inria.hal.science/hal-01555548 %~ IFIP-LNCS %~ IFIP %~ IFIP-TC %~ IFIP-WG %~ IFIP-TC6 %~ IFIP-WG6-1 %~ IFIP-MIDDLEWARE %~ IFIP-LNCS-7662