DESIGN (2323B)
1 Design notes 2 ============ 3 4 There are three main abstractions in the design of dedup: 5 6 - The chunker interface 7 - The snapshot layer 8 - The block layer 9 10 The block layer 11 --------------- 12 13 From the outside world, the block layer is just an abstraction for 14 dealing with variable length blocks. All blocks are referenced with 15 their hash. 16 17 The block layer is arranged into a stack of layers. From top to 18 bottom these are as follows: 19 20 - Generic layer 21 - The compression layer 22 - The encryption layer 23 - The storage layer 24 25 The generic layer is the one that client code interfaces with. It is 26 the top level entrypoint to the block layer. The generic layer 27 calculates the hash of the block and passes it down to the compression 28 layer. 29 30 The compression layer will prepend a compression descriptor to the 31 block and then compress the block using snappy or lz4. It is possible 32 to disable compression in which case a special descriptor is prepended 33 and the data is passed uncompressed to the encryption layer. 34 35 The encryption layer will prepend an encryption descriptor to the 36 block and then encrypt/authenticate the block using XChaCha20 and 37 Poly1305. It is possible to disable encryption in which case it acts 38 as a bypass with a special type of encryption descriptor. The block 39 is then passed to the storage layer. 40 41 The storage layer will prepend a storage descriptor and append the 42 descriptor and the data to a single backing file. 43 44 The snapshot layer 45 ------------------ 46 47 The snapshot abstraction is currently very simplistic. A snapshot is 48 a file under $repo/archive/<name>. The contents of the file are the 49 block hashes of the data stored in the snapshot. 50 51 The chunker interface 52 --------------------- 53 54 The chunker issues variable length blocks. The minimum block size is 55 512KB, the maximum block size is 8MB and the average block size is 56 2MB. These configuration parameters can be modified by editing 57 config.h but it can be tricky to tune it properly. 58 59 The buzhash[0] rolling hash algorithm is used to fingerprint the input 60 stream. 61 62 When encryption is enabled, a random seed is generated and stored 63 encrypted in the repository state file. The seed is XOR-ed with the 64 buzhash initial state table to mitigate against length fingerprinting 65 attacks. 66 67 [0] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html