dedup

deduplicating backup program
git clone git://git.2f30.org/dedup
Log | Files | Refs | README | LICENSE

DESIGN (2323B)


      1 Design notes
      2 ============
      3 
      4 There are three main abstractions in the design of dedup:
      5 
      6   - The chunker interface
      7   - The snapshot layer
      8   - The block layer
      9 
     10 The block layer
     11 ---------------
     12 
     13 From the outside world, the block layer is just an abstraction for
     14 dealing with variable length blocks.  All blocks are referenced with
     15 their hash.
     16 
     17 The block layer is arranged into a stack of layers.  From top to
     18 bottom these are as follows:
     19 
     20   - Generic layer
     21   - The compression layer
     22   - The encryption layer
     23   - The storage layer
     24 
     25 The generic layer is the one that client code interfaces with.  It is
     26 the top level entrypoint to the block layer.  The generic layer
     27 calculates the hash of the block and passes it down to the compression
     28 layer.
     29 
     30 The compression layer will prepend a compression descriptor to the
     31 block and then compress the block using snappy or lz4.  It is possible
     32 to disable compression in which case a special descriptor is prepended
     33 and the data is passed uncompressed to the encryption layer.
     34 
     35 The encryption layer will prepend an encryption descriptor to the
     36 block and then encrypt/authenticate the block using XChaCha20 and
     37 Poly1305.  It is possible to disable encryption in which case it acts
     38 as a bypass with a special type of encryption descriptor.  The block
     39 is then passed to the storage layer.
     40 
     41 The storage layer will prepend a storage descriptor and append the
     42 descriptor and the data to a single backing file.
     43 
     44 The snapshot layer
     45 ------------------
     46 
     47 The snapshot abstraction is currently very simplistic.  A snapshot is
     48 a file under $repo/archive/<name>.  The contents of the file are the
     49 block hashes of the data stored in the snapshot.
     50 
     51 The chunker interface
     52 ---------------------
     53 
     54 The chunker issues variable length blocks.  The minimum block size is
     55 512KB, the maximum block size is 8MB and the average block size is
     56 2MB.  These configuration parameters can be modified by editing
     57 config.h but it can be tricky to tune it properly.
     58 
     59 The buzhash[0] rolling hash algorithm is used to fingerprint the input
     60 stream.
     61 
     62 When encryption is enabled, a random seed is generated and stored
     63 encrypted in the repository state file.  The seed is XOR-ed with the
     64 buzhash initial state table to mitigate against length fingerprinting
     65 attacks.
     66 
     67 [0] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html