Add filesystem based backing store

2024-11-22 14:31:20 +03:00 · 2014-04-10 11:56:15 +01:00 · 2014-04-10 11:56:15 +01:00 · 05932293f6
commit 05932293f6
parent 4a49ff5266
4 changed files with 1401 additions and 0 deletions
--- a/Docs/source-object-backing-store
+++ b/Docs/source-object-backing-store
@ -0,0 +1,194 @@
 Source Object (low level) cache backing store
 =============================================
 Introduction
 ------------
 The source object cache provides a system to extend the life of source
 objects (html files, images etc.) after they are no longer immediately
 being used.
 Only fetch types where we have well defined rules on caching are
 considered, in practice this limits us to HTTP(S). The section in
 RFC2616 [1] on caching specifies these rules.
 To futher extend the objects lifetime they can be pushed into a
 backing store where the objects are available for reuse less quickly
 than from RAM but faster than retriving from the network again.
 The backing store implementation provides a key:value infrastructure
 with a simple store, retrive and invalidate interface.
 Generic filesystem backing store
 --------------------------------
 Although the backing store interface is fully pluggable a generic
 implementation based on storing objects on the filesystem in a
 heirachy of directories.
 The option to alter the backing store format exists and is controled
 by a version field. It is implementation defined what happens if a
 version mis-match occours.
 As the backing store only holds cache data one should not expect a
 great deal of effort to be expended converting formats (i.e. the cache
 may simply be discarded).
 Layout version 1
 ----------------
 An object has an identifier value generated from the url (NetSurf
 backing stores uses the url as the unique key). The value used is
 obtained using nsurl_hash() which is currently a 32 bit FNV so is
 directly usable.
 This identifier is adequate to ensure the collision rate for the
 hashed url values (a collision for every 2^16 urls added) is
 sufficiently low the overhead of returning the wrong object (which
 backing stores are permitted to do) is not significat.
 An entry list is maintained which contains all the metadata about a
 given identifier. This list is limited in length to constrain the
 resources necessary to maintain it. It is made persistant to avoid the
 overhead of reconstructing it at initialisation and to keep the data
 used to improve the eviction decisions.
 Each object is stored and retrived directly into the filesystem using
 a filename generated from a base64url encoding of an address
 value. The objects address is derived from the identifier by cropping
 it to a shorter length.
 A mapping between the object address and its entry is maintained which
 uses storage directly proportional to the size of the address length.
 The cropping length is stored in the control file with the default
 values set at compile time. This allows existing backing stores to
 continue operating with existing data independantly of new default
 setting. This setting gives some ability to tune the default cache
 index size to values suitable for a specific host operating system.
 E.g. Linux based systems can easily cope with several megabytes of
 mmaped index but RISC OS might want to limit this to a few megabytes
 of heap at most.
 The files are stored on disc using their base64url address value.
 By creating a directory for each character of the encoded filename
 (except the last which is of course the leafname) we create a
 directory structure where no directory has more than 64 entries.
 E.g. A 19bit address of 0x1 would be base64url encoded into AAAB
 resulting in the data being stored in a file path of
 "/store/prefix/data/B/A/A/BAAAAA".
 An address of 0x00040001 encodes to BAAB and a file path of
 "/store/prefix/meta/B/A/A/BAABAA"
 Control files
 ~~~~~~~~~~~~~
 control
 +++++++
 A control file is used to hold a list of values describing how the
 other files in the backing store should be used.
 entries
 +++++++
 this file contains a table of entries describing the files held on the
 filesystem.
 Each control file table entry is 28 bytes and consists of
 - signed 64 but value for last use time
 - 32bit full url hash allowing for index reconstruction and
   addiitonal collision detection. Also the possibility of increasing
   the ADDRESS_LENGTH although this would require renaming all the
   existing files in the cache and is not currently implemented.
 - unsigned 32bit length for data
 - unsigned 32bit length for metadata
 - unsigned 16bit value for number of times used.
 - unsigned 16bit value for flags
 - unsigned 16bit value for data block index (unused)
 - unsigned 16bit value for metatdata block index (unused)
 Address to entry index
 ~~~~~~~~~~~~~~~~~~~~~~
 An entry index is held in RAM that allows looking up the address to
 map to an entry in the control file.
 The index is the only data structure whose size is directly depndant
 on the length of the hash specificaly:
 (2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes
 where ADDRESS_BITS is how long the address is in bits and ENTRY_BITS
 is how many entries the control file (and hence the while
 cache) may hold.
 RISCOS values
 +++++++++++++
 By limiting the ENTRY_BITS size to 14 (16,384 entries) the entries
 list is limited to 448kilobytes.
 The typical values for RISC OS would set ADDRESS_BITS to 18. This
 spreads the entries over 262144 hash values which uses 512 kilobytes
 for the index. Limiting the hash space like this reduces the
 efectiveness of the cache.
 A small ADDRESS_LENGTH causes a collision (two urls with the same
 address) to happen roughly for every 2 ^ (ADDRESS_BITS / 2) = 2 ^ 9 =
 512 objects stored. This roughly translates to a cache miss due to
 collision every ten pages navigated to.
 Larger systems
 ++++++++++++++
 In general ENTRY_BITS set to 16 as this limits the store to 65536
 objects which given the average size of an object at 8 kilobytes
 yeilds half a gigabyte of disc used which is judged to be sufficient.
 For larger systems e.g. those using GTK frontend we would most likely
 select ADDRESS_BITS as 22 resulting in a collision every 2048 objects
 but the index using some 8 Megabytes
 Typical values
 --------------
 Example 1
 ~~~~~~~~~
 For a store with 1034 objects genrated from a random navigation of
 pages linked from the about:welcome page.
 Metadata total size is 593608 bytes an average of 574 bytes. The
 majority of the storage is used to hold the urls and headers.
 Data total size is 9180475 bytes a mean of 8879 bytes 1648726 in the
 largest 10 entries which if excluded gives 7355 bytes average size
 Example 2
 ~~~~~~~~~
 355 pages navigated in 80 minutes from about:welcome page and a
 handful of additional sites (google image search and reddit)
 2018 objects in cache at quit. 400 objects from news.bbc.co.uk alone
 Metadata total 987,439 bytes mean of 489 bytes
 data total 33,127,831 bytes mean of 16,416 bytes
 with one single 5,000,811 byte gif
 data totals without gif is 28,127,020 mean 13,945
 [1] http://tools.ietf.org/html/rfc2616#section-13
--- a/Makefile.defaults
+++ b/Makefile.defaults
@ -86,6 +86,11 @@ NETSURF_HOMEPAGE := "about:welcome"
 # Valid options: YES, NO
 NETSURF_USE_LIBICONV_PLUG := YES
 # Enable building the source object cache filesystem based backing store.
 # implementation.
 # Valid options: YES, NO
 NETSURF_FS_BACKING_STORE := NO
 # Initial CFLAGS. Optimisation level etc. tend to be target specific.
 CFLAGS :=
--- a/content/Makefile
+++ b/content/Makefile
@ -3,4 +3,9 @@
 S_CONTENT := content.c content_factory.c dirlist.c fetch.c hlcache.c	\
 	llcache.c mimesniff.c urldb.c no_backing_store.c
 # Make filesystem backing store available
 ifeq ($(NETSURF_FS_BACKING_STORE),YES)
 	S_CONTENT += fs_backing_store.c
 endif
 S_CONTENT := $(addprefix content/,$(S_CONTENT))
--- a/content/fs_backing_store.c
+++ b/content/fs_backing_store.c