fix source object backing store documentation

fix formatting
add section on version 2 layout
This commit is contained in:
Vincent Sanders 2020-03-26 23:38:01 +00:00
parent 51dc59ecc9
commit 6b0cb5479f
1 changed files with 73 additions and 47 deletions

View File

@ -1,26 +1,36 @@
Source Object (low level) cache backing store
=============================================
Introduction
------------
[TOC]
The source object cache provides a system to extend the life of source
objects (HTML files, images etc.) after they are no longer immediately
being used.
# Introduction
Only fetch types where we have well defined rules on caching are
considered, in practice this limits us to HTTP(S). The section in
RFC2616 [1] on caching specifies these rules.
The source object (referred to as low level in the code) content cache
provides a unified API for the rest of the browser to retrieve objects
(HTML files, images etc.) from a URL.
The cache initialy always fufuls these requsts by using the fetcher
system to retrive data according to the URL scheme (network for HTTP,
disc for file etc.) and storing the result in memory.
The cache also provides a system to extend the life of source objects
in memory when they are no longer immediately being used. Only fetch
types where there are well defined rules on caching are considered, in
practice this limits the cache to URLS with HTTP(S) schemes. The
section in RFC2616 [1] on caching specifies these rules.
To further extend the objects lifetime they can be pushed into a
backing store where the objects are available for reuse less quickly
than from RAM but faster than retrieving from the network again.
than from memory but faster than retrieving from the network again.
The backing store implementation provides a key:value infrastructure
with a simple store, retrieve and invalidate interface.
Generic filesystem backing store
--------------------------------
The key is the object URL which by definition is unique for a source
object. The value is the source object data *and* the associated
metadata
# Generic filesystem backing store
Although the backing store interface is fully pluggable a generic
implementation based on storing objects on the filesystem in a
@ -34,13 +44,45 @@ As the backing store only holds cache data one should not expect a
great deal of effort to be expended converting formats (i.e. the cache
may simply be discarded).
Layout version 1.1
------------------
## Layout version 2.02
An object has an identifier value generated from the URL (NetSurf
backing stores uses the URL as the unique key). The value used is
obtained using nsurl_hash() which is currently a 32 bit FNV so is
directly usable.
The version 2 layout stores cache entries in a hash map thus only uses
memory proportional to the number of entries present removing the need
for large fixed size indexes.
The object identifier is generated from nsurl_hash() and data entries
are stored in either a fixed size disc blocks or in separate files on disc.
The file path if stored on disc must conform to the limitations of all
the filesystems the cache can be placed upon.
From http://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits the relevant subset is:
- path elements no longer than 8 characters
- acceptable characters are A-Z, 0-9
- short total path lengths (255 or less)
- no more than 77 entries per directory (6bits worth)
The short total path lengths mean the encoding must represent as much
data as possible in the least number of characters.
To achieve all these goals we use RFC4648 base32 encoding which packs
five bits into each character of the filename. By splitting the 32bit
identifier using six bits per directory level only five levels of
directory are required with a maximum of 64 entries per
directory. This requires a total path length of 22 bytes (including
directory separators) BA/BB/BC/BD/BE/ABCDEFG
Files that are under 8KiB in size are stored in "small block files"
these are pre allocated 8 Megabyte files on disc in which remove the
need to have many, many small files stored on disc at the expensie of
a some amount of wasted space for files that are smaller than the 8K
block size.
## Layout version 1.1
An object has an identifier value generated from the URL (the unique
key). The value used is obtained using nsurl_hash() which is currently
a 32 bit FNV so is directly usable.
This identifier is adequate to ensure the collision rate for the
hashed URL values (a collision for every 2^16 URLs added) is
@ -83,26 +125,23 @@ resulting in the data being stored in a file path of
An address of 0x00040001 encodes to BAAB and a file path of
"/store/prefix/m/B/A/A/BAABAAA"
Version 1.0
-----------
## Layout Version 1.0
The version 1 layout was identical to the 1.1 except base64url
The version 1.0 layout was identical to the 1.1 except base64url
encoding was used, this proved problematic as some systems filesystems
were case insensitive so upper and lower case letters collided.
There is no upgrade provision from the previous version simply delete
the cache directory.
Control files
~~~~~~~~~~~~~
## Control files
### control
control
+++++++
A control file is used to hold a list of values describing how the
other files in the backing store should be used.
entries
+++++++
### entries
this file contains a table of entries describing the files held on the
filesystem.
@ -110,26 +149,18 @@ filesystem.
Each control file table entry is 28 bytes and consists of
- signed 64 bit value for last use time
- 32bit full url hash allowing for index reconstruction and
additional collision detection. Also the possibility of increasing
the ADDRESS_LENGTH although this would require renaming all the
existing files in the cache and is not currently implemented.
- unsigned 32bit length for data
- unsigned 32bit length for metadata
- unsigned 16bit value for number of times used.
- unsigned 16bit value for flags
- unsigned 16bit value for data block index (unused)
- unsigned 16bit value for metatdata block index (unused)
Address to entry index
~~~~~~~~~~~~~~~~~~~~~~
### Address to entry index
An entry index is held in RAM that allows looking up the address to
map to an entry in the control file.
@ -137,14 +168,13 @@ map to an entry in the control file.
The index is the only data structure whose size is directly dependant
on the length of the hash specifically:
(2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes
(2 ^ (ADDRESS_BITS - 3)) * ENTRY_BITS) in bytes
where ADDRESS_BITS is how long the address is in bits and ENTRY_BITS
is how many entries the control file (and hence the while
cache) may hold.
RISCOS values
+++++++++++++
## RISCOS values
By limiting the ENTRY_BITS size to 14 (16,384 entries) the entries
list is limited to 448kilobytes.
@ -159,8 +189,7 @@ address) to happen roughly for every 2 ^ (ADDRESS_BITS / 2) = 2 ^ 9 =
512 objects stored. This roughly translates to a cache miss due to
collision every ten pages navigated to.
Larger systems
++++++++++++++
## Larger systems
In general ENTRY_BITS set to 16 as this limits the store to 65536
objects which given the average size of an object at 8 kilobytes
@ -170,11 +199,9 @@ For larger systems e.g. those using GTK frontend we would most likely
select ADDRESS_BITS as 22 resulting in a collision every 2048 objects
but the index using some 8 Megabytes
Typical values
--------------
## Typical values
Example 1
~~~~~~~~~
### Example 1
For a store with 1034 objects generated from a random navigation of
pages linked from the about:welcome page.
@ -185,8 +212,7 @@ majority of the storage is used to hold the URLs and headers.
Data total size is 9180475 bytes a mean of 8879 bytes 1648726 in the
largest 10 entries which if excluded gives 7355 bytes average size
Example 2
~~~~~~~~~
### Example 2
355 pages navigated in 80 minutes from about:welcome page and a
handful of additional sites (google image search and reddit)
@ -201,4 +227,4 @@ with one single 5,000,811 byte gif
data totals without gif is 28,127,020 mean 13,945
[1] http://tools.ietf.org/html/rfc2616#section-13
[1] http://tools.ietf.org/html/rfc2616#section-13