ECS metadata

Thank you for your feedback!

ECS maintains its own metadata that tracks data locations and transaction history. This metadata is maintained in logical tables and journals.

The tables hold key-value pairs to store information relating to the objects. A hash function is used to do fast lookups of values associated with a key. These key-value pairs are stored in a B+ tree for fast indexing of data locations. By storing the key-value pair in a balanced, searched tree format, like a B+ tree, the location of the data and metadata can be accessed quickly. In addition, to further enhance query performance of these logical tables, ECS implements a two-level log-structured merge (LSM) tree. Thus, there are two tree-like structures where a smaller tree is in memory (memory table) and the main B+ tree resides on disk. A lookup of key-value pairs first queries the memory table and if the value is not memory, it looks in the main B+ tree on disk.

Transaction history is recorded in journal logs, which are written to disks. The journals track the index transactions not yet committed to the B+ tree. After the transaction is logged into a journal, the memory table is updated. After the table in memory becomes full or after a set period of time, the table is merged, sorted, or dumped to B+ tree on disk. A checkpoint is then recorded in the journal. The following figure illustrates this process.

This is an example to show the workflow of transaction updates to ECS tables

Figure 2. Workflow of transaction updates to ECS tables

Both the journals and B+ trees are written to chunks.

ECS uses several different tables, and each table can get quite large. To optimize performance of table lookups, each table is divided into partitions that are distributed across the nodes in a VDC/site. The node where the partition is written becomes the owner or authority of that partition or section of the table.

One such table is a chunk table, which tracks the physical location of chunk fragments and replica copies on disk. The following table shows a sample of a partition of the chunk table. Each chunk identifies its physical location by listing the disk within the node, the file within the disk, the offset within that file, and the length of the data. Chunk ID C1 is erasure coded, and chunk ID C2 is triple mirrored. For more details about triple mirroring and erasure coding, see Advanced data-protection methods.

Table 2. Sample chunk table partition

Chunk ID	Chunk location
C1	Node1:Disk1:file1:offset1:length Node2:Disk1:File1:offset1:length Node3:Disk1:File1:offset1:length Node4:Disk1:File1:offset1:length Node5:Disk1:File1:offset1:length Node6:Disk1:File1:offset1:length Node7:Disk1:File1:offset1:length Node8:Disk1:File1:offset1:length Node1:Disk2:File1:offset1:length Node2:Disk2:File1:offset1:length Node3:Disk2:File1:offset1:length Node4:Disk2:File1:offset1:length Node5:Disk2:File1:offset1:length Node6:Disk2:File1:offset1:length Node7:Disk2:File1:offset1:length Node8:Disk2:File1:offset1:length
C2	Node1:Disk3:File1:offset1:length Node2:Disk3:File1:offset1:length Node3:Disk3:File1:offset1:length

Chunk ID

Chunk location

Node1:Disk1:file1:offset1:length

Node2:Disk1:File1:offset1:length

Node3:Disk1:File1:offset1:length

Node4:Disk1:File1:offset1:length

Node5:Disk1:File1:offset1:length

Node6:Disk1:File1:offset1:length

Node7:Disk1:File1:offset1:length

Node8:Disk1:File1:offset1:length

Node1:Disk2:File1:offset1:length

Node2:Disk2:File1:offset1:length

Node3:Disk2:File1:offset1:length

Node4:Disk2:File1:offset1:length

Node5:Disk2:File1:offset1:length

Node6:Disk2:File1:offset1:length

Node7:Disk2:File1:offset1:length

Node8:Disk2:File1:offset1:length

Node1:Disk3:File1:offset1:length

Node2:Disk3:File1:offset1:length

Node3:Disk3:File1:offset1:length

Another example is an object table, which is used for object name to chunk mapping. The following table shows an example of a partition of an object table that lists the chunk or chunks and shows where an object resides in the chunk or chunks.

Table 3. Sample object table

Object name	Chunk ID
ImgA	C1:offset:length
FileA	C4:offset:length C6:offset:length

Object name

Chunk ID

ImgA

C1:offset:length

FileA

C4:offset:length

C6:offset:length

A service called vnest, which runs on all nodes, maintains the mapping of table partition owners. The following table shows an example of a portion of a vnest mapping table.

Table 4. Sample vnest mapping table

Table ID	Table partition owner
Table 0 P1	Node 1
Table 0 P2	Node 2

Your Browser is Out of Date

ECS metadata

ECS metadata