Top
SynCat: representation strawmen
A document contains some possible representations.
Contents
Goals
Ultimately, this document will describe how to write down catalogue
entries in a Storage-Element neutral fashion.
Currently, it enumerates a number of possible encoding schemes for
people to consider. The encoding scheme should allow a complete
catalogue of files, contained within directories, to be represented
within some flat-file. The format should be easy to parse and ideally
extensible.
Prior art
I don't know of any efforts on standardising catalogue representation.
If any such standards exist, they should be examined. If you know of
any, please let me know and I'll add them here.
Attributes
Each file entry has some attributes that describe it. Some are
required, others are optional. These are described below:
Required attributes
- filename (SURL or logical filename)
- status (online, nearline, deleted)
Suggested attributes
- when last accessed (general comment: it won't happen)
- owner (also group?) --- for accounting/quota (first define "owner")
- in which space (space token) is the file?
- size of file
- checksum
Possible encoding schemes
The remainder of the document describes the different encoding
schemes.
1. Comma-separated lists
Representation is a list of files. Each file is represented by a
single line. Each line has one or more attributes separated by a
comma and terminated by a single new-line character 0x0a;.
1a. Fixed order
The attributes are always listed in a specific order. Attributes are
"retired" by publishing empty at that position. New
attributes are added by publishing the additional attributes at the
end of the list.
/dir1/dir2/higgs-1.raw,online,space-token-1,110252
/dir1/dir2/higgs-2.raw,online,space-token-1,110230
If we want to support implementation-specific attributes, these are
published as a comma-separated list, after the agreed list with a
semicolon separating the two lists.
/dir1/dir2/higgs-1.raw,online,space-token-1,110252;castor-data1a,castor-data1b
/dir1/dir2/higgs-2.raw,online,space-token-1,110230;castor-data2a,castor-data2b
1b. Keyword-value pairs
Each attribute type is assigned a unique keyword. Each line is a list
of comma-separated attribute values, separated by a comma. Each
attribute value is published as the keyword, followed by an equals
sign, followed by the value.
Attributes can be published in any order. Attributes are
"retired" simply by not publishing them. New attributes can
be added by publishing new keywords that don't conflict with
previously used keywords.
latency=online,name=/dir1/dir2/higgs-1.raw,token=space-token-1,size=110252
latency=online,name=/dir1/dir2/higgs-2.raw,token=space-token-1,size=110230
If we want to support implementation-specific attributes, these are
published as a comma-separated list, after the agreeed list with a
semicolon separating the two lists.
name=/dir1/dir2/higgs-1.raw,token=space-token-1,latency=online,size=110252;pool=dpm-node-1
name=/dir1/dir2/higgs-2.raw,token=space-token-1,latency=online,size=110230;pool=dpm-node-2
Escaping special characters in attribute values
Various characters have special meaning, these include comma and
potentially semi-colon and equals. Attribute values that include
these special attributes must be escaped to prevent problems.
C-like escaping
The back-slash character (\) is used to escape special
characters in attribute values:
original text | markup text |
\ | \\ |
, | \, |
; | \; |
= | \= |
XML-like escaping
An ampersand (&) is used to escape special characters in
attribute values;
original text | markup text |
& | & |
, | , |
; | &semicolon; |
= | = |
Alternatively, the unicode values could be used.
2. XML
XML is designed to be an extensible format that allows text to be
"marked up", allowing published information to have additional
(sematic) meaning.
Elements vs Attributes
Attributes for an entry can be published either as child elements of
an entry element or as attributes. The following example
shows the elements:
<catalogue>
<entry>
<name>/dir1/dir2/higgs-1.raw</name>
<token>space-token-1</token>
<latency>online</latency>
<size>110252</size>
</entry>
<entry>
<name>/dir1/dir2/higgs-2.raw</name>
<token>space-token-1</token>
<latency>online</latency>
<size>110230</size>
</entry>
</catalogue>
as pure attributes:
<catalogue>
<entry
name="/dir1/dir2/higgs-1.raw"
token="space-token-1"
latency="online"
size="110252"/>
<entry
name="/dir1/dir2/higgs-2.raw"
token="space-token-1"
latency="online"
size="110230"/>
</catalogue>
as catalogue entry's name being marked up by the other attributes:
<catalogue>
<entry
token="space-token-1"
latency="online"
size="110252">/dir1/dir2/higgs-1.raw</entry>
<entry
token="space-token-1"
latency="online"
size="110230">/dir1/dir2/higgs-2.raw</entry>
</catalogue>
Flat format vs Structured
Both XML and the catalogue are tree-like. Instead of using a flat
format for all entries, the structure can be expressed within the XML
file:
<catalogue>
<entry
token="space-token-1"
latency="online"
size="110252">/dir1/dir2/higgs-1.raw</entry>
<entry
token="space-token-1"
latency="online"
size="110230">/dir1/dir2/higgs-2.raw</entry>
</catalogue>
and the same information in a structured form:
<catalogue>
<dir name="dir1">
<dir name="dir2">
<entry token="space-token-1"
latency="online"
size="110252">higgs-1.raw</entry>
<entry token="space-token-1"
latency="online"
size="110230">higgs-2.raw</entry>
</dir>
</dir>
</catalogue>
Implementation-specific attributes
One can use XML-namespaces to separate implementation-specific
information. For example
<catalogue xmlns="http://catsyn.example.com/2008_1">
<dir name="dir1">
<dir name="dir2">
<entry token="space-token-1"
latency="online"
size="110252">higgs-1.raw</entry>
<entry token="space-token-1"
latency="online"
size="110230">higgs-2.raw</entry>
</dir>
</dir>
</catalogue>
and with some fake Storm-specific information
<catalogue xmlns="http://catsyn.example.com/2008_1"
xmlns:storm="http://storm.example.com/2008_2">
<dir name="dir1">
<dir name="dir2">
<entry token="space-token-1"
latency="online"
storm:gpfsvol="1"
size="110252">higgs-1.raw</entry>
<entry token="space-token-1"
latency="online"
storm:gpfsvol="2"
size="110230">higgs-2.raw</entry>
</dir>
</dir>
</catalogue>
3. YAML
From the YAML definition abstract: "YAML [...] is a human-friendly,
cross language, Unicode based data serialisation language designed
around the common native data structures of agile programming
languages. It is broadly useful for programming needs ranging from
configuration files to Internet messaging to object persistence to
data auditing."
The claim is that YAML is much easier to process than XML, whilst
preserving much of XML's flexibility; however, it is much younger than
XML. It's also more "human friendly".
Presenting entries as a mapping of mappings
/dir1/dir2/higgs-1.raw:
token: space-token-1
latency: online
size: 110252
/dir1/dir2/higgs-2.raw:
token: space-token-1
latency: online
size: 110230
Or as a simple sequency of mappings:
-
name: /dir1/dir2/higgs-1.raw
token: space-token-1
latency: online
size: 110252
-
name: /dir1/dir2/higgs-2.raw
token: space-token-1
latency: online
size: 110230
Others?
Are there other possibilities people want to consider?