SynCat: representation strawmen

A document contains some possible representations.

Goals
Prior art
Attributes
Possible encoding schemes:

Goals

Ultimately, this document will describe how to write down catalogue entries in a Storage-Element neutral fashion.

Currently, it enumerates a number of possible encoding schemes for people to consider. The encoding scheme should allow a complete catalogue of files, contained within directories, to be represented within some flat-file. The format should be easy to parse and ideally extensible.

Prior art

I don't know of any efforts on standardising catalogue representation. If any such standards exist, they should be examined. If you know of any, please let me know and I'll add them here.

Attributes

Each file entry has some attributes that describe it. Some are required, others are optional. These are described below:

Required attributes

filename (SURL or logical filename)
status (online, nearline, deleted)

Suggested attributes

when last accessed (general comment: it won't happen)
owner (also group?) --- for accounting/quota (first define "owner")
in which space (space token) is the file?
size of file
checksum

Possible encoding schemes

The remainder of the document describes the different encoding schemes.

1. Comma-separated lists

Representation is a list of files. Each file is represented by a single line. Each line has one or more attributes separated by a comma and terminated by a single new-line character 0x0a;.

1a. Fixed order

The attributes are always listed in a specific order. Attributes are "retired" by publishing empty at that position. New attributes are added by publishing the additional attributes at the end of the list.

/dir1/dir2/higgs-1.raw,online,space-token-1,110252 /dir1/dir2/higgs-2.raw,online,space-token-1,110230

If we want to support implementation-specific attributes, these are published as a comma-separated list, after the agreed list with a semicolon separating the two lists.

/dir1/dir2/higgs-1.raw,online,space-token-1,110252;castor-data1a,castor-data1b /dir1/dir2/higgs-2.raw,online,space-token-1,110230;castor-data2a,castor-data2b

1b. Keyword-value pairs

Each attribute type is assigned a unique keyword. Each line is a list of comma-separated attribute values, separated by a comma. Each attribute value is published as the keyword, followed by an equals sign, followed by the value.

Attributes can be published in any order. Attributes are "retired" simply by not publishing them. New attributes can be added by publishing new keywords that don't conflict with previously used keywords.

latency=online,name=/dir1/dir2/higgs-1.raw,token=space-token-1,size=110252 latency=online,name=/dir1/dir2/higgs-2.raw,token=space-token-1,size=110230

If we want to support implementation-specific attributes, these are published as a comma-separated list, after the agreeed list with a semicolon separating the two lists.

name=/dir1/dir2/higgs-1.raw,token=space-token-1,latency=online,size=110252;pool=dpm-node-1 name=/dir1/dir2/higgs-2.raw,token=space-token-1,latency=online,size=110230;pool=dpm-node-2

Escaping special characters in attribute values

Various characters have special meaning, these include comma and potentially semi-colon and equals. Attribute values that include these special attributes must be escaped to prevent problems.

C-like escaping

The back-slash character (\) is used to escape special characters in attribute values:

original text	markup text
`\`	`\\`
`,`	`\,`
`;`	`\;`
`=`	`\=`

XML-like escaping

An ampersand (&) is used to escape special characters in attribute values;

original text	markup text
`&`	`&`
`,`	`,`
`;`	`&semicolon;`
`=`	`=`

Alternatively, the unicode values could be used.

2. XML

XML is designed to be an extensible format that allows text to be "marked up", allowing published information to have additional (sematic) meaning.

Elements vs Attributes

Attributes for an entry can be published either as child elements of an entry element or as attributes. The following example shows the elements:

<catalogue> <entry> <name>/dir1/dir2/higgs-1.raw</name> <token>space-token-1</token> <latency>online</latency> <size>110252</size> </entry> <entry> <name>/dir1/dir2/higgs-2.raw</name> <token>space-token-1</token> <latency>online</latency> <size>110230</size> </entry> </catalogue>

as pure attributes:

as catalogue entry's name being marked up by the other attributes:

<catalogue> <entry token="space-token-1" latency="online" size="110252">/dir1/dir2/higgs-1.raw</entry> <entry token="space-token-1" latency="online" size="110230">/dir1/dir2/higgs-2.raw</entry> </catalogue>

Flat format vs Structured

Both XML and the catalogue are tree-like. Instead of using a flat format for all entries, the structure can be expressed within the XML file:

and the same information in a structured form:

<catalogue> <dir name="dir1"> <dir name="dir2"> <entry token="space-token-1" latency="online" size="110252">higgs-1.raw</entry> <entry token="space-token-1" latency="online" size="110230">higgs-2.raw</entry> </dir> </dir> </catalogue>

Implementation-specific attributes

One can use XML-namespaces to separate implementation-specific information. For example

<catalogue xmlns="http://catsyn.example.com/2008_1"> <dir name="dir1"> <dir name="dir2"> <entry token="space-token-1" latency="online" size="110252">higgs-1.raw</entry> <entry token="space-token-1" latency="online" size="110230">higgs-2.raw</entry> </dir> </dir> </catalogue>

and with some fake Storm-specific information

<catalogue xmlns="http://catsyn.example.com/2008_1" xmlns:storm="http://storm.example.com/2008_2"> <dir name="dir1"> <dir name="dir2"> <entry token="space-token-1" latency="online" storm:gpfsvol="1" size="110252">higgs-1.raw</entry> <entry token="space-token-1" latency="online" storm:gpfsvol="2" size="110230">higgs-2.raw</entry> </dir> </dir> </catalogue>

3. YAML

From the YAML definition abstract: "YAML [...] is a human-friendly, cross language, Unicode based data serialisation language designed around the common native data structures of agile programming languages. It is broadly useful for programming needs ranging from configuration files to Internet messaging to object persistence to data auditing."

The claim is that YAML is much easier to process than XML, whilst preserving much of XML's flexibility; however, it is much younger than XML. It's also more "human friendly".

Presenting entries as a mapping of mappings

/dir1/dir2/higgs-1.raw: token: space-token-1 latency: online size: 110252 /dir1/dir2/higgs-2.raw: token: space-token-1 latency: online size: 110230

Or as a simple sequency of mappings:

- name: /dir1/dir2/higgs-1.raw token: space-token-1 latency: online size: 110252 - name: /dir1/dir2/higgs-2.raw token: space-token-1 latency: online size: 110230

Others?

Are there other possibilities people want to consider?