Early DraftK. MacLeod
xml-serializationThe Casbah Project
 October 1999

Self-Describing XML Data Representation

Abstract

This document describes a simple, plain-text, self-describing, Extensible Markup Language (XML) format for structured data. The base format defined here allows for nested values made up of dictionaries, lists, and atomic values without arbitrary size limits. This format is intended to be minimal, flexible, reusable, and targetable.



1. Introduction

The XML Serialization format provides a way to marshal data structures or application objects using a simple, fixed set of XML elements.

XML Serialization provides four elements for encoding objects, <dictionary>, <list>, <atom>, and <ref>. Multiple references to <dictionary>, <list>, and <atom> elements are supported using an `id' attribute that can be referred to using the `ref' attribute of the <ref> element. <dictionary>, <list>, and <atom> elements support a `type' attribute to declare the type or class of the data. <atom> elements support an `encoding' attribute to declare it's encoding (such as BASE64). <dictionary>, <list>, and <atom> elements support a `length' attribute that gives the number of pairs in a dictionary, the number of elements in a list, or the parsed length of the text of an <atom>.

This format is being developed as part of Scarab, a framework for handling data and distributed computing. Scarab itself is being developed as part of Casbah, a language independent environment for writing applications that access a wide variety of data sources. Scarab includes a binary format that is comparible to this format. Scarab supports storage or transfer using other XML DTDs and Schemas in addition to this serialization.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[3].

Within this draft, text enclosed within brackets (`[text]') represents issues or incomplete specifications.

1.1 Examples of XML Serialization

An example serialization of the following value:

    record = ( month: 'April', day: 5, year: 1997 ) 
    encode(record, "a day in the life")
would be:
    <?xml version="1.0"?>
    <!DOCTYPE list SYSTEM "ldo-xml.dtd">
    <list>
      <dictionary>
        <atom>month</atom><atom>April</atom>
        <atom>day</atom><atom>5</atom>
        <atom>year</atom><atom>1997</atom>
      </dictionary>
      <atom>a day in the life</atom>
    </list>



2. XML Serialization

XML Serialization defines three generic datatypes, <atom>, <list>, and <dictionary> that can be further specialized by higher level protocols or marshaling. <atom>s are used for datatypes having values which are intrinsically indivisible. <list>s and <dictionary>s are used for datatypes having values which can be decomposed into two or more components. <dictionary>s, specifically, can be used to represent records with keyed fields or objects with member variable names.

<atom>s can only contain character data. <list>s and <dictionary>s can only contain other <list>s, <dictionary>s, <atom>s, and <ref>s. <ref>s are used to resolve multiple references to the same serialized data value.

2.1 <atom>

<atom>s are used to encode values that are intrinsically indivisible, such as strings, numbers, symbols, enumerated values, and binary data. <atom> can also be used to encode values that are divisible but have common string representations, such as dates or URIs.

Implementations may treat all atoms as octet sequences or strings. The `type' attribute can be used to guide conversion to more specialized types or to enforce data types.

The `length' attribute can be used to indicate the number of octets encoded in the atom.

The `encoding' attribute can be used to specify `base64' encoding for the atom, to support encoding of binary data. Base64 encoding is defined in RFC 1521[2].

The `id' attribute can be used to provide an identifier for this atom so it can be referenced again later using <ref>.

2.2 <list>

<list>s are used to encode multipart values that can be represented as sequences. <list> elements may contain any number and any combination of <atom>s, <list>s, <dictionary>s, and <ref>s. Implementations can use the `type' attribute to guide conversion to more specialized types or to enforce data types.

The `length' attribute can be used to indicate the number of values in the list.

The `id' attribute can be used to provide an identifier for this list so it can be referenced again later using <ref>.

2.3 <dictionary>

<dictionary>s are used to encode multipart values that can be represented as key, value pairs. <dictionary> elements may contain any number of pairs, and each key and value may be any combination of <atom>s, <list>s, <dictionary>s, and <ref>s. Implementations can use the `type' attribute to guide conversion to more specialized types or to enforce data types.

The `length' attribute can be used to indicate the number of key, value pairs in the list.

The `id' attribute can be used to provide an identifier for this dictionary so it can be referenced again later using <ref>.

2.4 <ref>

<ref>s are used to refer to a value that has already been serialized. [need a good definition here] The value of the `ref' attribute identifies the previously serialized atom, list, or dictionary with the corresponding `id' attribute.

2.5 `type' attribute

The `type' attribute is intended to be used by implementations to specify data types that are pertinent to that implementation. Implementations may benefit by sharing a convention for declaring the type of data, even if they cannot provide natural or transparent access to that data. For example, an implementation may allow a user to access the contents of a generic dictionary, list, or atom directly and without reliance on the `type' attribute.

The convention presented here is patterned after URIs [RFC?]. Types are represented by a two-part identifier seperated by a colon (`:'). The left-hand part is the scheme identifier (a computer language name, a standard for data interchange, or an alias declared externally (re. XML Namespaces)) and the right-hand part is an opaque string identifying the type of data in a syntax appropriate for and defined by the scheme. [copy syntax from URI RFC]

Example type attribute values are:

    perl:scalar
    xml-dcd:ui1
    corba:i4
    mime:text/html
    ldo:number

Type schemes used in LDO (including schemes for languages used in LDO) will be defined as seperate specifications. See Scarab.



3. Conformance and Interoperability

[need to fill out this section]



4. Security Considerations

This data representation does not contain any features for initiating actions.

Several elements in this data representation have no defined upper limits on size or quantity. Conforming implementations SHOULD perform sanity checks on data received to avoid buffer overruns and out of resource (memory, disk, etc.) conditions.



References

[1] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0", W3C XML, February 1998.
[2] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", BCP 14, RFC 2119, September 1993.
[3] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.


Author's Address

  Ken MacLeod
  The Casbah Project
  1330 66th Street
  Des Moines, IA 50311
  US
Phone:  +1 515 279 0319
EMail:  ken@bitsko.slc.ut.us
URI:  http://casbah.org/Scarab/


Appendix A. Document Type Definition

<!ENTITY % item        "(dictionary | list | atom | ref)">

<!ENTITY % item.attrib
       "type           CDATA           #IMPLIED
        length         CDATA           #IMPLIED
        id             ID              #IMPLIED">

<!ELEMENT dictionary   (%item;, %item;)* > 
<!ATTLIST dictionary
        %item.attrib;
>

<!ELEMENT list         (%item;)*         > 
<!ATTLIST list
        %item.attrib;
>

<!ELEMENT atom         (#PCDATA)         > 
<!ATTLIST atom
        %item.attrib;
        encoding       CDATA           #IMPLIED
>

<!ELEMENT ref          EMPTY             > 
<!ATTLIST ref
        ref            IDREF           #REQUIRED
>


Appendix B. Larger Examples

[Well, we know what should go _here_, don't we? ;-]



Appendix C. FAQ

This section will not appear in the final specification. This section is intended to provide history and commentary on the spec. When this document is finalized, this information will be historical and may be provided elsewhere.

C.1 Q: Where did the name "atom" come from?

"atom" or "atomic" is the term most commonly used by designers if and when they refer to a value or object that is not composed of or an aggregate of other values or objects. Many descriptions of computer languages, protocols, or formats don't distinguish between simple or complex objects or don't provide a specific term for non-aggregate types (they simply call them "objects", for example). The term "scalar" is a close second to "atom" in terms of common usage.

Alternatives included "value", "data", "datum", "scalar", and "primitive". "value", "data", and "datum" all seemed wrong because lists and dicts are also values, data, or datum. "scalar" seems too closely tied to numeric or quantity values. "primitive" is often associated with implementation in a lower-level language (like C or assembler).

C.2 Why not include more common data types, like number, string, or JPEG?

Scarab's XML and binary reference serializations were designed with the assumption that we don't know what datatypes you really need. Rather than define a small set of datatypes that may or may not fit your needs, we've used some of the most basic data types available and then supported a "type" attribute that lets you declare a specific type. LDO-Types defines common values for type attributes supported by Scarab applications. If it's your application that is used for both client and server you can also define your own data types. Scarab can also use other serialization formats that provide more specific data types.