Rationale¶
The Philosophy of a Standard Data Format¶
A Standard Data Format (SDF) is a useful concept when large amounts of data have to be archieved, converted into each other, or when they should be analysed by a number of different data analysis programs.
Software Independent File Formats¶
SDF files should be independent of the instrument and the instrument-software by which the data was obtained.
SDF files should contain all the information that was availabe when the data was created.
SDF files should be independent of the operating system. One should be able to access the data on all operating systems without file conversions.
SDF files should be easily accessed without the necessity of specialized software. A good concept of archiving this goal is the use of a
markup language
like XML which allows to combine numbers with meaningfultags
.SDF files should even be readable by humans. From this it follows that the file is a pure ASCII-text file. File compression can be performed by the operating system using zip-, gz-, or any other standard compression tools.
SDF files should be able to contain complex structures that consist of more than one dataset. Datasets and their parameters should be organized in a tree-like hierachical structure.
In some situations SDF files should be able to contain non-ASCII but binary data. Examples are measurements whos results are images. In this case the data block of an SDF file could be for example a JPEG-image. Remark (1) Conversion without information loss in open source formats like, e.g., PNG should be a way of implementation. (2) Concepts like
uuencode/uudecode
for making these binary parts readable could be considered. Including binary datasets should be definitely restricted to the data block and never occur in the meta-data of SDF files.The previous topic must not open a backdoor to install propietary, instrument- or software-dependend data formats in SDF. Data blocks containing binary Matlab
.mat
files are not allowed.Tools for accessing and manipulation SDF files should be written in a progamming language that itselve is independent of the operating system. In this way not only the datasets but also the access-tools are available on all operating systems.
XML/DTD as the file format language¶
XML allows to structure the contents of a dataset with “meaningful” tags of a markup language. In this way it is, e.g., possible to say that the following number belongs to the dataset called
my-data
is a parameter with the namedelay-time
and has the unitms
.Nowadays, XML is so wide spread among general file formats (even Microsoft is using it) that we can consider XML to be supported and stable in a long term future. Even if that would not be the case, the open (and low-weighted) definition of XML will make it easy to export XML into what ever the follow-up format would be.
DTD’s are the XML concept for defining the set of allowed tags and their properties in the XML file.
Using XML for the SDF allows to create data files that can easily be parsed by computer programs, but that are also readable by humans.
Using XML one can have a look at the data files with many convinient XML browsers which are available on all operating systems (Firefox, Internet-Explorer, …).
XML files can be modified “by hand” with the help of standard text editors, most of them are able to highlight (colorize) the XML syntax.
SDF/XML files can be easily created by self-written software or by macro-languages of propietary third-party software like Mathcad or Mathematica.
Why SDF and not HDF5¶
SDF is highly inspired by the concepts of the HDF5 data format. The following citations are taken from the HDF5 homepage:
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
The HDF5 technology suite includes:¶
A versatile data model that can represent very complex data objects and a wide variety of metadata.
A completely portable file format with no limit on the number or size of data objects in the collection.
A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
A rich set of integrated performance features that allow for access time and storage space optimizations.
Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
The HDF5 data model, file format, API, library, and tools are open and distributed without charge.
Building on its 20 year history, The HDF Group offers personalized consulting, training, design, software development, and support services to help clients take full advantage of HDF5 capabilities in addressing their unique data management challenges.
So, the question arises why not to use HDF5 as a standard data format.
The short answer is: Because HDF5 is binary and you cannot access its contents without a HDF5 library. But here are some more details:
Assume that you are trained to analyse your data with the program
xy-plot
. This program does not provide animport HDF5
option. Then you will not be able to read the HDF5 file.With SDF on the other hand – and the
xy-plot
program doesn’t supportimport SDF
either – you can take a simple editor, cut out the data block of the SDF file, store, and import it as ASCII or CSV-data. This will work on any machine, any OS, and in 50 years.As a programmer you are forced to use the HDF5 libraries in order to allow your programs access to HDF5 data. And you do not have any real chance to access the data by self–written code.
With SDF on the other hand, you have the choice of using XML parsing libraries to read in the complete source tree of SDF files in a single command. Or you can use most elementary file i/o-functions to read the file line-by-line and extract only those informations in which you are interested. In the most extreme situation search the data block, read the data word-by-word, and that’s it.
HDF5 needs HDF5-Browsers to look at the contents of HDF files.
In SDF on the other hand, you can use Firefox or Internet-Explorer to get a first impression. Even
Emacs
,gedit
otnotepad
are sufficient for this task. For programmers it is easy to write an SDF browser by their own – withQt
,gtk
,wxwindows
, etc., one can easily write graphical interfaces for SDF browsers.
Backward and Upward Compatibility of SDF¶
Backward compatibility is an absolute neccessity of a SDF. One must be able to read old SDF files even if the version of the SDF definition has changed.
From this it follows that we must avoid the excessive use of
attributes
in the definition of the SDF’s XML tags. We cannot
introduce new attributes in a future version of SDF without making
them optional.
SDF in File Format Conversions¶
SDF is a powerful tool when a larger number of data file formats have to be converted into each other. Having \(N\) different file formats and converting each into each other is, on the first view, a problem of having \(N\times (N-1)\) conversion routines. However, if one converts in a first step each file format into SDF and then converts the SDF file into the desired file format the problem reduces to a set of \(2N\) conversion routines.
Fig. 1 Converting various file formats into each other via an intermediate SDF file format (right) saves a lot of conversion routines when compared to the direct conversion of each file into each other (left).¶
Furthermore, since one file format does not understand all the tags of another file format the direct conversion will allways be acompanied with a loss of information. With a properly defined intermediate SDF file format this loss of information can be avoided at least in the conversion step from third-party file formats into the SDF file.
SDF Concepts¶
A dataset is considered to be much more than just a collection of \(x,y,z\)-data columns. It is typically acompanied by a list of parameters that describe the settings of the experiment and the instrument. It often has a name, a date-of-creation, and much more information that is recorded together with the data columns. SDF allows in a very general way to store and retrieve such additional information. As mentioned above, the conversion of third-party data formats into SDF is without loss of these informations.
Frequently, the outcome of an experiment is more than a single data set. SDF introduces the concept of workspaces to group a number of data sets together. Workspaces again can contain additional information (parameters, name, dates, etc.) associated with the collection of data sets. The workspace parameters then are kind of global parameters, common to all datasets contained in the workspace.
If a workspace is allowed to contain other workspaces the concept can become quite powerful in creating tree-like hierachical structures of data sets. A number of tiny utility programs can be created to handle the SDF data files. Searching data files by specific contents, retrieving part of the information of the data file, converting file formats into each other, and extracting specific data columns out of an SDF file are just examples.
Embedding the SDF concepts into a high–level language (like Python or C++) even allows the user to write data analysis tools directly based on SDF.
In combination with a graphical front-end browser, SDF files are well suited to build-up huge archives of experimental data.