RODE Guru defends the use of data archive format (July 1997)

Eric Booth of RODE Consultants takes issue over PDM’s questioning of the benefits and costs involved in the use of the SEG RODE encapsulation standard.

I believe that editorial in last month's PDM on formats misrepresents the costs and benefits of the SEG RODE encapsulation standard. It is based on the American Petroleum Institute's RP66 version 2. RP66 is a very flexible and powerful way of storing and exchanging data and related metadata that is media independent. It forms the basis of many new formats including RODE, DLIS, GEOSHARE, WITS and PEF (POSC exchange files). The RP66 base standard defines number formats, low level media bindings (which mean that the user does not need to know how the hardware stores the data), data structures and some common objects such as the file header, the data source and any data dictionaries used. It is, admittedly, difficult to understand and work with, but it is really the domain of the technical programmers amongst us and need not concern the end user. The data management professional needs to understand some of the features and facilities, they do not need to understand all the details. The data exchange standards are defined by schema that implement data models, the POSC exchange file implements the EPICENTRE model, GEOSHARE implements a broad based exploration model, DLIS implements a well log model capable of storing data for any field or processed digital well log.

Simple model

The RODE model is extremely simple, we assume that the data to be encapsulated consists of variable length records separated by tape marks. The model allows tape marks within a data file and provides options to record all input tape status conditions. The status conditions allow old data to be encapsulated and recovered with the status values from the original media. This allows seismic software (e.g. demultiplex programs) to attempt to recover errors. The only required object is the RODE-CONTEXT object, this requires that the creator of the data file records who they are, the internal format of the data, who wrote the software, which version they used, who the job was done for, where the data came form etc. There is an ANCILLARY-INFORMATION object that allows users to save metadata by designation and value. It is not a required object, but it can be used to store any information about a data file (e.g. end point co-ordinates, ensemble ranges, and velocity fields - its up to you). We also provided an indexing facility, it allows index files to be generated in RP66 format and saved on the media. So if the data model is so simple why have their been problems and why is RODE seen as difficult and expensive.

There are at least three problems to discuss:

The early published examples and implementations were based on very early drafts of RP66 version 2 (we must use version 2 for large physical blocks on tape). As a seismic programmer, I had no previous experience of RP66 or DLIS, and made a simple error in the data channel. The RP66 experts did not notice the error and it has been propagated into some of the early implementations. I found this error late last year and submitted two papers to the SEG Technical Standards committee explaining the error, providing a simple generic coding fix and a corrected example. I have placed draft copies of these papers for review and comments on my web-site ( http://www.rodecon.demon.co.uk ) as they appear to have gone into a black hole at the SEG. The incompatibility problem is, therefore, well understood and a simple fix is available.

RP66 provides efficient structures for bulk data. Most implementations are designed for handling complex data models and do not take advantage of the simplicity implicit in the RODE model. The physical records can be double buffered so that data is always available and the logical records extracted using one or occasionally two data moves. The software then needs to identify the logical record type and if it is a data record determine its structure. A RODE encapsulated file normally only has one structure and this should be set up as pointer increments from the start of the logical record. The addresses of the encapsulated data and status values (including the data length) can then be returned directly to the user. Unfortunately, generic RP66 software tends to work back through the defining records for each instance of an encapsulated data record and re-determine the structure of the encapsulated record. RP66 structures and objects are complex (but then so is a video recorder) and implementation is not easy. RP66 is an accepted basis for many exchange formats and implementers are getting better at creating files and also at handling some of the more common errors. Performance issues are, therefore, a matter of careful coding - in many cases de-encapsulation should be faster than reading short raw records from the media.

The last major issue is the use of metadata. If you are prepared to guarantee that your records are perfect and that they will never be compromised then you do not need metadata stored with the data. In the real world, tape labels fall off, paper records are lost, staff leave, companies are taken over, and databases can be corrupted etc. Storing metadata and index files on modern media is a good idea. Building large archives costs a lot of money and it will be wasted if you cannot find the data. The selection of parameters to store and the allocation of these parameters to RP66 objects and attributes is non-trivial and there is no defined standard. RP66 allows users to define a local content standard that defines the use of objects and attributes within a file and the quality of the data within the file. The data manager is responsible for deciding how data is archived and ensuring that the data is correct and can be easily found. They may delegate the responsibility but they cannot shirk it.

F-word

The flexibility and functionality of RP66 and RODE are a problem – you need to strike a balance between storing all relevant metadata and keeping the files simple. You can, however, ensure that the data is easy and efficient to read and include all the metadata on a single media. End users are not interested in the physical location of data, all they want is access to data. The system must be able to identify the correct volume, load it, confirm that it is correct, move to a specific file, open that and confirm that it is the file that the user wanted and then extract the appropriate data. This should happen automatically without user intervention, volume and file identifiers are essential. Any large archive project should set standards for volume labels and file identifiers.

I believe that you should always be able to recover the metadata from the same media as the data if necessary. Hopefully you should never need to read all of the objects but they provide a high level of disaster insurance providing they contain the relevant metadata.

Eric Booth has 28 years technical experience in the Oil Industry, including: Field Acquisition, Seismic data processing, Systems Management, System design, Capacity planning., Professional Society standards committees and Research and Development. He currently chairs the SEG Technical Standards sub-committee for High-Density Media formats and is working with the American Petroleum Institute’s RP66 committee on the implementation of RP66 on new media. His company RODE Consultants Limited (eric@rodecon.demon.co.uk) is active in a number of RODE related projects for a variety of clients and has also developed ReadRP66 a stand alone package to verify compliance with the API’s RP66 version 2.01 and the SEG RODE schema.

Click here to comment on this article

If your browser does not work with the MailTo button, send mail to pdm@the-data-room.com with PDM_V_2.0_199707_7 as the subject.

© Oil IT Journal - all rights reserved.