DIY data management - Part III

Garry Perrat from UK consultancy Geocon concludes our three part UNIX primer with an introduction to some of the powerful ‘little languages’ available to UNIX users.

Shell scripts can contain any Unix command. Two of the most useful for data management purposes are sed (stream editor) and awk (a text-processing language named after its inventors: Aho, Weinberger and Kernighan). You can run them from the command line but it is better to put longer commands in an editable script.

Sed reads a text file, applying edits as it goes. For example, suppose we have a 2D seismic navigation file containing lines GEO97-123 and GEO97-123A which have been merged into one line in our interpretation software. We might want to change GEO97-123A in our nav. file to GEO97-123, for which purpose we can use sed’s substitution command:

sed ’s/GEO97-123A/GEO97-123 /’ nav.dat >nav.dat2

The command has the general form s/search/replace/ which means "substitute search with replace". Note the trailing space in the replacement string to ensure that any subsequent columns remain aligned and the quotes to hide any special characters (like the space). Suppose now that we have not only an A-line but B and C as well. We can extend the command to replace all three reshoot names with the base name:

sed –e ’s/GEO97-123A/GEO97-123 /’ –e ’s/GEO97-123B/GEO97-123 /’ –e ’s/GEO97-123C/GEO97-123 /’ nav.dat >nav.dat2

Now that we have three separate edit commands each one must be preceded with –e. Otherwise it’s much the same as the first example. However, we can do better than this by using a "wildcard":

sed ’s/GEO97-123./GEO97-123 /’ nav.dat >nav.dat2

The period after the first "123" matches any single character so we don't need a separate command for each seismic line. However, if we also have another line GEO97-1234 which we don't want changing to GEO98-123 we need to be more specific:

sed ’s/GEO97-123[A-Z]/GEO97-123 /’ nav.dat >nav.dat2

The "[A-Z]" means "match any single character between A and Z" so it will not match a number. If you want to match lowercase letters as well you can include them as another range within the square brackets:

sed ’s/GEO97-123[A-Za-z]/GEO97-123 /’ nav.dat >nav.dat2

All of these examples are short enough to be run from the command line but sed can do far more than there is room to include here at which point putting it into a script can help.

Awk is particularly useful for data manipulation and reformatting. The simplest use is to print only certain fields from a file. Suppose we have a file, horizons.dat, containing inline-xline-x-y-t1-v1-t2-v2-t3-v3, all separated with whitespace, and only want inline, xline and d1 (=t1*v1/2000):

 1 5 435067 6235678 523 1480 1546 2504 2804 3256
1256 245 425698 6254302 498 1480 1630 2396 3045 3502

We can use awk’s print command to select the fields we require:

awk ‘{print $1,$2,$5*$6/2000}’ horizons.dat >horizons.dat2

In awk a dollar precedes a field number so this command prints fields 1, 2 and the product of the 5th and 6th divided by 2000, each separated by a single space. (Fields are, by default, any sequence of characters separated by whitespace.) Note that the body of the command is enclosed within curly braces and is also quoted to hide the special characters (e.g. preventing the shell from attempting variable substitution with the dollars). However, if our file has fields which vary in width (e.g. inline may range from "1" to "1256") our output file might look rather messy and be difficult to use:

1 5 387.02
1256 245 368.52

We need a formatted print command:

awk ‘{printf("%4d %4d %6.2f\n",$1,$2,$5*$6/2000)}’ horizons.dat >horizons.dat2

which results in this output file:

 1 5 387.02
1256 245 368.52

printf looks a bit hairy but is well-worth mastering as it increases awk’s power no end. The general syntax is:


Each individual format begins with a % and there must be as many formats as there are values to print (which can be anything - field numbers, other variables, constants, strings, etc.)
The basic formats are:

%s	String
%-s	Left-justified string
%d	Integer
%f	Floating point

Strings and integers may be preceded with a minimum field width although this will be increased if the value won't fit into the specified width. (e.g. %4d in the example above specifies a 4-digit integer, %-10s specifies a left-justified 10-character string). Floating point numbers can be preceded with a decimal number specifying field width and precision (e.g. %6.2f in the example above forces the number to be printed with two decimal places in a total field width of six characters (including the decimal point). Note the \n at the end of the format, the newline character – don't forget it or you'll end up with one very long line in the output file! Anything not beginning with either % or \ is printed verbatim (including spaces) so:

printf("inline=%4d, xline=%4d, depth=%6.2f\n",$1,$2,$5*$6/2000)


inline= 1, xline= 5, depth=387.02
inline=1256, xline= 245, depth=368.52

Our last example can be re-written in full with variables, comments and more space to improve clarity:

awk ‘{ # Define variables
i=$1 # Inline is the first field
j=$2 # Xline is the second field
t1=$5 # T1 is the fifth field
v1=$6 # V1 is the sixth field
# Compute depth
# Print the results
printf("inline=%4d, xline=%4d, depth=%6.2f\n",i,j,depth) }’ horizons.dat >horizons.dat2

Note that variables within awk are never preceded with a dollar but are just referenced by name. This is one of those confusing differences between awk and the shell (particularly so when there is awk code within a shell script, as in this case!). Note that quoted strings are never interpreted as variables so:

print "inline=" inline



more examples

Scripts can be very complex. Many Landmark commands started from the OpenWorks launcher are actually wrapper scripts which do various checks and perform other tasks before spawning the actual application and other CAEX systems may work in a similar way. Some other examples include:
Looking for files that haven't been used for some time, perhaps owned by a particular user and greater than a certain size (e.g. find /disk* -user gcp –atime +30 –size +50000000c finds files under directories /disk* owned by gcp, last accessed more than 30 days ago and larger than (roughly) 50MB).
Reformatting ASCII data (eg. horizons, well data, velocities, faults).
Computing interval velocities from stacking velocities and time picks.
Depth converting exported time horizons.
Note the common themes in these tasks - saving time on often-used commands (which you could write out in full each time but are easier to use in a script - phone fred is much easier than grep -i fred $HOME/docs/phonebook), reformatting data and performing computations difficult unless you have appropriate software. That is the power of scripts.
Scripts are not just for the command line. There are many scripting languages available of which some (e.g. Tcl and Perl) include GUI capabilities with which to build user-friendly front ends for all your scripts and really impress your users ... not to mention your managers!
Not Just For Unix
Scripting isn't just for Unix, either. Of course, you can install Linux, Solaris, etc. on PCs but there are also Windows versions of many Unix utilities, including Cygwin ( ), the MKS Toolkit ( ) and WinXs ( ).
Further Information
Online man pages should be available for most Unix commands on your system (e.g. man sed displays the page for sed) but they can sometimes be rather inpenetrable! The classic awk reference is The Awk Programming Language by Aho Kernigan and Weinberger published by Addison Wesley, ISBN 0-201-07981-X. The books published by O'Reilly and Associates are also good, including "Unix in a Nutshell", an excellent quick reference for most commands, and "awk and sed".
What Do You Want To Do Today?
The possibilities with scripts really are endless. So next time you think ‘I wish I could ...’ remember that with a script you probably can!

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.