DIY data management - Part II

Garry Perrat from UK consultancy Geocon continues our three part UNIX primer with an introduction to the UNIX pipe and goes on to loops and nested scripts. Before you know it you are programming. Code for this article available on the www.oilit.com website along with other code snippets Real Soon Now!

In Part 1 we started working with scripts, including variables and command line arguments, by building an example called listfiles. We continue to improve it below after a different example.

using all arguments

Remember that arguments given to a script can be reference via $1, $2 and so on. In the same way as ls *.dat lists all .dat files together, $* means "all arguments". For example, consider this script called lsort:

ls
-l | sort -n +4

There are three parts to this command:

A long listing (ls –l)

The vertical line “|“ is a "pipe" (shift-backslash on my keyboard) which sends the output from the command before as input for the following one.

A numeric sort on the fifth field (think of +4 as "ignore the first four fields for sort purposes") which happens to be file size in the listing. (This can vary between systems and will be the fourth field if group is omitted.)

So it generates a list of files in the current directory sorted by size. This is all very good but what if you only want to list certain files or want to list files from a number of different directories? Try this version:

ls
-l $* | sort -n +4

If we run

lsort /disk*/projects/myproj/*.sgy

all the files that match are passed to the ls command which is exactly what we want. It gets better. If we run

lsort -a

the listing command becomes

ls -l
-a

In other words a list all files in the current directory. (You would normally write this as ls -la but it works the same with separate arguments.) Of course, you can combine both ls arguments and multiple files (e.g. lsort -a *.dat) - just remember to specify the ls arguments before any filenames.

looping through arguments

Going back to filelist now, we can add functionality to loop through any number of different file types given as arguments:

#!/bin/sh
PROJ=$1
shift
for
FTYPE in $*
do
ls -l /disk*/projects/$PROJ/*.$FTYPE /nobackup*/projects/$PROJ/*.$FTYPE
done

The first argument is always the project name so shift deletes it (after it has been captured in PROJ), stores the second in $1, the third in $2, and so on.

For each element in the specified list, in this case $* (which means "all (shifted) command line arguments") the for loop sets the variable FTYPE to the element and executes the commands between do and done.

You can run as many commands as you like within the do loop - this example only has a single listing.

The result is a list for each filetype specified (e.g. filelist myproj sgy txt log lists .sgy files, then .txt files and finally .log files).

We could write a script to loop through a single filetype for many projects (all arguments except the last are projects) or, indeed, many filetypes in many projects (we would have to parse for a special argument which specified where the list of projects ended and that of filetypes began). These modifications are beyond the scope of this article but are included on the website.

filename arguments

We can use arguments to apply a script to a data file, perhaps saving the output in another file. Say we have a number of files containing many logs for many wells and we want another set of files containing only those lines specifying well and log names. The file logs1.dat may look something like this (entirely fictitious example):

Well:
12/34-5a
Log:
DT
Depth
Value
1000
123
1010
234
1020
345
...
Log:
POR
Depth
Value
1000
26.7
1010
13.7
...
Well:
12/34-5b
...

We can generate our list in a new file with a script called listlogs:

egrep ‘(Well|Log)’ logs1.dat >logs1.names

the output file looking like this:

Well:
12/34-5a
Log:
DT
Log:
POR
Well:
12/34-5b
...

egrep is "extended grep", one of its extensions being the ability to search for alternative strings, listed within quoted parentheses and separated by pipes (in this case "Well" or "Log").

But we can improve it to avoid having to edit it for the next file:

egrep ‘(Well|Log)’ $1 >$2

which we run as

listlogs logs2.dat logs2.names.

In this case, however, it is probably better not to hardwire the output redirection into $2 but instead to leave it off altogether:

grep
Well $1

We run this as listlogs logs2.dat > logs2.names which follows the normal pattern for sending a command’s output into a file. This also permits us to write the results directly to the screen by omitting the output redirection (e.g. listlogs logs2.dat).

nested scripts

There is nothing to stop us running scripts from within other ones. Rather than manually run listlogs for every input file we could write another script to loop through all of them in turn:

#!/bin/sh
for
FILE in $*
do
listlogs $FILE >$FILE.names
done

This is run as something like listall logs*.dat and get a series of output files with the same name as the input plus a ".names" suffix (e.g. logs5.dat gives logs5.dat.names).

If we wanted to write all the output into a single file we could use the output append operator ">>" within the loop:

do
listlogs $FILE >>all.names

If we didn't use the append operator we would end up with only the last input file’s output saved since listlogs is run, and the output file written to, separately for each iteration of the for loop. Don't forget to empty all.names first if it already exists from a previous run and you don't want to keep that output. We could automatically over-write it but that’s a bit dangerous unless we backup the existing one first for which mv all.names all.names.old can be added before the for line. This will generate an error from mv if all.names doesn't exist but we can put up with that. The website includes an example of working around this more cleanly.

Editor’s notes

~ Indicates the command continues on the next line. Do not type the ~.

Our lawyers insist that we disclaim any responsibility for the use of the code snippets provided here and on the oilIT.com website. All code is provided ‘as is’ and no guarantee for fitness for purpose is implied either by The Data Room or by Geocon. Make sure you back up any critical files before running any script on them.

DIY data management - Part II

Click here to comment on this article

Click here to view this article in context on a desktop

DIY data management - Part II

Sign up for occasional emails and subscription information...

Click here to comment on this article

Click here to view this article in context on a desktop