cosine.blue

Blog by Gregory Chamberlain.

Rudimentary Parsing with Idiomatic POSIX Shell

Example-driven tutorial on the versatile read(1)-while loop.

History

Cover image: Photo of small mechanical keyboard and notebook with
 only #!/bin/sh written in it

POSIX shell idioms

You sit down before your obnoxiously clacky mechanical keyboard and bang out the familiar #!/bin/sh with that precious sense of optimism like writing the first words in a new notebook: the margins are aligned and you begin to think that your handwriting isn’t so bad after all. Soon enough, though, it’s riddled with scribbled out paragraphs and arrows that swing around the pages.

Maintaining safe and bug-free shell scripts can become troublesome as they grow in complexity, and limiting one’s craft only to those few blessed features defined by POSIX only amplifies that.

However, what is lost in flexibility is gained in portability. With some casual experimentation and a handful of powerful idioms, perhaps the purely POSIX shell language can prove more elegant and expressive than it is given credit for.

What follows are three ways to use one such idiom: the read(1)-while loop.

The read(1) builtin

Built into any POSIX compatible shell is the read(1) command.

read [-r] var

In its most basic usage, it can be used to gather a line of input from the user at the terminal; that might be to answer a yes/no prompt, or to provide a text string. But more versatile is its ability to parse text files in a rudimentary but reliable way.

The read(1) command always consumes exactly one line of text—that’ll be from standard input or whatever you pipe or redirect into it. When it hits EOF it fails. EOF stands for end-of-file and marks the end of a file or data stream. So, the general pattern is to call read(1) repeatedly until it fails.

Below is a really simple example which simply prints each line read. The entire content of each line, except the terminating newline character, is assigned to the line variable. We set the IFS to the empty string to prevent field splitting. It’s conventional to call the variable line, but really you can call it anything you like.

while IFS= read -r line
do printf '%s\n' "$line"
done

Unless you know what you’re doing, you probably want to pass the -r option; its effect is described in the POSIX Programmer’s Manual. Beware that read(1) still fails if a non-empty line is missing its newline character before EOF, so the loop will not be entered for that final line. The solution is to check for emptiness in the event that read fails

while IFS= read -r line || [ "$line" ]
do …
done

but you may instead prefer to assume that the input is well formed—garbage in, garbage out.

Case study 1 of 3: Linewise filtering

Let’s say we want to remove lines that begin with a hash sign (#)—something akin to grep -v ^\#. This can be achieved with the #* glob pattern in a case statement.

while IFS= read -r line
do
    case $line in
        \#*) ;;
        *) printf '%s\n' "$line"
    esac
done

The backslash escape in \#* is necessary to protect the hash sign from becoming a shell comment. Pattern matching each line like this is a powerful way to interpret file contents. Let’s also ignore leading white space:

while IFS= read -r line
do
    case ${line#${line%%[![:space:]]*}} in
        \#*) ;;
        *) printf '%s\n' "$line"
    esac
done <<EOF
one
# two
  # three
  four
EOF

The precise workings of this parameter expansion wizardry is described in Dylan Arap’s excellent pure-sh-bible, but understanding it is not necessary going forward—just know that it expands to the value of line stripped of any leading white space. This time I fed it a here document, resulting in the following two lines being printed:

one
  four

Case study 2 of 3: Key-value pairs

Let’s now try to parse a file consisting of lines of the form key=value where key contains no equals signs and value can be anything at all.

For this we set the IFS to the equals sign and pass two variable names as arguments to read(1).

while IFS='=' read -r key value
do printf 'Key ‘%s’ has value ‘%s’.\n' "$key" "$value"
done <<EOF
name=banana
type=fruit
colour=yellow
EOF

Setting IFS='=' tells read(1) to split the line into fields at every equals sign. Since we named only two variables (key and value) to assign fields to, any fields right of the second are merged into value. In other words, all characters after the first equals sign are considered part of the value, so no quoting or escaping is necessary.

Running the above code results in the following output:

Key ‘name’ has value ‘banana’.
Key ‘type’ has value ‘fruit’.
Key ‘colour’ has value ‘yellow’.

Case study 3 of 3: User’s full name

Since the original Unix the fifth field of a user’s passwd(5) record has been the GECOS field where users’ contact information is stored. Here’s mine:

$ grep ^$USER: /etc/passwd
greg:x:1000:1000:Gregory Chamberlain,,,:/home/greg:/bin/bash

Among the colon-separated fields, you can see that my full name is the first of four comma-separated values within the GECOS field. Many desktop operating systems ask for your full name during installation or when adding a new user and that’s where it’s stored. If not, try chfn(1) or just (carefully!) edit the /etc/passwd directly as root.

Assigning the first field to user and the fifth to gecos, we use underscores in place of the other fields which we don’t care about.

while IFS=: read -r user _ _ _ gecos _
do [ x"$user" = x"$USER" ] && name="${gecos%%,*}"
done < /etc/passwd

Don’t be tempted to move the < /etc/passwd file redirection closer to the read statement; it needs to feed into the while loop itself so that it progresses over successive lines.

After reading each line, the resulting values of user and USER are tested for equality. It’s important to protect arbitrary strings from [(1) options parsing by prefixing each with any ol’ character, in this case x—otherwise $user could expand to something like -z and cause an error.

[ x"$user" = x"$USER" ]

If the two match, then we know we are looking at the right line and so we begin parsing the gecos string for a name:

name="${gecos%%,*}"

We can’t use the same read(1) trick again because we already absorbed the line. Anyway, we’re only interested in the first field so we can just use a greedy suffix pattern in the expansion of gecos. You can read about Parameter expansion in dash(1) for the details but in short %%,* means remove everything after the first comma.

Below is a short script that illustrates how this could be integrated into a larger program. I’ve also thrown in a read(1) command that prompts interactively for a name if one is not found, demonstrating how read(1) can be used outside of a loop as well.

#!/bin/sh
while IFS=: read -r user _ _ _ gecos _
do [ x"$user" = x"$USER" ] && name="${gecos%%,*}"
done < /etc/passwd

if [ "$name" ]
then
    printf 'Your real name is %s.\n' "$name"
else
    printf 'What’s your name? '
    IFS= read -r name
    printf 'Hello %s!\n' "$name"
fi

Reasons not to use Bash

Most prominently, Bash scripts are not portable. From Drew Devaults’s Introduction to POSIX shell:

Any shell that utilizes features specific to Bash are not portable, which means you cannot take them with you to any other system. […] This is bad if your users wish to utilize your software anywhere other than GNU/Linux. If your build tooling utilizes bashisms, your software will not build on anything but GNU/Linux. If you ship runtime scripts that use bashisms, your software will not run on anything but GNU/Linux.

He goes on to argue that

you should stick to POSIX shell for your personal scripts, too. You might not care now, but when you feel like flirting with other Unicies you’ll thank me when all of your scripts work.

Also, Bash is monstrously complex; even its man page confesses it’s too big and too slow. And let’s not forget Shellshock, the arbitrary code execution vulnerability in Bash responsible for millions of attacks on web-facing servers.

See also


To leave a comment, please send a plaintext email to ~chambln/public-inbox@lists.sr.ht and it will show up in my public inbox.