Welcome to Linux Knowledge Base and Tutorial
"The place where you learn linux"
GetNetWise: You
e One Click Away

 Create an AccountHome | Submit News | Your Account  

Tutorial Menu
Linux Tutorial Home
Table of Contents
Up to --> Linux Tutorial

· Editing Files
· Vi
· Sed
· Awk
· Perl

Glossary
MoreInfo
Man Pages
Linux Topics
Test Your Knowledge

Site Menu
Site Map
FAQ
Copyright Info
Terms of Use
Privacy Info
Disclaimer
WorkBoard
Thanks
Donations
Advertising
Masthead / Impressum
Your Account

Communication
Feedback
Forums
Private Messages
Recommend Us
Surveys

Features
HOWTOs
News
News Archive
Submit News
Topics
User Articles
Web Links

Google
Google


The Web
linux-tutorial.info

Who's Online
There are currently, 235 guest(s) and 1 member(s) that are online.

You are an Anonymous user. You can register for free by clicking here

  
Linux Tutorial - Editing Files - Awk
  Sed ---- Perl  


Awk

Another language that Linux provides and is standard on many (most?) UNIX systems is awk. The abbreviation awk is an acronym composed of the first letter of the last names of its developers: Alfred Aho, Peter Weinberger, and Brian Kernighan. Like sed, awk is an interpreted pattern-matching language. In addition, awk, like sed, can also read stdin. It can also be passed the name of a file containing its arguments.

The most useful aspect of awk (at least useful for me and the many Linux scripts that use it) is its idea of a field. Like sed, awk will read whole lines, but unlike sed, awk can immediately break into segments (fields) based on some criteria. Each field is separated by a field separator. By default, this separator is a space. By using the -F option on the command line or the FS variable within an awk program, you can specify a new field separator. For example, if you specified a colon (:) as a field separator, you could read in the lines from the /etc/password file and immediately break it into fields.

A programming language in its own right, awk has become a staple of UNIX systems. The basic purposes of the language are manipulating and processing text files. However, awk is also a useful tool when combined with output from other commands, and allows you to format that output in ways that might be easier to process further. One major advantage of awk is that it can accomplish in a few lines what would normally require dozens of lines in sh or csh shell script, or may even require writing something in a lower-level language, like C.

The basic layout of an awk command is

pattern { action }

where the action to be performed is included within the curly braces ({}). Like sed, awk reads one input a line at a time, aut awk sees each line as a record broken up into fields. Fields are separated by an input Field Separator (FS), which by default is a Tab or space. The FS can be changed to something else, for example, a semi-colon (;), with FS=;. This is useful when you want to process text that contains blanks; for example, data of the following form:

Blinn, David;42 Clarke Street;Sunnyvale;California;95123;33
Dickson, Tillman;8250 Darryl Lane;San Jose;California;95032;34
Giberson, Suzanne;102 Truck Stop Road;Ben Lomond;California;26
Holder, Wyliam; 1932 Nuldev Street;Mount Hermon;California;95431;42
Nathanson, Robert;12 Peabody Lane;Beaverton;Oregon;97532;33
Richards, John;1232 Bromide Drive;Boston;Massachusetts;02134;36
Shaffer, Shannon;98 Whatever Way;Watsonville;California;95332;24

Here we have name, address, city, state, zip code, and age. Without using ; as a field separator, Blinn and David;42 would be two fields. Here, we would want to treat each name, address city, etc., a single unit, rather than as multiple fields.

The basic format of an awk program or awk script, as it is sometimes called, is a pattern followed by a particular action. Like sed, each line of the input is checked by awk to see if it matches that particular pattern. Both sed and awk do well when comparing string values, However, whereas checking numeric values is difficult with sed, this functionality is an integral part of awk.

If we wanted, we could use the data previously listed and output only the names and cities of those people under 30. First, we need an awk script, called awk.scr, that looks like this:

FS=; $6 < 30 { print $1, $3 }

Next, assume that we have a data file containing the seven lines of data above, called awk.data. We could process the data file in one of two ways. First

awk -f awk.scr awk.data

The -f option tells awk that it should read its instructions from the file that follows. In this case, awk.scr. At the end, we have the file from which awk needs to read its data.

Alternatively, we could start it like this:

We can even make string comparisons. as in

$4 == "California" { print $1, $3 }

Although it may make little sense, we could make string comparisons on what would normally be numeric values, as in

$6 == "33" { print $1, $3 }

This prints out fields 1 and 3 from only those lines in which the sixth field equals the string 33.

Not to be outdone by sed, awk will also allow you to use regular expressions in your search criteria. A very simple example is one where we want to print every line containing the characters "on." (Note: The characters must be adjacent and in the appropriate case.) This line would look like this:

/on/ {print $0}

However, the regular expressions that awk uses can be as complicated as those used in sed. One example would be

/[^s]on[^;]/ {print $0}

This says to print every line containing the pattern on, but only if it is not preceded by an ^s nor followed by a semi-colon(^;). The trailing semi-colon eliminates the two town names ending in "on" (Boston and Beaverton) and the leading s eliminates all the names ending in "son." When we run awk with this line, our output is

Giberson, Suzanne;102 Truck Stop Road;Ben Lomond;California;96221;26

But doesn't the name "Giberson" end in "son"? Shouldn't it be ignored along with the others? Well, yes. However, that's not the case. The reason this line was printed out was because of the "on" in Ben Lomond, the city in which Giberson resides.

We can also use addresses as part of the search criteria. For example, assume that we need to print out only those lines in which the first field name (i.e., the persons last name) is in the first half of the alphabet. Because this list is sorted, we could look for all the lines between those starting with "A" and those starting with "M." Therefore, we could use a line like this:

/^A/,/^M/ {print $0}

When we run it, we get

 

What happened? There are certainly several names in the first half of the alphabet. Why didn't this print anything? Well, it printed exactly what we told it to print. Like the addresses in both vi and sed, awk searches for a line that matches the criteria we specified. So, what we really said was "Find the first line that starts with an A and then print all the lines up to and including the last one starting with an M." Because there was no line starting with an "A," the start address didn't exist. Instead, the code to get what we really want would look like this:

/^[A-M]/ {print $0}

This says to print all the lines whose first character is in the range A-M. Because this checks every line and isn't looking for starting and ending addresses, we could have even used an unsorted file and would have gotten all the lines we wanted. The output then looks like this:

Blinn, David;42 Clarke Street;Sunnyvale;California;95123;33
Dickson, Tillman;8250 Darryl Lane;San Jose;California;95032;34
Giberson, Suzanne;102 Truck Stop Road;Ben Lomond;California;96221;26
Holder, Wyliam; 1932 Nuldev Street;Mount Hermon;California;95431;42

If we wanted to use a starting and ending address, we would have to specify the starting letter of the name that actually existed in our file. For example:

/^B/,/^H/ {print $0}

Because printing is a very useful aspect of awk, its nice to know that there are actually two ways of printing with awk. The first we just mentioned. However, if you use printf instead of print, you can specify the format of the output in greater detail. If you are familiar with the C programming language, you already have a head start, as the format of printf is essentially the same as in C. However, there are a couple of differences that you will see immediately if you are a C programmer.

For example, if we wanted to print both the name and age with this line

$6 >30 {printf"%20s %5d\n",$1,$6}

the output would look like this:

Blinn, David 33
Dickson, Tillman 34
Holder, Wyliam 42
Nathanson, Robert 33
Richards, John 36

The space used to print each name is 20 characters long, followed by five spaces for the age.

Because awk reads each line as a single record and blocks of text in each record as fields, it needs to keep track of how many records there are and how many fields. These are denoted by the NR variable.

Another way of using awk is at the end of a pipe. For example, you may have multiple-line output from one command or another but only want one or two fields from that line. To be more specific, you may only want the permissions and file names from an ls -l output. You would then pipe it through awk, like this

ls -l | awk '{ print $1" "$9 }'

and the output might look something like this:

-rw-r--r-- mike.letter
-rw-r--r-- pat.note
-rw-r--r-- steve.note
-rw-r--r-- zoli.letter

This brings up the concept of variables. Like other languages, awk enables you to define variables. A couple are already predefined and come in handy. For example, what if we didn't know off the tops of our heads that there were nine fields in the ls -l output? Because we know that we wanted the first and the last field, we can use the variable that specifies the number of fields. The line would then look like this:

ls -l | awk '{ print $1" "$NF }'

In this example, the space enclosed in quotes is necessary; otherwise, awk would print $1 and $NR right next to each other.

Another variable that awk uses to keep track of the number of records read so far is NR. This can be useful, for example, if you only want to see a particular part of the text. Remember our example at the beginning of this section where we wanted to see lines 5-10 of a file (to look for an address in the header)? In the last section, I showed you how to do it with sed, and now I'll show you with awk.

We can use the fact that the NR variable keeps track of the number of records, and because each line is a record, the NR variable also keeps track of the number of lines. So, we'll tell awk that we want to print out each line between 5 and 10, like this:

cat datafile | awk '{NR >=5 && NR <= 10 }'

This brings up four new issues. The first is the NR variable itself. The second is the use of the double ampersand (&&). As in C, this means a logical AND. Both the right and the left sides of the expression must be true for the entire expression to be true. In this example, if we read a line and the value of NR is greater than or equal to 5 (i.e., we have read in at least five lines) and the number of lines read is not more than 10, the expression meets the logical AND criteria. The third issue is that there is no print statement. The default action of awk, when it doesn't have any additional instructions, is to print out each line that matches the pattern. (You can find a list of other built in variables in the table below)

The last issue is the use of the variable NR. Note that here, there is no dollar sign ($) in front of the variable because we are looking for the value of NR, not what it points to. We do not need to prefix it with $ unless it is a field variable. Confused? Lets look at another example.

Lets say we wanted to print out only the lines where there were more than nine fields. We could do it like this:

cat datafile | awk '{ NF > 9 }'

Compare this

cat datafile | awk { print $NF }

which prints out the last field in every line. (You can find a list of other built in variable in the table below)

Up to now, we've been talking about one line awk commands. These have all performed a single action on each line. However, awk has the ability to do multiple tasks on each line as well as a task before it begins reading and after it has finished reading.

We use the BEGIN and END pair as markers. These are treated like any other pattern. Therefore, anything appearing after the BEGIN pattern is done before the first line is read. Anything after the END pattern is done after the last line is read. Lets look at this script:

BEGIN { FS=";"}
{printf"%s\n", $1}
{printf"%s\n", $2}
{printf"%s, %s\n",$3,$4}
{printf"%s\n", $5}
END {print "Total Names:" NR}

Following the BEGIN pattern is a definition of the field separator. This is therefore done before the first line is read. Each line is processed four times, where we print a different set of fields each time. When we finish, our output looks like this:

Blinn, David
42 Clarke Street
Sunnyvale, California
95123
Dickson, Tillman
8250 Darryl Lane
San Jose, California
95032
Giberson, Suzanne
102 Truck Stop Road
Ben Lomond, California
96221
Holder, Wyliam
1932 Nuldev Street
Mount Hermon, California
95431
Nathanson, Robert
12 Peabody Lane
Beaverton, Oregon
97532
Richards, John
1232 Bromide Drive
Boston, Massachusetts
02134
Shaffer, Shannon
98 Whatever Way
Watsonville, California
95332
Total Names:7

Aside from having a pre-defined set of variables to use, awk allows us to define variables ourselves. If in the last awk script we had wanted to print out, lets say, the average age, we could add a line in the middle of the script that looked like this:

{total = total + $6 }

Because $6 denotes the age of each person, every time we run through the loop, it is added to the variable total. Unlike other languages, such as C, we don't have to initialize the variables; awk will do that for us. Strings are initialized to the null string and numeric variables are initialized to 0.

After the END, we can include another line to print out our sum, like this:

{print "Average age: " total/NR
}

Table awk Comparison Operators

Operator Meaning
< less than
<= less than or equal to
== equal to
!= not equal to
>= greater than or equal to
>

greater than

Table Default Values of awk Built-in Variables

Variable

Meaning

Default

ARGC

number of command-line arguments

-

ARGV

array of command-line arguments

-

FILENAME

name of current input file

-

FNR

record number in current file

-

FS

input field separator

space or tab

NF

number of fields in the current record

-

NR

number of records read

-

OFMT

numeric output format

%.6g

OFS

output field separator

space

ORS

output record separator

new line

RS

input record separator

new line

Is that all there is to it? No. In fact, we haven't even touched the surface. awk is a very complex programming language and there are dozens more issues that we could have addressed. Built into the language are mathematical functions, if and while loops, the ability to create your own functions, strings and array manipulation, and much more.

Unfortunately, this is not a book on UNIX programming languages. Some readers may be disappointed that I do not have the space to cover awk in more detail. I am also disappointed. However, I have given you a basic introduction to the constructs of the language to enable you to better understand the more than 100 scripts on your system that use awk in some way.

 Previous Page
Sed
  Back to Top
Table of Contents
Next Page 
Perl


MoreInfo

Test Your Knowledge

User Comments:


Posted by hrosen on August 17, 2004 08:37pm:

Hello Jim. I've been absent from here due to defective RAM that took my computer quite down. Now, my comments. I believe that awk requires single quotes around the curly braces as $ ls -l | '{ print $1 " " $NF }' and, additionally, placing a comma between $1 and $NF will also produce a space between the printed fields.


Posted by jimmo on October 23, 2004 04:07pm:

Interesting that you are the first to notice. This has been online for three years. The single-quotes seem to have been lost when I translated the files from MS-Word to HTML. It's corrected now.


You can only add comments if you are logged in.

Copyright 2002-2009 by James Mohr. Licensed under modified GNU Free Documentation License (Portions of this material originally published by Prentice Hall, Pearson Education, Inc). See here for details. All rights reserved.
  




Login
Nickname

Password

Security Code
Security Code
Type Security Code


Don't have an account yet? You can create one. As a registered user you have some advantages like theme manager, comments configuration and post comments with your name.

Help if you can!


Amazon Wish List

Did You Know?
The Linux Tutorial welcomes your suggestions and ideas.


Friends



Tell a Friend About Us

Bookmark and Share



Web site powered by PHP-Nuke

Is this information useful? At the very least you can help by spreading the word to your favorite newsgroups, mailing lists and forums.
All logos and trademarks in this site are property of their respective owner. The comments are property of their posters. Articles are the property of their respective owners. Unless otherwise stated in the body of the article, article content (C) 1994-2013 by James Mohr. All rights reserved. The stylized page/paper, as well as the terms "The Linux Tutorial", "The Linux Server Tutorial", "The Linux Knowledge Base and Tutorial" and "The place where you learn Linux" are service marks of James Mohr. All rights reserved.
The Linux Knowledge Base and Tutorial may contain links to sites on the Internet, which are owned and operated by third parties. The Linux Tutorial is not responsible for the content of any such third-party site. By viewing/utilizing this web site, you have agreed to our disclaimer, terms of use and privacy policy. Use of automated download software ("harvesters") such as wget, httrack, etc. causes the site to quickly exceed its bandwidth limitation and are therefore expressly prohibited. For more details on this, take a look here

PHP-Nuke Copyright © 2004 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.
Page Generation: 0.14 Seconds