Welcome to Linux Knowledge Base and Tutorial
"The place where you learn linux"
HP & Linux

 Create an AccountHome | Submit News | Your Account  

Tutorial Menu
Linux Tutorial Home
Table of Contents
Up to --> Linux Tutorial

· Shells and Utilities
· The Shell
· The Search Path
· Directory Paths
· Shell Variables
· Permissions
· Regular Expressions and Metacharacters
· Quotes
· Pipes and Redirection
· Interpreting the Command
· Different Kinds of Shells
· Command Line Editing
· Functions
· Job Control
· Aliases
· A Few More Constructs
· The C-Shell
· Commonly Used Utilities
· Looking for Files
· Looking Through Files
· Basic Shell Scripting
· Managing Scripts
· Shell Odds and Ends

Man Pages
Linux Topics
Test Your Knowledge

Site Menu
Site Map
Copyright Info
Terms of Use
Privacy Info
Masthead / Impressum
Your Account

Private Messages
Recommend Us

News Archive
Submit News
User Articles
Web Links


The Web

Who's Online
There are currently, 227 guest(s) and 0 member(s) that are online.

You are an Anonymous user. You can register for free by clicking here

Linux Tutorial - Shells and Utilities - Regular Expressions and Metacharacters
  Permissions ---- Quotes  

Regular Expressions and Metacharacters

Often, the arguments that you pass to commands are file names. For example, if you wanted to edit a file called letter, you could enter the command vi letter. In many cases, typing the entire name is not necessary. Built into the shell are special characters that it will use to expand the name. These are called metacharacters.

The most common metacharacter is *. The * is used to represent any number of characters, including zero. For example, if we have a file in our current directory called letter and we input

the shell would expand this to

Or, if we had a file simply called let, this would match as well.

Instead, what if we had several files called letter.chris, letter.daniel, and letter.david? The shell would expand them all out to give me the command

We could also type in vi letter.da*, which would be expanded to

If we only wanted to edit the letter to chris, we could type it in as vi *chris. However, if there were two files, letter.chris and note.chris, the command vi *chris would have the same results as if we typed in:

In other words, no matter where the asterisk appears, the shell expands it to match every name it finds. If my current directory contained files with matching names, the shell would expand them properly. However, if there were no matching names, file name expansion couldn't take place and the file name would be taken literally.

For example, if there were no file name in our current directory that began with letter, the command

could not be expanded and we would end up editing a new file called (literally) letter*, including the asterisk. This would not be what we wanted.

What if we had a subdirectory called letters? If it contained the three files letter.chris, letter.daniel, and letter.david, we could get to them by typing

. This would expand to be:

The same rules for path names with commands also apply to files names. The command

is the same as

which as the same as

This is because the shell is doing the expansion before it is passed to the command. Therefore, even directories are expanded. And the command

could be expanded as both letters/letter.chris and lease/letter.joe., or any similar combination

The next wildcard is ?. This is expanded by the shell as one, and only one, character. For example, the command vi letter.chri? is the same as vi letter.chris. However, if we were to type in vi letter.chris? (note that the "?" comes after the "s" in chris), the result would be that we would begin editing a new file called (literally) letter.chris?. Again, not what we wanted. This wildcard could be used if, for example, there were two files named letter.chris1 and letter.chris2. The command vi letter.chris? would be the same as

Another commonly used metacharacter is actually a pair of characters: [ ]. The square brackets are used to represent a list of possible characters. For example, if we were not sure whether our file was called letter.chris or letter.Chris, we could type in the command as: vi letter.[Cc]hris. So, no matter if the file was called letter.chris or letter.Chris, we would find it. What happens if both files exist? Just as with the other metacharacters, both are expanded and passed to vi. Note that in this example, vi letter.[Cc]hris appears to be the same as vi letter.?hris, but it is not always so.

The list that appears inside the square brackets does not have to be an upper- and lowercase combination of the same letter. The list can be made up of any letter, number, or even punctuation. (Note that some punctuation marks have special meaning, such as *, ?, and [ ], which we will cover shortly.) For example, if we had five files, letter.chris1-letter.chris5, we could edit all of them with vi letter.chris[12435].

A nice thing about this list is that if it is consecutive, we don't need to list all possibilities. Instead, we can use a dash (-) inside the brackets to indicate that we mean a range. So, the command

could be shortened to

What if we only wanted the first three and the last one? No problem. We could specify it as

This does not mean that we want files letter.chris1 through letter.chris35! Rather, we want letter.chris1, letter.chris2, letter.chris3, and letter.chris5. All entries in the list are seen as individual characters.

Inside the brackets, we are not limited to just numbers or just letters. we can use both. The command vi letter.chris[abc123] has the potential for editing six files: letter.chrisa, letter.chrisb, letter.chrisc, letter.chris1, letter.chris2, and letter.chris3.

If we are so inclined, we can mix and match any of these metacharacters any way we want. We can even use them multiple times in the same command. Let's take as an example the command

Should they exist in our current directory, this command would match all of the following:

letter.chrisa note.chrisa letter.chrisb note.chrisb letter.chrisc
note.chrisc letter.chrisd note.chrisd letter.chrise note.chrise
letter.chris1 note.chris1 letter.chris2 note.chris2 letter.chris3
note.chris3 letter.chris4 note.chris4 letter.chris5 note.chris5
letter.Chrisa note.Chrisa letter.Chrisb note.Chrisb letter.Chrisc
note.Chrisc letter.Chrisd note.Chrisd letter.Chrise note.Chrise
letter.Chris1 note.Chris1 letter.Chris2 note.Chris2 letter.Chris3
note.Chris3 letter.Chris4 note.Chris4 letter.Chris5 note.Chris5

Also, any of these names without the leading letter or note would match. Or, if we issued the command:

these would match

letter.daniel note.daniel letter.david note.david

Remember, I said that the shell expands the metacharacters only with respect to the name specified. This obviously works for file names as I described above. However, it also works for command names as well.

If we were to type dat* and there was nothing in our current directory that started with dat, we would get a message like

dat*: not found

However, if we were to type /bin/dat*, the shell could successfully expand this to be /bin/date, which it would then execute. The same applies to relative paths. If we were in / and entered ./bin/dat* or bin/dat*, both would be expanded properly and the right command would be executed. If we entered the command /bin/dat[abcdef], we would get the right response as well because the shell tries all six letters listed and finds a match with /bin/date.

An important thing to note is that the shell expands as long as it can before it attempts to interpret a command. I was reminded of this fact by accident when I input /bin/l*. If you do an

you should get the output:

-rwxr-xr-x 1 root root 22340 Sep 20 06:24 /bin/ln -r-xr-xr-x 1 root root 25020 Sep 20 06:17 /bin/login -rwxr-xr-x 1 root root 47584 Sep 20 06:24 /bin/ls

At first, I expected each one of the files in /bin that began with an "l" (ell) to be executed. Then I remembered that expansion takes place before the command is interpreted. Therefore, the command that I input, /bin/l*, was expanded to be

Because /bin/ln was the first command in the list, the system expected that I wanted to link the two files together (what /bin/ln is used for). I ended up with error message:

/bin/ln: /bin/ls: File exists

This is because the system thought I was trying to link the file /bin/login to /bin/ls, which already existed. Hence the message.

The same thing happens when I input /bin/l? because the /bin/ln is expanded first. If I issue the command /bin/l[abcd], I get the message that there is no such file. If I type in


I get:

/bin/ln: missing file argument

because the /bin/ln command expects two file names as arguments and the only thing that matched is /bin/ln.

I first learned about this aspect of shell expansion after a couple of hours of trying to extract a specific subdirectory from a tape that I had made with the cpio command. Because I made the tape using absolute paths, I attempted to restore the files as /home/jimmo/letters/*. Rather than restoring the entire directory as I expected, it did nothing. It worked its way through the tape until it got to the end and then rewound itself without extracting any files.

At first I assumed I made a typing error, so I started all over. The next time, I checked the command before I sent it on its way. After half an hour or so of whirring, the tape was back at the beginning. Still no files. Then it dawned on me that hadn't told the cpio to overwrite existing files unconditionally. So I started it all over again.

Now, those of you who know cpio realize that this wasn't the issue either. At least not entirely. When the tape got to the right spot, it started overwriting everything in the directory (as I told it to). However, the files that were missing (the ones that I really wanted to get back) were still not copied from the backup tape.

The next time, I decided to just get a listing of all the files on the tape. Maybe the files I wanted were not on this tape. After a while it reached the right directory and lo and behold, there were the files that I wanted. I could see them on the tape, I just couldn't extract them.

Well, the first idea that popped into my mind was to restore everything. That's sort of like fixing a flat tire by buying a new car. Then I thought about restoring the entire tape into a temporary directory where I could then get the files I wanted. Even if I had the space, this still seemed like the wrong way of doing things.

Then it hit me. I was going about it the wrong way. The solution was to go ask someone what I was doing wrong. I asked one of the more senior engineers (I had only been there less than a year at the time). When I mentioned that I was using wildcards, it was immediately obvious what I was doing wrong (obvious to him, not to me).

Lets think about it for a minute. It is the shell that does the expansion, not the command itself (like when I ran /bin/l*). The shell interprets the command as starting with /bin/l. Therefore, I get a listing of all the files in /bin that start with "l". With cpio , the situation is similar.

When I first ran it, the shell interpreted the files (/home/jimmo/data/*) before passing them to cpio. Because I hadn't told cpio to overwrite the files, it did nothing. When I told cpio to overwrite the files, it only did so for the files that it was told to. That is, only the files that the shell saw when it expanded /home/jimmo/data/*. In other words, cpio did what it was told. I just told it to do something that I hadn't expected.

The solution is to find a way to pass the wildcards to cpio. That is, the shell must ignore the special significance of the asterisk. Fortunately, there is a way to do this. By placing a back-slash (\) before the metacharacter, you remove its special significance. This is referred to as "escaping" that character.

So, in my situation with cpio, when I referred to the files I wanted as /home/jimmo/data/\*, the shell passed the arguments to cpio as /home/jimmo/data/*. It was then cpio that expanded the * to mean all the files in that directory. Once I did that, I got the files I wanted.

You can also protect the metacharacters from being expanded by enclosing the entire expression in single quotes. This is because it is the shell that first expands wildcard before passing them to the program. Note also that if the wild card cannot be expanded, the entire expression (including the metacharacters) is passed as an argument to the program. Some programs are capable of expanding the metacharacters themselves.

As in places, other the exclamation mark (!) has a special meaning. (That is, it is also a metacharacter) When creating a regular expression, the exclamation mark is used to negate a set of characters. For example, if we wanted to list all files that did not have a number at the end, we could do something like this

ls *[!0-9]

This is certainly faster than typing this

ls *[a-zA-z]

However, this second example does not mean the same thing. In the first case, we are saying we do not want numbers. In the second case, we are saying we only want letters. There is a key difference because in the second case we do not include the punctuation marks and other symbols.

Another symbol with special meaning is the dollar sign ($). This is used as a marker to indicate that something is a variable. I mentioned earlier in this section that you could get access to your login name environment variable by typing:

The system stores your login name in the environment variable LOGNAME (note no "$"). The system needs some way of knowing that when you input this on the command line, you are talking about the variable LOGNAME and not the literal string LOGNAME. This is done with the "$".Several variables are set by the system. You can also set variables yourself and use them later on. I'll get into more detail about shell variables later.

So far, we have been talking about metacharacters used for searching the names of files. However, metacharacters can often be used in the arguments to certain commands. One example is the grep command, which is used to search for strings within files. The name grep comes from Global Regular Expression Print (or Parser). As its name implies, it has something to do with regular expressions. Lets assume we have a text file called documents, and we wish to see if the string "letter" exists in that text. The command might be

This will search for and print out every line containing the string "letter." This includes such things as "letterbox," "lettercarrier," and even "love-letter." However, it will not find "Letterman," because we did not tell grep to ignore upper- and lowercase (using the -i option). To do so using regular expressions, the command might look like this

Now, because we specified to look for either "L" or "l" followed by "etter," we get both "letter" and "Letterman." We can also specify that we want to look for this string only when it appears at the beginning of a line using the caret (^) symbol. For example

This searches for all strings that start with the "beginning-of-line," followed by either "L" or "l," followed by "etter." Or, if we want to search for the same string at the end of the line, we would use the dollar sign to indicate the end of the line. Note that at the beginning of a string, the dollar sign is treated as the beginning of the string, whereas at the end of a string, it indicates the end of the line. Confused? Lets look at an example. Lets define a string like this:


If we echo that string, we simply get ^[Ll]etter. Note that this includes the caret at the beginning of the string. When we do a search like this

it is equivalent to

Now, if write the same command like this

This says to find the string defined by the VAR variable(^[Ll]etter) , but only if it is at the end of the line. Here we have an example, where the dollar sign has both meanings. If we then take it one step further:

This says to find the string defined by the VAR variable, but only if it takes up the entry line. In other words, the line consists only of the beginning of the line (^), the string defined by VAR, and the end of the line ($).

Here I want to side step a little. When you look at the variable $VAR$ it might be confusing to some people. Further, if you were to combine this variable with other characters you may end with something you do not expect because the shell decides to include as part of the variable name. To prevent this, it is a good idead to include the variable name within curly-braces, like this:


The curly-braces tell the shell what exactly belongs to the variable name. I try to always include the variable name within curly-braces to ensure that there is no confusion. Also, you need to use the curly-braces when comining variables like this:


Often you need to match a series of repeated characters, such as spaces, dashes and so forth. Although you could simply use the asterisk to specify any number of that particular character, you can run into problems on both ends. First, maybe you want to match a minimum number of that character. This could easily solved by first repeating that character a certain number of times before you use the wildcard. For example, the expression


would match at least three equal signs. Why three? Well, we have explicitly put in three equal signs and the wildcard follows the fourth. Since the asterisk can be zero or more, it could mean zero and therefore the expression would only match three.

The next problem occurs when we want to limit the maximum number of characters that are matched. If you know exactly how many to match, you could simply use that many characters. What do you do if you have a minimum and a maximum? For this, you enclose the range with curly-braces: {min,max}. For example, to specify at least 5 and at most 10, it would look like this: {5,10}. Keep in mind that the curly braces have a special meaning for the shell, so we would need to escape them with a back-slash when using them on the command line. So, lets say we wanted to search a file for all number combinations between 5 and 10 number long. We might have something like this:

This might seem a little complicated, but it would be far more complicated to write an regular expression that searches for each combination individually.

As we mentioned above, to define a specific number of a particular character you could simply input that character the desired number of times. However, try counting 17 periods on a line or 17 lower-case letters ([a-z]). Imagine trying to type in this combination 17 times! You could specify a range with a maximum of 17 and a minimum of 17, like this: {17,17}. Although this would work, you could save yourself a little typing by simply including just the single value. Therefore, to match exactly 17 lower-case letters, you might have something like this:

If we want to specify a minimum number of times, without a maximum, we simply leave off the maximum, like this:

This would match a pattern of at least 17 lower-case letters.

Another problem occurs when you are trying to parse data that is not in English. If you were looking for all letters in an English text, you could use something like this: [a-zA-Z]. However, this would not include German letters, like ,, and so forth. To do so, you would use the expressions [:lower:], [:upper:] or [:alpha:] for the lower-case letters, upper-case letters or all letters, respectively, regardless of the language. (Note this assumes that national language support (NLS) is configured on your system, which it normally is for newer Linux distributions.

Other expressions include:

  • [:alnum:] - Alpha-numeric characters.
  • [:cntrl:] - Control characters.
  • [:digit:] - Digits.
  • [:graph:] - Graphics characters.
  • [:print:] - Printable characters.
  • [:punct:] - Punctuation.
  • [:space:] - White spaces.

    One very important thing to note is that the brackets are part of the expression. Therefore, if you want to include more in a bracket expression you need to make sure you have the correction number of brackets. For example, if you wanted to match any number of alpha-numeric or punctuation, you might have an expression like this: [[:alnum:][:digit:]]*.

    Another thing to note is that in most cases, regular expression are expanded as much as possible. For example, let's assume I was parsing an HTML file and wanted to match the first tag on the line. You might think to try an expression like this: "<.*>". This says to match any number of characters between the angle brackets. This works if there is only one tag on the line. However, if you have more than one tag, this expression would match everything from the first opening angle-bracket to the last closing angle bracket with everything inbetween.

    There are a number of rules that are defined for regular expression, the understanding of which helps avoid confusion:

    1. An non-special character is equivalent to that character.
    2. When preceeded by a backslash (\) is every special character equivalent to itself
    3. A period specifies any single character
    4. An asterisk specifies zero or more copies of the preceeding chacter
    5. When used by itself, an asterisk species everything or nothing
    6. A range of characters is specified within square brackets ([ ])
    7. The beginning of the line is specified with a caret (^) and the end of the line with a dollar sign ($)
    8. If included within square brackets, a caret (^) negates the set of characters

  •  Previous Page
      Back to Top
    Table of Contents
    Next Page 


    Test Your Knowledge

    User Comments:

    You can only add comments if you are logged in.

    Copyright 2002-2009 by James Mohr. Licensed under modified GNU Free Documentation License (Portions of this material originally published by Prentice Hall, Pearson Education, Inc). See here for details. All rights reserved.

    Looking for a "printer friendly" version?



    Security Code
    Security Code
    Type Security Code

    Don't have an account yet? You can create one. As a registered user you have some advantages like theme manager, comments configuration and post comments with your name.

    Help if you can!

    Amazon Wish List

    Did You Know?
    You can choose larger fonts by selecting a different themes.


    Tell a Friend About Us

    Bookmark and Share

    Web site powered by PHP-Nuke

    Is this information useful? At the very least you can help by spreading the word to your favorite newsgroups, mailing lists and forums.
    All logos and trademarks in this site are property of their respective owner. The comments are property of their posters. Articles are the property of their respective owners. Unless otherwise stated in the body of the article, article content (C) 1994-2013 by James Mohr. All rights reserved. The stylized page/paper, as well as the terms "The Linux Tutorial", "The Linux Server Tutorial", "The Linux Knowledge Base and Tutorial" and "The place where you learn Linux" are service marks of James Mohr. All rights reserved.
    The Linux Knowledge Base and Tutorial may contain links to sites on the Internet, which are owned and operated by third parties. The Linux Tutorial is not responsible for the content of any such third-party site. By viewing/utilizing this web site, you have agreed to our disclaimer, terms of use and privacy policy. Use of automated download software ("harvesters") such as wget, httrack, etc. causes the site to quickly exceed its bandwidth limitation and are therefore expressly prohibited. For more details on this, take a look here

    PHP-Nuke Copyright © 2004 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.
    Page Generation: 0.40 Seconds