Documentation Center

  • Trials
  • Product Updates

Tokens in Regular Expressions

Introduction

Parentheses used in a regular expression not only group elements of that expression together, but also designate any matches found for that group as tokens. You can use tokens to match other parts of the same string. One advantage of using tokens is that they remember what they matched, so you can recall and reuse matched text in the process of searching or replacing.

Each token in the expression is assigned a number, starting from 1, going from left to right. To make a reference to a token later in the expression, refer to it using a backslash followed by the token number. For example, when referencing a token generated by the third set of parentheses in the expression, use \3.

As a simple example, if you wanted to search for identical sequential letters in a string, you could capture the first letter as a token and then search for a matching character immediately afterwards. In the expression shown below, the (\S) phrase creates a token whenever regexp matches any nonwhitespace character in the string. The second part of the expression, '\1', looks for a second instance of the same character immediately following the first:

poestr = ['While I nodded, nearly napping, ' ...
          'suddenly there came a tapping,'];

[mat,tok,ext] = regexp(poestr, '(\S)\1', 'match', ...
               'tokens', 'tokenExtents');
mat
mat = 
    'dd'    'pp'    'dd'    'pp'

The tokens returned in cell array tok are:

'd', 'p', 'd', 'p'

Starting and ending indices for each token in the input string poestr are:

11 11,  26 26,  35 35,  57 57

For another example, capture pairs of matching HTML tags (e.g., <a> and </a>) and the text between them. The expression used for this example is

expr = '<(\w+).*?>.*?</\1>';

The first part of the expression, '<(\w+)', matches an opening bracket (<) followed by one or more alphabetic, numeric, or underscore characters. The enclosing parentheses capture token characters following the opening bracket.

The second part of the expression, '.*?>.*?', matches the remainder of this HTML tag (characters up to the >), and any characters that may precede the next opening bracket.

The last part, '</\1>', matches all characters in the ending HTML tag. This tag is composed of the sequence </tag>, where tag is whatever characters were captured as a token.

hstr = '<!comment><a name="752507"></a><b>Default</b><br>';
expr = '<(\w+).*?>.*?</\1>';

[mat,tok] = regexp(hstr, expr, 'match', 'tokens');
mat{:}
ans =
    <a name="752507"></a>
ans =
    <b>Default</b>
tok{:}
ans = 
    'a'
ans = 
    'b'

Multiple Tokens

Here is an example of how tokens are assigned values. Suppose that you are going to search the following text:

andy ted bob jim andrew andy ted mark

You choose to search the above text with the following search pattern:

and(y|rew)|(t)e(d)

This pattern has three parenthetical expressions that generate tokens. When you finally perform the search, the following tokens are generated for each match.

Match

Token 1

Token 2

andy

y

 

ted

t

d

andrew

rew

 

andy

y

 

ted

t

d

Only the highest level parentheses are used. For example, if the search pattern and(y|rew) finds the text andrew, token 1 is assigned the value rew. However, if the search pattern (and(y|rew)) is used, token 1 is assigned the value andrew.

Unmatched Tokens

For those tokens specified in the regular expression that have no match in the string being evaluated, regexp and regexpi return an empty string ('') as the token output, and an extent that marks the position in the string where the token was expected.

The example shown here executes regexp on the path string str returned from the MATLAB® tempdir function. The regular expression expr includes six token specifiers, one for each piece of the path string. The third specifier [a-z]+ has no match in the string because this part of the path, Profiles, begins with an uppercase letter:

str = tempdir
str =
   C:\WINNT\Profiles\bpascal\LOCALS~1\Temp\
expr = ['([A-Z]:)\\(WINNT)\\([a-z]+)?.*\\' ...
        '([a-z]+)\\([A-Z]+~\d)\\(Temp)\\'];

[tok ext] = regexp(str, expr, 'tokens', 'tokenExtents');

When a token is not found in a string, MATLAB still returns a token string and token extent. The returned token string is an empty character string (''). The first number of the extent is the string index that marks where the token was expected, and the second number of the extent is equal to one less than the first.

In the case of this example, the empty token is the third specified in the expression, so the third token string returned is empty:

tok{:}
ans = 
    'C:'    'WINNT'     ''    'bpascal'    'LOCALS~1'    'Temp'

The third token extent returned in the variable ext has the starting index set to 10, which is where the nonmatching substring, Profiles, begins in the string. The ending extent index is set to one less than the starting index, or 9:

ext{:}
ans =
     1     2
     4     8
    10     9
    19    25
    27    34
    36    39

Tokens in Replacement Strings

When using tokens in a replacement string, reference them using $1, $2, etc. instead of \1, \2, etc. This example captures two tokens and reverses their order. The first, $1, is 'Norma Jean' and the second, $2, is 'Baker'. Note that regexprep returns the modified string, not a vector of starting indices.

regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')
ans =
    Baker, Norma Jean

Named Capture

If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than having to keep track of which token number is assigned to which token.

When referencing a named token within the expression, use the syntax \k<name> instead of the numeric \1, \2, etc.:

poestr = ['While I nodded, nearly napping, ' ...
          'suddenly there came a tapping,'];

regexp(poestr, '(?<anychar>.)\k<anychar>', 'match')
ans = 
    'dd'    'pp'    'dd'    'pp'

Named tokens can also be useful in labeling the output from the MATLAB regular expression functions. This is especially true when you are processing numerous strings.

For example, parse different pieces of street addresses from several strings. A short name is assigned to each token in the expression string:

str1 = '134 Main Street, Boulder, CO, 14923';
str2 = '26 Walnut Road, Topeka, KA, 25384';
str3 = '847 Industrial Drive, Elizabeth, NJ, 73548';

p1 = '(?<adrs>\d+\s\S+\s(Road|Street|Avenue|Drive))';
p2 = '(?<city>[A-Z][a-z]+)';
p3 = '(?<state>[A-Z]{2})';
p4 = '(?<zip>\d{5})';

expr = [p1 ', ' p2 ', ' p3 ', ' p4];

As the following results demonstrate, you can make your output easier to work with by using named tokens:

loc1 = regexp(str1, expr, 'names')
loc1 = 
     adrs: '134 Main Street'
     city: 'Boulder'
    state: 'CO'
      zip: '14923'
loc2 = regexp(str2, expr, 'names')
loc2 = 
     adrs: '26 Walnut Road'
     city: 'Topeka'
    state: 'KA'
      zip: '25384'
loc3 = regexp(str3, expr, 'names')
loc3 = 
     adrs: '847 Industrial Drive'
     city: 'Elizabeth'
    state: 'NJ'
      zip: '73548'

See Also

| |

More About

Was this topic helpful?