How to Match Strings in SystemVerilog Using Regular Expressions

Recently, I needed to filter out some instance paths from my UVM testbench hierarchy. I discovered that this can be done using regular expressions and that UVM already has a function called uvm_pkg::uvm_re_match(), which is a DPI-C function that makes use of the POSIX function regexec() to perform a string match.

The uvm_re_match function will return zero if there is a match and 1 if the regular expression does NOT match.

This function is very easy to use. Here is an example which can be found on EDAPlayground:

module top;
  import uvm_pkg::*;
  
  bit match;
  string str = "abcdef.ghij[2]";
  string regex;

  initial begin
    // match - returns 0
    regex="abcdef.ghij[[][2-7][]]";
    match = uvm_re_match(regex, str);
    printResult();
   
    //match - returns 0
    regex="abcdef*";
    match = uvm_re_match(regex, str);
	printResult();  
 
    //NO match - return 1
    regex="xyz";
    match = uvm_re_match(regex, str);
    printResult();
  end
  
  function void printResult();
    $display(" MATCH=", match, " when searching for regular expression:", regex, " inside string: ", str);
  endfunction
endmodule

OUTPUT:

MATCH=0 when searching for regular expression:abcdef.ghij[[][2-7][]] inside string: abcdef.ghij[2]
MATCH=0 when searching for regular expression:abcdef* inside string: abcdef.ghij[2]
MATCH=1 when searching for regular expression:xyz inside string: abcdef.ghij[2]

So I started to use the uvm_pkg::uvm_re_match() function to match my class instances.

While playing with this function, I discovered some non-obvious behavior, which I thought I would share with you.

This is best illustrated using this example on EDAPlayground:

module top;
  import uvm_pkg::*;
   
  bit match;
  string str = "abcdef.ghij[2]";
  string regex;
  
  initial begin
 
    //case 1 - NO match
    regex = "abcdef.ghij[2]";
    $display("Case1:", regex);
    match =uvm_re_match(regex, str);
    $display(match);
 
    //case 2 - NO match
    regex = "abcdef.ghij\[2\]";
    $display("Case2:", regex);
    match =uvm_re_match(regex, str);
    $display(match);

    //case 3 - MATCHES
    regex = "abcdef.ghij\\[2\\]";
    $display("Case3:", regex);
    match =uvm_re_match(regex, str);
    $display(match);
   
    //case 4 - MATCHES
    regex = "abcdef.ghij[[]2[]]";
    $display("Case4:", regex);
    match =uvm_re_match(regex, str);
    $display(match);
  end
endmodule

OUTPUT:

Case1:abcdef.ghij[2]
1
Case2:abcdef.ghij[2]
1
Case3:abcdef.ghij\[2\]
0
Case4:abcdef.ghij[[]2[]]
0

“Case 1” is clearly a mistake because according to POSIX regex the [2] will try to match the character found between the brackets, which is 2, and no matching is performed for the bracket characters [ and ] themselves. Here is a great website for testing the behavior of regular expressions on a sample text.

I expected “Case 2” to work because the bracket characters are escaped using \[ and \], but in SystemVerilog it seems that the \ character also needs to be escaped because it is itself the escape character used inside a string (for more details see this stackoverflow question). See the output when printing the regex for “Case 2”. I therefore need to escape this escape character with another \ character, as in “Case 3”.

“Case 4” is also a solution because we use the character set from regular expressions. We add the opening and closing brackets inside the character set operator [ ] like this: [[] and []].

uvm_re_match inside UVM code

Note that the implementation of uvm_re_match() has two variants:

  • The POSIX regular expression (default)
  • The glob style

The implementation is chosen based on the DPI mode of the UVM library. DPI mode is selected whenever UVM_NO_DPI is not defined. If DPI mode is used, then the uvm_re_match function will use the POSIX implementation, otherwise it will use the glob style implementation, as can be seen below:

`ifdef UVM_NO_DPI
  `define UVM_REGEX_NO_DPI
`endif

`ifndef UVM_REGEX_NO_DPI
  import "DPI-C" context function int uvm_re_match(string re, string str);
  import "DPI-C" context function void uvm_dump_re_cache();
  import "DPI-C" context function string uvm_glob_to_re(string glob);
`else
  // The Verilog only version does not match regular expressions,
  // it only does glob style matching.
  function int uvm_re_match(string re, string str);
    //...code
  endfunction

  function void uvm_dump_re_cache();
  endfunction

  function string uvm_glob_to_re(string glob);
    // code
  endfunction

`endif

If your code defines UVM_NO_DPI or UVM_REGEX_NO_DPI, then the uvm_re_match function will not be able to process POSIX regular expressions and the regular expressions will not work as expected.

Conclusion

When using the escape character \ in a SystemVerilog string, don’t forget to check whether you need to escape it once more like this \\. Otherwise, it might not do what you expect it to do.

Have you always done this? Please share your experience of using regular expressions in SystemVerilog.


Comments

andrew ming October 7th, 2020 00:55:36

Thank you for sharing.


Leave a Comment:

Your comment will be visible after approval.

(will not be published)

This site uses Akismet to reduce spam. Learn how your comment data is processed.