Regular Expressions

Regular Expressions are a powerful tool widely used to perform searches and validations. In the context of a web application they are commonly used to perform input validation (e.g. Email address).

Regular expressions are a notation for describing sets of character strings. When a particular string is in the set described by a regular expression, we often say that the regular expression matches the string. (source)

It is well-known that Regular Expressions are hard to master. Sometimes, what seems to be a simple validation, may lead to a Denial-of-Service.

Go authors took it seriously and, unlike other programming languages, opted by a RE2 implementation for the regex standard package.

Why RE2

RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. (source)

With security in mind, RE2 also guarantees a linear-time performance and graceful failing: the memory available to the parser, the compiler and the execution engines is limited.

Regular Expression Denial of Service (ReDoS)

Regular Expression Denial of Service (ReDoS) is an algorithmic complexity attack that provokes a Denial of Service (DoS). ReDos attacks are caused by a regular expression that takes a very long time to be evaluated, exponentially related with the input size. This exceptionally long time in the evaluation process is due to the implementation of the regular expression in use, for example, recursive backtracking ones. (source)

You're better reading the full article "Diving Deep into Regular Expression Denial of Service (ReDoS) in Go" as it goes deep into the problem and also includes comparison between most popular programming languages. In this section we will focus on a real world use case.

For some reason you're looking for a Regular Expression to validate Email addresses provided on your signup form. After a quick search you found this RegEx for email validation at RegExLib.com

^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$

If you try to match [email protected] against this regular expression you may feel confident that it does what you're looking for. If you're developing using Go, you'll come up with something like

package main

import (
    "fmt"
    "regexp"
)

func main() {
    testString1 := "[email protected]"
    testString2 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
    regex := regexp.MustCompile("^([a-zA-Z0-9])(([\\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$")

    fmt.Println(regex.MatchString(testString1))
    // expected output: true
    fmt.Println(regex.MatchString(testString2))
    // expected output: false
}

what is just fine

$ go run src/redos.go
true
false

What if you're developing with, for example, JavaScript?

const testString1 = '[email protected]';
const testString2 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!';
const regex = /^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$/;

console.log(regex.test(testString1));
// expected output: true
console.log(regex.test(testString2));
// expected output: hang/FATAL EXCEPTION

This time, execution will hang forever and your application will attend no further requests (at least this process), meaning no further signups until the application gets restarted, meaning business loss.

What's missing?

If you have background with other programming languages such as Perl, Python, PHP or JavaScript you should be aware of the differences regarding Regular Expression supported features.

RE2 does not support constructs for which only backtracking solutions are known to exist such as Backreferences and Lookaround.

Consider the following problem: validating whether an arbitrary string is a well-formed HTML tag: a) opening and closing tag names match and b) optionally there's some text in between.

Fulfill requirement b) is straightforward .*?, but a) is challenging as closing tag match depends on what was matched as opening tag. This is exactly what Backreferences allows us to do. Check the JavaScript implementation below

const testString1 = '<h1>Go Secure Coding Practices Guide</h1>';
const testString2 = '<p>Go Secure Coding Practices Guide</p>';
const testString3 = '<h1>Go Secure Coding Practices Guid</p>';
const regex = /<([a-z][a-z0-9]*)\b[^>]*>.*?<\/\1>/;

console.log(regex.test(testString1));
// expected output: true
console.log(regex.test(testString2));
// expected output: true
console.log(regex.test(testString3));
// expected output: false

\1 will hold the value previously captured by ([A-Z][A-Z0-9]*).

This is something you should not expect to do in Go

package main

import (
    "fmt"
    "regexp"
)

func main() {
    testString1 := "<h1>Go Secure Coding Practices Guide</h1>"
    testString2 := "<p>Go Secure Coding Practices Guide</p>"
    testString3 := "<h1>Go Secure Coding Practices Guid</p>"
    regex := regexp.MustCompile("<([a-z][a-z0-9]*)\b[^>]*>.*?<\/\1>")

    fmt.Println(regex.MatchString(testString1))
    fmt.Println(regex.MatchString(testString2))
    fmt.Println(regex.MatchString(testString3))
}

Running Go source code sample above should result in the following errors

$ go run src/backreference.go
# command-line-arguments
src/backreference.go:12:64: unknown escape sequence
src/backreference.go:12:67: non-octal character in escape sequence: >

You may feel tempted to fix these errors, coming up with the following regular expression

<([a-z][a-z0-9]*)\b[^>]*>.*?<\\/\\1>

Then, this is what you'll get

go run src/backreference.go
panic: regexp: Compile("<([a-z][a-z0-9]*)\b[^>]*>.*?<\\/\\1>"): error parsing regexp: invalid escape sequence: `\1`

goroutine 1 [running]:
regexp.MustCompile(0x4de780, 0x21, 0xc00000e1f0)
        /usr/local/go/src/regexp/regexp.go:245 +0x171
main.main()
        /go/src/backreference.go:12 +0x3a
exit status 2

While developing something from scratch you'll probably find a nice workaround to the lack of some features. On the other hand, porting existing software may make you look for full featured alternative to the standard Regular Expression package, and you'll find some (e.g. dlclark/regexp2). Keep in mind that, then you'll (probably) lose RE2 "safety features" such as the linear-time performance.

results matching ""

    No results matching ""