Regex word boundaries with Unicode

Satsuki Hashiba
2 min readSep 16, 2019

--

This article is about how to use the regex word boundaries(\b) with Unicode.

Photo by Dan Gold on Unsplash

If you want to extract the word “apple” from the sentence, the regex is like this:

\bapple\b

\b represents the beginning or end of a word (Word Boundary). This regex would match apple in an apple pie, but wouldn’t match apple in pineapple, applecarts or bakeapples.

How about “café”? How can we extract the word “café” in regex? Actually, \bcafé\b wouldn’t work. Why? Because “café” contains non-ASCII character: é. \b can’t be simply used with Unicode such as समुद्र, 감사, месяц and 😉 .

When you want to extract Unicode characters, you should directly define characters which represent word boundaries.

The answer:

(?<=[\s,.:;"']|^)UNICODE_WORD(?=[\s,.:;"']|$)

Breakdown

(?<=PATTERN) represents positive lookbehind, it matches at a position if the pattern inside the lookbehind can be matched ending at that position. (?<=apple-)pie would match pie in apple-pie, but wouldn’t match pie in lime-pie. This pattern isn’t contained in the result.

(?=PATTERN) is positive lookahead. apple(?=-pie) would match apple in apple-pie, but wouldn’t match apple in apple-juice.

[] is a character class and means OR. This would match only one out of several characters. The characters which represent word boundaries are defined here. Please add other characters like , , ! or ? depending on the situation.

| also means OR. We use | here because ^ and $ can’t be contained between[].

^ matches the position before the first character in the string, and$ matches right after the last character in the string.

Now, you can extract any Unicode characters from the sentence! Try it here 😎

--

--