Compressing IPv6 Addresses with Regular Expressions

2021-03-06

This post was lying around for too long, time to finish it and get it off my todo list.

The Goal

Convert from a full IPv6 address like 2001:0db8:0000:0000:0000:0023:4200:0123 to a compressed one like 2001:db8::23:4200:123 with a regular expression. Why? Because I can. And because some prometheus exporters only give you the uncompressed addresses.

Details regarding IPv6 address compression can be found in RFC 5952 Section 4.

Preprocessing

The first step is to get from the full form to a form where the leading zeros are removed/reduced to just a single zero per block.

This could be done like this:

^0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*)$

and $1$2$3$4$5$6$7$8 as a replacement string. This results in 2001:db8:0:0:0:23:4200:123

The Common Case

Now the funny part begins. Basically we have to find the first longest sequence of zero-blocks that is longer than 1 block. If we make a group with everything to the left and right of that (including the : ) and combine the 2 groups we get the result. There are some corner cases, those will be handled later.

To find a sequence of length N we can build a very simple expression like

0:0:< in total N zeros >:0:0

Everything to the left of that must have less N consecutive zero blocks. A expression to match that could be:

((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*

Everything to the right of the zero block sequence can have at most N consecutive zero blocks. The expression for that looks like this.

(:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*

Combined they result in this expression:

(((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*)0:0:< in total N zeros >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)

Combining the first and fourth group of that expression results in the compressed representation of the address, if the longest zero block sequence is N blocks long. We can build the expression for all possible block lengths.

The Corner Cases

As mentioned earlier there are several corner cases.

Compression at the left or right side

If the longest sequence is at the left or right end (e.g. 0:0:0:0:0:0:1:2 and 2:1:0:0:0:0:0:0) then we need a special expression. For the left side it looks like this:

0(:)0:< N-1 times 0 in total >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)

This solves 2 problems:

The group that would be on the left side with the expression for the common case needs to be removed, because there is nothing to match there.
we have to find a : to build the :: in the compressed representation. This is done by taking one from the N zero blocks of the longest sequence.

the right side is constructed the same way.

Compressing 7 consecutive zero blocks

When there are 7 consecutive zero blocks then the compression will happen at either the left or right side, because there is only one non zero block left. for the 7 only the two expressions for the sides are needed, not the common one.

Compressing 8 zero blocks

All the other expressions don't work for the one case of 8 consecutive zeros. But we have to get 2 : for the :: from somewhere. This could look like this:

0(:)0(:)0:0:0:0:0:0

Combining it all

In total we get 3 expressions each for 2, 3, 4, 5 and 6 consecutive zero blocks, 2 for the 7 consecutive zero blocks and 1 for the 8 zero blocks.

Those 18 expressions can be combined like this:

^(($EXPR1)|($EXPR2)|...)$

Building all of this by hand is shitty and annoying. Here is some code to do it.

zero = "0"
non_zero = "[1-9a-f]"
all_chars = "[0-9a-f]"
non_zero_chunk = f"{non_zero}{all_chars}{{0,3}}"

def max_n_zero_block_left(n: int) -> str:
    return f"((({ zero }:){{0,{n}}}{ non_zero_chunk }:)*)"

def max_n_zero_block_right(n: int) -> str:
    return f"((:{ non_zero_chunk }(:{ zero }){{0,{n}}})*)"

def n_length_zero_block(n: int) -> str:
    return ":".join(zero*n)

left_zero_prefix = f"0(:)"
right_zero_suffix = f"(:)0"

patterns = list()
replacement_positions = list()
next_pattern_group_start = 1

for n in range(7,1,-1):
        # special case for the longest continuos zero string at the left
        pattern = f"({ left_zero_prefix }{ n_length_zero_block(n - 1) }({ max_n_zero_block_right(n) }))"
        patterns.append(pattern)
        replacement_positions.append(next_pattern_group_start + 2)
        replacement_positions.append(next_pattern_group_start + 3)
        next_pattern_group_start += 6

        # special case for ending with 0
        pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n - 1) }{ right_zero_suffix })"
        patterns.append(pattern)
        replacement_positions.append(next_pattern_group_start + 2)
        replacement_positions.append(next_pattern_group_start + 6)
        next_pattern_group_start += 6

        if n == 7:
            continue  # the regular case does not exist for n=7

        # regular case
        pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n) }({ max_n_zero_block_right(n) }))"
        patterns.append(pattern)
        replacement_positions.append(next_pattern_group_start + 2)
        replacement_positions.append(next_pattern_group_start + 6)
        next_pattern_group_start += 9

patterns.append("(0(:)0(:)0:0:0:0:0:0)")
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 3)

all_patterns = "|".join(patterns)
all_patterns = "^(" + all_patterns + ")$"
print("Expression:")
print(all_patterns)

replacement = "".join(f"${{{i}}}" for i in replacement_positions)
print("Replacement:")
print(replacement)

And the output of the script:

Expression:
^((0(:)0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,7})*)))|(((((0:){0,6}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(:)0)|(0(:)0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(:)0)|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(0(:)0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(:)0)|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(0(:)0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(:)0)|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(0(:)0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0(:)0)|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(0(:)0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0(:)0)|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(0(:)0(:)0:0:0:0:0:0))$
Replacement:
${3}${4}${9}${13}${15}${16}${21}${25}${27}${31}${36}${37}${42}${46}${48}${52}${57}${58}${63}${67}${69}${73}${78}${79}${84}${88}${90}${94}${99}${100}${105}${109}${111}${115}${120}${121}

Please keep in mind that I am just an idiot on the internet, don't use this expression to burn your production environment down.

swagspace.org