# Compressing IPv6 Addresses with Regular Expressions

2021-03-06

This post was lying around for too long, time to finish it and get it off my todo list.

## The Goal

Convert from a full IPv6 address like `2001:0db8:0000:0000:0000:0023:4200:0123`

to a compressed one like `2001:db8::23:4200:123`

with a regular expression. Why? Because I can. And because some prometheus exporters only give you the uncompressed addresses.

Details regarding IPv6 address compression can be found in RFC 5952 Section 5.

## Preprocessing

The first step is to get from the full form to a form where the leading zeros are removed/reduced to just a single zero per block.

This could be done like this:

`^0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*)$`

and `$1$2$3$4$5$6$7$8`

as a replacement string. This results in `2001:db8:0:0:0:23:4200:123`

## The Common Case

Now the funny part begins. Basically we have to find the first longest sequence of zero-blocks that is longer than 1 block. If we make a group with everything to the left and right of that (including the `:`

) and combine the 2 groups we get the result. There are some corner cases, those will be handled later.

To find a sequence of length `N`

we can build a very simple expression like

`0:0:< in total N zeros >:0:0`

Everything to the left of that must have less `N`

consecutive zero blocks. A expression to match that could be:

`((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*`

Everything to the right of the zero block sequence can have at most `N`

consecutive zero blocks. The expression for that looks like this.

`(:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*`

Combined they result in this expression:

`(((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*)0:0:< in total N zeros >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)`

Combining the first and fourth group of that expression results in the compressed representation of the address, if the longest zero block sequence is `N`

blocks long. We can build the expression for all possible block lengths.

## The Corner Cases

As mentioned earlier there are several corner cases.

### Compression at the left or right side

If the longest sequence is at the left or right end (e.g. `0:0:0:0:0:0:1:2`

and `2:1:0:0:0:0:0:0`

) then we need a special expression. For the left side it looks like this:

`0(:)0:< N-1 times 0 in total >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)`

This solves 2 problems:

- The group that would be on the left side with the expression for the common case needs to be removed, because there is nothing to match there.
- we have to find a
`:`

to build the`::`

in the compressed representation. This is done by taking one from the`N`

zero blocks of the longest sequence.

the right side is constructed the same way.

### Compressing 7 consecutive zero blocks

When there are 7 consecutive zero blocks then the compression will happen at either the left or right side, because there is only one non zero block left. for the 7 only the two expressions for the sides are needed, not the common one.

### Compressing 8 zero blocks

All the other expressions don't work for the one case of 8 consecutive zeros. But we have to get 2 `:`

for the `::`

from somewhere. This could look like this:

`0(:)0(:)0:0:0:0:0:0`

## Combining it all

In total we get 3 expressions each for 2, 3, 4, 5 and 6 consecutive zero blocks, 2 for the 7 consecutive zero blocks and 1 for the 8 zero blocks.

Those 18 expressions can be combined like this:

`^(($EXPR1)|($EXPR2)|...)$`

Building all of this by hand is shitty and annoying. Here is some code to do it.

```
zero = "0"
non_zero = "[1-9a-f]"
all_chars = "[0-9a-f]"
non_zero_chunk = f"{non_zero}{all_chars}{{0,3}}"
def max_n_zero_block_left(n: int) -> str:
return f"((({ zero }:){{0,{n}}}{ non_zero_chunk }:)*)"
def max_n_zero_block_right(n: int) -> str:
return f"((:{ non_zero_chunk }(:{ zero }){{0,{n}}})*)"
def n_length_zero_block(n: int) -> str:
return ":".join(zero*n)
left_zero_prefix = f"0(:)"
right_zero_suffix = f"(:)0"
patterns = list()
replacement_positions = list()
next_pattern_group_start = 1
for n in range(7,1,-1):
# special case for the longest continuos zero string at the left
pattern = f"({ left_zero_prefix }{ n_length_zero_block(n - 1) }({ max_n_zero_block_right(n) }))"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 3)
next_pattern_group_start += 6
# special case for ending with 0
pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n - 1) }{ right_zero_suffix })"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 6)
next_pattern_group_start += 6
if n == 7:
continue # the regular case does not exist for n=7
# regular case
pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n) }({ max_n_zero_block_right(n) }))"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 6)
next_pattern_group_start += 9
patterns.append("(0(:)0(:)0:0:0:0:0:0)")
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 3)
all_patterns = "|".join(patterns)
all_patterns = "^(" + all_patterns + ")$"
print("Expression:")
print(all_patterns)
replacement = "".join(f"${{{i}}}" for i in replacement_positions)
print("Replacement:")
print(replacement)
```

And the output of the script:

```
Expression:
^((0(:)0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,7})*)))|(((((0:){0,6}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(:)0)|(0(:)0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(:)0)|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(0(:)0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(:)0)|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(0(:)0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(:)0)|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(0(:)0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0(:)0)|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(0(:)0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0(:)0)|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(0(:)0(:)0:0:0:0:0:0))$
Replacement:
${3}${4}${9}${13}${15}${16}${21}${25}${27}${31}${36}${37}${42}${46}${48}${52}${57}${58}${63}${67}${69}${73}${78}${79}${84}${88}${90}${94}${99}${100}${105}${109}${111}${115}${120}${121}
```

Please keep in mind that I am just an idiot on the internet, don't use this expression to burn your production environment down.