Compressing IPv6 Addresses with Regular Expressions
2021-03-06
This post was lying around for too long, time to finish it and get it off my todo list.
The Goal
Convert from a full IPv6 address like
2001:0db8:0000:0000:0000:0023:4200:0123
to a compressed one like
2001:db8::23:4200:123
with a regular expression.
Why? Because I can. And because some prometheus exporters only give you the
uncompressed addresses.
Details regarding IPv6 address compression can be found in RFC 5952 Section 4.
Preprocessing
The first step is to get from the full form to a form where the leading zeros are removed/reduced to just a single zero per block.
This could be done like this:
^0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*:)0{0,3}(.*)$
and $1$2$3$4$5$6$7$8
as a replacement string.
This results in 2001:db8:0:0:0:23:4200:123
The Common Case
Now the funny part begins. Basically we have to find the first longest sequence
of zero-blocks that is longer than 1 block. If we make a group with everything to
the left and right of that (including the :
) and combine the 2
groups we get the result. There are some corner cases, those will be handled
later.
To find a sequence of length N
we can build a very simple expression like
0:0:< in total N zeros >:0:0
Everything to the left of that must have less N
consecutive zero blocks.
A expression to match that could be:
((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*
Everything to the right of the zero block sequence can have at most N
consecutive zero blocks. The expression for that looks like this.
(:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*
Combined they result in this expression:
(((0:){0,$N-1}[1-9a-f][0-9a-f]{0,3}:)*)0:0:< in total N zeros >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)
Combining the first and fourth group of that expression results in the
compressed representation of the address, if the longest zero block sequence is
N
blocks long. We can build the expression for all possible block lengths.
The Corner Cases
As mentioned earlier there are several corner cases.
Compression at the left or right side
If the longest sequence is at the left or right end
(e.g. 0:0:0:0:0:0:1:2
and 2:1:0:0:0:0:0:0
) then we need a
special expression. For the left side it looks like this:
0(:)0:< N-1 times 0 in total >:0:0((:[1-9a-f][0-9a-f]{0,3}(:0){0,$N})*)
This solves 2 problems:
- The group that would be on the left side with the expression for the common case needs to be removed, because there is nothing to match there.
- we have to find a
:
to build the::
in the compressed representation. This is done by taking one from theN
zero blocks of the longest sequence.
the right side is constructed the same way.
Compressing 7 consecutive zero blocks
When there are 7 consecutive zero blocks then the compression will happen at either the left or right side, because there is only one non zero block left. for the 7 only the two expressions for the sides are needed, not the common one.
Compressing 8 zero blocks
All the other expressions don't work for the one case of 8 consecutive zeros.
But we have to get 2 :
for the ::
from somewhere.
This could look like this:
0(:)0(:)0:0:0:0:0:0
Combining it all
In total we get 3 expressions each for 2, 3, 4, 5 and 6 consecutive zero blocks, 2 for the 7 consecutive zero blocks and 1 for the 8 zero blocks.
Those 18 expressions can be combined like this:
^(($EXPR1)|($EXPR2)|...)$
Building all of this by hand is shitty and annoying. Here is some code to do it.
zero = "0"
non_zero = "[1-9a-f]"
all_chars = "[0-9a-f]"
non_zero_chunk = f"{non_zero}{all_chars}{{0,3}}"
def max_n_zero_block_left(n: int) -> str:
return f"((({ zero }:){{0,{n}}}{ non_zero_chunk }:)*)"
def max_n_zero_block_right(n: int) -> str:
return f"((:{ non_zero_chunk }(:{ zero }){{0,{n}}})*)"
def n_length_zero_block(n: int) -> str:
return ":".join(zero*n)
left_zero_prefix = f"0(:)"
right_zero_suffix = f"(:)0"
patterns = list()
replacement_positions = list()
next_pattern_group_start = 1
for n in range(7,1,-1):
# special case for the longest continuos zero string at the left
pattern = f"({ left_zero_prefix }{ n_length_zero_block(n - 1) }({ max_n_zero_block_right(n) }))"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 3)
next_pattern_group_start += 6
# special case for ending with 0
pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n - 1) }{ right_zero_suffix })"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 6)
next_pattern_group_start += 6
if n == 7:
continue # the regular case does not exist for n=7
# regular case
pattern = f"(({ max_n_zero_block_left(n-1) }){ n_length_zero_block(n) }({ max_n_zero_block_right(n) }))"
patterns.append(pattern)
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 6)
next_pattern_group_start += 9
patterns.append("(0(:)0(:)0:0:0:0:0:0)")
replacement_positions.append(next_pattern_group_start + 2)
replacement_positions.append(next_pattern_group_start + 3)
all_patterns = "|".join(patterns)
all_patterns = "^(" + all_patterns + ")$"
print("Expression:")
print(all_patterns)
replacement = "".join(f"${{{i}}}" for i in replacement_positions)
print("Replacement:")
print(replacement)
And the output of the script:
Expression:
^((0(:)0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,7})*)))|(((((0:){0,6}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(:)0)|(0(:)0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(:)0)|(((((0:){0,5}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,6})*)))|(0(:)0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(:)0)|(((((0:){0,4}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,5})*)))|(0(:)0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(:)0)|(((((0:){0,3}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,4})*)))|(0(:)0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0(:)0)|(((((0:){0,2}[1-9a-f][0-9a-f]{0,3}:)*))0:0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,3})*)))|(0(:)0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0(:)0)|(((((0:){0,1}[1-9a-f][0-9a-f]{0,3}:)*))0:0(((:[1-9a-f][0-9a-f]{0,3}(:0){0,2})*)))|(0(:)0(:)0:0:0:0:0:0))$
Replacement:
${3}${4}${9}${13}${15}${16}${21}${25}${27}${31}${36}${37}${42}${46}${48}${52}${57}${58}${63}${67}${69}${73}${78}${79}${84}${88}${90}${94}${99}${100}${105}${109}${111}${115}${120}${121}
Please keep in mind that I am just an idiot on the internet, don't use this expression to burn your production environment down.