In Ruby, especially while writing low-level protocols, you may have encountered Array#pack()
and/or String#unpack()
. If you’ve ever experimented with them, they may seem mysterious; they take a cryptic string as a parameter (maybe something like C4
or w*
) and seem to return gibberish (or, conversely, convert gibberish into something humans can understand). I’m not going to go into tons of detail, nor am I going to cover every format that these methods accept, but I’ll cover a few that I’ve used.
What led me to writing this post was actually some recent experimenting that I’ve been doing with Crystal. I’m trying my hand at writing a simple BER parser/encoder as a part of my journey to writing an LDAP library for Crystal. The lack of an LDAP library is honestly the biggest reason I haven’t used Crystal for more things. Since BER is a binary method of encoding, the Ruby LDAP BER code uses a ton of Array#pack()
and String#unpack()
. Unfortunately, Crystal doesn’t have analogous methods for its Array
class or String
struct so I’ve had to write my own.
Here, I’ll describe a few of the formats supported by #pack
and write some compatible examples in Crystal.
What Are Pack and Unpack?
The Ruby Array#pack()
method takes an Array of data (usually numbers) and “packs” the data together into a String. This could be used to take binary data and turn it back into something you can read, or it could be used to transport binary data over an ASCII protocol. The opposite of this is String#unpack()
, which allows you to take apart (or “unpack”) a String into a set of numbers. In reality, this packing refers to arranging data into a specific format and is pretty low-level stuff. It converts objects in Ruby to a format that corresponds to how C handles data in memory.
Ruby isn’t alone in having this function; Perl and Python have it too (though Python refers to the C structures as structs
and uses this module for its pack
and unpack
functions).
Why Doesn’t Crystal Have Them?
I’m pretty sure the Crystal developers don’t want the Ruby approach. Based on this GitHub issue, it looks like at least some of the Crystal maintainers (like @straight-shoota and @jhass) don’t like the Ruby way of doing this. They instead advocate for using the features of IO
, which is mostly what I’ll be demonstrating.
I think they offer a reasonable argument against using a String
as a container for packed binary data, though I think having some kind of shortcut on IO would be helpful for porting code from other places (like Ruby).
Converting to Crystal
With all that context out of the way, let’s write some code!
First, we’ll need some example data. Thankfully, this part works in both Ruby and Crystal:
1 2 3 |
@string1 = "Simple" @string2 = "Déjà vu" |
Now we’re ready for some (un)packing.
Single-Byte Packs
Unpacking
First, let’s look at how this works in Ruby:
1 2 3 4 5 6 |
@string1.unpack("C*") # => [83, 105, 109, 112, 108, 101] @string2.unpack("C*") # => [68, 195, 169, 106, 195, 160, 32, 118, 117] |
For Ruby, the C*
means that the data is (or should be) carved up into unsigned, 8-bit integers. The unsigned part means that it can’t encode negative numbers. The 8-bit part, well, means that the integer has 8 binary digits. Combined, these details mean the data is broken up into chunks of , meaning 0
through 255
.
Let’s do the same with Crystal:
1 2 3 4 5 6 |
@string1.bytes # => [83, 105, 109, 112, 108, 101] @string2.bytes # => [68, 195, 169, 106, 195, 160, 32, 118, 117] |
As you can see, Crystal made this pretty easy. Instances of String
have the #bytes
method that returns exactly what we want, an Array of UInt8
as they’re called in Crystal. We’ll be using this approach as the basis for our other packing formats as well because of its simplicity.
Packing
Now let’s convert it back:
1 2 3 4 5 6 |
[83, 105, 109, 112, 108, 101].pack("C*") # => "Simple" [68, 195, 169, 106, 195, 160, 32, 118, 117].pack("C*") # => "D\xC3\xA9j\xC3\xA0 vu" |
Huh, that’s weird. Asking Ruby about the String’s encoding gives us a hint:
1 2 3 |
[68, 195, 169, 106, 195, 160, 32, 118, 117].pack("C*").encoding # => #<Encoding:ASCII-8BIT> |
The conversion back to a String in Ruby doesn’t include any encoding info so Ruby defaulted to ASCII. We can have Ruby convert it to the right encoding:
1 2 3 |
[68, 195, 169, 106, 195, 160, 32, 118, 117].pack("C*").force_encoding("UTF-8") # => "Déjà vu" |
OK, that’s easy enough. Let’s do the same in Crystal:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
String.build do |io| [83, 105, 109, 112, 108, 101].each do |number| io.write_byte number.to_u8 end end # => "Simple" String.build do |io| [68, 195, 169, 106, 195, 160, 32, 118, 117].each do |number| io.write_byte number.to_u8 end end # => "Déjà vu" |
Notice that Crystal didn’t have the same encoding problem. This is because Crystal, according to the documentation for String
, treats all Strings as UTF-8 by definition.
Signed Bytes
With Ruby’s #pack
and #unpack
, the C*
means use unsigned, 8-bit integers as much as possible (you can also just use C
for just getting the first byte, or CC
for the first two, etc.). But what about c*
? The lowercase “c” means used signed, 8-bit integers. This means that the first bit is reserved for determining if the number is positive or negative. This still encodes the same number of possible outputs, but it just shifts them over. It can now hold from -127
to 127
. For the first example (@string1
), there’s no difference:
1 2 3 |
@string1.unpack("c*") # => [83, 105, 109, 112, 108, 101] |
But, for the second example, the Array looks a little different:
1 2 3 |
@string2.unpack("c*") # => [68, -61, -87, 106, -61, -96, 32, 118, 117] |
Not too different, but Crystal’s #bytes
doesn’t take any parameters to help us. Luckily, mathematically, this is a pretty easy fix. We just need to subtract 256
from the number if it is greater than 127
. That said, Crystal is a statically typed language (though it uses lots of hints to infer those types at compile time), and #bytes
returns an Array of a precise type: UInt8
. It isn’t possible to subtract 256
, which is of type Int32
from a UInt8
. This means we’ll need to do some converting:
1 2 3 |
@string2.bytes.map { |n| (n < 128 ? n : n.to_i16 - 256).to_i8 } # => [68, -61, -87, 106, -61, -96, 32, 118, 117] |
The last conversion (#to_i8
) wasn’t strictly necessary, but it makes sure the Array ends up with the precise type (Int8
) we’re expecting.
Converting it back is basically applying the same process in reverse (adding 256
if the number is less than 0
):
1 2 3 4 5 6 7 |
String.build do |io| [68, -61, -87, 106, -61, -96, 32, 118, 117].each do |number| io.write_byte (number < 0 ? number + 256 : number).to_u8 end end # => "Déjà vu" |
Four-Byte Packs
Packing and unpacking using unsigned, 4-byte (32-bit) “words” is done with the N
format option.
Unpacking
1 2 3 4 5 6 |
@string1.unpack("N*") # => [1399418224] @string2.unpack("N*") # => [1153673578, 3282051190] |
It may not be obvious yet, but we’ve actually lost some data in this conversion. This is because the purpose of #pack
and #unpack
actually have nothing to do with Strings. Strings here are just a container for holding the packed bits. The problem is that neither of our Strings are holding a multiple of 32-bits, and incomplete “words” are invalid, so they aren’t unpacked. For this reason, if your goal is to convert a String to binary and back again, using 32-bit words is a bad choice. But, if your goal is to pack some binary data (and you can “zero fill” it so it is evenly divisible by 32) and convert it back to binary, this works great.
So let’s reproduce these results in Crystal. Since it takes a little work, I’ll make a function to do it and run it against both Strings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def unpack_to_u32(input : String) nums = [] of UInt32 # This is a temporary store input.bytes.each_slice(4) do |this_slice| if this_slice.size == 4 # only do this if we have a full 4 bytes bin_nums = this_slice.map do |num| to_bin = num.to_s(2) # convert the number to binary "#{"0" * (8 - to_bin.size)}#{to_bin}" # zero-fill the binary end nums << bin_nums.join.to_u32(base: 2) # join binary strings, convert to UInt32 end end nums end unpack_to_u32(@string1) # => [1399418224] unpack_to_u32(@string2) # => [1153673578, 3282051190] |
Note that when defining methods in Crystal, the space between input
and :
matters; without it, it means that input
is simply a keyword and its default value is literally the struct String
(basically a class name), with it, it sets the expected type for that parameter.
Let’s go over what exactly this function does:
- It takes a String as input
- Creates a temporary store for collecting the results of our algorithm called
nums
- It iterates over the result of calling
#bytes
on our input String. How it iterates is important though: it operates on subsets (slices) of at most 4 Array items long. SinceN
is supposed to operate on 4-byte words, this is exactly what we want. Notice though that I said at most 4 items;#each_slice(4)
returns an Enumerator for 4 items at a time until it reaches some value where 0 < x < 4, meaning if there’s a remainder, it is returned as the last subset. - Since slices can be incomplete, we need the conditional wrapper around our conversion logic to only do the conversion if it is exactly 4 bytes in size.
- We then create a temporary variable that contains the result of calling
#map
to convert each number to a binary string- We zero-fill (meaning we pad the front of the binary string with
0
if necessary) to make sure it is the full 8-bits in length. This is because we’ll be joining the 8-bit chunks before converting them to 32-bit numbers. Here’s a simple demonstration of why this is necessary:
- We zero-fill (meaning we pad the front of the binary string with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Convert "word" to UInt32 then binary "word".unpack("N*") # => [2003792484] "word".unpack("N*").map { |b| b.to_s(2) } # convert to binary # => ["1110111011011110111001001100100"] "word".unpack("N*").map { |b| b.to_s(2) }.first # => "1110111011011110111001001100100" "word".unpack("N*").map { |b| b.to_s(2) }.first.size # how long? # => 31 # Now convert "word" to UInt8 then binary "word".unpack("C*") # => [119, 111, 114, 100] "word".unpack("C*").map { |b| b.to_s(2) } # => ["1110111", "1101111", "1110010", "1100100"] "word".unpack("C*").map { |b| b.to_s(2) }.join # => "1110111110111111100101100100" "word".unpack("C*").map { |b| b.to_s(2) }.join.size # => 28 |
Those missing digits matter… that’s similar to the difference between 20001
and 201
; you can’t just ignore digits in the middle, even if they’re zeros.
- Our final step is to convert from binary to a 32-bit integer.
Packing
Converting back is much easier. Here’s the ruby version:
1 2 3 4 5 6 |
[1399418224].pack("N*") # => "Simp" [1153673578, 3282051190].pack("N*").force_encoding("UTF-8") # => "Déjà v" |
Again, the missing letters are expected. Let’s do the same in Crystal:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
String.build do |io| [1399418224].each do |number| io.write_bytes number.to_u32, IO::ByteFormat::BigEndian end end # => "Simp" String.build do |io| [1153673578, 3282051190].each do |number| io.write_bytes number.to_u32, IO::ByteFormat::BigEndian end end # => "Déjà v" |
The Crystal version is basically the same as the Single-byte example, except we convert to unsigned 32-bit integers (UInt32
) and we have to specify that the Byte format is Big Endian because of how the 32-bit words are joined.
BER-Compressed Packs
The BER encoding format includes the ability to compress multi-byte words slightly. Ruby’s pack/unpack formats include w*
which implements BER compression.
Unpacking
Here are the Ruby examples:
1 2 3 4 5 6 |
@string1.unpack("w*") # => [83, 105, 109, 112, 108, 101] @string2.unpack("w*") # => [68, 1103082, 1101856, 118, 117] |
Notice that the first one is identical to C*
; all the words are single-byte. The second string, however, includes some larger integers. Sure, the resulting Array has less elements, but is it really compressed? Let’s find out:
1 2 3 4 5 6 7 8 9 10 11 12 |
# BER compressed version @string2.unpack("w*").map { |n| n.to_s(2) }.join # => "100010010000110101001110101010000110100000010000011101101110101" @string2.unpack("w*").map { |n| n.to_s(2) }.join.size # => 63 # Straight binary @string2.unpack("C*").map { |n| n.to_s(2) }.join # => "100010011000011101010011101010110000111010000010000011101101110101" @string2.unpack("C*").map { |n| n.to_s(2) }.join.size # => 66 |
That’s 3 bits less, or a compression of about 4.5%. Sure, it isn’t a much with such a small amount of data, but it’s something. So how does this magic work? Writing this in Crystal gives the answer. I’m certain that this could be optimized, but it gets the job done:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
require "big" def unpack_to_ber(input : String) to_combine = [] of String # A buffer for binary strings output = [] of (Int32 | BigInt) # What we'll be returning input.bytes.each do |num| if num < 128 if to_combine.empty? output << num.to_i else to_combine << num.to_s(2) bin_nums = to_combine.map { |n| "#{"0" * (7 - n.size)}#{n}" } output << bin_nums.join.to_big_i(base: 2) to_combine.clear # empty the Array end else # The format combines binary numbers larger than 128 to_combine << (num - 128).to_s(2) end end output end unpack_to_ber(@string1) # => [83, 105, 109, 112, 108, 101] unpack_to_ber(@string2) # => [68, 1103082, 1101856, 118, 117] |
I had to use BigInt
because, unlike the other formats, BER encoding can have arbitrarily large words, so writing it this way is safer.
The walkthrough of this code would be pretty lengthy, but in summary, but numbers less than 128 are passed as-is (though they’re converted to Int32
), unless they’re preceded by numbers greater than 127. In that case, it is combined with however many consecutive numbers > 127 that came before it, then it is returned as a BigInt
. Note that this “combining” only works when it is done with the binary forms of these numbers, and they must be zero-filled to 7 bits.
Packing
Packing these back into our original String is essentially the reverse, though there’s more binary trickery involved:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
def pack_from_ber(input : Array(Int32 | BigInt)) String.build do |io| input.each do |number| if number < 128 # If the number is small, just write it io.write_byte number.to_u8 else # Otherwise, convert to binary then break into 7-bit chunks bin_nums = number.to_s(2).scan(/.{1,7}/) bin_nums[0..-2].each do |x| # Scan provides MatchData objects, not strings # so we need to use `[0]` on them to get the string # then we can convert to an Int32 and do a bitwise OR num = x[0].to_i(2) | 0x80 io.write_byte(num.to_u8) end # For the last number (the small one) just convert to UInt8 io.write_byte(bin_nums.last[0].to_i(2).to_u8) end end end end pack_from_ber([83, 105, 109, 112, 108, 101]) # => "Simple" pack_from_ber([68, 1103082, 1101856, 118, 117]) # => "Déjà vu" |
Phew! That was a fun.
Conclusion
Obviously this barely scratches the surface for porting the very powerful #pack
and #unpack
methods to Crystal. I hope the work gets done because they’re extremely handy. Let me know if you have some other examples or see places where these can be improved.