Rewriting Ruby’s #pack and #unpack Methods in Crystal

In Ruby, especially while writing low-level protocols, you may have encountered Array#pack() and/or String#unpack(). If you’ve ever experimented with them, they may seem mysterious; they take a cryptic string as a parameter (maybe something like C4 or w*) and seem to return gibberish (or, conversely, convert gibberish into something humans can understand). I’m not going to go into tons of detail, nor am I going to cover every format that these methods accept, but I’ll cover a few that I’ve used.

What led me to writing this post was actually some recent experimenting that I’ve been doing with Crystal. I’m trying my hand at writing a simple BER parser/encoder as a part of my journey to writing an LDAP library for Crystal. The lack of an LDAP library is honestly the biggest reason I haven’t used Crystal for more things. Since BER is a binary method of encoding, the Ruby LDAP BER code uses a ton of Array#pack() and String#unpack(). Unfortunately, Crystal doesn’t have analogous methods for its Array class or String struct so I’ve had to write my own.

Here, I’ll describe a few of the formats supported by #pack and write some compatible examples in Crystal.

What Are Pack and Unpack?

The Ruby Array#pack() method takes an Array of data (usually numbers) and “packs” the data together into a String. This could be used to take binary data and turn it back into something you can read, or it could be used to transport binary data over an ASCII protocol. The opposite of this is String#unpack(), which allows you to take apart (or “unpack”) a String into a set of numbers. In reality, this packing refers to arranging data into a specific format and is pretty low-level stuff. It converts objects in Ruby to a format that corresponds to how C handles data in memory.

Ruby isn’t alone in having this function; Perl and Python have it too (though Python refers to the C structures as structs and uses this module for its pack and unpack functions).

Why Doesn’t Crystal Have Them?

I’m pretty sure the Crystal developers don’t want the Ruby approach. Based on this GitHub issue, it looks like at least some of the Crystal maintainers (like @straight-shoota and @jhass) don’t like the Ruby way of doing this. They instead advocate for using the features of IO, which is mostly what I’ll be demonstrating.

I think they offer a reasonable argument against using a String as a container for packed binary data, though I think having some kind of shortcut on IO would be helpful for porting code from other places (like Ruby).

Converting to Crystal

With all that context out of the way, let’s write some code!

First, we’ll need some example data. Thankfully, this part works in both Ruby and Crystal:

Now we’re ready for some (un)packing.

Single-Byte Packs

Unpacking

First, let’s look at how this works in Ruby:

For Ruby, the C* means that the data is (or should be) carved up into unsigned, 8-bit integers. The unsigned part means that it can’t encode negative numbers. The 8-bit part, well, means that the integer has 8 binary digits. Combined, these details mean the data is broken up into chunks of 2^8, meaning 0 through 255.

Let’s do the same with Crystal:

As you can see, Crystal made this pretty easy. Instances of String have the #bytes method that returns exactly what we want, an Array of UInt8 as they’re called in Crystal. We’ll be using this approach as the basis for our other packing formats as well because of its simplicity.

Packing

Now let’s convert it back:

Huh, that’s weird. Asking Ruby about the String’s encoding gives us a hint:

The conversion back to a String in Ruby doesn’t include any encoding info so Ruby defaulted to ASCII. We can have Ruby convert it to the right encoding:

OK, that’s easy enough. Let’s do the same in Crystal:

Notice that Crystal didn’t have the same encoding problem. This is because Crystal, according to the documentation for String, treats all Strings as UTF-8 by definition.

Signed Bytes

With Ruby’s #pack and #unpack, the C* means use unsigned, 8-bit integers as much as possible (you can also just use C for just getting the first byte, or CC for the first two, etc.). But what about c*? The lowercase “c” means used signed, 8-bit integers. This means that the first bit is reserved for determining if the number is positive or negative. This still encodes the same number of possible outputs, but it just shifts them over. It can now hold from -127 to 127. For the first example (@string1), there’s no difference:

But, for the second example, the Array looks a little different:

Not too different, but Crystal’s #bytes doesn’t take any parameters to help us. Luckily, mathematically, this is a pretty easy fix. We just need to subtract 256 from the number if it is greater than 127. That said, Crystal is a statically typed language (though it uses lots of hints to infer those types at compile time), and #bytes returns an Array of a precise type: UInt8. It isn’t possible to subtract 256, which is of type Int32 from a UInt8. This means we’ll need to do some converting:

The last conversion (#to_i8) wasn’t strictly necessary, but it makes sure the Array ends up with the precise type (Int8) we’re expecting.

Converting it back is basically applying the same process in reverse (adding 256 if the number is less than 0):

Four-Byte Packs

Packing and unpacking using unsigned, 4-byte (32-bit) “words” is done with the N format option.

Unpacking

It may not be obvious yet, but we’ve actually lost some data in this conversion. This is because the purpose of #pack and #unpack actually have nothing to do with Strings. Strings here are just a container for holding the packed bits. The problem is that neither of our Strings are holding a multiple of 32-bits, and incomplete “words” are invalid, so they aren’t unpacked. For this reason, if your goal is to convert a String to binary and back again, using 32-bit words is a bad choice. But, if your goal is to pack some binary data (and you can “zero fill” it so it is evenly divisible by 32) and convert it back to binary, this works great.

So let’s reproduce these results in Crystal. Since it takes a little work, I’ll make a function to do it and run it against both Strings:

Note that when defining methods in Crystal, the space between input and : matters; without it, it means that input is simply a keyword and its default value is literally the struct String (basically a class name), with it, it sets the expected type for that parameter.

Let’s go over what exactly this function does:

  • It takes a String as input
  • Creates a temporary store for collecting the results of our algorithm called nums
  • It iterates over the result of calling #bytes on our input String. How it iterates is important though: it operates on subsets (slices) of at most 4 Array items long. Since N is supposed to operate on 4-byte words, this is exactly what we want. Notice though that I said at most 4 items; #each_slice(4) returns an Enumerator for 4 items at a time until it reaches some value where 0 < x < 4, meaning if there’s a remainder, it is returned as the last subset.
  • Since slices can be incomplete, we need the conditional wrapper around our conversion logic to only do the conversion if it is exactly 4 bytes in size.
  • We then create a temporary variable that contains the result of calling #map to convert each number to a binary string
    • We zero-fill (meaning we pad the front of the binary string with 0 if necessary) to make sure it is the full 8-bits in length. This is because we’ll be joining the 8-bit chunks before converting them to 32-bit numbers. Here’s a simple demonstration of why this is necessary:

Those missing digits matter… that’s similar to the difference between 20001 and 201; you can’t just ignore digits in the middle, even if they’re zeros.

  • Our final step is to convert from binary to a 32-bit integer.

Packing

Converting back is much easier. Here’s the ruby version:

Again, the missing letters are expected. Let’s do the same in Crystal:

The Crystal version is basically the same as the Single-byte example, except we convert to unsigned 32-bit integers (UInt32) and we have to specify that the Byte format is Big Endian because of how the 32-bit words are joined.

BER-Compressed Packs

The BER encoding format includes the ability to compress multi-byte words slightly. Ruby’s pack/unpack formats include w* which implements BER compression.

Unpacking

Here are the Ruby examples:

Notice that the first one is identical to C*; all the words are single-byte. The second string, however, includes some larger integers. Sure, the resulting Array has less elements, but is it really compressed? Let’s find out:

That’s 3 bits less, or a compression of about 4.5%. Sure, it isn’t a much with such a small amount of data, but it’s something. So how does this magic work? Writing this in Crystal gives the answer. I’m certain that this could be optimized, but it gets the job done:

I had to use BigInt because, unlike the other formats, BER encoding can have arbitrarily large words, so writing it this way is safer.

The walkthrough of this code would be pretty lengthy, but in summary, but numbers less than 128 are passed as-is (though they’re converted to Int32), unless they’re preceded by numbers greater than 127. In that case, it is combined with however many consecutive numbers > 127 that came before it, then it is returned as a BigInt. Note that this “combining” only works when it is done with the binary forms of these numbers, and they must be zero-filled to 7 bits.

Packing

Packing these back into our original String is essentially the reverse, though there’s more binary trickery involved:

Phew! That was a fun.

Conclusion

Obviously this barely scratches the surface for porting the very powerful #pack and #unpack methods to Crystal. I hope the work gets done because they’re extremely handy. Let me know if you have some other examples or see places where these can be improved.

Leave a Reply