Code archaeology: Reading City of Heroes' .bin files
posted 07 Mar 2020 by Ruby
For the past month, I’ve been digging into the data files for a nearly twenty year old MMO. It’s a fascinating journey through not just the juicy details I’m looking for, but also how the original City of Heroes devs thought about structuring the very data for their game and the systems design behind the character data and abilities.
Why? Because it was fun! But it was also frustrating at times. The data structures are directly tied to how City of Heroes represents data in memory, to the point that it has pointer offsets encoded in the data. Without access to the source code, I probably would not have wanted to put in the time to reverse engineer everything to the degree I did, and I definitely would have second-guessed myself a lot. Even with the source, there was plenty that was downright confusing, which is why it’s taken me a month to get this far.
If you’re interested in learning more about City’s bin files and how to parse them, keep reading. If you just want the goods, however, go here. Let’s jump into it!
What are .bin files?
Pretty much all game data is stored in bin files, both on the client and the server. This allows for fast loading of the megabytes and megabytes of in-memory data that the game has to keep track of, everything from text to powers data to playable missions. If it isn’t a bin, it’s probably graphics: i.e., textures, geometry, and visual effects. On the client, all this data is compressed into archives called piggs. Piggs are another topic entirely, and I won’t cover them here. There’s plenty of existing tools available that work on piggs.
Obviously, bin files don’t spring into existence out of nowhere. Originally, the developers of the game did most of their work in Excel spreadsheets so that they could work on character attributes and powers and the formulae that drive them in one place. A script was used to export the Excel data into def files, which are a text representation of the data that can be used directly by development builds of the game (I imagine so that devs could quickly tweak values while working without recompiling all of the bins). The def files are again another topic that I won’t get too far into here other than to say that they exist as an intermediary step.
When the game is actually packaged for release, all those def files get compiled into a binary format known as the bin file. This is a very efficient format that basically allows the game to re-assemble its state in memory at startup as quickly as possible. As such, several of the bin files, especially the ones I cover here, are interdependent, referencing data found in other bins. You can’t just read the games power tables; you also need to know how they relate to the archetypes, attributes, and other data. (I’ll detail that specifically below.)
What does a .bin file look like?
The bin files of course are a proprietary format. Each one starts with a signature that doesn’t change, plus a CRC that matches it to the client/server build to prevent accidentally trying to load the wrong version of a bin file. This is important since the data structures can change from build to build, so loading an incompatible bin would basically just crash everything.
Below is an example of the first few bytes of a bin file. The signature and filetype should be the same for all bin files. The CRC is just shown as a placeholder (*) because I completely ignore it when I’m parsing. It doesn’t mean anything if you’re not the game client/server.
Now’s a good time to talk about a few things with how bin data is stored. First, any string you encounter is going to be ASCII (specifically
ISO 8859-1 code page) encoded, which is annoying because you’ll likely need to convert it to UTF-8 to do anything in a contemporary programming
environment. Second, all numbers are encoded little-endian. In the
“Filetype length” above, the value
06 00 translates into 6, referring to the 6 bytes in the string “Parse7”.
“Parse7” refers to how the remainder of the data is stored. There are two formats I’m aware of: “Parse6” and “Parse7”. “Parse6” specifically refers to the old live version of the game, and it was a completely different data format from i25 forward. Because I’m working with Homecoming which is an i25 server, everything in this article refers to “Parse7”. Apologies if you were looking for i24 information. The length being at the front of the string “Parse7” is in fact an artifact of “Parse6”, which stored all strings as Pascal-style.
Speaking of strings, that leads us into the next section of the bin file, what’s called the “string pool.” All of the strings referred to by later data structures in the file get packed at the beginning in one long set of data. The very first part of this data set is its length. This is an important concept for bins going forward; every structure is going to start with its length in bytes, so you know exactly how much data to read.
In this one instance, because the string data length is variable, it also adds padding at the end to make sure that the file stays byte aligned to 4 byte chunks. (If you’re curious why this is done, I would start with this article… warning, it’s not a topic that’s for the faint of heart or particularly important for you to know for this discussion.)
Here’s an example of what it looks like:
|Data length||String data||Padding|
The data length comes first, as a 32-bit unsigned integer stored in little-endian byte order. Get used to seeing this, because most of the bin file is going
to be either this or a 32-bit floating point. :) The bytes
9B 50 03 00 translates to 0x3509B, or 217,243 decimal. This is the length of the data
section, minus the padding. Because this doesn’t divide into 4 evenly, a final padding byte is added to the end. You can figure out how many padding bytes there
are with this simple formula:
(4 - (data_length % 4)) % 4
So what is the string pool exactly? Well to be honest it’s one long run-on sentence of NUL-terminated strings. At this point, you may be tempted to go ahead
and split it into a vector of individual strings or similar, but you need to keep them together like this. The reason for this is that any string data you read
from the structures further into the file is actually an offset into the string pool. To extract the desired string, start at the offset and
then read until you hit a
00 byte. Alternatively, if you do want to pre-split them, make sure you record the original offset where the string started.
Unfortunately, this is where most commonalities between bin files kind of stops. As I mentioned earlier, bin files are direct representations of the game’s data structures, so reading any given bin file from this point forward is determined by those structures. The next section will discuss the powers data that I was specifically interested in.
Reassembling powers data
In order to actually re-assemble all of the games power data, it takes several different bin files. As I mentioned earlier in this post, the bin data is frequently hierarchical and interdependent. Here are the bins I had to disassemble:
clientmessages-en.bin- Contains all of the localized strings used through the game client. I’ll talk about it more a bit later.
attrib_names.bin- Contains names of some of the character attributes. (Things like defenses, damage types, statuses, etc.)
classes.bin- Contains information about the archetypes. This is important to the discussion about powers because you have to combine some of this data with the powers data to get final values.
powercats.bin- High level power categories, e.g. “blaster primary power sets.”
powersets.bin- Power sets data.
powers.bin- Individual powers.
Getting all the data I needed for my project required drawing a line from classes to power categories to power sets to powers. (Except I didn’t realize that at first and started with powers and worked my way backwards.) Luckily, in order for the game client itself to re-establish this link, each of those structures contains named references to the others, so when you scan an archetype, it’ll tell you what power categories to look at, and then the power category will tell you what power sets, and so forth.
For the purpose of this discussion, I’m just going to focus on the
powers.bin file, since it’s the most complex, and give you a few high level
examples. I could write entire articles on individual sections of this structure because it’s so complex. Why? Well because the powers are basically
the entire game, more or less. (See “What’s in a character?” below.) On average each power is about 80 KB of data, and there’s around 22,000 of them.
It’s a huge, messy wad of data. Again, this is why the bin files are arranged the way they are; it’s all about how fast we can get them off disk
and into the final state we need them in memory.
Picking up from the past section, just past the string pool in
powers.bin is where the powers data starts. It’s one giant list of powers, and it
starts with a data length just like the string pool. Right after that data length, however, is another 32-bit unsigned integer value. This is the
size of the powers array in elements. So now you have to switch to iterating by count rather than bytes.
Pseudo-code looks something like this:
1 2 3 4 5 6 7 8 9 ' Assume ReadU32() reads a 32-bit unsigned integer from the file. Let Data_Length = ReadU32() Let Powers_Size = ReadU32() For Index In 0 To Powers_Size ' Read a Power ... Next Index ' At this point it's a good idea to verify you read Data_Length bytes, but not necessary.
Once you’re reading an individual power, it gets into the data you expect. You read in a data length for the whole struct, and then you can start reading in individual fields.
1 2 3 4 5 6 7 8 9 Let Data_Length = ReadU32() Let Full_Name = ReadU32() Let CRC_Full_Name = ReadU32() Let Source_File = ReadU32() Let Name = ReadU32() Let Source_Name = ReadU32() Let System = ReadU32() ' ... more fields
This is literally the first six fields in a power, but it continues on for a lot longer. At this point, you’ll see that every read is a 32-bit unsigned integer so far. This is one of the things that does make reading bin files easy, but makes deciphering a bit harder. Every value you encounter is going to be a 4-byte value of some sort. Some of them are 32-bit unsigned integers, some of them are 32-bit floating points. Both are stored little-endian as I talked about earlier, and floating points are stored in IEEE 754 format. You probably don’t need to worry about either of those too much; any modern programming language will have standard ways of converting them, check the documentation.
You may have noticed that things like “Name” are also stored as integers. As I mentioned in the section on string pools, this is the offset into the string pool you’re looking for, so it’s trivial to convert it to a string. How do you know if the data you just read is a string? Well if you’re referring to the source code like I did, you can just look. :) Or if you see a really odd value, try using at as an offset into the string pool and see if it returns something that makes sense for that field.
Okay, now for the tricky part. As one would expect when dealing with a high-level programming language, data structures can contain other data structures and arrays. What does that look like?
For an array of primitive values, it’s a pattern you’ve already seen. It starts with the size of the vector.
1 2 3 4 5 6 7 ' ... somewhere further down Let Array_Size = ReadU32() Let Attack_Types = Vector( Of U32 ) For Index In 0 To Array_Size Attack_Types.Push( ReadU32() ) Next Index
Unfortunately there’s not any type of marker in the bin file that tells you something is an array rather than any other series of 32-bit values. You just have to know ahead of time.
For nested data structures, it’s important to know that every one is itself stored as an array, even if there’s only ever 1 in that spot. Then the first thing you read will actually be the data length of the structure before moving on to the structure itself.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ' ... even further down Let Array_Size = ReadU32() Let Effect_Groups = Vector( Of Effect_Group ) For Index In 0 To Array_Size Effect_Groups.Push( ReadEffectGroup() ) Next Index ' ... Function ReadEffectGroup() Let Data_Length = ReadU32() Let Effect_Group = New Effect_Group Let Array_Size = ReadU32() For Index In 0 To Array_Size Effect_Group.Tags.Push( ReadU32() ) Next Index Effect_Group.Chance = ReadFloat() ' Read a bunch more fields ... ' At the end, again it's a good idea to verify Data_Length Return Effect_Group End Function
So that last example got a bit more complex, but you can see how you can keep going down the rabbit hole of nested arrays and structs. Things get really complex, especially with powers. Most of the other bins I encountered weren’t anywhere near as complex.
Unfortunately it can be really difficult to tell how to read the bits exactly without looking at the game’s source code for the struct definitions. If you’re feeling adventurous, there’s a few tips at the end of this post on how to go in blind.
If you want to see how I did it, see the “Powers parser in Rust” section below.
What’s in a character?
I didn’t realize this until I dug into them, but the powers are basically everything active going on in the game. Your character’s powers, obviously. But also boosts (called enhancements in the game UI), and inspirations, and applied buffs, and so on. The game is basically crunching all of these down into your character’s base attributes to determine how everything works. Amusingly, this means that things like damage are applied kind of in the same way a defense buff is—it’s just that one depletes one part of a table immediately and the other boosts a different part of the table over a period of time. As far as the game is concerned, you’re just a bag of numbers. This is where that pointer arithmetic I mentioned earlier comes into play; it’s running the same handful of calculations and just constantly shifting the window of where it’s looking at the tables.
The character data aside from a few very specific fields to track your customizations is just giant tables of character attributes. One might think that a lot of values are calculated at run-time, like how much damage your T1 attack does at level 23, but that’s not really how it works. While the original data was based on formulae stored in Excel spreadsheets as I described earlier, the game instead deals with huge tables that have all the possible values from level 1 to 50 for each potential attribute. Depending on the attribute, these are then added to or multiplied at runtime with whatever your current modifiers are.
For powers specifically, these are applied by the effect groups, which represent a potential set of attribute modifiers that can be applied to the target of a power. It’s important to note not all powers have their own effect groups. You need to also check the redirects, as some powers just reference other powers for their final effect.
Effect groups are the most complicated part of this puzzle, because each power behaves in specific ways described by one or more effect groups that have a fairly complex set of data stored in them. The effects of some do simple attribute math, but others are handled very specially by the game’s code. This is a part I’m still working on deciphering myself to try and boil them down into something that’s easier to understand without knowing how the game works internally.
What the heck is a P-string?
So, back to
clientmessages-en.bin. (I promised I would explain it!) When you pull strings out of the string pool, you’ll notice a lot of them just
say cryptic things like “P2041358”. These “p-strings,” not to be confused with the Pascal-style strings I talked about earlier, are how City of Heroes
handled localization back when it was distributed in multiple languages. The number is a hash based on the contents of the string, so this also serves
as a de-duper.
The table you want is stored in
clientmessages-en.bin, and it’s just a big lookup table of keys to values. If localization were still enabled, there’d
be complimentary bins called things like
clientmessages-fr.bin that you would look at instead.
If you want to see how to read this file, check this code.
Powers parser in Rust
Luckily if you don’t want to start from scratch, I have an entire project written in Rust that is designed to read in the bin files for the powers specifically and then spit them back out as JSON. The output is not a full representation of the data; there’s a lot of fields that really aren’t interesting unless you’re some internal code to the game or want to draw the graphics. Also, don’t worry if you don’t know Rust. I tried to avoid making it too Rust-y so that others could follow along. If you have experience with any programming language, I think you should be able to understand it, and I left plenty of code comments behind.
It’s open source (MIT licensed) so you can even re-use the code yourself if you want. Additionally if you’re just interested in the output, you can find that over here.
If you’re specifically interested in the parsing part, that’s all in the
bin_parse module. The rest of the program is mostly just doing the
grunt work of assembling the data and re-writing it into JSON.
Hey, I want to try this
Okay, so the topic of reverse engineering in general is way too big to get into for this article. But I do have some tips if you really want to try reading bins yourself, and you don’t want to use the game’s source code as a reference. (Which, to be honest, is not as easy at sounds… deciphering the token parsing code took me a fair bit of work.)
- Start with the header and string pool as I already described above.
- Everything is 4-byte aligned once you get past the string pool.
- When you first start, try to see if the first element is an array size. Most (but not all) of the bins I worked with had a single root array containing all the data.
- If it looks like an array, or if it’s not, the next thing to try is a data length. There are a couple of very rare exceptions, such as fixed arrays where you have to know how many elements to read ahead of time.
- Once you get started with that first array or struct, look for more structs/arrays that are nested. You can basically figure out the entire container structure without knowing anything about the data, and then fill in the specific data fields after the fact.
- If you see a seemingly random set of bytes that’s not a data length, it could be a floating point value or an offset into the string pool.
- Look for very specific repeated values to act as markers—as an example, one thing that helped me was that one of the fields in the power struct had
a very specific default value of
999. I used this as an anchor to make sure I was reading things around it consistently.
- Keep in mind, the devs can change the structures at any time when they do a new build. The powers in particular seem to change with every new issue/page, and your parser will break.