Create an executable research paper

Today I want to talk to you about a research paper, that is, at the same time, a computer program. What does that mean?

Well, just because a file says, in its name, that it’s an EXE, or executable program, doesn’t mean that it necessarily is. It also doesn’t mean that it isn’t both an executable program and something else at the same time. Like, a text file, for example, containing an interesting research paper.

Now, I didn’t just find this research paper — which is also a program — I made it, for SIGBOVIK this year. So I want to talk about how I did that, because I think it’s interesting. We’re going to learn about compilers, we’re going to learn about some bits, and some low-level assembly, and hopefully have some fun. Um, let’s get started!

Every file on the computer is just a bunch of bytes. And we’ll talk more about bytes in a second because there’s a particular thing about them that’s important for this work. But I’m not going to belabor it, because first of all, I know there are a lot of people that already know everything about bytes.

And then there are many people watching that don’t live and breathe this stuff. Both will probably be bored by the details. But also, the details are in the paper. A computer program is identified by a specific pattern of bytes.

And the important thing is that it begins with a header, and we’ll talk about the header in a little bit, and then it contains some data, and it contains some code. And all that it means to be a program is that it looks like this. Okay? And then of course it should work; it should have some kind of functionality. The functionality is provided by this code. You may know about source code.

Like something written in the C programming language, or a Python script. That’s not what I’m talking about here. This kind of code is represented as bytes, and it is the native instruction set — the native language — of the computer.

The chip actually understands this part of the program directly. Okay? In fact the way it does that is: It’s going to just load this file into memory. And then it’s going to say — and roughly speaking, it just puts the file somewhere in memory. And then it’s going to know from something in the header that it should sort of start here. And this is called the instruction pointer.

The thing that says where we’re currently executing. Then it’s going to load up some instruction, from the bytes in the code section, and it’s going to execute that. Then it’s going to go to the next instruction.

The computer, when it looks inside a program and starts running it, interprets a byte as an instruction and it knows most of the bytes to mean something. The byte 40 equals “INC AX,” where AX is some temporary inside the processor called a register, and it contains a number and it basically says: Increase that number by one. Increment. Some instructions are not just one byte. For example, 6A followed by 32 means “PUSH” the value 32 (some byte) onto the stack. And the stack is just some other thing that the processor has in it.

Bytes also have meaning–there’s a standard way to interpret them–as characters in a text file. And when you look at them that way, 40 means the @ sign, 6A is a lowercase J; 32 is the digit 2. Now only some bytes are considered printable. And those are the ones we’re going to be interested in for this paper. Printable means that there’s an agreed-upon character–basically something that appears on your keyboard–that everyone on all computer systems agrees: That byte means that thing. Now the printable bytes start here, with a space, which I’m going to draw this way.

We’ve got an exclamation mark; quotes; “hashtag”. So here we have the printable subset of bytes. Outside of this region is a bunch of hooey.

This character for example deletes. This character beeps. And up here you could put some of the characters together to create emoji like the pile of poo.

If you look inside almost any program, you’ll see a combination of all these characters; you’ll see printable ones but you’ll see also beeping, and deleting characters, and maybe even an emoji. Here’s an example of what that looks like. Total gibberish.

Okay, now I think I can really explain what I did. So we’re going to take a program, written in the C programming language, to be kind of old school. And here I’m talking about source code. So this is a thing that a normal programmer would write. Sort of. And feed it to a compiler, called ABC, which I also wrote.

It’s particular to this project. And that is going to output another file, an executable file, which means it needs to have a header and data and code, but every byte in this entire file will be a printable byte, so I can only use those 95 bytes. This also is not to scale.

This paper ends up having to be really really big. About 420 kilobytes. The reason this file needs to be so large is that the header, which will describe sort of where everything is, and how big it is, needs to be made of printable bytes, so we’re not able to use those small bytes. And if you know anything about this problem, I’ll mention right now that I do this in “expert mode.”

What I mean by that is that the code will not be self-modifying. So it will only execute instructions that are within the printable subset. This beast here is Intel’s manual, volume 2, instruction set reference, for the x86 series of processors. And pretty much everything that your computer can do is by way of the instructions that are in this book. So here I’ve opened it up to a random page; this is the cosine routine on floating point numbers.

And the restriction to printable bytes in our project today is going to leave us with about this much of the book left. A very small fraction of all instructions available to us. So here are all the instructions in the printable subset of x86.

The blue is the opcode byte; the black is the name of the instruction, which is what you’d find in the book, or programmers often think of the instructions by these names; and the red means that it takes some argument, that describes for example in subtraction, which two things to subtract. Now those red argument bytes also have to be printable, and that limits what we can do. Suffice to say, it’s complicated, but this table in the book shows us that only one small segment of all addressing modes are available.

And in particular I can’t just subtract one register from another. I always have to involve exactly one register and exactly one weird memory indirection. Unfortunately we’re missing some really important instructions that are in most programs. For example, we don’t have the MOV instruction which just moves data from one place to another. And we also don’t have the interrupt instruction, which is how you’d normally communicate with the operating system to do things like print things to the screen, or even exit the program, which is going to turn out to be a difficult issue. And although we have a family of jump instructions, which is important, because that’s how we’re going to change the instruction pointer around to go to different instructions and have some kind of loops in our program, the argument that the jump takes needs to be a printable byte, and this is going to mean that we only can make jumps that are from 32 to 127 bytes forward.

If you can only jump forward, the instruction pointer will always just go further and further down the program, and you won’t ever be able to make a loop, and that makes programming very very hard. So, compiling to the printable subset of instructions, with printable arguments, requires solving some puzzles. And it’s kind of a miracle that they can even be solved; sometimes just barely! I think it’s kind of cool, so I want to walk you through some of those, just to give you a flavor of what it’s like, and maybe learn something.

And the first one is just the computation of the bitwise OR function. The OR function operates on bits. Let’s suppose that we have two numbers: A and B.

“A OR B” means to look at every pair of bits, and if A contains a 1 or B contains a 1, or both, then the output is 1. But if both of them are 0, then the output is zero. So, in this case–which covers all of the possibilities, right?–the output is 1110.

Now we don’t have this instruction, but we can simulate it using some instructions we do have. One of those is AND. For AND, the output is 1 only if both A and B have a 1 in that position. Otherwise it’s 0. Another instruction we have is XOR, or “exclusive or.” For exclusive or, the output is 1 if exactly one of the two bits is 1; 0 if both are 1, or if both are 0.

Now observe the relationship between these two things and this output that we want to achieve. We have ones in all of the right positions, so in fact if we could OR these two things together, we’d be done. But of course we don’t have OR. However, we now have the case that there’s at most one in any column; either 10, 01, or 00. Which means that we could add these two together, and we’ll never need to carry, and we’ll get the OR that we desire.

This basically relies on the fact that PLUS is the same as OR if you never have two 1 bits in the same position because then you’d need to carry that to the next position, just like in decimal arithmetic. Now, we don’t have a PLUS operation, but we can implement PLUS using subtraction. And basically what we do is subtract the negative of a number in order to simulate plus. In two’s complement binary, the negative of a number is obtained by flipping all of the bits, 0 to 1 and 1 to 0, and then adding 1. We can add 1 with the increment instruction, which we already know is printable, and we can flip all of the bits by doing an exclusive or operation with a constant that’s just all 1s. So by doing tricks like this, we can decompose operations that we don’t have, like OR, into instructions that we do, like AND and XOR and increment.

And this is really the essence of compilation. This is what all compilers do when they translate source code, which contains operations that the computer can’t do directly, through a variety of transformations into much simpler code, but longer code, that’s phrased in terms of instructions that the computer can perform. So in the ABC compiler we first take in the C programming language, and parse it to produce an abstract syntax tree language. Then we do something called elaboration to get rid of some of the high level constructs like “for” loops and turn those into GOTOs.

That results in the C Intermediate Language, which is bigger but simpler code. Then we convert that to a low-level language, LLVMNOP (ha ha), where we make the layout of data explicit, and then finally perform some more transformations to larger but simpler code, into the printable x86 language, and then output the paper. We also perform some optimizations within a language itself, optimizing CIL to itself, and performing temporary allocation within the LLVMNOP language, to itself. ABC is something like 13,000 lines of Standard ML, and a lot of this is normal for a compiler, despite the fact that we’re solving some abnormal problems because of the output language.

Now transforming complex things into simple things does require good simple building blocks. And one that I’ve already relied on is the ability to load a particular constant value, like 1111, into a register. Although most programmers take this completely for granted, this is surprisingly hard in the printable subset, because for one thing any byte that’s not printable can’t just be loaded or added into a register. So now let’s talk about the apparently simple task of loading a particular value into a register. Normally we would just load a value into a register using the MOV instruction.

This doesn’t work because the MOV instruction isn’t printable, and if the value we want to load isn’t printable, that won’t work as well. The first observation is that we can at least make this register have some known value. We do that with the AND instruction. This byte, which is space, has only one 1 bit in binary. That means if the A register contains something, and we don’t know what yet, and we AND it with that value, all of these values will become 0, because they’re ANDed with 0. Only this one slot will remain possibly 1.

Then if we perform another AND, but with the value 40–which also has one 1, but in a different position–that’s guaranteed to turn off this one remaining bit that might still be 1. After doing these two ANDs, we know that A contains the value 0. Now what I’ll do is build a big matrix.

And the matrix will tell me how to get from some byte that I have in a register (on the left here), to some byte that I want (which I have on the top here). And each cell will give me a series of instructions to execute to transform that value into the desired value. Now when the input equals the output–that’s the whole diagonal here–I don’t need to execute any instructions. So I can start by just filling in the diagonal with the empty instruction list, which I’ll write as a dash. And this matrix actually has 256 values on this side and 256 values on this side; I’m not going to be able to fit it on the paper.

Now we know that we have the increment and decrement instructions, which means that adjacent to this diagonal, I can get from 3 to 4 by just executing “increment AX”. This will be true for this entire thing, and similarly decrement. Now in fact I can get to any number from 2 by just incrementing 2 over and over again, so I’ll initialize this entire matrix with repeated applications of increment. This one will have three copies; this one four; and similarly for decrement. That demonstrates that I can get from any number to any other number through a very long series of increments or decrements. That totally sucks, but it gets us started.

Now the next step is to look at this matrix and explore possibilities for shortcuts. For example, if I have 6 in the register, and I apply AND with say, 20, because 6 and 20 don’t share any bits, that will give me zero. So I’m able to replace whatever was in here, which was DEC DEC DEC DEC DEC, with just AND 20. In fact most of this column I’ll be able to fill with AND 20 or AND 40. Now the neat thing is, that because this gives me a plan for getting from any number to any other number, if I’m trying to figure out how to get from 6 to 1, for example, one of the things I can do is say: Well, what if I go from 6 to 0?

That’s just one instruction. And then I go from 0 to 1. That’s just one more instruction.

And what that gives me is that this cell, which used to be “decrement decrement decrement decrement decrement”, can now be AND 20, yielding 0, and then increment one time, yielding 1. In addition to the AND instruction I can also have SUB, and XOR, and some other funny tricks I can play in here. So I can take this matrix and I can simplify it by applying shortcuts, and then composing operations where I go from one value to another value, and then use that value’s shortcut to where I’m trying to get, over and over and over again, which makes the matrix better and better by making these cells faster and faster. And I can just keep doing that until I get to a minimal matrix. And it turns out it takes no more than four instruction bytes to get from any one value to any other value, which is pretty good. What I really like about this is how beautiful the matrix looks when you just think about how many instructions it takes to get from one value to another and plot that in a graphic.

Here’s what that looks like. This kind of fractal structure shows up a lot in computer science, and mathematics, and Hyrule. For example a paper I wrote a few years ago about drinking games had a similar kind of pattern; I’ll just show you what that looks like in three dimensions.

This is only for eight-bit bytes, but you can generalize this to sixteen-bit words, and you can play tricks with the stack in order to load each individual byte into a word, and that all works out. Okay, two more puzzles and then we can execute this paper. I mentioned before about the jump problem, which is that all jumps that we have access to will jump forward in the program. Jumping forward is great, but we also need to be able to jump backwards, in order to create loops.

For example, in the C programming language, a “while” loop just keeps executing the same piece of code until some condition becomes false. At first blush this seems impossible, because if I’m only jumping forward, the instruction pointer’s going to only ever get bigger. But there is one condition under which the instruction pointer gets smaller.

The instruction pointer is the thing that tells us where in the current program we’re executing. And this is a thirty-two bit number; that’s a lot of bits. That means it can represent a number up to 2^32, which is approximately 4.2 billion different positions in the file. And our file is not 4.2 billion; it’s only 420kb, and in fact the usable code section is only 64 kilobytes large. If this is our program, the instruction point starts here and it jumps forward a few times; we’ll get to the end.

If I execute past the end of my program, I will just keep going into memory, into whatever’s there, executing all sorts of weird instructions and probably doing stuff that I don’t want. FOR EXAMPLE, executing non-printable instructions! So this is no good. But there is a strange fact, which is that if we’re only using the 16-bit portion of the instruction pointer, that is, if these are all 0s, and we issue a jump that happens to cross the boundary here (this boundary is address FFFF, that is, the largest 16-bit word, all 1s in binary), then execution will wrap around. It won’t continue into the 32-bit portion, and it goes back to the beginning, modulo 2^16. This is kind of like when your odometer rolls over, or when you shoot so many ducks in Duck Hunt that your score goes back to zero.

And weirdly it only happens when your instruction pointer is already in the 16-bit range. Which means we can use this trick but only if our code stays within the 16-bit region. This is where that’s documented in the manual.

I should say you shouldn’t necessarily trust this manual; it has a bunch of typos, like this is a mistake right here. Really Intel? Backslash? So our whole program will fit within an address space, where the beginning is 0000 and the very last instruction byte is FFFF.

Of course not to scale. And the program will consist of blocks. Blocks have to be of a certain size.

They can’t be too small and they can’t be too large, because what we’re going to do is jump between those blocks and every time we ever want to make a jump, we’re always going to jump to the next block. And we have to be careful to put some jump at the very end of this program, so that it jumps, which wraps around, and brings us back to the start–that’s the whole point–but if we do that, then from any place we can technically reach anywhere else we want to go by doing a series of jumps. This series of blocks I call the “ladder.” Each block starts with what I call a “rung”, because, you know, ladders, and the rung always decrements a register called SI, and then says “jump not zero” to the next block.

So basically what will happen is if I want to jump a certain number of blocks, I will load SI with that number, so if I want to go 5 I’ll start with 5, and then I’ll jump here, decrement SI so that it’ll say 4, and then I’ll jump not zero which means, 4 is not zero, so I’ll jump to the next one and then it’ll become 3, and then 2 and then 1 and maybe wrap around. When it finally becomes 0, I will not jump and I’ll instead enter the block and execute it. We can also put jumps inside this block. So there could be a “jump if greater than” some value, and this always targets the next block. So if we’re executing one of these jumps I’m going to load SI with the right value. This is very reliable but it’s an extreme pain in the butt to lay them out, because no block can be too long and no block can be two short, because I can’t even jump 10 bytes; I have to jump at least 32.

Moreover, this whole process has a dependency on itself because I need to know how far to jump, I need that to be printable, but I don’t yet know how big these blocks will be. And unlike a normal assembling process, even a basic thing like loading a value into a register–for example how many blocks to jump–depends on the specific value being loaded. (Because, recall that table that we just computed.) So I have to do this in an iterative loop. I have to do this over and over again, laying it out, hoping that it will work, and then making minute adjustments until it does.

The other bummer about this is any backwards jump has to go through the entire program, and so some operations, like multiplication, which are implemented with loops because I don’t have the multiplication operation, take time that’s proportional to the size of the program instead of just being a single operation. The last trick to talk about is communicating with the operating system. And basically the story is: We can’t do it. This is normally done via something called an interrupt, which is the instruction whose opcode is CD. And the most important interrupt to know is INT 21. In DOS, this interrupt basically says “Hey!

Check this out!” You load up the registers with something you want to do, DOS takes over, and it does something like write to the screen or open a file. This byte CD is not printable.

Now there are some close calls that I describe in the paper, that almost allow us to execute interrupts. But one trick that I did find out involves, the way that interrupts themselves actually work, which is kind of neat. This is a very low-level concept in the processor, and as a result, whenever an interrupt happens, what the processor actually does is it looks inside memory, and it looks inside a specific part of memory, which is the very beginning.

And based on the interrupt number, it’s going to pick one of these slots, and it’s going to read out of that slot an address that will be the new instruction pointer. Since each one of these is four bytes long–32 bits long–the INT 21 handler will be at position 0084 at the beginning of memory. This thing is called the interrupt vector table, because it gives you a table for each interrupt of where to go. Now what’s useful about this is that some interrupts aren’t caused by the interrupt instruction. For example the timer interrupt is this one, and this interrupt just automatically happens because of the system clock every few milliseconds.

So we could put something in here, like for example the interrupt 21 handler, by just copying this into wherever the timer interrupt handler is, and then the computer will periodically execute the interrupt 21 handler. That’s not very useful for us, because we don’t want this to just keep happening every few milliseconds, and there are some technical reasons why it won’t even work. But there’s one other way that interrupts happen, and that’s when we execute an illegal instruction. That’s interrupt number 6.

We can control when an illegal instruction happens as long as we have one available to us, by just trying to execute one. So what we do, at program startup, using some hand-written assembly that’s only printable, is copy the address from the INT 21 handler–that’s DOS’s system calls–over top of the illegal instruction handler. Now if we can execute an illegal instruction, it will execute the same code that would happen if we called INT 21 normally. And do we have an illegal instruction? Yes we do! This one here is Adjust RPL Field of Segment Selector, which is completely useless in user programs, because it’s always illegal, and what I love about this instruction is this is a lowercase letter C, which means–this actually takes an argument–we can encode this illegal instruction as lowercase letter C, lowercase letter Y, lowercase letter A, and just an exclamation mark for emphasis.

This turns out to be this oddity, which is illegal. Now for technical reasons we can actually only use this trick one time before it stops working. So what I’ll use it for is to invoke interrupt 21 with AH equals 4C, which means “exit the program.”

And that’s okay, because we only exit the program once. YOEO. And this is why “cya!” is such an appropriate encoding. Let’s actually look inside the paper for a second. Most of this paper is text, and it’s just text that’s part of the data, or header even, of the executable.

There’s some drawings in there too. Once you get far enough along, we find the code. It’s not something that you can read, but it’s all made of keyboard characters.

And if we look at the end here it says “LX4 cya!” and that is where the program will be executing when it exits. Sorry, I know this video’s getting pretty long, but speaking of boring, a program that doesn’t do anything but loop and then exit is not very interesting.

And since we’ve used up our trick to contact the operating system, which is how you’d normally do input and output, we need another way to do it. Fortunately there is a low-level way for us to do that. This is a pinout of the Intel 8086, and the Intel processors, like most microprocessors, have a way to communicate with peripherals that are on the motherboard, by using something called I/O ports. An I/O port is pretty simple. Normally you’re reading and writing from the main memory of your computer, which is what we’ve been talking about the entire time when we’re talking about addresses. We use these data pins here in order to specify the address, and we also read or write the data on those same pins.

This pin over here allows us to switch that behavior to instead read and write from peripheral addresses, which basically instructs these pins to refer to a different part of the motherboard. That’s basically all that I/O ports are, but it’s a very low-level concept in the processor. Now it just so happens that we do have a pair of slightly awkward instructions for reading and writing the I/O ports.

And we’ll use these to communicate with some peripheral. So this paper doesn’t print anything out, but what it does do, and I’m about to show it to you, is play music playing this ancient FM synthesis sound card called the AdLib. Let’s go take a listen. Okay, here we are, finally, in computer world. I’m using a DOS emulator called DOSBox, but this also works perfectly fine in real DOS, as long as you have the right peripherals.

I wanted to show you that the paper is all printable bytes, but unfortunately it’s too big to even load into the editor. But we can run it, and what this program does, is it takes on the command line a description of music in a format called ABC. I didn’t invent this format. So if I say “ABC” — (music plays) — It’ll play those notes and then exit. It accepts more complicated input, including the possibility of playing more than one track at the same time.

Now I’ll just leave you with the built-in tune, which plays if you don’t provide it with any music of your own devising. Perfect music.