Query-Based Compiler Architectures

2020-06-2519:1711733ollef.github.io

What a query-based compiler is and what they are good for.

Show article

Note: This is an old post originally from the documentation of the Sixten programming language, that I've touched up and fleshed out. After the time that it was written I've found out about Salsa, a Rust library with very similar goals to my Rock library, which is definitely worth checking out as well!

Background

Compilers are no longer just black boxes that take a bunch of source files and produce assembly code. We expect them to:

Be incremental, meaning that if we recompile a project after having made a few changes we only recompile what is affected by the changes.
Provide editor tooling, e.g. through a language server, supporting functionality like going to definition, finding the type of the expression at a specific location, and showing error messages on the fly.

This is what Anders Hejlsberg talks about in his video on modern compiler construction that some of you might have seen.

In this post I will cover how this is achieved in Sixten by building the compiler around a query system.

For those of you that don't know, Sixten is an experimental functional programming language created to give the programmer more control over memory layout and boxing than most other high-level languages do. The most recent development of Sixten is being done in the Sixty repository, and is completely query-based. Here's a little video giving a taste of what its language server can do, showing type-based completions:

Traditional pipeline-based compiler architectures

A traditional compiler pipeline might look a bit like this:

+-----------+ +-----+ +--------+ +--------+
| | | | | | | |
|source text|---parse--->| AST |---typecheck-+->|core AST|---generate--->|assembly|
| | | | ^ | | | |
+-----------+ +-----+ | +--------+ +---------
 |
 read and write
 types
 |
 v
 +----------+
 | |
 |type table|
 | |
 +----------+

There are many variations, and often more steps and intermediate representations than in the illustration, but the idea stays the same:

We push source text down a pipeline and run a fixed set of transformations until we finally output assembly code or some other target language. Along the way we often need to read and update some state. For example, we might update a type table during type checking so we can later look up the type of entities that the code refers to.

Traditional compiler pipelines are probably quite familiar to many of us, but how query-based compilers should be architected might not be as well-known. Here I will describe one way to do it.

Going from pipeline to queries

What does it take to get the type of a qualified name, such as "Data.List.map"? In a pipeline-based architecture we would just look it up in the type table. With queries, we have to think differently. Instead of relying on having updated some piece of state, we do it as if it was done from scratch.

As a first iteration, we do it completely from scratch. It might look a little bit like this:

fetchType :: QualifiedName -> IO Type
fetchType (QualifiedName moduleName name) = do
 fileName <- moduleFileName moduleName
 sourceCode <- readFile fileName
 parsedModule <- parseModule sourceCode
 resolvedModule <- resolveNames parsedModule
 let definition = lookup name resolvedModule
 inferDefinitionType definition

We first find out what file the name comes from, which might be Data/List.vix for Data.List, then read the contents of the file, parse it, perhaps we do name resolution to find out what the names in the code refer to given what is imported, and last we look up the name-resolved definition and type check it, returning its type.

All this for just for getting the type of an identifier? It seems ridiculous because looking up the type of a name is something we'll do loads of times during the type checking of a module. Luckily we're not done yet.

Let's first refactor the code into smaller functions:

fetchParsedModule :: ModuleName -> IO ParsedModule
fetchParsedModule moduleName = do
 fileName <- moduleFileName moduleName
 sourceCode <- readFile fileName
 parseModule moduleName

fetchResolvedModule :: ModuleName -> IO ResolvedModule
fetchResolvedModule moduleName = do
 parsedModule <- fetchParsedModule moduleName
 resolveNames parsedModule

fetchType :: QualifiedName -> IO Type
fetchType (QualifiedName moduleName name) = do
 resolvedModule <- fetchResolvedModule moduleName
 let definition = lookup name resolvedModule
 inferDefinitionType definition

Note that each of the functions do everything from scratch on their own, i.e. they're each doing a (longer and longer) prefix of the work you'd do in a pipeline. I've found this to be a common pattern in my query-based compilers.

One way to make this efficient would be to add a memoisation layer around each function. That way, we do some expensive work the first time we invoke a function with a specific argument, but subsequent calls are cheap as they can return the cached result.

This is essentially what we'll do, but we won't use a separate cache per function, but instead have a central cache, indexed by the query. This functionality is provided by Rock, a library that packages up some functionality for creating query-based compilers.

The Rock library

Rock is an experimental library heavily inspired by Shake and the Build systems à la carte paper. It essentially implements a build system framework, like make.

Build systems have a lot in common with modern compilers since we want them to be incremental, i.e. to take advantage of previous build results when building anew with few changes. But there's also a difference: Most build systems don't care about the types of their queries since they work at the level of files and file systems.

Build systems à la carte is closer to what we want. There the user writes a bunch of computations, tasks, choosing a suitable type for keys and a type for values. The tasks are formulated assuming they're run in an environment where there is a function fetch of type Key -> Task Value, where Task is a type for describing build system rules, that can be used to fetch the value of a dependency with a specific key. In our above example, the key type might look like this:

data Key
 = ParsedModuleKey ModuleName
 | ResolvedModuleKey ModuleName
 | TypeKey QualifiedName

The build system has control over what code runs when we do a fetch, so by varying that it can do fine-grained dependency tracking, memoisation, and incremental updates.

Build systems à la carte is also about exploring what kind of build systems we get when we vary what Task is allowed to do, e.g. if it's a Monad or Applicative. In Rock, we're not exploring that, so our Task is a thin layer on top of IO.

A problem that pops up now, however, is that there's no satisfactory type for Value. We want fetch (ParsedModuleKey "Data.List") to return a ParsedModule, while fetch (TypeKey "Data.List.map") should return something of type Type.

Indexed queries

Rock allows us to index the key type by the return type of the query. The Key type in our running example becomes the following GADT:

data Key a where
 ParsedModuleKey :: ModuleName -> Key ParsedModule
 ResolvedModuleKey :: ModuleName -> Key ResolvedModule
 TypeKey :: QualifiedName -> Key Type

The fetch function gets the type forall a. Key a -> Task a, so we get a ParsedModule when we run fetch (ParsedModuleKey "Data.List"), like we wanted, because the return type depends on the key we use.

Now that we know what fetch should look like, it's also worth revealing what the Task type looks like in Rock, more concretely. As mentioned, it's a thin layer around IO, providing a way to fetch keys (like Key above):

newtype Task key a = Task { unTask :: ReaderT (Fetch key) IO a }
newtype Fetch key = Fetch (forall a. key a -> IO a)

The rules of our compiler, i.e. its "Makefile", then becomes the following function, reusing the functions from above:

rules :: Key a -> Task a
rules key = case key of
 ParsedModuleKey moduleName ->
 fetchParsedModule moduleName

 ResolvedModuleKey moduleName ->
 fetchResolvedModule moduleName

 TypeKey qualifiedName ->
 fetchType qualifiedName

Caching

The most basic way to run a Task in Rock is to directly call the rules function when a Task fetches a key. This results in an inefficient build system that recomputes every query from scratch.

But the Rock library lets us layer more functionality onto our rules function, and one thing that we can add is memoisation. If we do that Rock caches the result of each fetched key by storing the key-value pairs of already performed fetches in a dependent hashmap. This way, we perform each query at most once during a single run of the compiler.

Verifying dependencies and reusing state

Another kind of functionality that can be layered onto the rules function is incremental updates. When it's used, Rock keeps track of what dependencies a task used when it was executed (much like Shake) in a table, i.e. what keys it fetched and what the values were. Using this information it's able to determine when it's safe to reuse the cache from a previous run of the compiler even though there might be changes in other parts of the dependency graph.

This fine-grained dependency tracking also allows reusing the cache when a dependency of a task changes in a way that has no effect. For example, whitespace changes might trigger a re-parse, but since the AST is the same, the cache can be reused in queries that depend on the parse result.

Reverse dependency tracking

Verifying dependencies can be too slow for real-time tooling like language servers, because large parts of the dependency graph have to be traversed just to check that most of it is unchanged even for tiny changes.

For example, if we make changes to a source file with many large imports, we need to walk the dependency trees of all of the imports just to update the editor state for that single file. This is because dependency verification by itself needs to go all the way to the root queries for all the dependencies of a given query, which can often be a large proportion of the whole dependency tree.

To fix this, Rock can also be made to track reverse dependencies between queries. When e.g. a language server detects that a single file has changed, the reverse dependency tree is used to invalidate the cache just for the queries that depend on that file by walking the reverse dependencies starting from the changed file.

Since the imported modules don't depend on that file, they don't need re-checked, resulting in much snappier tooling!

Closing thoughts

Most modern languages need to have a strategy for tooling, and building compilers around query systems seems like an extremely promising approach to me.

With queries the compiler writer doesn't have to handle updates to and invalidation of a bunch of ad-hoc caches, which can be the result when adding incremental updates to a traditional compiler pipeline. In a query-based system it's all handled centrally once and for all, which means there's less of a chance it's wrong.

Queries are excellent for tooling because they allow us to ask for the value of any query at any time without worrying about order or temporal effects, just like a well-written Makefile. The system will compute or retrieve cached values for the query and its dependencies automatically in an incremental way.

Query-based compilers are also surprisingly easy to parallelise. Since we're allowed to make any query at any time, and they're memoised the first time they're run, we can fire off queries in parallel without having to think much. In Sixty, the default behaviour is for all input modules to be type checked in parallel.

Lastly, I hope that this post will have inspired you to use a query-based compiler architecture, and given you an idea of how it can be done.

Read the original article

Comments

By sly010 2020-06-262:328 reply

I have some negative feelings about this trend (of increased integration in compilers), but I can't quite put my finger on the reason.

Before the language server idea came along all compilers were pure functions. Dependency management and caching were the responsibility of the build system. A flexible build system could handle things that the language designers haven't though of, like code generation, mixed language projects, linking between different languages etc. Things are very composable and extendable and everything can be modeled as a DAG.

With language servers and incremental compilation becoming part of some long running compiler process the responsibilities are blurred, it all leads to increased integration, less flexibility and when things break, you will not be able to tell why.

Aren't we giving up too much to get faster recommendations in IDEs?

By comex 2020-06-267:05

Contrary to other replies, I think this is a reasonable concern. For example, the traditional C model allowed people to relatively easily write wrappers that do transparent caching (ccache) and parallelism (distcc), which is harder with a more integrated model.

Except:

> Aren't we giving up too much to get faster recommendations in IDEs?

The goal of incremental compilation isn't just IDE recommendations. It's faster compile times. C's system of header files and file-level dependency tracking works for C, but modern C++ severely strains its limits, considering how templates often force you to put all your code in header files. Rust doesn't even have header files and uses a single compilation unit for an entire library, which is much worse… but Rust is working on fine-grained query-based incremental compilation, which might someday let it leapfrog C++ in rebuild times.

Maybe.

By bcherny 2020-06-263:42

I wonder if you could approach this like you’d approach designing any other data structure, by first nailing down what operations you are optimizing for.

We could optimize for ease of iterating on the compiler, or for ease of integration into an ecosystem of various build systems. Or for correctness, consistency, compilation speed, etc.

Or, maybe it turns out that saving developers’ time is more important than any of these, as long as we keep all of these within reasonable bounds, since it’s by a factor of 10,000 the most common operation you’re going to be performing.

By drivebycomment 2020-06-265:441 reply

I honestly fail to see anything we're giving up.

What makes you assume somehow compilers won't support the pipeline manner ? What "less flexibility" you've actually suffered from ? When was the last time you had to debug compiler problems yourself ?

By sly010 2020-06-2612:051 reply

If the compiler and the build system is so coupled, I am now stuck with that build system.

Specifically compiling frontend code with a generic build tool like bazel is a torture, because the entire javascript ecosystem consists of giant monolith everything-in-a-box toolkits like webpack.

By drivebycomment 2020-06-2620:231 reply

> If the compiler and the build system is so coupled, I am now stuck with that build system.

This seems like a strawman. Please give me an example of this.

Your example has nothing to do with what most people considers "compiler" (that reads in code written in some programming language, and transforms it into some lower-level form, either machine code or some other format). Bazel is not a compiler, and it does not have any compiler built-in.

By sly010 2020-06-273:091 reply

Webpack is a highly integrated build system that has very high integration with the language and the compilers it drives. e.g. You don't have an explicit dependency graph, your graph is implicitly defined by your source code imports together with your random magic build level configuration. You also don't have build steps, it's just a big ... thing. Once you start doing things the "webpack way", (e.g. using loaders) you lost all chance of using any other build system (e.g. bazel).

Imho this integration between the build system (webpack) and the compilers (typescript/babel) is bad and accidental. It happened because they were all written in the same language and because it was possible for the build system (webpack) to use the compiler as a library.

I am worried that if we start creating flexible query based compilers for language, it will make sense for the query language to be in the same language and for the build system to be in the same language which will lead to coupling, which will lead to lack of flexibility.

I hope this makes sense.

As an analogy, try to contrast the unix philosophy with a Spark cluster for data processing workflows. You don't need to be a "unix expert" to extend the unix toolkit, but organizations have Spark teams...

By drivebycomment 2020-06-273:20

I'm afraid this is the end of our meaningful discussion and we'll have to agree to disagree since you seem to have very different definition of what compiler is than I do.

Webpack doesn't "compile" anything in the sense of traditional compiler sense. From webpack documentation:

> Note that webpack will not alter any code other than import and export statements.

So, from my perspective, your example isn't relevant to the topic at all, as webpack is not and does not contain a compiler like LLVM or GCC or Java compiler / JVM or Haskell compiler/runtime or any such system, and has no relevance whatsoever to this discussion of "increased integration in compilers" w.r.t. compile servers and IDEs.

By ratmice 2020-06-263:393 reply

The assertion "all compilers were pure functions", is a strange one, because it is almost entirely backwards.

the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout, and in the c-preprocessor when #include was added, and in the assembler with the .incbin directive, If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.

while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

There have been very few compilers which have even had a relatively pure core (TeX is the only one that I can actually think of), language servers are if anything moving them to a more pure model, simply due to the fact that its sending sources through some file descriptor rather than having to construct some graph out of filenames.

Long story short, "purity" in the sense of a compiler is a function from source text -> binary text, "foo.c" is not source text, and a bunch of errors is not binary text.

At least language servers take in source text as input.

By rumanator 2020-06-266:131 reply

> the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout

I don't understand your point. A function doesn't cease to be a function if it sends it's output somewhere else.

> and in the c-preprocessor when #include was added,

The C preprocessor is not the compiler. It's a macro processor that expands all macros to generate the translation unit, which is the input that compilers use to generate their output.

And that is clearly a function.

> If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.

That assertion makes no sense at all. Compilers take source code as input and output binaries. That's it. The feature you're mentioning is just a convenient trick to cut down build times by avoiding to compile source files that haven't changed. That's not the responsibility of the compiler. That's a function whose input is the source files' attributes and it's output is a DAG of files that is used to run a workflow where in each step a compiler is invoked to take a specific source file as input in order to generate a binary.

It's functions all the way down, but the compiler is just a layer in the middle.

> while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

You have it entirely backwards: build systems exist because compilers are pure functions with specific and isolated responsibilities. Compilers take source code as input and generate binaries as output. That's it. And they are just a component in the whole build system, which is comprised of multiple tools that are designed as pure functions as well.

By ratmice 2020-06-267:492 reply

> I don't understand your point. A function doesn't cease to be a function if it sends it's output somewhere else.

I think here lies the miscommunication, I'm talking about pure functions, it doesn't cease to be a function, but it does cease to be a pure one if sending its output somewhere else is done by side-effect.

By haakonhr 2020-06-2611:03

I guess there is pure and pure. Pure in the sense of no side-effects at all, such as for example writing to a file, and pure in the sense of not relying on state.

By rumanator 2020-06-2612:33

> I think here lies the miscommunication, I'm talking about pure functions, it doesn't cease to be a function, but it does cease to be a pure one if sending its output somewhere else is done by side-effect.

No, you're getting it entirely wrong. The input is the translation unit, the output is the binaries. If you understand what a compiler does and what's it's role in a build process then you'll understand it works as a pure function. There are no side-effects. Translation units in, binaries out.

I suggest you spend some time with a tiny C or C++ project trying to arrive at an executable by performing all build steps by hand instead of using a Makefile or any form of high-level build system.

By sly010 2020-06-2611:521 reply

> The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

But files also make various pieces of compiler chain interoperable and allows me to define a DAG. That's exactly what make is so powerful and that's exactly what I'd hate to loose.

Modern compilers do a lot and understandably they are trying to avoid writing out some partially calculated state to disk (e.g. serializing and AST to disk between stages would be doing work twice). But moving everything into the process means your compiler becomes a walled garden.

You can see this happening in the javascript world. Very few people actually know what WebPack does. It's a giant black box with infinite number of switches and everything is "magic".

By ratmice 2020-06-2619:24

I totally agree with this and think it's a valid concern, It difficult to get the best of both worlds here within the confines of the unix process model.

The query style compiler isn't by necessity a single process. You could imagine an implementation based on the actor model or almost any other message passing system where the queries are driven externally. That the expedient way to do this is by stuffing everything into a giant process is regrettable.

By Ace17 2020-06-265:311 reply

> The only reason an external file/dag is actually necessary is due to impurity in the compilation process.

No compilation process can know about the parts of your project written in another language.

By ratmice 2020-06-265:581 reply

I didn't do a good job of explaining it, the thing is that if your compiler is truly pure, your source input is a node in a graph, the compiler is an edge, and the output is another node. Given 2 languages with pure compilers the entire compilation process inherently admits a DAG, rather than having to reconstruct the DAG in order to drive the compilation process.

By rumanator 2020-06-266:351 reply

> I didn't do a good job of explaining it, the thing is that if your compiler is truly pure, your source input is a node in a graph, the compiler is an edge, and the output is another node.

You've just described the role of a compiler in a build system.

> Given 2 languages with pure compilers the entire compilation process inherently admits a DAG

That's what you're getting wrong. Building is not the same thing as compiling. A software build is a workflow comprised of one or more jobs, where compiling is one of the many types of jobs that's performed. In fact, the process of generating a library or executable from source code is a multi-step process where compiling is only one of the many steps. You get a source code, which may be processed by a macro processor to generate the source code to be compiled, then the compiler generates binaries from that source code, which are then handed to the linker, etc etc etc.

By ratmice 2020-06-267:151 reply

What build systems get wrong from a purity perspective is that the compiler has hidden inputs and outputs from the build systems perspective.

You cannot infer from the inputs and outputs the DAG shape. And so you must construct an artificial mapping to those inputs and outputs.

All I am saying is in pure compilers this is unneccesary because inputs and outputs and compilation steps readily form a graph shape.

Its hard just hard to explain because our familiar approach (to operating systems in general) is not well suited to it.

Couple of introspective questions: From a makefile can you reconstruct the input, send it over the network and successfully compile it without knowledge specific to the compilation step?

Can you modify the input in memory, and recompile it without saving?

Can you start the final step when the first step has finished?

All of these things should be realistically possible in a pure build process where each step is pure. Some of these questions relate to the input, some whether the DAG accurately reflects the real DAG underlying the build steps, etc etc etc.

Anyhow I've said all that I really want to say, that our familiar systems don't readily admit it is a separate topic all together.

By rumanator 2020-06-2612:42

> What build systems get wrong f

Build systems are not getting anything wrong. You're just struggling with a bunch of misconceptions about what a build system is and does, and what role a compiler plays in the whole process.

> is that the compiler has hidden inputs

There are no hidden inputs. There are only the compiler settings and project configs, and even the choice of compiler, which are all managed and set by the build system, and then there is the source code.

That's it.

By pjmlp 2020-06-268:22

All UNIX compilers with their primitive toolchains, these kind of ideas go all the way back to Xerox development environments.

Lucid and IBM already had query based compiler architectures for their C++ toolchains (Energize C++ and Visual Age for C++ v4).

By Orphis 2020-06-2610:19

If you cache only at the file level, then you're going to miss a lot of opportunities and still repeat a lot of work.

You could cache at the function level, or each template separately. You could cache includes automatically on files that change frequently to avoid parsing headers again and again without having to manually specify precompiler headers (yes, sure, modules should help there soon).

By chriswarbo 2020-06-267:27

"Traditional" compilers (e.g. GCC, GHC, javac, etc.) are essentially single-purpose black boxes: source code goes in, executable comes out.

Usually that source code must be on disk, often it must be arranged in a certain directory structure (e.g. src/module/...), and sometimes it must be in files with particular names (e.g. javac). This forces programmatic use of the compiler to be more complicated, e.g. setting up temporary directories to appease these rules.

That single-purpose is a common use-case, but certainly not the only one. Traditional compilers typically perform pre-processing, lexing, parsing, precedence resolution, name resolution, macro expansion, type inference, type checking, optimisation, code generation, linking, stripping, etc. all within the same binary (there are some exceptions, e.g. the C pre-processor can also be invoked separately).

In my experience, this is the opposite of composable and extendable! Each of these steps is very useful in its own right, yet we typically have no way to invoke them independently, e.g. to parse code into a structured form; or to infer the type of an expression; or to resolve a name; or to optimise an AST; etc.

To make this composable and extendable in the way you suggest, we would need to make these separate processes, piped together with a build tool (e.g. make, or a helper script). In practice this doesn't happen, but some projects have hooks into their code for extensibility; e.g. GCC can be run with different front- and back-ends, and the "middle-end" can be extended with new passes and plugins (finally!); GHC has some limited plugin functionality, and has a (very flaky!) Haskell API for invoking its different stages; etc.

My point is that the "traditional" world was pretty awful for composability and extendability. From the outside, we had big opaque compiler processes invoked by Make. If we're willing to drop down to the compiler's implementation language, there were some limited facilities to make them do something other than their usual source files -> binary task.

If we look at the post, we see that it's talking about "drop down to the compiler's implementation language" rather than standalone processes with stdio and Makefiles. However, the approach it's talking about is precisely one of pure functions (e.g. `fetchType`) and a flexible build system (Rock), providing composability and extendability. It even says this explicitly, e.g.

> The rules of our compiler, i.e. its "Makefile", then becomes the following function, reusing the functions from above:

Note that the post isn't specifically about LSP; it only mentions "providing editor tooling, e.g. through a language server". It doesn't even talk about long-running processes. As a counter-example, it would be pretty trivial to expose these 'tasks' as standalone commands, piping through stdio, if we really wanted to. So we're not "giving up too much"; we would be gaining composability and extendability!

As for "faster recommendations in IDEs", that's a complete straw man. The post gives the example of querying for the type of a qualified function name, and a few others (e.g. resolving names). Sure, those would be useful for IDEs, but they would also be useful for many more systems. Some examples, off the top of my head:

- Code search, e.g. searching by type (like Hayoo, but more general and less flaky); finding usages across a package repo (this relies on name resolution)

- Chasing (resolved) name occurrences, e.g. to finding downstream projects impacted by a breaking change; or to speed up delta-debugging by only checking commits which change code used by the test.

- Documentation generators can benefit from looking up types.

- Static analysers benefit from name resolution, type inference/lookup, etc.

Personally I've spent years on projects which use compilers for things other than their usual source files -> executable task, and their lack of composability and extendability is painful (my comment history is full of rants about this, mostly regarding GHC!). The approach described in this post would be amazing to see in "real" languages (i.e. those with lots of users and code, where more tooling and automation would provide a lot of benefit). I've often thought about a similar approach to this 'query-based' design, and would love to see things go even further in this direction (e.g. to a Prolog-style database of code)

By hobo_mark 2020-06-2522:291 reply

Reminded me of this lecture from last year:

Responsive compilers - Nicholas Matsakis - PLISS 2019

https://youtube.com/watch?v=N6b44kMS6OM

(Of course it's based on Rust, but the same principles would be applicable elsewhere)

By fluffything 2020-06-2611:23

The blog post does cite salsa, which is the frame work that was created to create Lark, the language that was created to protoype Rust's implementation of query-based compilation.

https://github.com/lark-exploration/lark

https://github.com/salsa-rs/salsa

By foobar_ 2020-06-260:00

Can some one build HN filter for these type links. Sick of all the political bike shed.