Arjun Narayan blog

The Problem with Lying is Keeping Track of all the Lies

Arjun Narayan — Wed, 05 Jun 2024 19:58:00 GMT

Over on the Materialize blog I wrote a piece on database isolation, and what goes wrong when you have weak isolation.

The Uses and Abuses of Cloud Data Warehouses

Arjun Narayan — Thu, 27 Jul 2023 19:56:00 GMT

Over on the Materialize blog, I wrote a piece about what often goes wrong in data infrastructure. There’s a discussion on Hacker News as well.

Kafka is not a database

Arjun Narayan — Tue, 08 Dec 2020 20:53:00 GMT

Over on the Materialize blog I wrote a post about what Kafka does and doesn't do.

The Philosophy of Computational Complexity

Sun, 27 May 2018 20:40:00 GMT

Tyler Cowen asks: Has there been progress in philosophy? He cheats a bit, pointing mainly to progress in science in his 16 answers. Perhaps that is the point.

Philosophy's successes rapidly create new fields, leaving them only to tackle the new frontier. So over time it appears they've done no work. I am sympathetic to this view, and some AI researcher friends of mine also occasionally feel similarly aggrieved as their successes spawn new fields (search, vision, natural language processing), but they are left hanging the bag marked 'problems on which we've made no progress'.

I like Tyler's 16 answers (some more than others), but I think there's a 17th significant omission, which is the tremendous advances we've made in our understanding of computational complexity. This is, in my estimation, the field in which we have made the most progress in the past 50 years (the field did not really exist 50 years ago, and it's been 46 years since Karp's seminal paper). I would go so far as to say that we currently lie on the cusp of a potential scientific revolution, akin to where theoretical physics was in 1899.

First, computational complexity is not well-understood outside of computer science, and this is a pity. Philosophers in particular should pay more attention to computational complexity! Second, among its most natural audience - other computer scientists - I've frequently heard expressions of derision pointed in its direction. The sentiment I've heard most is that the basic structure that complexity research gives us is useful (for example we deeply care if we're using an O(n log n) algorithm or an O(n\^2) algorithm in practice), but the more complicated complexity classes are off the rails. The last significant advance we had was in classifying the NP-complete problems. But we KNOW that P!=NP. Why do we need a proof? The final crux of the argument comes from an appeal to that favorite jackass of armchair physicists, Richard Feynman: Feynman apocryphally had trouble understanding why it was an open question at all! As Scott Aaronson has said multiple times, if we were physicists, we would have long declared P!=NP to be a "law of the universe" and moved on1. But I care. First, because the theory is important as-is. But if you're not swayed by analogy to the practical consequences of advancements in theoretical physics, I'll give you two solidly practical reason to care. First, all of cryptography depends crucially on our understanding of complexity. Second, the computational complexity of markets are a crucial theoretical design criterion, one that has historically been underemphasized in market design, but will increasingly become more relevant as our markets get more computationally complex.

P and NP in 2 minutes

A brief introduction to the problem at hand: P is the class of problems that have efficient polynomial-time solutions. NP is the class of problems that have efficient polynomial time verifications of correct solutions.

I'll give you an example: multiplying two numbers together is in P. If you increase the number of digits in my input, I have to do more but not that much more work (strictly speaking, about n\^1.425 work), where n is the number of digits in the larger input. n\^1.425 is a polynomial, and thus this entire problem is "in polynomial time", the proof is by construction.

Here's an example of a problem that is in NP: factoring a number. We don't have a constructive algorithm that will give us a number's factors in polynomial time (it is an open question if there is one), but if a genie were to present you with the factors, you can very easily verify if the purported factors are indeed correct, by multiplication. Thus, factoring is in NP.

Now there are some problems in NP that are "as hard" as all other problems in NP, because a solution can be used to bootstrap solutions to all other NP problems. The classic example is the traveling salesman problem. If you had a genie that could always solve traveling salesman problems, you can easily translate any other NP problem (such as factoring) into an instance of the traveling salesman problem, give it to your genie, take the resulting solution, and use that to get the factors. We thus call traveling salesman "NP complete". There are a surprisingly large number of practical problems that are NP complete, and identifying a wide set of them basically kicked off this field.

The question now comes around: Are NP-complete problems solvable in polynomial time? This may seem like a surprising question, because the instinctive answer is "Of course not! Why on Earth would you think it would be so?" But a proof has been surprisingly elusive, and meanwhile, we've built more and more systems that implicitly rely on P!=NP, and its various downstream consequences, so it would be nice to know for sure.

The Computational Universe and the Physical Universe

P!=NP is the biggest open question today, and not just because of the cliched joke that if P=NP, we'd solve the other 5 outstanding Clay Millenium Prize Problems. It's the biggest outstanding question because it has the chance to reveal very fundamental flaws in our entire scaffolding of theoretical understanding of our computational universe. Many different---and currently very plausible---advances to answering the P=NPquestion could very well be akin to the Michelson-Morley experiment, revealing gigantic flaws in our understanding of the computational universe, just as that experiment pretty much put the kibosh on luminiferous aether2. Anyone who confidently asserts that P!=NP is dangerously close to Lord Kelvin's assertion in 1900 that "There is nothing new to be discovered in physics now. All that remains is more and more precise measurement."

Let me elaborate on where we currently are in our understanding of our computational universe: Over the past fifty years, we've built up our understanding on top of a house of cards, and each year we add another level of cards atop. At the base are a few fundamental open questions on which we have reasonable educated guesses (I will not resist this opportunity to snark back at physicists; these are what physicists would call "laws"). Many of those cards on the top are attempts to collapse the house of cards, and in an attempt at a meta-proof by contradiction, each failed attempt is yet another clue that perhaps the foundation is sound after all.

But to me, the important insight behind understanding the relevance of the P!=NP problem is that it's not merely a binary question. It's more a question of several different worlds that we could be living in, all of which are distinct and profoundly different.

As Scott Aaronson lays out in NP-complete Problems and Physical Reality, questions about the computational nature of our universe are fundamentally questions about the physical nature of our universe. In a sense, theP!=NP question is the heir (in magnitude) to the implied question posed by the Lorentz transformation[5]. An answer to the P!=NP that resolved which of Impagliazzo's worlds we actually lived in would be as impactful as the theory of special relativity. Essentially, the core question of our era is: "is cryptography possible?"

Understanding the physical structure of our universe matters because it determines the structure of how we will inhabit the universe. Liu Cixin's The Dark Forest poses that it is the physical reality of relativistic acceleration that explains the Fermi paradox: since kinetic payloads accelerated to relativistic speeds cannot be detected in advance, preemptive first strikes are the Nash equilibrium strategy. The only winning move is not to play, and stay "dark" in the forest of the universe. I find this solution intriguing enough to be the basis of a science fiction universe, although I am skeptical that this is the universe that we do live in. Many uncertainties remain in the detailsof how decentralized one can make one's communication web, which we do not yet fully understand3. Nevertheless, these questions depend very deeply on the exact details of the technological limits our physical universe (e.g. does relativistic acceleration necessarily leave energy signatures that can be traced ex post facto?).

What is the computational structure of our universe? Given how little we know of our universe, I've found the best way to approach this question is to try to understand five different parallel worlds, first introduced by Russell Impagliazzo in A Personal View of Average Case Complexity, each with different computational complexity properties. The core question then becomes: which of these five worlds of Russell Impagliazzo do we live in? He named his five worlds Algorithmica, Heuristica, Pessiland, Minicrypt, and Cryptomania.

Algorithmica

This is the P=NP world that people typically think of. This is the one where we have a constructive algorithm that efficiently solves NP-hard problems. We use this algorithm to bootstrap answers to the other Clay problems, all of physics and mathematics, and then shortly after, ascend to Godhood as Arthur C. Clarke-esque beings of pure energy without corporeal form. I will not elaborate further on this world, except to note that nobody really believes that this is the world we live in4. Nevertheless, until proven otherwise, this is potentially the world we live in. If we were truly contemplating Algorithmica and Not Algorithmica as binary possibilities, I too would take the Feynman stance. So let's move on to more interesting possibilities.

Heuristica

The best start to describing this world is to begin by directly quoting Impagliazzo:

Heuristica is the world where NP problems are intractable in the worst-case, but tractable on average for any samplable probability distribution5.

In essence, you have to do an equivalently hard amount of work to find an instance of an NP-hard problem where solving requires super-polynomial time. Thus, NP-hardness is no longer useful to bootstrap cryptographic one-way functions. I like to think of this world as the one in which we realize that "NP-hardness" as a notion is practically useless. In this world, what truly matters is the scaffolding provided by the heuristic analyses themselves. "Is this problem NP hard" becomes replaced with "Is this problem NP-heuristicable". Carrying forward Aaronson's analysis in NP-complete Problems and Physical Reality, if you have a problem instance that is non-NP-heuristicable, there was probably some underlying computational or physical process that did a "hard" amount of work to generate that problem instance! It is almost certain that in this world, there would be no practical cryptography, as attacker and defender would simply be locked in a computational arms race. It is, of course, possible that cryptographic one-way functions exist in "some other" way than NP-hardness but all of that depends on reifying precisely how the NP-heuristics work. For all practical purposes, in this world, cryptography would revert to the stone age, and we would start over. But there's a silver lining: we would get to efficiently solve all naturally-occuring NP-hard problems that weren't pre-constructed to be hard by well-resourced computational adversaries. This is a fun but dystopian universe: there are no hard problems, but there are also no secrets.

Pessiland

Impagliazzo names this world as such because it's truly the worst of all worlds. In this world, NP-hard problems exist, NP-hard instances are easy to find and naturally occur everywhere, but finding matched pairs of NP-hard instances with known easy "keys" is hard. Let's build some intuition using the traveling salesman problem: in this world, TSP is still hard to solve. But it is hard to generate hard instances of TSP with known solutions ("keys"). Whatever process is used to generate the instance while retaining the key ends up selecting for the subset of problems that are not hard to crack. Thus, you cannot bootstrap any cryptographic one-way function using this process6.

I.E. The universe is full of naturally occuring NP-hard instances (e.g. all of math and physics, all computational modeling problems like traveling salesman, circuit minimization, etc.). But somehow, generating instances of NP-hard problems with known keys is hard. In a sense, the heuristics work very well for instances we generate from known keys, but not for any natural instances from the wild.

This is a garbage world. We don't get any benefit from solving NP-hard problems that we want to solve, and we don't get any cryptographic benefit from NP-hardness either. In this world the Z3 theorem prover still doesn't work well, but now neither does Amazon.com --- all ecommerce today relies upon public-key cryptography to keep our payments secure.

Minicrypt

Minicrypt is a world where a modest amount of cryptography is possible (hence the "mini"). Specifically, one-way functions exists[6], but public-key cryptography does not7. One-way functions can be used to bootstrap secure channels, but the actual secret generator must be shared over secure channels a priori. This world is well described in Vernor Vinge's A Fire Upon the Deep. In that universe, the chief trade goods are physically transmitted one-time pads. These one-time pads are small, but necessary to bootstrap cryptographically secure computation, as public-key cryptography does not exist.

You can also get a good intuition for this world because it is one that we've speculated at length on due to quantum computers: the discrete logarithm problem (which is the basis for public-key cryptography) isefficiently solvable by quantum computers (which don't exist yet) via Shor's Algorithm. If you imagine a world where we've built quantum computers but we haven't made any advances in devising public-key cryptographic schemes that aren't Shor's Algorithmable, you're in minicrypt.

However, in this world you could still use quantum cryptography to do non-public-key cryptography, but you'd still have to physically ship the quantum bits to the destination (due to the no-cloning theorem). So there's a lot of physically shipping around quantum cryptography starter packs. There's a lot of sci fi movies that are essentially set in minicrypt, because you'd have to get the correct quantum bits into the secure mainframe before you can upload the plans for the death star.

Cryptomania

This is the classic P!=NP world, except we now have proofs that we live in this world, as opposed to just educated guesses. Our house of cards is stabilized, and all our existing theoretical scaffolding is sound. Importantly, we don't live in minicrypt, pessiland, or heuristica.

That said, there are a couple caveats. To quote Impagliazzo again,

However, blind acceptance of the existence of public key cryptosystems as a de facto complexity axiom is unwarranted. Currently, all known secure public key cryptosystems are based on variants of RSA, Rabin, and Diffie-Hellman cryptosystems. If an efficient way of factoring integers and solving discrete logarithms became known, then not only would the popular public key cryptosystems be broken, but there would be no candidate for a secure public-key cryptosystem, or any real methodology for coming up with such a candidate. There is no theoretical reason why factoring or discrete log should be intractable problems.

Minicrypt or Cryptomania?

Lance Fortnow says that "most computer scientists would say Cryptomania or Minicrypt". I'll go one further, and put my money where my mouth is: I'll bet you twenty one million bitcoin that we live in Cryptomania or Minicrypt!

Answering this binary question alone would give us very different worlds. That said, there are other low probability worlds we might live in. Here's one, which I name Polynomia:

Polynomia: In this world, we have a constructive polynomial time algorithm that solves NP-complete problems, except the exponent is some ridiculously large number so as to render the algorithm practically useless. In this world the house of cards we've built up has collapsed, but in a way that leaves nothing but useless rubble to throw out. We begin the computer-scientific enterprise anew.

There's a good reason that Impagliazzo doesn't waste time on this world, because it's really not a serious possibility8. I bring it up because it serves as a useful guidepost as to why some people give the Feynman answer of "just declare it a law of physics and move on". If you were to phrase the question as "do we live in Minicrypt/Cryptomania or Polynomia?", the derisive answer is a lot more sympathetic! But as I hope I've made clear today, that's not the actual question.

Acknowledgements

Thanks to Justin Jaffray, Jason Reed, and Jennifer Gillenwater for helpful suggestions to this post. Thank's also to Dave Moore for suggestions to this post.

Newsflash to my economist friends: physics envy is everywhere, even amongst the current masters of the universe, computer scientists.

The history of the Lorentz transformation is very interesting: we essentially had the equations that described the strange behavior of light without understanding why, until the theory of special relativity. The Michelson-Morley experiments were giant WTFs in 1887, which pretty conclusively shot the existing physics theories about luminiferous aether. Special relativity only came around in 1905. For 18 long years, we had no idea what was going on, even after we had the equations to describe it precisely.

Details such as: if you onion route your messages, can you communicate without revealing your true location? If you set up automated retaliation "submarines" that return fire, can you layer a level of mutually assured destruction that returns you to a stable non-first strike equilibrium?

In Bill Gasarch's poll of prominent members of the theoretical research community, we have 9 respondents (out of 100) who said that P=NP would be the result. Of all respondents, 62 left comments. I categorize the following 6 commenters as either supporting a Polynomia view, or posing that if P=NP, it would be because of Polynomia and not Algorithmica: Jamie Andrews, Yuri Gurevich, David Isles, Donald Knuth, Vladik Kreinovich (who in turn cites Levin as the true originator of his position), Clyde Kruskal. I do not have high confidence in my reading of these comments; there's a lot of in-jokes there.

If your question is why can't I have a uniform distribution over all of the problem inputs that give worst-case complexity? The answer is that the magical Impagliazzo's demon makes sampling from this probability distribution NP-hard, so you're back to putting in large amounts of work to generate instances that require large amounts of work to crack.

"One way function" is the standard term in cryptography, but it is poorly named for giving the right intuition. We essentially talk about functions that are easy in two ways if you know a key, but only easy in one direction if you don't know the key. This is useful in cryptography, because you share the decryption key with your buddy, and then freely use the encryption scheme. If someone else gets hold of the encryption scheme, they cannot use it to decipher the decryption key.

Public key cryptography refers to algorithms that can bootstrap a shared secret key where the participants are only talking over open channels, and have not previously established the shared secret. The classic example isDiffie-Hellman key exchange.

But maybe I'm wrong! Don Knuth (see question 17) holds the clearest pro-Polynomia view. It's unclear how much of his view is genuinely pro-Polynomia versus a Polynomia-must-be-taken-more-seriously, you underestimate the demons of the shadow at your own peril, etc. etc.

A History of Transaction Histories

Arjun Narayan — Fri, 30 Mar 2018 17:13:00 GMT

I’ve been trying to understand database transactions for a long time, and recently spent some time researching this with Justin Jaffray. Here’s an attempt to summarize what we’ve learned.

One recurring challenge has been that understanding transactions has been a decades long journey for the database and distributed infrastructure community. Reading the literature requires understanding this context. Otherwise it’s jarring to see contradicting terminology, and contradicting intuitions.To clarify this, I’ve written down how the discourse around database transactions has evolved.

In the beginning (1990?) database connections were nasty, brutish, and short. Under the law of the jungle, anarchy reigned and programmers had to deal with numerous anomalies. To deal with this, the ANSI SQL-92 standard was the first attempt to bring a uniform definition to isolation levels. They chose the following path:

There are a few known anomalies, namely dirty reads, non-repeatable reads, and phantom reads.
They recognized that there are some locking strategies that prohibit some or all of these anomalies. These they called “read uncommitted” (the jungle), “read committed” (no dirty reads), repeatable read (no dirty reads or non-repeatable reads).
The final level of enlightenment was the isolation level that didn’t have all three anomalies. They called it serializable, but like… lets just say that was a flaw in this plan.

But serializability is one of the older gods. The definition goes back to the 1960s, and is defined in terms oftransaction histories. To quote Christos Papadimitriou from one of the earlier papers that discussed transactions:

A sequence of atomic user updates/retrievals is called serializable essentially if its overall effect is as though the users took turns, in some order, executing each their entire transaction indivisibly.

I personally find reading papers from the era of scanned typewritten sheets hard, so I’m going to skip ahead and say that Herlihy-Wing is a decent place to understand serializability. But the idea is surprisingly simple: every transaction happens as if it happened at some instantaneous point in time, and there’s some order between them (it doesn’t even say that this order has to be clear to the users, or to the transactions themselves! It just has ex-post-facto to exist).

A few years later, folks from Microsoft’s database group wrote a great paper called A Critique of ANSI SQL Isolation Levels, which points out that the ANSI SQL definition of serializable is… not serializable? Like the way ANSI thinks about it, if you eliminate dirty reads, non-repeatable reads, and phantom reads, you’re serializable.This is wrong. It turns out if you eliminate these three anomalies, what you have is a new isolation level which they called Snapshot Isolation.

The other valuable contribution of the Critique is to point out that defining isolation in terms of “the absence of specific anomalous phenomena” is bad, and like, can we please start talking about transaction histories? The surrounding literature around this time period talks about isolation levels by defining specific locking strategiesand then proving that those locking strategies can or cannot have various anomalies. This is confusing, because you can have multiple locking strategies with functionally equivalent isolation levels, but the names are associated with the strategy, and so you have multiple names for the same isolation level.

In comes a knight on shining armor to clear up this mess: Atul Adya, with his PhD Thesis Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions. Adya starts by looking at transaction histories. His strategy is essentially, “here are some histories, lets study the graphs”. Then he identifies various weird shapes in the graphs (honestly, mostly cycles of various lengths?), and says “these are bad”. Then he maps each badness back to the ANSI definitions, sort of retroactively giving some coherent mathematical meaning to the ANSI isolation levels. This is very valuable! We finally have precise definitions for isolation levels, ones that are independent of the locking strategy that happens to live at that level!

But besides just giving meaning to the four ANSI levels, he also identifies other anomalies that are more subtle, e.g. G2, or “anti-dependency cycles” which are very subtle and live in the vast uncharted territory between Snapshot Isolation and Serializability.

Adya’s thesis is seminal in databases. It is the first time someone coherently gave mathematical definitions of isolation levels in terms of properties one can observe in a transaction history graph. The year is 1999. Infrastructure still sucks.

Now the next question comes up: what do we do with all these databases that exist in the wild (namely Oracle and Postgres, MySQL is still a toy around now) that claim serializability but are actually just snapshot isolation, because they were built in a world of ANSI definitions, but now we know better?

Alan Fekete et al have a great idea in 2005, which they call “Making Snapshot Isolation Serializable”. It essentially takes a vanilla snapshot isolation database, and layers on some checks in the SQL statements, to ensure that you have serializability. They use TPC-C as a running example, because it has already been cleverly engineered to always be serializable, even when running on a snapshot isolation database. So like you, the application programmer, even if you are forced to use a Snapshot Isolation database, can get serializability with this One Weird Trick (Anomalies hate him!).

Now this is a good idea, and Fekete extends this work in 2008 in a paper called Serializable Isolation for Snapshot Databases. This paper basically runs with that idea, except instead of having those checks be written in application code, they do the checks at the database level. So you don’t have to rip out your transaction processing engine, but you can just layer on another set of checks, and you can make your database serializable. The technique is called SSI (Serializable Snapshot Isolation) because apparently taking two technical definitions and combining them to name a completely new definition is not confusing at all.

This is such a good idea, that some people decide to implement it in PostgreSQL and Postgres finally has serializability. The year is 2012. Note that this is not default postgres mode because ANSI says Snapshot (but called Serializable) is sufficient. This is a good point to note that The ANSI definitions have not been fixed.

This strategy is extended in a paper called A Critique of Snapshot Isolation . This paper points out that SSI bootstrapping algorithm can be simplified. To expand, the original SSI algorithm checks for write-write conflicts. Instead, Yabandeh proposes detecting Read-Write conflicts instead. While this requires holding in-memory data structures that have information about the latest timestamp of every read in the database, it has far better concurrency control behavior (for some common workloads), as it never aborts reads. It only ever aborts writes.

This is also finally the year when people start to wake up and realize they care about serializability, because Jeff Dean said it’s important. Michael Stonebraker, meanwhile, has been shouting for about 20 years and is very frustrated that people apparently only care when The Very Important Googlers say its important.

At this point academics have a lot of angst, because now we’re in this weird zone where

according to the theory, serializability is The Only Way
many people use garbage non-serializable databases and… are basically fine?
Therefore, are we just like wasting our time? Does serializability even matter? What are we doing with our lives?

Peter Bailis summarizes this up best in a very fine blog post in 2014. To quote him:

Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them.

The retort is basically: you’re not always fine, and you’re going to find out that you’re not fine when its too late, so don’t do that. This is not an easy argument to make. It finally starts to catch once people have been burned with NoSQL databases for about a decade.

Arjun Narayan

Arjun Narayan — Tue, 29 Aug 2006 18:50:00 GMT

Databases

I co-founded Materialize, a database startup, with Frank McSherry. I was the founding CEO.

I worked at Cockroach Labs as a software engineer, working mainly on SQL query execution and performance engineering.

Computer Science

I received a PhD in computer science from the University of Pennsylvania. My PhD was on distributed systems that provided differentially private guarantees when processing data across federated security and privacy domains.

I knew quickly that I didn’t want to be an academic, nor did I care deeply about privacy, but I found the experience of getting a PhD extremely rewarding. The problems I was solving were contrived, but solving them required leveraging theorem provers, type systems, databases, deterministic runtimes, compilers, and distributed systems. Overall, a fantastic computer science education.

Investing

I invest in technical founders at Amplify Partners.

Writing

I’ve collected some of my past writings on this blog, and intend to continue such writing from time to time. I expect writing to be infrequent - only when I really feel like I have something unique to add - so expect the burden on your inbox to be occasional.

Subscribe now