Skip to content

Call for comments: “Introduction to Theoretical Computer Science”

May 17, 2018

As I mentioned before, I am teaching CS 121  at Harvard, and have written my own text, with the (not very original) title “Introduction to Theoretical Computer Science” . I am hoping for this text to turn into a published textbook in the next year or two. Toward this end, I would be grateful for any comments or suggestions on the text, as I will be working on it over the summer.

You can use the issues page on the GitHub repository to make any suggestions, corrections,  reports on bugs and typos, and so on.

I also hope that others instructors would consider using this text in their courses, and would welcome suggestions as to what I can do to help in that. In particular, over the summer I plan to add more exercises (including some solved ones) and more examples, as well as more explanations and “proof ideas” for many of the theorems.

The main differences between this text and “classical approaches” to intro theory courses such as Sipser’s are the following:

Circuits / straightline programs  as the basic model instead of automata

I do not start with finite automata as the basic computational model, but rather with straight-line programs (or, equivalently Boolean circuits) in an extremely simple programming language which I call the “NAND programming language” since its only operation is assigning to one variable the NAND of two others.

Automata are discussed later in the course, after Turing machines and undecidability, as an example for a restricted computational model where problems such as halting are effectively solvable. This actually corresponds to the historical ordering: Boolean algebra goes back to Boole’s work in the 1850’s, Turing machines and undecidability were of course discovered in the 1930’s, while finite automata were introduced in the 1943 work of McCulloch and Pitts but only really understood in the seminal 1959 work of Rabin and Scott. More importantly, the main current practical motivations for restricted models such as regular and context free languages (whether it is for parsing, for analyzing liveness and safety, or even for routing on software defined networks) are precisely because these are tractable models in which semantic questions can be effectively answered. This motivation can be better appreciated after students see the undecidability of semantic properties of general computing models.

Moreover, the Boolean circuit / straightline programs model is extremely simple to both describe and analyze, and some of the main lessons of the theory of computation, including the notions of the duality between code and data, and the idea of universality, can already be seen in this context. Nonuniform computation also gives essential background for some exciting topics I like to cover later in the course. For example, cryptography (making sense of notions such as “256 bits of security” or difficulty of bitcoin mining), pseudorandom generators and the conjecture that BPP=P, and quantum computing. Circuits of course also yield the simplest proof of the Cook Levin theorem.

A programming language instead of Turing machines for modeling uniform computation

Instead of Turing Machines,  I introduce uniform computation using an equivalent model obtained by extending the straightline NAND programming language mentioned above to include loops and arrays (I call the resulting programming language “NAND++”). I do define Turing machines and show the equivalence to NAND++ program, so the students can connect this both to the historical context as well to other literature they may encounter.

I believe that the programming language formalism makes this model more concrete and familiar to the students than Turing machines. For examples, a result such as the equivalence of Turing machines with two dimensional tapes and one dimensional tapes is now described as showing that a language with two dimensional arrays can be “transpiled” into a language with one dimensional arrays only (i.e., two dimensional arrays can be implemented via “syntactic sugar”).

Moreover, because the NAND++ programming language is extremely simple and fully specified,  it is easy to show its formal equivalence with TM’s. In fact,  since NAND++ is essentially obtained by adding a “feedback loop” to Boolean circuits, this makes the Cook Levin theorem easier to prove. I even implemented the Cook Levin reduction in a Python notebook, and so students can see how one can transform a NAND++ program  P into a graph G and a number k such that G has a cut of size k if and only if the program had an input that makes it output 1. Alas, these graphs tend to be quite large:

isetreduction

maxcut_reduction

 

I also introduce yet another extension of NAND++ that allows indirect access to arrays, and hence is essentially equivalent to the standard RAM machine model used (implicitly) in algorithms courses. (I call this language NAND<<.) Using a RAM based model makes the distinction between notions such as O(n) and O(n^2) time more meaningful, and makes the time complexity classes correspond to the informal definitions of linear and quadratic time that students encountered in their algorithms lectures (or their whiteboard coding interviews..).

More advanced topics

Reducing the time dedicated to automata (and eliminating context free languages, though I do touch on them in the text in an optional section) allows to spend more time on topics  such randomness and computation, the interactions between proofs and programs (including Gödel’s incompleteness, interactive proof systems, and even a bit on the \lambda-calculus and the Curry-Howard correspondence), cryptography, and quantum computing.

The book is still work in progress, but I think is already quite usable as a textbook. Indeed, much of the feedback I got on the course was that people were happy with the text (though perhaps they merely preferred it to the lectures due to my teaching abilities 🙂 ). One student even got me this coffee mug:

IMG_20180517_144408

 

In any case, I hope to get more comments as I work on it over the summer. Regardless of whether or not it’s eventually published as a book, I intend to keep a version of it freely available online.

 

Short non-review of Caplan’s “Case Against Education”

May 2, 2018

There is an old joke that an economist would not lift a $100 bill lying on the sidewalk, since in equilibrium it should not be there. But economist Bryan Caplan from George Mason University believes the world is leaving trillion dollars or so on the sidewalk. The culprit in his mind is education, which developed countries tend to invest about 5-6% of their GDP on:

public-education-expenditure-as-share-of-gdp

Correspondingly, the number of years people spend in school has been growing:

 

school-life-expectancy-from-primary-to-tertiary-education

Most people would think this is a good thing. First of all, an educated population is a good in itself. Second, this seems correlated with growth in general:

 

real-gdp-per-capita-PWT-logscale

However, Caplan believes education in high school and above is mostly a waste of time and money. (He is also suspicious about K-8 education, which he calls a “daycare center for kids”.)   So in his mind, the GDP growth of advanced countries has been in spite of their investment in education. Who knows how far the U.S. could have come if the average years in school would have stayed at 8 or 10.

Caplan’s theory is that people’s productivity is just a function of innate skill, that is not changed at all by education. They only go to high school, college, and beyond due to an “arms race” of credentials aided by ease of access to education. If you own a business and believe Caplan’s theory, you’d probably want to set up shop as far away from universities as possible, where you could get quality workers on the cheap that did not waste time on education. Perhaps Amazon should hire Caplan as a consultant, they seem to be doing the exact opposite.

Caplan also wrote a book. Scott Aaronson is not impressed.

Some reading recommendations

April 26, 2018

Quanta magazine has an excellent article by Erica Klarreich on the recent progress on the 2 to 2 conjecture, which I have blogged about before.  The article does not go into the technical details, but gives a good perspective on what’s been done and what are the challenges ahead.

Eric Posner and Glen Weyl have a new book on “Radical Markets”  which I view as to some extent taking a “computer sciency approach” to some of the basic problems of society, economics and democracy. It’s “computer sciency” in the sense that their approach doesn’t place too high of a premium on “backwards compatibility” and also in the sense that, while they talk about transforming basic notions of property rights and even human rights in physical countries, their ideas might end up being more applicable to “digital countries” such as cryptocurrencies and online communities. Indeed, as Eli Ben-Sasson wrote here, many of the unanswered questions in cryptocurrencies are less about crypto and more about governance. (A third more technical sense is that it seems that some of their ideas, such as quadratic voting, correspond to replacing the 1 norm with the 2 norm, which is of course a theme we have seen before. )

Avi Wigderson posted a self contained version of the epilogue of his book on mathematics and computation. This epilogue gives a panoramic view of TCS’s past, present, and future, as well as its connections with its near and not so near neighbors. Regardless of your background, you will learn something from this book.

Finally,  I’ve recently started to be interested in connections between physics and algorithms, and in particular the relations between statistical physics and algorithmic difficulty. I highly recommend these lecture  notes by Afonso Bandiera, Amelia Perry, and Alex Wein.  In a tough year for our community, Amelia was another brilliant student that we lost much much too soon, and these notes are dedicated in her memory.

Childcare at STOC 2018 “TheoryFest”

April 16, 2018

(Announcement from Ilias Diakonikolas and David Kempe)

We are pleased to announce that we will provide pooled, subsidized
child care at STOC 2018. The cost will be $40 per day per child for
regular conference attendees, and $20 per day per child for students.
For more detailed information, including how to register for STOC 2018
childcare, see http://acm-stoc.org/stoc2018/childcare.html

Ilias Diakonikolas and David Kempe (local arrangements chairs)

Lecture notes on DKKMS

April 15, 2018

Mitali Bafna, Chi-Ning Chou, and Zhao Song wrote scribe notes for my lectures on the Dinur et al proof of the 2 to 2 conjecture (see the DKKMS  and KMS papers, though this presentation follows a different, and in my view simpler, approach.)

“Scribe notes” is really an understatement. In a heroic work, Mitali, Chi-Ning and Zhao supplied many more details and proofs (and much better exposition, including figures and comments) than I actually did in my talks.  (One proof is still missing, but hopefully it will be updated in a few weeks or so.)

The notes assume no background in prior works on PCP and unique games, and can be very useful for anyone interested in these recent breakthroughs. As I mentioned before, the reduction itself can be described in about 5 lines. The soundness analysis is unfortunately more complicated…

Consider signing a pledge for inclusiveness in TCS

April 2, 2018

Edith Cohen, Vitaly Feldman, Omer Reingold and Ronitt Rubinfeld wrote a pledge for inclusiveness in TCS. I think it is mostly common sense: saying that not just our universities, but also our conferences and workshops, are part of our workplace, and that we should strive for them to be free of harassment. But sometimes it is very useful to explicitly state common sense statements, as they set the right level of expectations.

So, if you agree with the pledge’s statement and are a member of the TCS community, regardless of whether you are a student, postdoc, faculty, in industry,  or any other member, please do consider signing it here.

Statistical physics dictionary

March 14, 2018

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc.. to be rather daunting. I’ve recently decided to start reading up on this, trying to figure out what these terms mean. Some sources I found useful are the fun survey by Cris Moore, the longer one by Zdeborová and Krzakala, the somewhat misnamed survey by Castellani and Cavagna, and the wonderful book by Mezard and Montanari. (So far, I’m still early in the process of looking at these sources. Would appreciate pointers to any others!)

This post is mainly for my own benefit, recording some of my thoughts.
It probably contains many inaccuracies and I’d be grateful if people point them out.

 

First of all, it seems that statistical physicists are very interested in the following problem: take a random instance J of a constraint satisfaction problem, and some number 0 \leq \alpha \leq 1, and sample from the uniform distribution \mu(\alpha) over assignments x to the variables that satisfy at least \alpha fraction of the constraints. More accurately, they are interested in the distribution \hat{\mu}(\beta) where the probability of an assignment x is proportional to \exp(-\beta J(x))) where \beta is some parameter and J(x) is the number of constraints of J that x violates. However, for an appropriate relation of \beta and \alpha, this is more or less the same thing, and so you can pick which one of them is easier to think about or to calculate based on the context.

Physicists call the distribution  \hat{\mu}(\beta) the Boltzman distribution, they would call J(x) the energy of x. Often they would write J(x) as H(x;J) where H, known as the Hamiltonian, is the general shape of the constraint satisfaction problem, and J are the parameters. The free energy is the typical value of \exp(\beta J(x)), which roughly corresponds to \exp(\beta \alpha |J|) where |J| is the total number of constraints in J and \alpha (as before) is the fraction of satisfied constraints.

 

Why would statistical physicists care about constraint satisfaction problems?

My understanding is that they typically study a system with a large n number of particles, and these particles have various forces acting on them. (These forces are local and so each one involves a constant number of particles – just like in constraint satisfaction problems.) These forces push the particles to satisfy the constraints, while temperature (which you can think of as 1/\beta) pushes the particles to have as much entropy as possible. (It is entropy in a Bayesian sense, as far as I understand; it is not that the particles’ state is sampled from this distribution at any given point in time, it is that our (or Maxwell’s demon’s) knowledge about the state of the particles can be captured by this distribution.) When the system is hot, \beta is close to zero, and the distribution has maximal entropy, when the system is cold, \beta gets closer to infinity, and the fraction \alpha of satisfied constraints grows closer to the maximum possible number. Sometimes there is a phase transition where a o(1) change in \beta can lead to an \Omega(1) change in \alpha. (These o(1) and \Omega(1) factors are with respect to the number of particles n, which we think of as tending to infinity.) That is, a small change in the temperature leads to a qualitatively different distribution. This happens for example when a system changes from solid to liquid.

Ideally (or maybe naively), we would think of the state of the system as being just a function of the temperature and not depend on the history of how we got to this temperature, but that’s not always the case. Just like an algorithm can get stuck in a local minima, so can the system get stuck in a so called “metastable state”. In particular, even if we cool the system to absolute zero, its state might not satisfy the global maximum fraction of constraints.

A canonical example of a system stuck in a metastable state is glass. For example, a windowpane at room temperature is in a very different state than a bag of sand in the same temperature, due to the windowpane’s history of first being heated to very high temperatures and then cooled down. In statistical physics lingo the static prediction of the state of the system is the calculation based on its temperature, while a dynamic analysis takes into account the history and the possibility of getting stuck in metastable states for a long time.

Why do statistical physicists care about random constraint satisfaction problems?

So far we can see why statistical physicists would want to understand a constraint satisfaction problem J that corresponds to some physical system, but why would they think that J is random?

As far as I can tell, there are two reasons for this. First, some physical systems are “disordered” in the sense that when they are produced, the coefficients of the various forces on the particles are random. A canonical example is spin glass. Second, they hope that the insights from studying random CSP’s could potentially be relevant for understanding the actual CSP’s that are not random (e.g., “structured” as opposed to “disordered”) that arise in practice.

In any case, a large body of literature is about computing various quantities of the state of a random CSP J at a given temperature, or after following certain dynamics.
For such a quantity X(J), we would want to know the typical value of X(J) that is obtained with high probability over J. Physicists call this value the “quenched” average (which is obtained by fixing J to a typical value). Often it is easier to compute the expectation of X(J) over J (which is called “annealed average” in physicist-lingo), which of course gives the right answer if the quantity X(J) is well concentrated – a property physicists call “self averaging”.

 

Often if a quantity X(J) is not concentrated, the quantity \log X(J) would be concentrated, and so we would want to compute the expectation \mathbb{E}_J \log X(J). The “replica trick” is to use the fact that \log X is the derivative at 0 of the function f(k)= X^k. So, \mathbb{E}_J \log X(J) = \mathbb{E}_J \lim_{k \rightarrow 0}\tfrac{1}{k} (X(J)^k-1). (Physicists would usually use n instead of k, but I already used n for the number of variables, which they seem to often call N.) Now we can hope to exchange the limit and expectation and write this as \lim_{k \rightarrow 0}\tfrac{1}{k} \mathbb{E}_J X(J)^k - 1. For an integer k, it is often the case that we can compute the quantity f(k)=\mathbb{E} X(J)^k. The reason is that typically X(J) is itself the expectation of some random variable that is parameterized by J, and hence X(J)^k would just correspond to taking the expectation of the product of k independent copies (or “replicas” in physicist-speak) from these random variables. So, now the hope is that we get some nice analytical expression for f(k), and then simply plug in f'(0) and hope this gives us the right answer (which surprisingly enough always seems to work in the contexts they care about).

 

What are the qualitative phase transitions that physicists are interested in?

Let us fix some “typical” J and consider the distribution \mu=\mu(J;\alpha) = \hat{\mu}(J;\beta) which is uniform over all assignments satisfying an \alpha fraction of J‘s constraints (or, essentially equivalently, where the probability of x is proportional to \exp(-\beta J(x))). We will be particularly interested in the second moment matrix M of \mu, which is the n\times n matrix such that M_{i,j} = \mathbb{E} x_ix_j where x is sampled from \mu. (Let’s think of x as a vector in \{ \pm 1 \}^n.)

We can often compute the quantity \mathbb{E}_J Tr(M(J)^k) for every k.
By opening up the parenthesis this corresponds to computing \mathbb{E}_J \langle x_1 , x_2 \rangle \langle x_2, x_3 \rangle \cdots \langle x_{k-1}, x_k \rangle \langle x_k , x_1 \rangle for k independent x_1,\ldots,x_k from \mu(J) (i.e., “replicas”). (Physicists often think of the related “overlap matrix” Q which is the k\times k matrix where Q_{a,b} = \langle x_a,x_b \rangle.) This means that we can completely determine the typical spectrum of M as a function of the parameters \alpha or \beta.

 

The typical behavior when \beta is small (i.e., the system is “hot”) is that the solutions form one large “cluster” around a particular vector x^*. This is known as the “replica symmetric” phase, since all copies (“replicas”) chosen independently from \mu would have roughly the same behavior, and the matrix Q above would have n on the diagonal and the same value q off diagonal.
This means that M is close to the rank one matrix x^*(x^*)^\top. This is for example the case in the “planted” or “hidden” model (related to “Nishimory condition” in physics) if we plant a solution x^* at a sufficiently “easy” regime in terms of the relation between the number of satisfied constraints and the total number of variables. In such a case, it is easy to “read off” the planted solution from the second moment matrix, and it turns out that many algorithms succeed in recovering the solution.

As we cool down the system (or correspondingly, increase \alpha) the distribution breaks into exponentially many clusters. In this case, even if we planted a solution x^*, the second moment matrix will look like the identity matrix (since the centers of the clusters could very well be in isotropic positions). Since there are exponentially many clusters that dominate the distribution, if we sample from this distribution we are likely to fall in any one of them. For this reason we will not see the cluster behavior in static analysis but we will get stuck in the cluster we started with in any dynamic process. This phase is called “1-dRSB” for “one step dynamic replica symmetry breaking” in the physics literature. (The “one step” corresponds to the distribution being simply clustered, as opposed to some hierarchy of sub-clusters within clusters. The “one step” behavior is the expected one for CSP’s as far as I can tell.)

As far as I can tell, the statistical physics predictions would be that this would be the parameter regime where sampling from the distribution \mu would be hard, as well as the regime where recovering an appropriately planted assignment would be hard. However, perhaps in some cases (e.g., in those arising from deep learning?) we are happy with anyone of those clusters.

If we cool the system more then one or few clusters will dominate the distribution (this is known as condensation), in which case we get to the “static replica symmetry breaking” phase where the second moment matrix changes and has rank corresponding to roughly the number of clusters. If we then continue cooling the system more then the static distribution will be the global optimum (the planted assignment if we had one), though dynamically we’ll probably be still stuck in one of those exponentially many clusters.

The figure below from Zdeborová and Krzakala illustrates these transitions for the 5 coloring constraint satisfaction problem (the red dot is the planted solution). This figure is as a function of the degree of the graph $c$, whereas increasing the degree can be thought of as making the system “colder” (as we are adding more constraints the system needs to satisfy).

lenka_survey_fig

What about algorithms?

I hope to read more to understand belief propagation, survey propagation, various Markov Chain sampling algorithms, and how they might or might not relate to the sum of squares algorithm and other convex programs. Will update when I do.