But, if you want the short, in-person, better weather version, you might want to sign up for the winter Course David Steurer and I will teach on this topic January 3-7 2017 in UC San Diego.
We hope to touch on how the sos algorithm interacts with questions in computational complexity, approximation algorithms, machine learning, quantum information theory, extremal combinatorics, and more.
If you are interested in attending the course, or even following it remotely, please see the course web page, and also sign up for its Piazza page (people without a Harvard email can use this form). David and I will be trying to write fairly complete lecture notes, which we will post on the website, and I might also post some summaries on this blog.
Here is the introduction for this course:
The terms “Algebra” and “Algorithm” both originate from the same person. Muhammad ibn Musa al Khwarizmi was a 9th century Persian mathematician, astronomer and geographer. The Latin translation of his books introduced the Hindu-Arabic decimal system to the western world. His book “The Compendious Book on Calculation by Completion and Balancing” also presented the first general solution for quadratic equations via the technique of “completing the square”. More than that, this book introduced the notion of solving general as opposed to specific equations by a sequence of manipulations such as subtracting or adding equal amounts. Al Khwarizmi called the latter operation al-jabr (“restoration” or “completion”), and this term gave rise to the word Algebra. The word Algorithm is derived from the Latin form of Al Khwarizmi’s name. (See The equation that couldn’t be solved by Mario Livio for much of this history.)
Muhammad ibn Musa al-Khwarizmi (from a 1983 Soviet Union stamp commemorating his 1200 birthday).
However, the solution of equations of degree larger than two took much longer time. Over the years, a great many ingenious people devoted significant effort to solving special cases of such equations. In the 14th century, the Italian mathematician Maestro Dardi of Pisa gave a classification of 198 types of cubic (i.e., degree ) and quartic (i.e., degree ) equations, but could not find a general solution for all such examples. Indeed, in the 16th century, Italian mathematicians would often hold “Mathematical Duels” in which opposing mathematicians would present to each other equations to solve. These public competitions attracted many spectators, were the subject of bets, and winning such duels was often a condition for obtaining appointments or tenure at universities. It is in the context of these competitions, and through a story of intrigue, controversy and broken vows, that the general formula for cubic and quartic equations was finally discovered, and later published in the 1545 book of Cardano.
The solution for quintic formulas took another 250 years. Many great mathematicians including Descartes, Leibnitz, Lagrange, Euler, and Gauss worked on the problem of solving equations of degree five and higher, finding solutions for spcial cases but without
discovering a general formula. It took until the turn of the 19thcentury and the works of Ruffini, Galois and Abel to discover that in fact such a general formula for solving degree 5 or higher equations via combinations of the basic arithmetic formulas and taking roots does not exist. More than that, these works gave rise to a precise characterization of which equations are solvable and thus led to the birth of group theory.
Today, solving an equation such as (which amounts to constructing a 17-gon using a compass and straightedge- one of Gauss’s achievements that he was most proud of) can be done in a few lines of routine calculations. Indeed, this is a story that repeats itself often in science: we move from special cases to a general theory, and in the process transform what once required creative genius into mere calculations. Thus often the sign of scientific success is when we eliminate the need for creativity and make boring what was once exciting.
Let us fast-forward to present days, where the design of algorithms is another exciting field that requires a significant amount of creativity. The Algorithms textbook of Cormen et al has 35 chapters, 156 sections, and 1312 pages, dwarfing even Dardi’s tome on the 198 types of cubic and quartic equations. The crux seems to be the nature of efficient computation. While there are some exceptions, typically when we ask whether a problem can be solved at all the answer is much simpler, and does not seem to require such a plethora of techniques as is the case when we ask whether the problem can be solved efficiently. Is this state of affairs inherent, or is it just a matter of time until algorithm design will become as boring as solving a single polynomial equation?
We will not answer this question in this course. However, it does motivate some of the questions we ask, and the investigations we pursue. In particular, this motivates the study of general algorithmic frameworks as opposed to tailor-made algorithms for particular problems. There is also a practical motivation for this as well: real world problems often have their own kinks and features, and will rarely match up exactly to one of the problems in the textbook. A general algorithmic framework can be applied to a wider range of problems, even if they have not been studied before.
There are several such general frameworks, but we will focus on one example that arises from convex programming: the Sum of Squares (SOS) Semidefinite Programming Hierarchy. It has the advantage that on the one hand it is general enough to capture many algorithmic
techniques, and on the other hand it is specific enough that (if we are careful) we can avoid the “curse of completeness”. That is, we can actually prove impossibility results or lower bounds for this framework without inadvertently resolving questions such as vs . The hope is that we can understand this framework enough to be able to classify which problems it can and can’t solve. Moreover, as we will see, through such study we end up investigating issues that are of independent interest, including mathematical questions on geometry, analysis, and probability, as well as questions about modeling beliefs and knowledge of computationally bounded observers.
Let us now take a step back from the pompous rhetoric and slowly start getting around to the mathematical contents of this course. It will be mostly be focused on the Sum of Squares (SOS) semidefinite programming hierarchy. In a sign that perhaps we did not advance so much from the middle ages, the SOS algorithm is also a method for solving polynomial
equations, albeit systems of several equations in several variables. However, it turns out that this is a fairly general formalism. Not onlyis solving such equations, even in degree two, NP-hard, but in fact one can often reduce directly to this task from problems of interest in a fairly straightforward manner.For example, given a 3SAT formula of a form such as , we can easily translate the question of whether it has a satisfying assignment (where is the number of variables) into the question of whether the equations can be solved where is the number of claues and is the degree polynomial obtained by summing for every clause the polynomial such that equals if satisfies clause and otherwise.
We will be interested in solving such equations over the real numbers, and typically in settings where (a) the polynomials in questions are low degree, and (b) obtaining an approximate solution is essentially as good as obtaining an exact solution, which helps avoid at least some issues of precision and numerical accuracy. Nevertheless, this is still a very challenging setting. In particular, whenever there is more than one equation, or the degree is higher than two, the task of solving polynomial equations becomes non convex, and generally speaking, there can be exponentially many local minima for the “energy function” which is obtained by summing up the square violations of the equations. This is problematic since many of the tools we use to solve such equations involve some form of local search, maintaining at each iteration a current solution and looking for directions of improvements. Such methods can and will get “stuck” at such local minima.
When faced with a non-convex problem, one approach that is used in both practice and theory is to enlarge the search space. Geometrically, we hope that by adding additional dimensions, one may find new ways to escape local minima. Algebraically, this often amounts to adding additional variables, with a standard example being the linearization technique where we reduce, say, quadratic equations in variables into a linear equations in variables by letting correspond to . If the original system was sufficiently overdetermined, one could hope that we can still solve for .
The SOS algorithm is a systematic way of enlarging the search space, by adding variables in just such a manner. In the example above it adds in the additional constraint that if the matrix would be positive semidefinite. That is, that it satisfies for every column vector . (Note that if was in fact of the form then would equal .) More generally, the SOS algorithm is parameterized by a number , known as its degree, and for every set of polynomial equations on variables, yields a semidefinite program on variables that becomes a tighter and tighter approximation of the original equations as grows. As the problem is NP complete, we don’t expect this algorithm to solve polynomial equations efficiently (i.e., with small degree ) in the most general case, but understanding in which cases it does so is the focus of much research efforts and the topic of this course.
The SOS algorithm has its roots in questions raised in the late century by Minkowski and Hilbert of whether any non-negative polynomial can be represented as a sum of squares of other polynomials. Hilbert realized that, except for some special cases (most notably univariate polynomials and quadratic polynomials), the answer is negative and that there are examples—which he showed to exist by non constructive means—of non-negative polynomial that cannot be represented in this way. It was only in the 1960’s that Motzkin gave a concrete example of such a polynomial, namely
By the arithmetic-mean geometric-mean inequality, and hence this polynomial is always non-negative. However, it is not hard, though a bit tedious, to show that it cannot be expressed as a sum of squares.
In his famous 1900 address, Hilbert asked as his 17th problem whether any polynomial can be represented as a sum of squares of rational functions. (For example, Motzkin’s polynomial above can be shown to be the sum of squares of four rational functions of denominator and numerator degree at most ). This was answered positively by Artin in 1927. His approach can be summarized as follows: given a hypothetical polynomial that cannot be represented in this form, to use the fact that the rational functions are a field to extend the reals into a “pseudo-real” field on which there would actually be an element such that , and then use a “transfer principle” to show that there is an actual real such that . (This description is not meant to be understandable but to make you curious enough to look it up…)
Later in the 60’s and 70’s, Krivine and Stengle extended this result to show that any unsatisfiable system of polynomial equations can be certified to be unsatisfiable via a Sum of Squares (SOS) proof (i.e., by showing that it implies an equation of the form
for some polynomials ). This result is known as the Positivstallensatz. The survey of Reznick and monograph of Murray are good sources for much of this material.
In the late 90’s / early 2000’s, there were two separate efforts on getting quantitative / algorithmic versions of this result. On one hand Grigoriev and Vorobjov (1999) asked the question of how large the degree of an SOS proof needs to be, and in particular Grigoriev (1999,2001) proved several lower bounds on this degree for some interesting polynomials. On the other hand Parrilo (2000) and Lasserre (2001) independently came up with hierarchies of algorithms for polynomial optimization based on the Positivstellensatz using semidefinite programming. (A less general version of this algorithm was also described by Naum Shor in a 1987 Russian paper, which was cited by Nesterov in 1999.)
It turns that the SOS algorithm generalizes and encapsulates many other convex-programming based algorithmic hierarchies such as those proposed by Lovász and Schrijver and Sherali and Adams, and other more specific algorithmic techniques such as linear programming and spectral techniques. As mentioned above, the SOS algorithm seems to achieve a “goldilocks” balance of being strong enough to capture interesting techniques but weak enough so we can actually prove lower bounds for it. One of the goals of this course (and line of research) is to also understand what algorithmic techniques can not be captured by SOS, particularly in the setting (e.g., noisy low-degree polynomial optimization) where it seems most appropriate for.
SOS has applications to: equilibrium analysis of dynamics and control (robotics, flight controls, …), robust and stochastic optimization, statistics and machine learning, continuous games, software verification, filter design, quantum computation and information, automated theorem proving, packing problems, etc…
The SOS algorithm is intensively studied in several fields, but
different communities emphasize different aspect of it. The main
characteristics of the Theoretical Computer Science (TCS) viewpoint, as
opposed to that of other communities are:
In theoretical computer science we typically define a computational problem and then try to find the best (e.g., most time efficient, smallest approximation factor, etc..) algorithm for this problem. One can ask what is the point in restricting attention to a particular algorithmic framework such as SOS, as opposed to simply trying to find the best algorithm for the problem at hand. One answer is that we could hope that if a problem is solved via a general framework, then that solution would generalize better to different variants and cases (e.g., considering average-case variants of a worst-case problem, or measuring “goodness” of the solution in different ways).
This is a general phenomenon that occurs time and again many fields, known under many names including the “bias variance tradeoff”, the “stability plasticity dilemma”, “performance robustness tradeoff” and many others. That is, there is an inherent tension between optimally solving a particular question (or optimally adapting to a particular environment) and being robust to changes in the question/environment (e.g., avoiding “overfitting”). For example, consider the following two species that roamed the earth few hundred million years ago during the Mesozoic era. The dinosaurs were highly complex animals that were well adapted to their environment. In contrast cockroaches have extremely simple reflexes, operating only on very general heuristics such as “run if you feel a brush of air”. As one can tell by the scarcity of “dinosaur spray” in stores today, it was the latter species that was more robust to changes in the environment. With that being said, we do hope that the SOS algorithm is at least approximately optimal in several interesting settings.
You are invited to submit your proposal of workshop or tutorial by August 31st; see details here.
In short: you just need to propose an exciting theme and arrange the speakers. We will take care of the logistic details like rooms, AV, coffee break, etc.
Note that this is only a half-day event (2:30pm-6pm) since in the morning there will be another not-to-be-missed event: A Celebration of Mathematics and Computer Science: Celebrating Avi Wigderson’s 60th birthday (which actually starts already on Thursday, October 5th). See Boaz’s announcement here.
If you have any questions about the FOCS workshops, feel free to get in touch with the coordinators: Aleksander Madry and Alexandr Andoni.
STOC Theory Fest 2017 (Montreal June 19-23)
Sanjeev Arora, Paul Beame, Avrim Blum, Ryan Williams
SIGACT Chair Michael Mitzenmacher announced at the STOC’16 business meeting that starting in 2017, STOC will turn into a 5-day event, a Theory Fest. This idea was discussed at some length in a special session at FOCS 2014 and the business meeting at STOC 2015. Now the event is being planned by a small group (Sanjeev Arora, SIGACT ex-chair Paul Beame, Avrim Blum, and Ryan Williams; we also get guidance from Michael Mitzenmacher and STOC’17 PC chair Valerie King). We’re setting up committees to oversee various aspects of the event.
Here are the major changes (caveat: subject to tweaking in coming years):
(i) STOC talks go into 3 parallel sessions instead of two. Slight increase in number of accepts to 100-110.
(ii) STOC papers also required to be presented in evening poster sessions (beer/snacks served).
(iii) About 9 hours of plenary sessions, which will include: (a) Three keynote 50-minute talks (usually prominent researchers from theory and nearby fields) (b) Plenary 20-minute talks selected from the STOC program by the STOC PC —best papers, and a few others. (c) Plenary 20-minute talks on notable papers from the broader theory world in the past year (including but not limited to FOCS, ICALP, SODA, CRYPTO, QIP, COMPLEXITY, SoCG, COLT, PODC, SPAA, KDD, SIGMOD/PODS, SIGMETRICS, WWW, ICML/NIPS), selected by a committee from a pool of nominations. (Many nominees may be invited instead to the poster session.)
(iv) 2-hour tutorials (three in parallel).
(v) Some community-building activities, including grad student activities, networking, career advice, funding, recruiting, etc.
(vi) A day of workshops; 3 running in parallel. (Total of 18 workshop-hours.)
Our hope is that workshop day(s) will over time develop into a separate eco-system of regular meetings and co-located conferences (short or long). In many other CS fields the workshop days generate as much energy as the main conference, and showcase innovative, edgy work.
Poster sessions have been largely missing at STOC, but they have advantages: (a) Attendees can quickly get a good idea of all the work presented at the conference (b) Grads and young researchers get more opportunities to present their work and to interact/network, fueled by beer and snacks. (c)Attendees get an easy way to make up for having missed a talk during the day, or to ask followup questions. (d) Posters on work from other theory conferences broadens the intellectual scope of STOC,
We invite other theory conferences to consider co-locating with the Theory Fest. To allow such coordination, in future the location/dates for the Theory Fest will be announced at least 18 months in advance, preferably 2 years. Even for 2017 it is not too late yet.
Finally, we see the Theory Fest as a work in progress. Feedback from attendees will be actively sought and used to refashion the event.
Happy families are all alike; every unhappy family is unhappy in its own way.
I am talking about the work Reed-Muller Codes Achieve Capacity on Erasure Channels by Shrinivas Kudekar, Santhosh Kumar, Marco Mondelli, Henry D. Pfister, Eren Sasoglu and Rudiger Urbanke. We are used to thinking of some error correcting codes as being “better” than others in the sense that they have fewer decoding errors. But it turns out that in some sense all codes of a given rate have the same average number of errors. The only difference is that “bad” codes (such as the repetition code), have a fairly “smooth” error profile in the sense that the probability of decoding success decays essentially like a low degree polynomial with the fraction of errors, while for “good” codes the decay is like a step function, where one can succeed with probability when the error is smaller than some but this probability quickly decays to half when the error passes .
Specifically, if is a linear code of dimension and , we let be the random variable over that is obtained by sampling a random codeword in and erasing (i.e., replacing it with ) every coordinate independently with probability . Then we define to be the average over all of the conditional entropy of given . Note that for linear codes, the coordinate is either completely fixed by or it is a completely uniform bit, and hence can be thought of as the expected number of the coordinates that we won’t be able to decode with probability better than half from a -sized random subset of the remaining coordinates.
One formalization of this notion that all codes have the same average number of errors is known as the Area Law for EXIT functions which states that for every code of dimension , the integral is a fixed constant independent of . In particular note that if is the simple “repetition code” where we simply repeat every symbol times, then the probability we can’t decode some coordinate from the remaining ones (in which case the entropy is one) is exactly where is the erasure probability. Hence in this case we can easily compute the integral which is simply one minus the rate of the code. In particular this tells us that the average entropy is always equal to the rate of the code. A code is said to be capacity achieving if there is some function that goes to zero with such that whenever . The area law immediately implies that in this case it must be that is close to one when (since otherwise the total integral would be smaller than ), and hence a code is capacity achieving if and only if the function has a threshold behavior. (See figure below).
The paper above uses this observation to show that the Reed Muller code is capacity achieving for this binary erasure channel. The only property they use is the symmetry of this code which means that for this code we might as well have defined with some fixed coordinate (e.g., the first one). In this case, using linearity, we can see that for every erasure pattern on the coordinates the entropy of given is a Boolean monotone function of . (Booleanity follows because in a linear subspace the entropy of the remaining coordinate is either zero or one; monotonicity follows because in the erasure channel erasing more coordinates cannot help you decode.) One can then use the papers of Friedgut or Friedgut-Kalai to establish such a property. (The Reed-Muller code has an additional stronger property of double transitivity which allows to deduce that one can decode not just most coordinates but all coordinates with high probability when the fraction of errors is a smaller than the capacity.)
How do you prove this area law? The idea is simple. Because of linearity, we can think of the following setting: suppose we have the all zero codeword and we permute its coordinates randomly and reveal the first of them. Then the probability that the coordinate is determined to be zero as well is . Another way to say it is that if we permute the columns of the generating matrix of randomly, then the probability that the column is independent from the first columns is . In other words, if we keep track of the rank of the first columns, then at step the probability that the rank will increase by one is , but since we know that the rank of all columns is , it follows that , which is what we wanted to prove. QED
p.s. Thanks to Yuval Wigderson, whose senior thesis is a good source for these questions.
Attendance is free but registration is required. Also there are funds for travel support for students for which you should apply before August 1st.
Confirmed speakers are:
Incorporating differential privacy broadly into Apple’s technology is visionary, and positions Apple as the clear privacy leader among technology companies today.
Learning more about the underlying technology would benefit the research community and assure the public of validity of these statements. (We, at Research at Google, are trying to adhere to the highest standards of transparency by releasing Chrome’s front-end and back-end for differentially private telemetry.)
I am confident this moment will come. For now, our heartfelt congratulations to everyone, inside and outside Apple, whose work made today’s announcement possible!
Yesterday Hillary Clinton became the first woman to be (presumptively) nominated for president by a major party. But in the eyes of many, the Republican Party was first to make history this election season by breaking the “qualifications ceiling” (or perhaps floor) in their own (presumptive) nomination.
Though already predicted in 2000 by the Simpsons , the possibility of a Trump presidency has rattled enough people so that even mostly technical bloggers such as Terry Tao and Scott Aaronson felt compelled to voice their opinion.
We too have been itching for a while to weigh in and share our opinions and to use every tool in our disposal for that, including this blog. We certainly think it’s very appropriate for scientists to be involved citizens and speak up about their views. But though we debated it, we felt that this being a group (technical) blog, it’s best not to wage into politics (as long as it doesn’t directly touch on issues related to computer science such as the Apple vs. FBI case). Hence we will refrain from future postings about the presidential election. For full disclosure, both of us personally support Hillary Clinton and have been donating to her campaign.
Among other things, Bobby showed the proof of the following result, that demonstrates much of those ideas:
Theorem: (Ellenberg and Gijswijt, building on Croot-Lev-Pach) There exists an absolute constant such that for every , if then contains a 3-term arithmetic progression.
To put this in perspective, up till a few weeks ago, the best bounds were of the form and were shown using fairly complicated proofs, and it was very reasonable to assume that a bound of the form is the best we can do. Indeed, an old construction of Behrend shows that this is the case in other groups such as the integers modulo some large or where is some large value depending on . The proof generalizes to for every constant prime (and for composite order cyclic groups as well).
The proof is extremely simple. It seems to me that it can be summarized to two observations:
Let’s now show the proof. Assume towards a contradiction that satisfies ( can be some sufficiently small constant, will do) but there do not exist three distinct points that form a -a.p. (i.e., such that or, equivalently, ).
Let be the number of -variate monomials over where each variable has individual degree at most (higher degree can be ignored modulo ) and the total degree is at most . Note that there are possible monomials where each degree is at most two, and their degree ranges from to , where by concentration of measure most of them have degree roughly . Indeed, using the Chernoff bound we can see that if is a sufficiently small constant, we can pick some such that if then but (to get optimal results, one sets to be roughly and derives from this value).
Now, if we choose in that manner, then we can find a polynomial of degree at most that vanishes on but is non zero on at least points. Indeed, finding such a polynomial amounts to solving a set of linear equations in variables.^{1} Define the matrix such that . Since the assumption that implies that , the theorem follows immediately from the following two claims:
Claim 1: .
Claim 2: .
Claim 1 is fairly immediate. Since is -a.p. free, for every , is not in and hence is zeros on all the off diagonal elements. On the other hand, by the way we chose it, has at least nonzeroes on the diagonal.
For Claim 2, we expand as a polynomial of degree in the two variables and , and write where corresponds to the part of this polynomial where the degree in is at most and corresponds to the part where the degree in is larger and hence the degree in is at most . We claim that both and are at most . Indeed, we can write as for some coefficients and polynomials , where indexes the monomials in of degree at most . But this shows that is a sum of at most rank one matrices and hence . The same reasoning shows that thus completing the proof of Claim 2 and the theorem itself.
More formally, we can argue that the set of degree polynomials that vanish on has dimension at least and hence it contains a polynomial with at least this number of nonzero values.↩
Northwestern University held a workshop on semidefinite programming hierarchies and sum of squares. Videos of the talks by Prasad Raghavendra, David Steurer and myself are available from the link above. The content to unicorns ratio in Prasad and David’s talks is much higher ☺