==== Algorithms To Live By ==== ([[reading notes]] for Algorithms to Live By: The Computer Science of Human Decisions, by Brian Christian and Tom Griffiths)
The word “algorithm” comes from the name of Persian mathematician al-Khwārizmī, author of a ninth-century book of techniques for doing mathematics by hand. (His book was called al-Jabr wa’l-Muqābala—and the “al-jabr” of the title in turn provides the source of our word “algebra.”) The earliest known mathematical algorithms, however, predate even al-Khwārizmī’s work: a four-thousand-year-old Sumerian clay tablet found near Baghdad describes a scheme for long division.
But algorithms are not confined to mathematics alone. When you cook bread from a recipe, you’re following an algorithm. When you knit a sweater from a pattern, you’re following an algorithm. When you put a sharp edge on a piece of flint by executing a precise sequence of strikes with the end of an antler—a key step in making fine stone tools—you’re following an algorithm. Algorithms have been a part of human technology ever since the Stone Age.
Applying the lens of computer science to everyday life has consequences at many scales. Most immediately, it offers us practical, concrete suggestions for how to solve specific problems. Optimal stopping tells us when to look and when to leap. The explore/exploit tradeoff tells us how to find the balance between trying new things and enjoying our favorites. Sorting theory tells us how (and whether) to arrange our offices. Caching theory tells us how to fill our closets. Scheduling theory tells us how to fill our time.
At the next level, computer science gives us a vocabulary for understanding the deeper principles at play in each of these domains. As Carl Sagan put it, “Science is a way of thinking much more than it is a body of knowledge.” Even in cases where life is too messy for us to expect a strict numerical analysis or a ready answer, using intuitions and concepts honed on the simpler forms of these problems offers us a way to understand the key issues and make progress.
Most broadly, looking through the lens of computer science can teach us about the nature of the human mind, the meaning of rationality, and the oldest question of all: how to live. Examining cognition as a means of solving the fundamentally computational problems posed by our environment can utterly change the way we think about human rationality.
They say: Don’t always consider all your options. Don’t necessarily go for the outcome that seems best every time. Make a mess on occasion. Travel light. Let things wait. Trust your instincts and don’t think too long. Relax. Toss a coin. Forgive, but don’t forget. To thine own self be true.
“Someone at Michigan” was almost certainly someone named Merrill Flood. Though he is largely unheard of outside mathematics, Flood’s influence on computer science is almost impossible to avoid. He’s credited with popularizing the traveling salesman problem (which we discuss in more detail in chapter 8), devising the prisoner’s dilemma (which we discuss in chapter 11), and even with possibly coining the term “software.” It’s Flood who made the first known discovery of the 37% Rule, in 1958, and he claims to have been considering the problem since 1949—but he himself points back to several other mathematicians.
====The Secretary Problem====
In any optimal stopping problem, the crucial dilemma is not which option to pick, but how many options to even consider
In your search for a secretary, there are two ways you can fail: stopping early and stopping late. When you stop too early, you leave the best applicant undiscovered. When you stop too late, you hold out for a better applicant who doesn’t exist
Look-Then-Leap Rule: You set a predetermined amount of time for “looking”—that is, exploring your options, gathering data—in which you categorically don’t choose anyone, no matter how impressive. After that point, you enter the “leap” phase, prepared to instantly commit to anyone who outshines the best applicant you saw in the look phase.
As the applicant pool grows, the exact place to draw the line between looking and leaping settles to 37% of the pool, yielding the 37% Rule: look at the first 37% of the applicants,* choosing none, then be ready to leap for anyone better than all those you’ve seen so far.
If you can recall previous applicants, the optimal algorithm puts a twist on the familiar Look-Then-Leap Rule: a longer noncommittal period, and a fallback plan.
Then the math says you should keep looking noncommittally until you’ve seen 61% of applicants, and then only leap if someone in the remaining 39% of the pool proves to be the best yet.
Knowing a Good Thing When You See It: Full Information
Full information means that we don’t need to look before we leap. We can instead use the Threshold Rule, where we immediately accept an applicant if she is above a certain percentile. We don’t need to look at an initial group of candidates to set this threshold—but we do, however, need to be keenly aware of how much looking remains available.
Any yardstick that provides full information on where an applicant stands relative to the population at large will change the solution from the Look-Then-Leap Rule to the Threshold
====When to Sell====
The critical thing to note in this problem is that our threshold depends only on the cost of search. Since the chances of the next offer being a good one—and the cost of finding out—never change, our stopping price has no reason to ever get lower as the search goes on, regardless of our luck. We set it once, before we even begin, and then we quite simply hold fast.
Motorists feature in some of the earliest literature on the secretary problem, and the framework of constant forward motion makes almost every car-trip decision into a stopping problem: the search for a restaurant; the search for a bathroom; and, most acutely for urban drivers, the search for a parking space
====The High Cost of Free Parking====
The problem of quitting while you’re ahead has been analyzed under several different guises, but perhaps the most appropriate to Berezovsky’s case—with apologies to Russian oligarchs—is known as the “burglar problem.” In this problem, a burglar has the opportunity to carry out a sequence of robberies. Each robbery provides some reward, and there’s a chance of getting away with it each time. But if the burglar is caught, he gets arrested and loses all his accumulated gains. What algorithm should he follow to maximize his expected take?
the results are pretty intuitive: the number of robberies you should carry out is roughly equal to the chance you get away, divided by the chance you get caught. If you’re a skilled burglar and have a 90% chance of pulling off each robbery (and a 10% chance of losing it all), then retire after 90/10 = 9 robberies. A ham-fisted amateur with a 50/50 chance of success? The first time you have nothing to lose, but don’t push your luck more than once.
there are sequential decision-making problems for which there is no optimal stopping rule. A simple example is the game of “triple or nothing.”
The math shows that you should always keep playing. But if you follow this strategy, you will eventually lose everything. Some problems are better avoided than solved.
About a dozen studies have produced the same result: people tend to stop early, leaving better applicants unseen. To get a better sense for these findings, we talked to UC Riverside’s Amnon Rapoport, who has been running optimal stopping experiments in the laboratory for more than forty years.
The “endogenous” time costs of searching, which aren’t usually captured by optimal stopping models, might thus provide an explanation for why human decision-making routinely diverges from the prescriptions of those models.
Intuitively, we think that rational decision-making means exhaustively enumerating our options, weighing each one carefully, and then selecting the best. But in practice, when the clock—or the ticker—is ticking, few aspects of decision-making (or of thinking more generally) are as important as this one: when to stop.
====The explore/exploit tradeoff.====
exploration is gathering information, and exploitation is using the information you have to get a known good result.
In computer science, the tension between exploration and exploitation takes its most concrete form in a scenario called the “multi-armed bandit problem.”
People tend to treat decisions in isolation, to focus on finding each time the outcome with the highest expected value. But decisions are almost never isolated, and expected value isn’t the end of the story. If you’re thinking not just about the next decision, but about all the decisions you are going to make about the same options in the future, the explore/exploit tradeoff is crucial to the process.
When balancing favorite experiences and new ones, nothing matters as much as the interval over which we plan to enjoy them.
So explore when you will have time to use the resulting knowledge, exploit when you’re ready to cash in. The interval makes the strategy.
Interestingly, since the interval makes the strategy, then by observing the strategy we can also infer the interval.
Robbins specifically considered the case where there are exactly two slot machines, and proposed a solution called the Win-Stay, Lose-Shift algorithm: choose an arm at random, and keep pulling it as long as it keeps paying off. If the arm doesn’t pay off after a particular pull, then switch to the other one. Although this simple strategy is far from a complete solution, Robbins proved in 1952 that it performs reliably better than chance.
The Gittins Index
As so often happens in mathematics, though, the particular is the gateway to the universal
Unlike previous researchers, Gittins approached the multi-armed bandit problem in those terms. He conceived the goal as maximizing payoffs not for a fixed interval of time, but for a future that is endless yet discounted
Gittins index values as a function of wins and losses, assuming that a payoff next time is worth 90% of a payoff now.
====Regret and Optimism====
regret minimization framework
assuming you’re not omniscient, your total amount of regret will probably never stop increasing, even if you pick the best possible strategy—because even the best strategy isn’t perfect every time.
Logarithmically increasing regret means that we’ll make as many mistakes in our first ten pulls as in the following ninety, and as many in our first year as in the rest of the decade combined. (The first decade’s mistakes, in turn, are as many as we’ll make for the rest of the century.) That’s some measure of consolation. In general we can’t realistically expect someday to never have any more regrets.
algorithms that offer the guarantee of minimal regret.
if we’re following a regret-minimizing algorithm, every year we can expect to have fewer new regrets than we did the year before.
Upper Confidence Bound algorithms implement a principle that has been dubbed “optimism in the face of uncertainty.” Optimism, they show, can be perfectly rational. By focusing on the best that an option could be, given the evidence obtained so far, these algorithms give a boost to possibilities we know less about. As a consequence, they naturally inject a dose of exploration into the decision-making process, leaping at new options with enthusiasm because any one of them could be the next big thing.
A/B testing
Data scientist Jeff Hammerbacher, former manager of the Data group at Facebook, once told Bloomberg Businessweek that “the best minds of my generation are thinking about how to make people click ads.”
One of the fundamental questions that has arisen in the decades since the Belmont Report is whether the standard approach to conducting clinical trials really does minimize risk to patients
In 1969, Marvin Zelen, a biostatistician who is now at Harvard, proposed conducting “adaptive” trials. One of the ideas he suggested was a randomized “play the winner” algorithm—a version of Win-Stay, Lose-Shift, in which the chance of using a given treatment is increased by each win and decreased by each loss.
In Zelen’s procedure, you start with a hat that contains one ball for each of the two treatment options being studied. The treatment for the first patient is selected by drawing a ball at random from the hat (the ball is put back afterward). If the chosen treatment is a success, you put another ball for that treatment into the hat—now you have three balls, two of which are for the successful treatment. If it fails, then you put another ball for the other treatment into the hat, making it more likely you’ll choose the alternative.
ECMO
The widespread difficulty with accepting results from adaptive clinical trials might seem incomprehensible. But consider that part of what the advent of statistics did for medicine, at the start of the twentieth century, was to transform it from a field in which doctors had to persuade each other in ad hoc ways about every new treatment into one where they had clear guidelines about what sorts of evidence were and were not persuasive. Changes to accepted standard statistical practice have the potential to upset this balance, at least temporarily.
In general, it seems that people tend to over-explore—to favor the new disproportionately over the best. In a simple demonstration of this phenomenon, published in 1966, Amos Tversky and Ward Edwards conducted experiments where people were shown a box with two lights on it and told that each light would turn on a fixed (but unknown) percentage of the time. They were then given 1,000 opportunities either to observe which light came on, or to place a bet on the outcome without getting to observe it.
In that case, people chose to observe 505 times, on average, placing bets the other 495 times. But the math says they should have started to bet after just 38 observations—leaving 962 chances to cash in.
The standard multi-armed bandit problem assumes that the probabilities with which the arms pay off are fixed over time. But that’s not necessarily true of airlines, restaurants, or other contexts in which people have to make repeated choices. If the probabilities of a payoff on the different arms change over time—what has been termed a “restless bandit”—the problem becomes much harder. (So much harder, in fact, that there’s no tractable algorithm for completely solving it, and it’s believed there never will be.) Part of this difficulty is that it is no longer simply a matter of exploring for a while and then exploiting: when the world can change, continuing to explore can be the right choice
To live in a restless world requires a certain restlessness in oneself. So long as things continue to change, you must never fully cease exploring
Still, the algorithmic techniques honed for the standard version of the multi-armed bandit problem are useful even in a restless world. Strategies like the Gittins index and Upper Confidence Bound provide reasonably good approximate solutions and rules of thumb, particularly if payoffs don’t change very much over time. And many of the world’s payoffs are arguably more static today than they’ve ever been
What Carstensen and her colleagues found is that the shrinking of social networks with aging is due primarily to “pruning” peripheral relationships and focusing attention instead on a core of close friends and family members. This process seems to be a deliberate choice: as people approach the end of their lives, they want to focus more on the connections that are the most meaningful.
As she puts it, these decreases are “the result of lifelong selection processes by which people strategically and adaptively cultivate their social networks to maximize social and emotional gains and minimize social and emotional risks.”
Being sensitive to how much time you have left is exactly what the computer science of the explore/exploit dilemma suggests. We think of the young as stereotypically fickle; the old, stereotypically set in their ways. In fact, both are behaving completely appropriately with respect to their intervals. The deliberate honing of a social network down to the most meaningful relationships is the rational response to having less time to enjoy them.
Now, just how socks should be sorted is a good way get computer scientists talking at surprising length
The truncated top of an immense, sorted list is in many ways the universal user interface.
Sorting involves steep diseconomies of scale, violating our normal intuitions about the virtues of doing things in bulk. Cooking for two is typically no harder than cooking for one, and it’s certainly easier than cooking for one person twice. But sorting, say, a shelf of a hundred books will take you longer than sorting two bookshelves of fifty apiece: you have twice as many things to organize, and there are twice as many places each of them could go. The more you take on, the worse it gets.
This is the first and most fundamental insight of sorting theory. Scale hurts.
one of the best preventives against the computational difficulty of sock sorting is just doing your laundry more often.
Worst-case analysis lets us make hard guarantees: that a critical process will finish in time, that deadlines won’t be blown
Computer science has developed a shorthand specifically for measuring algorithmic worst-case scenarios: it’s called “Big-O” notation. Big-O notation has a particular quirk, which is that it’s inexact by design. That is, rather than expressing an algorithm’s performance in minutes and seconds, Big-O notation provides a way to talk about the kind of relationship that holds between the size of the problem and the program’s running time. Because Big-O notation deliberately sheds fine details, what emerges is a schema for dividing problems into different broad classes.
“Mergesort is as important in the history of sorting as sorting in the history of computing.”
Mergesort also has real applications in small-scale domestic sorting problems. Part of the reason why it’s so widely used is that it can easily be parallelized. If you’re still strategizing about that bookshelf, the Mergesort solution would be to order a pizza and invite over a few friends. Divide the books evenly, and have each person sort their own stack. Then pair people up and have them merge their stacks. Repeat this process until there are just two stacks left, and merge them one last time onto the shelf
Bucket Sort
Sort Is Prophylaxis for Search
Computer science, as undergraduates are taught, is all about tradeoffs. We’ve already seen this in the tensions between looking and leaping, between exploring and exploiting. And one of the most central tradeoffs is between sorting and searching. The basic principle is this: the effort expended on sorting materials is just a preemptive strike against the effort it’ll take to search through them later. What the precise balance should be depends on the exact parameters of the situation, but thinking about sorting as valuable only to support future search tells us something surprising:
Err on the side of messiness.
Sorting something that you will never search is a complete waste; searching something you never sorted is merely inefficient.
The question, of course, becomes how to estimate ahead of time what your future usage will be.
Steve Whittaker is one of the world’s experts on how people handle their email. A research scientist at IBM and professor at UC Santa Cruz, Whittaker, for almost two decades, has been studying how people manage personal information. (He wrote a paper on “email overload” in 1996, before many people even had email.) In 2011, Whittaker led a study of the searching and sorting habits of email users, resulting in a paper titled “Am I Wasting My Time Organizing Email?” Spoiler alert: the conclusion was an emphatic Yes. “It’s empirical, but it’s also experiential,” Whittaker points out. “When I interview people about these kinds of organizational problems, that’s something that they characteristically talk about, is that they sort of wasted a part of their life.”
Computer science shows that the hazards of mess and the hazards of order are quantifiable and that their costs can be measured in the same currency: time
Once you allow for a “noisy comparator,” some of computer science’s most hallowed algorithms go out the window—and some of its most maligned have their day of redemption.
Comparison Counting Sort is a quadratic-time algorithm, like Bubble Sort. Thus it’s not a popular choice in traditional computer science applications, but it’s exceptionally fault-tolerant.
What does sorting look like when it emerges organically, from the bottom up?
It might look something like online poker.
Displacement happens when an animal uses its knowledge of the hierarchy to determine that a particular confrontation simply isn’t worth it. In many animal societies, resources and opportunities—food, mates, preferred spaces, and so forth—are scarce, and somehow it must be decided who gets what.
Though we may cringe when we see creatures turning their claws and beaks on each other, biologists tend to think of pecking orders as the violence that preempts violence.
Sound familiar? It’s the search-sort tradeoff.
The creation of a pecking order is a pugilistic solution to a fundamentally computational problem. For this reason, incidentally, debeaking chickens on farms may be a well-intentioned but counterproductive approach: it removes the authority of individual fights to resolve the order, and therefore makes it much harder for the flock to run any sorting procedure at all. So the amount of antagonism within the flock in many cases actually increases.
The key to thinking about decentralized sorting in nature, argues Jessica Flack, codirector of the Center for Complexity and Collective Computation at UW–Madison, is that dominance hierarchies are ultimately information hierarchies. There’s a significant computational burden to these decentralized sorting systems, Flack points out. The number of fights in, say, a group of macaques is minimized only to the extent that every monkey has a detailed—and similar—understanding of the hierarchy. Otherwise violence will ensue.
a race is fundamentally different from a fight.
Consider the difference between boxers and skiers, between fencers and runners. An Olympic boxer must risk concussion O(log n) times, usually from 4 to 6, to make it to the podium; allowing a greater number of athletes into the games would imperil the health of all. But a skeleton racer or ski jumper or halfpipe specialist needs to make only a constant number of gambles with gravity, no matter the size of the field. A fencer puts herself at her opponent’s mercy O(log n) times, but a marathoner must endure only one race. Being able to assign a simple numerical measure of performance results in a constant-time algorithm for status.
Having a benchmark—any benchmark—solves the computational problem of scaling up a sort.
When we think about the factors that make large-scale human societies possible, it’s easy to focus on technologies: agriculture, metals, machinery. But the cultural practice of measuring status with quantifiable metrics might be just as important. Money, of course, need not be the criterion; a rule like “respect your elders,” for instance, likewise settles questions of people’s status by reference to a common quantity. And the same principle is at work between nations as within them. It is often noted that a benchmark like national GDP—which underlies the invite lists to diplomatic summits such as the G20—is a crude, imperfect measurement. But the existence of any benchmark at all transforms the question of national status from one demanding at least a linearithmic number of tussles and resolutions into something with a single reference point that ranks all. Given that nation-to-nation status disputes often take military form, this saves not only time but lives.
Linearithmic numbers of fights might work fine for small-scale groups; they do in nature. But in a world where status is established through pairwise comparisons—whether they involve exchanging rhetoric or gunfire—the amount of confrontation quickly spirals out of control as society grows.
“The Most Organized Man in America”
====The Memory Hierarchy====
Eviction and Clairvoyance
These algorithms are known as “replacement policies” or “eviction policies,” or simply as caching algorithms.
Bélády’s 1966 paper on caching algorithms would become the most cited piece of computer science research for fifteen years
the goal of cache management is to minimize the number of times you can’t find what you’re looking for in the cache and must go to the slower main memory to find it
The hypothetical all-knowing, prescient algorithm that would look ahead and execute the optimal policy is known today in tribute as Bélády’s Algorithm. Bélády’s Algorithm is an instance of what computer scientists call a “clairvoyant” algorithm
Least Recently Used (LRU): evicting the item that’s gone the longest untouched
Unless we have good reason to think otherwise, it seems that our best guide to the future is a mirror image of the past. The nearest thing to clairvoyance is to assume that history repeats itself—backward
The dominant performance of the LRU algorithm in most tests that computer scientists have thrown at it leads to a simple suggestion: turn the library inside out. Put acquisitions in the back, for those who want to find them. And put the most recently returned items in the lobby, where they are ripe for the browsing.
Humans are social creatures, and presumably the undergraduate body would find it interesting to peruse its own reading habits. It would nudge the campus toward a more organic and free-form version of what colleges strive for when they assign “common books”: the facilitation of intellectual common points of reference. Here, the books being read on campus, whatever they happened to be, would become the books most likely to be serendipitously encountered by other students. A kind of grassroots, bottom-up analogue of the common book program.
The Cloud at the End of the Street
We often think of the Internet as a flat, independent, and loosely connected network. In fact, it’s none of those things. A quarter of all Internet traffic at present is handled by a single corporation, one that manages to stay almost entirely out of the headlines. This Massachusetts-based company is called Akamai, and they’re in the caching business.
Caching is just as useful when it’s proximity, rather than performance, that’s the scarce resource.
self-organizing lists
The definitive paper on self-organizing lists, published by Daniel Sleator and Robert Tarjan in 1985, examined (in classic computer science fashion) the worst-case performance of various ways to organize the list given all possible sequences of requests
Sleator and Tarjan’s results showed that some “very simple self-adjusting schemes, amazingly, come within a constant factor” of clairvoyance. Namely, if you follow the LRU principle—where you simply always put an item back at the very front of the list—then the total amount of time you spend searching will never be more than twice as long as if you’d known the future. That’s not a guarantee any other algorithm can make.
In short, the mathematics of self-organizing lists suggests something radical: the big pile of papers on your desk, far from being a guilt-inducing fester of chaos, is actually one of the most well-designed and efficient structures available. What might appear to others to be an unorganized mess is, in fact, a self-organizing mess.
The question was straightforward: what patterns characterize the way the world itself “forgets”—the way that events and references fade over time? Anderson and Schooler analyzed three human environments: headlines from the New York Times, recordings of parents talking to their children, and Anderson’s own email inbox. In all domains, they found that a word is most likely to appear again right after it had just been used, and that the likelihood of seeing it again falls off as time goes on.
In other words, reality itself has a statistical structure that mimics the Ebbinghaus curve.
The Forgetting Curve
Hermann Ebbinghaus.
His results mapped out a graph of how memory fades over time, known today by psychologists as “the forgetting curve.”
The Tyranny of Experience
“Many people hold the bias that human memory is anything but optimal,” wrote Anderson and Schooler. “They point to the many frustrating failures of memory. However, these criticisms fail to appreciate the task before human memory, which is to try to manage a huge stockpile of memories. In any system responsible for managing a vast data base there must be failures of retrieval. It is just too expensive to maintain access to an unbounded number of items.”
What’s surprising is not memory’s slowdown, but the fact that the mind can possibly stay afloat and responsive as so much data accumulates.
If the fundamental challenge of memory really is one of organization rather than storage, perhaps it should change how we think about the impact of aging on our mental abilities
It’s not that we’re forgetting; it’s that we’re remembering. We’re becoming archives.
First Things First
Though we always manage to find some way to order the things we do in our days, as a rule we don’t consider ourselves particularly good at it—hence the perennial bestseller status of time-management guides. Unfortunately, the guidance we find in them is frequently divergent and inconsistent. Getting Things Done advocates a policy of immediately doing any task of two minutes or less as soon as it comes to mind. Rival bestseller Eat That Frog! advises beginning with the most difficult task and moving toward easier and easier things. The Now Habit suggests first scheduling one’s social engagements and leisure time and then filling the gaps with work—rather than the other way around, as we so often do. William James, the “father of American psychology,” asserts that “there’s nothing so fatiguing as the eternal hanging on of an uncompleted task,” but Frank Partnoy, in Wait, makes the case for deliberately not doing things right away.
“We are what we repeatedly do”
So, Johnson asked, if you have several loads of laundry to do on the same day, what’s the best way to do them?
His answer was that you should begin by finding the single step that takes the least amount of time—the load that will wash or dry the quickest. If that shortest step involves the washer, plan to do that load first. If it involves the dryer, plan to do it last. Repeat this process for the remaining loads, working from the two ends of the schedule toward the middle.
In the case of single-machine scheduling, however, if we are going to do all the tasks assigned, then all schedules will take equally long to complete; the order is irrelevant.
This is a sufficiently fundamental and counterintuitive point that it’s worth repeating. If you have only a single machine, and you’re going to do all of your tasks, then any ordering of the tasks will take you the same amount of time.
This is something of a theme in computer science: before you can have a plan, you must first choose a metric. And as it turns out, which metric we pick here will directly affect which scheduling approaches fare best.
Moore’s Algorithm
Do the difficult things while they are easy and do the great things while they are small —LAO TZU
Scheduling theorists call this metric the “sum of completion times.”
Minimizing the sum of completion times leads to a very simple optimal algorithm called Shortest Processing Time: always do the quickest task you can.
a simple modification of Shortest Processing Time: divide the weight of each task by how long it will take to finish, and then work in order from the highest resulting importance-per-unit-time (call it “density” if you like, to continue the weight metaphor) to the lowest. And while it might be hard to assign a degree of importance to each one of your daily tasks, this strategy nonetheless offers a nice rule of thumb: only prioritize a task that takes twice as long if it’s twice as important.
When applied to debts rather than incomes, the same principle yields a strategy for getting in the black that’s come to be called the “debt avalanche.” This debt-reduction strategy says to ignore the number and size of your debts entirely, and simply funnel your money toward the debt with the single highest interest rate. This corresponds rather neatly to working through jobs in order of importance per unit time. And it’s the strategy that will reduce the total burden of your debt as quickly as possible.
Computer science can offer us optimal algorithms for various metrics available in single-machine scheduling, but choosing the metric we want to follow is up to us. In many cases, we get to decide what problem we want to be solving.
This offers a radical way to rethink procrastination, the classic pathology of time management. We typically think of it as a faulty algorithm. What if it’s exactly the opposite? What if it’s an optimal solution to the wrong problem?
As the researchers write, “this seemingly irrational choice reflected a tendency to pre-crastinate, a term we introduce to refer to the hastening of subgoal completion, even at the expense of extra physical effort.” Putting off work on a major project by attending instead to various trivial matters can likewise be seen as “the hastening of subgoal completion”—which is another way of saying that procrastinators are acting (optimally!) to reduce as quickly as possible the number of outstanding tasks on their minds. It’s not that they have a bad strategy for getting things done; they have a great strategy for the wrong metric.
Working on a computer brings with it an additional hazard when it comes to being conscious and deliberate about our scheduling metrics: the user interface may subtly (or not so subtly) force its own metric upon us.
Live by the metric, die by the metric. If all tasks are indeed of equal weight, then that’s exactly what we should be doing. But if we don’t want to become slaves to minutiae, then we need to take measures toward that end. This starts with making sure that the single-machine problem we’re solving is the one we want to be solving.
====Priority Inversion and Precedence Constraints====
The culprit was a classic scheduling hazard called priority inversion.
Priority inheritance. If a low-priority task is found to be blocking a high-priority resource, well, then all of a sudden that low-priority task should momentarily become the highest-priority thing on the system, “inheriting” the priority of the thing it’s blocking.
Hedberg explains, “If you’re flammable and have legs, you are never blocking a fire exit.”
“Things which matter most must never be at the mercy of things which matter least,” Goethe allegedly proclaimed; but while that has the ring of wisdom about it, sometimes it’s just not true. Sometimes that which matters most cannot be done until that which matters least is finished, so there’s no choice but to treat that unimportant thing as being every bit as important as whatever it’s blocking
After his death in 1994, the Association for Computing Machinery established an award in Lawler’s name, honoring people who demonstrate the humanitarian potential of computer science.
The Shortest Processing Time algorithm, as we saw, is the optimal policy if you want to cross off as many items as quickly as possible from your to-do list. But if some of your tasks have precedence constraints, there isn’t a simple or obvious tweak to Shortest Processing Time to adjust for that.
What the researchers found was that even the subtlest change to a scheduling problem often tips it over the fine and irregular line between tractable and intractable. For example, Moore’s Algorithm minimizes the number of late tasks (or rotten fruits) when they’re all of equal value—but if some are more important than others, the problem becomes intractable
The drawing of the borders of scheduling theory continues to this day. A recent survey showed that the status of about 7% of all problems is still unknown, scheduling’s terra incognita. Of the 93% of the problems that we do understand, however, the news isn’t great: only 9% can be solved efficiently, and the other 84% have been proven intractable.* In other words, most scheduling problems admit no ready solution. If trying to perfectly manage your calendar feels overwhelming, maybe that’s because it actually is. Nonetheless, the algorithms we have discussed are often the starting point for tackling those hard problems—if not perfectly, then at least as well as can be expected.
====Drop Everything: Preemption and Uncertainty====
Minimizing maximum lateness (for serving customers in a coffee shop) or the sum of completion times (for rapidly shortening your to-do list) both cross the line into intractability if some tasks can’t be started until a particular time. But they return to having efficient solutions once preemption is allowed. In both cases, the classic strategies—Earliest Due Date and Shortest Processing Time, respectively—remain the best, with a fairly straightforward modification. When a task’s starting time comes, compare that task to the one currently under way. If you’re working by Earliest Due Date and the new task is due even sooner than the current one, switch gears; otherwise stay the course. Likewise, if you’re working by Shortest Processing Time, and the new task can be finished faster than the current one, pause to take care of it first; otherwise, continue with what you were doing.
It turns out, though, that even if you don’t know when tasks will begin, Earliest Due Date and Shortest Processing Time are still optimal strategies, able to guarantee you (on average) the best possible performance in the face of uncertainty. If assignments get tossed on your desk at unpredictable moments, the optimal strategy for minimizing maximum lateness is still the preemptive version of Earliest Due Date—switching to the job that just came up if it’s due sooner than the one you’re currently doing, and otherwise ignoring it. Similarly, the preemptive version of Shortest Processing Time—compare the time left to finish the current task to the time it would take to complete the new one—is still optimal for minimizing the sum of completion times.
the weighted version of Shortest Processing Time is a pretty good candidate for best general-purpose scheduling strategy in the face of uncertainty. It offers a simple prescription for time management: each time a new piece of work comes in, divide its importance by the amount of time it will take to complete. If that figure is higher than for the task you’re currently doing, switch to the new one; otherwise stick with the current task.
Preemption Isn’t Free: The Context Switch
First of all, people and computer operating systems alike face a curious challenge: the machine that is doing the scheduling and the machine being scheduled are one and the same. Which makes straightening out your to-do list an item on your to-do list—needing, itself, to get prioritized and scheduled.
Second, preemption isn’t free. Every time you switch tasks, you pay a price, known in computer science as a context switch. When a computer processor shifts its attention away from a given program, there’s always a certain amount of necessary overhead. It needs to effectively bookmark its place and put aside all of its information related to that program. Then it needs to figure out which program to run next. Finally it must haul out all the relevant information for that program, find its place in the code, and get in gear.
None of this switching back and forth is “real work”—that is, none of it actually advances the state of any of the various programs the computer is switching between. It’s metawork. Every context switch is wasted time.
Humans clearly have context-switching costs too. We feel them when we move papers on and off our desk, close and open documents on our computer, walk into a room without remembering what had sent us there, or simply say out loud, “Now, where was I?” or “What was I saying?” Psychologists have shown that for us, the effects of switching tasks can include both delays and errors—at the scale of minutes rather than microseconds. To put that figure in perspective, anyone you interrupt more than a few times an hour is in danger of doing no work at all.
====Thrashing====
But if a task requires keeping track of so many things that they won’t all fit into memory, then you might well end up spending more time swapping information in and out of memory than doing the actual work.
This is thrashing: a system running full-tilt and accomplishing nothing at all. Denning first diagnosed this phenomenon in a memory-management context, but computer scientists now use the term “thrashing” to refer to pretty much any situation where the system grinds to a halt because it’s entirely preoccupied with metawork. A thrashing computer’s performance doesn’t bog down gradually. It falls off a cliff. “Real work” has dropped to effectively zero, which also means it’s going to be nearly impossible to get out.
When merely remembering everything we need to be doing occupies our full attention—or prioritizing every task consumes all the time we had to do them—or our train of thought is continually interrupted before those thoughts can translate to action—it feels like panic, like paralysis by way of hyperactivity. It’s thrashing, and computers know it well.
In a thrashing state, you’re making essentially no progress, so even doing tasks in the wrong order is better than doing nothing at all. Instead of answering the most important emails first—which requires an assessment of the whole picture that may take longer than the work itself—maybe you should sidestep that quadratic-time quicksand by just answering the emails in random order, or in whatever order they happen to appear on-screen.
====Interrupt Coalescing====
Part of what makes real-time scheduling so complex and interesting is that it is fundamentally a negotiation between two principles that aren’t fully compatible. These two principles are called responsiveness and throughput: how quickly you can respond to things, and how much you can get done overall
Establishing a minimum amount of time to spend on any one task helps to prevent a commitment to responsiveness from obliterating throughput entirely: if the minimum slice is longer than the time it takes to context-switch, then the system can never get into a state where context switching is the only thing it’s doing. It’s also a principle that is easy to translate into a recommendation for human lives. Methods such as “timeboxing” or “pomodoros,” where you literally set a kitchen timer and commit to doing a single task until it runs out, are one embodiment of this idea.
But what slice size should you aim for?
The moral is that you should try to stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit. Decide how responsive you need to be—and then, if you want to get things done, be no more responsive than that.
if none of your email correspondents require you to respond in less than twenty-four hours, you can limit yourself to checking your messages once a day.
"All human knowledge is uncertain, inexact, and partial."—BERTRAND RUSSELL
====Predicting the Future====
Our days are full of “small data.” In fact, like Gott standing at the Berlin Wall, we often have to make an inference from the smallest amount of data we could possibly have: a single observation.
This is the crux of Bayes’s argument. Reasoning forward from hypothetical pasts lays the foundation for us to then work backward to the most probable one.
In fact, for any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).
This incredibly simple scheme for estimating probabilities is known as Laplace’s Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history.
Philosophical Essay on Probabilities
Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
The Copernican Principle
Laplace’s Law
More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon.* And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already
To apply Bayes’s Rule, as we have seen, we first need to assign a prior probability to each of these durations. And it turns out that the Copernican Principle is exactly what results from applying Bayes’s Rule using what is known as an uninformative prior.
====Real-World Priors====
It’s often lamented that “the rich get richer,” and indeed the process of “preferential attachment” is one of the surest ways to produce a power-law distribution. The most popular websites are the most likely to get incoming links; the most followed online celebrities are the ones most likely to gain new fans; the most prestigious firms are the ones most likely to attract new clients; the biggest cities are the ones most likely to draw new residents. In every case, a power-law distribution will result
Good predictions thus begin with having good instincts about when we’re dealing with a normal distribution and when with a power-law distribution. As it turns out, Bayes’s Rule offers us a simple but dramatically different predictive rule of thumb for each.
for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with.
Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide.
Something normally distributed that’s gone on seemingly too long is bound to end shortly; but the longer something in a power-law distribution has gone on, the longer you can expect it to keep going
Erlang distribution
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
distributions that yield the same prediction, no matter their history or current state, are known to statisticians as “memoryless.”
These three very different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes’s Rule to the power-law, normal, and Erlang distributions, respectively. And given the way those predictions come out, the three distributions offer us different guidance, too, on how surprised we should be by certain events.
In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on. So a power-law event is more surprising the longer we’ve been waiting for it—and maximally surprising right before it happens. A nation, corporation, or institution only grows more venerable with each passing year, so it’s always stunning when it collapses.
In a normal distribution, events are surprising when they’re early—since we expected them to reach the average—but not when they’re late. Indeed, by that point they seem overdue to happen, so the longer we wait, the more we expect them.
And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it’s lasted.
The three prediction rules—Multiplicative, Average, and Additive—are applicable in a wide range of everyday situations. And in those situations, people in general turn out to be remarkably good at using the right prediction rule.
In light of what we know about Bayes’s Rule, this remarkably good human performance suggests something critical that helps to understand how people make predictions. Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don’t need to gather them explicitly; we absorb them from the world.
There’s a crucial caveat here, however. In cases where we don’t have good priors, our predictions aren’t good.
Good predictions require good priors.
Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.
What Our Predictions Tell Us About Ourselves
Decades after the original marshmallow experiments, Walter Mischel and his colleagues went back and looked at how the participants were faring in life. Astonishingly, they found that children who had waited for two treats grew into young adults who were more successful than the others, even measured by quantitative metrics like their SAT scores. If the marshmallow test is about willpower, this is a powerful testament to the impact that learning self-control can have on one’s life. But if the test is less about will than about expectations, then this tells a different, perhaps more poignant story
A team of researchers at the University of Rochester recently explored how prior experiences might affect behavior in the marshmallow test
And here, the children who had learned that the experimenter was unreliable were more likely to eat the marshmallow before she came back, losing the opportunity to earn a second treat.
Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
Priors in the Age of Mechanical Reproduction
Being a good Bayesian means representing the world in the correct proportions—having good priors, appropriately calibrated. By and large, for humans and other animals this happens naturally; as a rule, when something surprises us, it ought to surprise us, and when it doesn’t, it ought not to. Even when we accumulate biases that aren’t objectively correct, they still usually do a reasonable job of reflecting the specific part of the world we live in
Everything starts to break down, however, when a species gains language. What we talk about isn’t what we experience—we speak chiefly of interesting things, and those tend to be things that are uncommon
Simply put, the representation of events in the media does not track their frequency in the world. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
====When to Think Less====
The pro-and-con list was already a time-honored algorithm by Darwin’s time, being endorsed by Benjamin Franklin a century before. To get over “the Uncertainty that perplexes us,” Franklin
When we think about thinking, it’s easy to assume that more is better: that you will make a better decision the more pros and cons you list, make a better prediction about the price of a stock the more relevant factors you identify, and write a better report the more time you spend working on it. This is certainly the premise behind Franklin’s system. In this sense, Darwin’s “algebraic” approach to matrimony, despite its obvious eccentricity, seems remarkably and maybe even laudably rational.
However, if Franklin or Darwin had lived into the era of machine-learning research—the science of teaching computers to make good judgments from experience—they’d have seen Moral Algebra shaken to its foundations. The question of how hard to think, and how many factors to consider, is at the heart of a knotty problem that statisticians and machine-learning researchers call “overfitting.” And dealing with that problem reveals that there’s a wisdom to deliberately thinking less.
The Case Against Complexity
The lesson is this: it is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction.
If the study were repeated with different people, producing slight variations on the same essential pattern, the one- and two-factor models would remain more or less steady—but the nine-factor model would gyrate wildly from one instance of the study to the next. This is what statisticians call overfitting.
So one of the deepest truths of machine learning is that, in fact, it’s not always better to use a more complex model, one that takes a greater number of factors into account. And the issue is not just that the extra factors might offer diminishing returns—performing better than a simpler model, but not enough to justify the added complexity. Rather, they might make our predictions dramatically worse.
overfitting poses a danger any time we’re dealing with noise or mismeasurement—and we almost always are. There can be errors in how the data were collected, or in how they were reported. Sometimes the phenomena being investigated, such as human happiness, are hard to even define, let alone measure. Thanks to their flexibility, the most complex models available to us can fit any patterns that appear in the data, but this means that they will also do so even when those patterns are mere phantoms and mirages in the noise.
Fundamentally, overfitting is a kind of idolatry of data, a consequence of focusing on what we’ve been able to measure rather than what matters.
Perhaps nowhere, however, is overfitting as powerful and troublesome as in the world of business. “Incentive structures work,” as Steve Jobs put it. “So you have to be very careful of what you incent people to do, because various incentive structures create all sorts of consequences that you can’t anticipate.” Sam Altman, president of the startup incubator Y Combinator, echoes Jobs’s words of caution: “It really is true that the company will build whatever the CEO decides to measure.”
Such problems can’t simply be dismissed as a failure to achieve management goals. Rather, they are the opposite: the ruthless and clever optimization of the wrong thing.
In some cases, the difference between a model and the real world is literally a matter of life and death. In the military and in law enforcement, for example, repetitive, rote training is considered a key means for instilling line-of-fire skills. The goal is to drill certain motions and tactics to the point that they become totally automatic. But when overfitting creeps in, it can prove disastrous
Mistakes like these are known in law enforcement and the military as “training scars,” and they reflect the fact that it’s possible to overfit one’s own preparation. In one particularly dramatic case, an officer instinctively grabbed the gun out of the hands of an assailant and then instinctively handed it right back—just as he had done time and time again with his trainers in practice.
Detecting Overfitting: Cross-Validation
Simply put, Cross-Validation means assessing not only how well a model fits the data it’s given, but how well it generalizes to data it hasn’t seen. Paradoxically, this may involve using less data
Aside from withholding some of the available data points, it is also useful to consider testing the model with data derived from some other form of evaluation entirely
How to Combat Overfitting: Penalizing Complexity
From a statistics viewpoint, overfitting is a symptom of being too sensitive to the actual data we’ve seen. The solution, then, is straightforward: we must balance our desire to find a good fit against the complexity of the models we use to do so
If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
So what do these complexity penalties look like? One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model
Techniques like the Lasso are now ubiquitous in machine learning, but the same kind of principle—a penalty for complexity—also appears in nature. Living organisms get a certain push toward simplicity almost automatically, thanks to the constraints of time, memory, energy, and attention. The burden of metabolism, for instance, acts as a brake on the complexity of organisms, introducing a caloric penalty for overly elaborate machinery
Language forms yet another natural Lasso: complexity is punished by the labor of speaking at greater length and the taxing of our listener’s attention span. Business plans get compressed to an elevator pitch; life advice becomes proverbial wisdom only if it is sufficiently concise and catchy. And anything that needs to be remembered has to pass through the inherent Lasso of memory
The Upside of Heuristics
When it comes to portfolio management, it turns out that unless you’re highly confident in the information you have about the markets, you may actually be better off ignoring that information altogether. Applying Markowitz’s optimal portfolio allocation scheme requires having good estimates of the statistical properties of different investments. An error in those estimates can result in very different asset allocations, potentially increasing risk. In contrast, splitting your money evenly across stocks and bonds is not affected at all by what data you’ve observed. This strategy doesn’t even try to fit itself to the historical performance of those inves
Markowitz’s retirement savings
Gerd Gigerenzer and Henry Brighton have argued that the decision-making shortcuts people use in the real world are in many cases exactly the kind of thinking that makes for good decisions. “In contrast to the widely held view that less processing reduces accuracy,” they write, “the study of heuristics shows that less information, computation, and time can in fact improve accuracy.” A heuristic that favors simpler answers—with fewer factors, or less computation—offers precisely these “less is more” effects.
The concept of overfitting gives us a way of seeing the virtue in such evolutionary baggage. Though crossed-over nerve fibers and repurposed jawbones may seem like suboptimal arrangements, we don’t necessarily want evolution to fully optimize an organism to every shift in its environmental niche—or, at least, we should recognize that doing so would make it extremely sensitive to further environmental changes. Having to make use of existing materials, on the other hand, imposes a kind of useful restraint. It makes it harder to induce drastic changes in the structure of organisms, harder to overfit. As a species, being constrained by the past makes us less perfectly adjusted to the present we know but helps keep us robust for the future we don’t.
A similar insight might help us resist the quick-moving fads of human society. When it comes to culture, tradition plays the role of the evolutionary constraints. A bit of conservatism, a certain bias in favor of history, can buffer us against the boom-and-bust cycle of fads.
In machine learning, the advantages of moving slowly emerge most concretely in a regularization technique known as Early Stopping
This kind of setup—where more time means more complexity—characterizes a lot of human endeavors. Giving yourself more time to decide about something does not necessarily mean that you’ll make a better decision. But it does guarantee that you’ll end up considering more factors, more hypotheticals, more pros and cons, and thus risk overfitting.
The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less. If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort—it will lead us to worse solutions
As with all issues involving overfitting, how early to stop depends on the gap between what you can measure and what really matters. If you have all the facts, they’re free of all error and uncertainty, and you can directly assess whatever is important to you, then don’t stop early. Think long and hard: the complexity and effort are appropriate.
But that’s almost never the case. If you have high uncertainty and limited data, then do stop early by all means. If you don’t have a clear read on how your work will be evaluated, and by whom, then it’s not worth the extra time to make it perfect with respect to your own (or anyone else’s) idiosyncratic guess at what perfection might be. The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop
When to Think Less
When you’re truly in the dark, the best-laid plans will be the simplest. When our expectations are uncertain and the data are noisy, the best bet is to paint with a broad brush, to think in broad strokes
As McGill’s Henry Mintzberg puts it, “What would happen if we started from the premise that we can’t measure what matters and go from there? Then instead of measurement we’d have to use something very scary: it’s called judgment.”
The upshot of Early Stopping is that sometimes it’s not a matter of choosing between being rational and going with our first instinct. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is
As entrepreneurs Jason Fried and David Heinemeier Hansson explain, the further ahead they need to brainstorm, the thicker the pen they use—a clever form of simplification by stroke size
Darwin made up his mind exactly when his notes reached the bottom of the diary sheet. He was regularizing to the page.
Let It Slide
One evening, as she stared at her seating charts, “I realized that there was literally a one-to-one correlation between the amino acids and proteins in my PhD thesis and people sitting at tables at my wedding.” Bellows called out to her fiancé for a piece of paper and began scribbling equations. Amino acids became guests, binding energies became relationships, and the molecules’ so-called nearest-neighbor interactions became—well—nearest-neighbor interactions. She could use the algorithms from her research to solve her own wedding.
Bellows worked out a way to numerically define the strength of the relationships among all the guests. If a particular pair of people didn’t know one another they got a 0, if they did they got a 1, and if they were a couple they got a 50. (The sister of the bride got to give a score of 10 to all the people she wanted to sit with, as a special prerogative.) Bellows then specified a few constraints: the maximum table capacity, and a minimum score necessary for each table, so that no one table became the awkward “miscellaneous” group full of strangers. She also codified the program’s goal: to maximize the relationship scores between the guests and their tablemates.
In fact, no one understands as well as a computer scientist that in the face of a seemingly unmanageable challenge, you should neither toil forever nor give up, but—as we’ll see—try a third thing entirely.
The Difficulty of Optimization
The problem of route planning didn’t get the attention of the mathematics community until the 1930s, but then it did so with a vengeance. Mathematician Karl Menger spoke of “the postal messenger problem” in 1930, noting that no easier solution was known than simply trying out every possibility in turn. Hassler Whitney posed the problem in a 1934 talk at Princeton, where it lodged firmly in the brain of fellow mathematician Merrill Flood (who, you might recall from chapter 1, is also credited with circulating the first solution to the secretary problem). When Flood moved to California in the 1940s he spread it in turn to his colleagues at the RAND Institute, and the problem’s iconic name first appeared in print in a 1949 paper by mathematician Julia Robinson. As the problem swept through mathematical circles, it grew in notoriety. Many of the greatest minds of the time obsessed over it, and no one seemed able to make real headway.
In the traveling salesman problem, the question isn’t whether a computer (or a mathematician) could find the shortest route: theoretically, one can simply crank out a list of all the possibilities and measure each one. Rather, the issue is that as the number of towns grows, the list of possible routes connecting them explodes. A route is just an ordering of the towns, so trying them all by brute force is the dreaded O(n!) “factorial time”—the computational equivalent of sorting a deck of cards by throwing them in the air until they happen to land in order.
The question is: is there any hope of doing better?
Cobham–Edmonds thesis: an algorithm should be considered “efficient” if it runs in what’s called “polynomial time”—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered “tractable” if we know how to solve it using an efficient algorithm. A problem we don’t know how to solve in polynomial time, on the other hand, is considered “intractable.” And at anything but the smallest scales, intractable problems are beyond the reach of solution by computers, no matter how powerful.
One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
Introduction to Relaxation Methods or Discrete Relaxation Techniques
The idea behind such thought exercises is exactly that of Constraint Relaxation: to make the intractable tractable, to make progress in an idealized world that can be ported back to the real one. If you can’t solve the problem in front of you, solve an easier version of it—and then see if that solution offers you a starting point, or a beacon, in the full-blown problem
One day as a child, Brian was complaining to his mother about all the things he had to do: his homework, his chores.… “Technically, you don’t have to do anything,” his mother replied. “You don’t have to do what your teachers tell you. You don’t have to do what I tell you. You don’t even have to obey the law. There are consequences to everything, and you get to decide whether you want to face those consequences.”
The idea behind Lagrangian Relaxation is simple. An optimization problem has two parts: the rules and the scorekeeping. In Lagrangian Relaxation, we take some of the problem’s constraints and bake them into the scoring system instead. That is, we take the impossible and downgrade it to costly. (In a wedding seating optimization, for instance, we might relax the constraint that tables each hold ten people max, allowing overfull tables but with some kind of elbow-room penalty.)
Occasionally it takes a bit of diplomatic finesse, but a Lagrangian Relaxation—where some impossibilities are downgraded to penalties, the inconceivable to the undesirable—enables progress to be made
The conservative British columnist Christopher Booker says that “when we embark on a course of action which is unconsciously driven by wishful thinking, all may seem to go well for a time”—but that because “this make-believe can never be reconciled with reality,” it will inevitably lead to what he describes as a multi-stage breakdown: “dream,” “frustration,” “nightmare,” “explosion.” Computer science paints a dramatically rosier view. Then again, as an optimization technique, relaxation is all about being consciously driven by wishful thinking. Perhaps that’s partly what makes the difference.
====When to Leave It to Chance====
There is a deep message in the fact that on certain problems, randomized approaches can outperform even the best deterministic ones. Sometimes the best solution to a problem is to turn to chance rather than trying to fully reason out an answer.
But merely knowing that randomness can be helpful isn’t good enough. You need to know when to rely on chance, in what way, and to what extent.
In 1777, George-Louis Leclerc, Comte de Buffon, published the results of an interesting probabilistic analysis. If we drop a needle onto a lined piece of paper, he asked, how likely is it to cross one of the lines? Buffon’s work showed that if the needle is shorter than the gap between the blines, the answer is 2⁄π times the needle’s length divided by the length of the gap. For Buffon, deriving this formula was enough. But in 1812, Pierre-Simon Laplace, one of the heroes of chapter 6, pointed out that this result has another implication: one could estimate the value of π simply by dropping needles onto paper
Metropolis named this approach—replacing exhaustive probability calculations with sample simulations—the Monte Carlo Method, after the Monte Carlo casino in Monaco, a place equally dependent on the vagaries of chance. The Los Alamos team was able to use it to solve key problems in nuclear physics. Today the Monte Carlo Method is one of the cornerstones of scientific computing.
The first person to demonstrate the surprisingly broad applications of randomness in computer science was Michael Rabin. Born in 1931 in Breslau, Germany (which became Wrocław, Poland, at the end of World War II), Rabin was the descendant of a long line of rabbis. His family left Germany for Palestine in 1935, and there he was diverted from the rabbinical path his father had laid down for him by the beauty of mathematics
Sieve of Erastothenes
The Miller-Rabin primality test, as it’s now known, provides a way to quickly identify even gigantic prime numbers with an arbitrary degree of certainty.
Since there is no known deterministic algorithm for efficiently testing polynomial identity, this randomized method—with multiple observations quickly giving rise to near-certainty—is the only practical one we have.
The polynomial identity test shows that sometimes our effort is better spent checking random values—sampling from the two expressions we want to know about—than trying to untangle their inner workings.
Arguably the most important political philosopher of the twentieth century was Harvard’s John Rawls, who set for himself the ambitious task of reconciling two seemingly opposite key ideas in his field: liberty and equality. Is a society more “just” when it’s more free, or more equal? And do the two really have to be mutually exclusive? Rawls offered a way of approaching this set of questions that he called the “veil of ignorance.” Imagine, he said, that you were about to be born, but didn’t know as whom: male or female, rich or poor, urban or rural, sick or healthy. And before learning your status, you had to choose what kind of society you’d live in. What would you want? By evaluating various social arrangements from behind the veil of ignorance, argued Rawls, we’d more readily come to a consensus about what an ideal one would look like.
Should we be trying, for instance, to maximize mean happiness, median happiness, total happiness, or something else? Each of these approaches, famously, leaves itself open to pernicious dystopias—such as the civilization of Omelas imagined by writer Ursula K. Le Guin, in which prosperity and harmony abound but a single child is forced to live in abject misery.
Since neither sweeping statistics nor politicians’ favorite stories can truly guide us through thousands of pages of proposed legislation, a Monte Carlo–informed computer scientist would propose a different approach: sampling. A close examination of random samples can be one of the most effective means of making sense of something too complex to be comprehended directly. When it comes to handling a qualitatively unmanageable problem, something so thorny and complicated that it can’t be digested whole—solitaire or atomic fission, primality testing or public policy—sampling offers one of the simplest, and also the best, ways of cutting through the difficulties.
The Three-Part Tradeoff
Time and space are at the root of the most familiar tradeoffs in computer science, but recent work on randomized algorithms shows that there’s also another variable to consider: certainty. As Harvard’s Michael Mitzenmacher puts it, “What we’re going to do is come up with an answer which saves you in time and space and trades off this third dimension: error probability.” Asked for his favorite example of this tradeoff into uncertainty, he doesn’t hesitate. “A colleague just said that there should be a drinking game that every time this term appears on one of my slides, you have to take a drink. Have you ever heard of Bloom filters?”
the tactical use of randomness has emerged as an arguably even more important technique
This is an example of a so-called greedy algorithm, which you can also think of as a “myopic algorithm”: one that shortsightedly takes the best thing available every step of the way.
This is an algorithm known as Hill Climbing—since the search through a space of solutions, some better and some worse, is commonly thought of in terms of a landscape with hills and valleys, where your goal is to reach the highest peak.
It’s here that the hill climbing stops. Does this mean you’ve definitely found the single best possible itinerary, though? Sadly, no. You may have found only a so-called “local maximum,” not the global maximum of all the possibilities. The hill-climbing landscape is a misty one. You can know that you’re standing on a mountaintop because the ground falls away in all directions—but there might be a higher mountain just across the next valley, hidden behind clouds.
Another approach is to completely scramble our solution when we reach a local maximum, and start Hill Climbing anew from this random new starting point. This algorithm is known, appropriately enough, as “Random-Restart Hill Climbing”—or, more colorfully, as “Shotgun Hill Climbing.” It’s a strategy that proves very effective when there are lots of local maxima in a problem.
But there’s also a third approach: instead of turning to full-bore randomness when you’re stuck, use a little bit of randomness every time you make a decision. This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm. The Metropolis Algorithm is like Hill Climbing, trying out different small-scale tweaks on a solution, but with one important difference: at any given point, it will potentially accept bad tweaks as well as good ones.
Whether it’s jitter, random restarts, or being open to occasional worsening, randomness is incredibly useful for avoiding local maxima. Chance is not just a viable way of dealing with tough optimization problems; in many cases, it’s essential. Some questions linger, however. How much randomness should you use? And when? And—given that strategies such as the Metropolis Algorithm can permute our itinerary pretty much ad infinitum—how do you ever know that you’re done?
So what would happen, Kirkpatrick wondered, if you treated an optimization problem like an annealing problem—if you “heated it up” and then slowly “cooled it off”?
Simulated Annealing
in 1754, Horace Walpole coined the term “serendipity,” based on the fairy tale adventures of The Three Princes of Serendip (Serendip being the archaic name of Sri Lanka), who “were always making discoveries, by accidents and sagacity, of things they were not in quest of.”
William James
New conceptions, emotions, and active tendencies which evolve are originally produced in the shape of random images, fancies, accidental out-births of spontaneous variation in the functional activity of the excessively unstable human brain, which the outer environment simply confirms or refutes, adopts or rejects, preserves or destroys—selects, in short, just as it selects morphological and social variations due to molecular accidents of an analogous sort.
James thus viewed randomness as the heart of creativity. And he believed it was magnified in the most creative people. In their presence, he wrote, “we seem suddenly introduced into a seething caldron of ideas, where everything is fizzling and bobbing about in a state of bewildering activity, where partnerships can be joined or loosened in an instant, treadmill routine is unknown, and the unexpected seems the only law.” (Note here the same “annealing” intuition, rooted in metaphors of temperature, where wild permutation equals heat.)
Donald Campbell, a psychologist who lived a hundred years later. In 1960, Campbell published a paper called “Blind Variation and Selective Retention in Creative Thought as in Other Knowledge Processes.” Like James, he opened with his central thesis: “A blind-variation-and-selective-retention process is fundamental to all inductive achievements, to all genuine increases in knowledge, to all increases in fit of system to environment.”
If the Dice Man had only had a deeper grasp of computer science, he’d have had some guidance. First, from Hill Climbing: even if you’re in the habit of sometimes acting on bad ideas, you should always act on good ones. Second, from the Metropolis Algorithm: your likelihood of following a bad idea should be inversely proportional to how bad an idea it is. Third, from Simulated Annealing: you should front-load randomness, rapidly cooling out of a totally random state, using ever less and less randomness as time goes on, lingering longest as you approach freezing. Temper yourself—literally.
You might worry that making every decision by flipping a coin could lead to trouble, not least with your boss, friends, and family. And it’s true that mainlining randomness into your life is not necessarily a recipe for success. The cult classic 1971 novel The Dice Man by Luke Rhinehart (real name: George Cockcroft) provides a cautionary tale. Its narrator, a man who replaces decision-making with dice rolling, quickly ends up in situations that most of us would probably like to avoid.
The long-distance telegraph began with a portent—Samuel F. B. Morse, standing in the chambers of the US Supreme Court on May 24, 1844, wiring his assistant Alfred Vail in Baltimore a verse from the Old Testament: “WHAT HATH GOD WROUGHT.” The first thing we ask of any new connection is how it began, and from that origin can’t help trying to augur its future
The first telephone call in history, made by Alexander Graham Bell to his assistant on March 10, 1876, began with a bit of a paradox. “Mr. Watson, come here; I want to see you”—a simultaneous testament to its ability and inability to overcome physical distance.
The cell phone began with a boast—Motorola’s Martin Cooper walking down Sixth Avenue on April 3, 1973, as Manhattan pedestrians gawked, calling his rival Joel Engel at AT&T: “Joel, I’m calling you from a cellular phone. A real cellular phone: a handheld, portable, real cellular phone.” (“I don’t remember exactly what he said,” Cooper recalls, “but it was really quiet for a while. My assumption was that he was grinding his teeth.”)
the text message began, on December 3, 1992, with cheer: Neil Papworth at Sema Group Telecoms wishing Vodafone’s Richard Jarvis an early “Merry Christmas.”
The beginnings of the Internet were, somehow fittingly, much humbler and more inauspicious than all of that. It was October 29, 1969, and Charley Kline at UCLA sent to Bill Duvall at the Stanford Research Institute the first message ever transmitted from one computer to another via the ARPANET. The message was “login”—or would have been, had the receiving machine not crashed after “lo.”
Lo—verily, Kline managed to sound portentous and Old Testament despite himself.
The foundation of human connection is protocol—a shared convention of procedures and expectations, from handshakes and hellos to etiquette, politesse, and the full gamut of social norms
“Byzantine generals problem.”
The issues faced by the Byzantine generals, as he reminds us, “are not design complexities, they are impossibility results.”
Exponential Backoff: The Algorithm of Forgiveness
The world’s most difficult word to translate has been identified as “ilunga,” from the Tshiluba language spoken in south-eastern DR Congo.… Ilunga means “a person who is ready to forgive any abuse for the first time, to tolerate it a second time, but never a third time.”
—BBC NEWS
wireless networking began as a matter of necessity, in a place where no wires could do the job: Hawaii. In the late ’60s and early ’70s, Norman Abramson at the University of Hawaii in Honolulu was trying to link together the university’s seven campuses and many research institutes, spread across four islands and hundreds of miles. He hit upon the idea of implementing packet switching via radio rather than the phone system, connecting the islands with a loose chain of transmitters and receivers. This system would come to be known as the ALOHAnet.
The breakthrough turned out to be increasing the average delay after every successive failure—specifically, doubling the potential delay before trying to transmit again. So after an initial failure, a sender would randomly retransmit either one or two turns later; after a second failure, it would try again anywhere from one to four turns later; a third failure in a row would mean waiting somewhere between one and eight turns, and so on. This elegant approach allows the network to accommodate potentially any number of competing signals. Since the maximum delay length (2, 4, 8, 16…) forms an exponential progression, it’s become known as Exponential Backoff
Exponential Backoff was a huge part of the successful functioning of the ALOHAnet beginning in 1971, and in the 1980s it was baked into TCP, becoming a critical part of the Internet. All these decades later, it still is. As one influential paper puts it, “For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff.”
Beyond just collision avoidance, Exponential Backoff has become the default way of handling almost all cases of networking failure or unreliability
Shortly after being sworn in to Hawaii’s First Circuit Court, Judge Steven Alm noticed a remarkable pattern. Probationers would repeatedly violate their probation terms, and circuit judges would routinely use their discretion to let them off with a warning. But at some point, perhaps after a dozen or more violations, the judge would decide to be strict, and assign the violator a prison sentence measured in years. Says Alm, “I thought, what a crazy way to try to change anybody’s behavior.” So Alm proposed almost exactly the opposite. In place of violation hearings scheduled a long time into the future, requiring uncertain judgment calls, and occasionally producing huge penalties, HOPE is based on immediate, predefined punishments that begin with just one day in jail and increase after each incident. A five-year study by the Department of Justice reported that HOPE probationers were half as likely as regular probationers to be arrested for a new crime or have their probation revoked. And they were 72% less likely to use drugs. Seventeen states have followed Hawaii’s lead and launched their own versions of HOPE.
====Flow Control and Congestion Avoidance====
At the heart of TCP congestion control is an algorithm called Additive Increase, Multiplicative Decrease, or AIMD. Before AIMD kicks in, a new connection will ramp up its transmission rate aggressively: if the first packet is received successfully it sends out two more, if both of those get through it sends out a batch of four, and so on. But as soon as any packet’s ACK does not come back to the sender, the AIMD algorithm takes over. Under AIMD, any fully received batch of packets causes the number of packets in flight not to double but merely to increase by 1, and dropped packets cause the transmission rate to cut back by half (hence the name Additive Increase, Multiplicative Decrease). Essentially, AIMD takes the form of someone saying, “A little more, a little more, a little more, whoa, too much, cut way back, okay a little more, a little more…” Thus it leads to a characteristic bandwidth shape known as the “TCP sawtooth”—steady upward climbs punctuated by steep drops.
Conservatism here is essential: a network can stabilize only if its users pull back at least as fast as the rate at which it is being overloaded. For the same reason, a merely additive increase helps stabilize things for everyone, preventing rapid overload-and-recovery cycles.
Spanish philosopher José Ortega y Gasset in 1910 voiced the same sentiment. “Every public servant should be demoted to the immediately lower rank,” he wrote, “because they were advanced until they became incompetent.”
Some organizations have attempted to remediate the Peter Principle by simply firing employees who don’t advance. The so-called Cravath System, devised by leading law firm Cravath, Swaine & Moore, involves hiring almost exclusively recent graduates, placing them into the bottom ranks, and then routinely either promoting or firing them over the following years. In 1980, the US Armed Forces adopted a similar “up or out” policy with the Defense Officer Personnel Management Act. The United Kingdom has likewise pursued what they call “manning control,” to great controversy.
Is there any alternative, any middle path between the institutional stagnation of the Peter Principle and the draconian severity of the “up or out” system? The AIMD algorithm can offer just such an approach, since it is explicitly designed to handle the demands of a volatile environment. A computer network must manage its own maximum transmission capacity, plus the transmission rates of its clients, all of which may be fluctuating unpredictably. Likewise, in a business setting, a company has a limited pool of funds to pay for its operations, and each worker or vendor has a limited capacity for the amount of work they can do and the amount of responsibility they can handle. Everyone’s needs, capacities, and partnerships are always in flux.
The lesson of the TCP sawtooth is that in an unpredictable and changing environment, pushing things to the point of failure is indeed sometimes the best (or the only) way to use all the resources to their fullest. What matters is making sure that the response to failure is both sharp and resilient. Under AIMD, every connection that isn’t dropping the ball is accelerated until it is—and then it’s cut in half, and immediately begins accelerating again. And though it would violate almost every norm of current corporate culture, one can imagine a corporation in which, annually, every employee is always either promoted a single step up the org chart or sent part of the way back down.
As Laurence J. Peter saw it, the insidious Peter Principle arises in corporations because of “the first commandment of hierarchical life: the hierarchy must be preserved.” TCP, in contrast, teaches the virtues of flexibility. Companies speak of “flat” hierarchies and “tall” hierarchies, but they might consider speaking of dynamic ones. Under an AIMD system, no one is long anxious about being overtaxed, nor long resentful about a missed promotion; both are temporary and frequent correctives, and the system hovers near its equilibrium despite everything changing all the time. Perhaps one day we’ll speak not of the arc of one’s career, but rather of its sawtooth.
Backchannels: Flow Control in Linguistics
Curiously, the rising awareness of the critical role of feedback in the field of networking mirrored an almost identical set of developments going on around the same time in the linguistics community. In the middle of the twentieth century, linguistics was dominated by the theories of Noam Chomsky, which considered language in its most perfect and ideal state—perfectly fluent, grammatical, uninterrupted sentences, as if all communication were written text. But starting in the 1960s and ’70s, a surge of interest in the practical aspects of spoken language revealed just how elaborate and subtle the processes are that govern turn-taking, interruption, and composing a sentence or story on the fly while being attuned to a listener’s reactions every step of the way. What emerged was a vision of even ostensibly one-way communication as a collaborative act. As linguist Victor Yngve would write in 1970, “In fact, both the person who has the turn and his partner are simultaneously engaged in both speaking and listening. This is because of the existence of what I call the back channel, over which the person who has the turn receives short messages such as ‘yes’ and ‘uh-huh’ without relinquishing the turn.”
In one illustrative study, a team led by Janet Bavelas at the University of Victoria investigated what would happen when someone listening to a personal story got distracted: not what would happen to the listener’s comprehension, but what would happen to the story. With poor feedback, they discovered, the story falls apart.
Narrators who told close-call stories to distracted listeners … told them less well overall and particularly poorly at what should have been the dramatic conclusion. Their story endings were abrupt or choppy, or they circled around and retold the ending more than once, and they often justified their story by explaining the obvious close call.
We’ve all had the experience of talking to someone whose eyes drifted away—to their phone, perhaps—making us wonder whether our lackluster storytelling was to blame. In fact, it’s now clear that the cause and effect are often the reverse: a poor listener destroys the tale
The problem was everywhere. And the problem was bufferbloat.
When a networking buffer fills up, what typically happens is called Tail Drop: an unceremonious way of saying that every packet arriving after that point is simply rejected, and effectively deleted. (Turning new customers away from the crêpe stand once the line gets too long would be a version of Tail Drop in a human context.) Given the postal metaphor for packet switching, it might seem a bit odd to imagine a mail carrier who simply vaporizes every parcel that doesn’t fit onto the truck that morning. Yet it’s precisely such “packet drops” that lead a computer to notice that one of its packets hasn’t been acknowledged, prompting AIMD to start halving the bandwidth. Dropped packets are the Internet’s primary feedback mechanism
The body is its own flow control. We can’t be in more than one place at one time. At a crowded party we inevitably participate in less than 5% of the conversation, and cannot read up or catch up on the remainder. Photons that miss the retina aren’t queued for later viewing. In real life, packet loss is almost total.
We use the idiom of “dropped balls” almost exclusively in a derogatory sense, implying that the person in question was lazy, complacent, or forgetful. But the tactical dropping of balls is a critical part of getting things done under overload.
The most prevalent critique of modern communications is that we are “always connected.” But the problem isn’t that we’re always connected; we’re not. The problem is that we’re always buffered. The difference is enormous
the move from circuit switching to packet switching has played itself out across society. We used to request dedicated circuits with others; now we send them packets and wait expectantly for ACKs. We used to reject; now we defer.
The much-lamented “lack of idleness” one reads about is, perversely, the primary feature of buffers: to bring average throughput up to peak throughput. Preventing idleness is what they do. You check email from the road, from vacation, on the toilet, in the middle of the night. You are never, ever bored. This is the mixed blessing of buffers, operating as advertised.
“This is a long-term swamp,”
I’m an optimist in the sense that I believe humans are noble and honorable, and some of them are really smart.… I have a somewhat more pessimistic view of people in groups.
—STEVE JOBS
John Maynard Keynes, once said that “successful investing is anticipating the anticipations of others.”
Professional investment may be likened to those newspaper competitions in which the competitors have to pick out the six prettiest faces from a hundred photographs, the prize being awarded to the competitor whose choice most nearly corresponds to the average preferences of the competitors as a whole; so that each competitor has to pick, not those faces which he himself finds prettiest, but those which he thinks likeliest to catch the fancy of the other competitors, all of whom are looking at the problem from the same point of view. It is not a case of choosing those which, to the best of one’s judgment, are really the prettiest, nor even those which average opinion genuinely thinks the prettiest. We have reached the third degree where we devote our intelligences to anticipating what average opinion expects the average opinion to be. And there are some, I believe who practice the fourth, fifth, and higher degrees.
the “halting problem.” As Alan Turing proved in 1936, a computer program can never tell you for sure whether another program might end up calculating forever without end—except by simulating the operation of that program and thus potentially going off the deep end itself. (Accordingly, programmers will never have automated tools that can tell them whether their software will freeze.) This is one of the foundational results in all of computer science, on which many other proofs hang.* Simply put, any time a system—be it a machine or a mind—simulates the workings of something as complex as itself, it finds its resources totally maxed out, more or less by definition.
One of the most memorable bluffs in high-level poker occurred when Tom Dwan wagered $479,500 on Texas Hold ’Em’s absolute worst possible hand, the 2–7—while literally telling his opponent, Sammy George, that he was holding it. “You don’t have deuce-seven,” George replied. “You don’t have deuce-seven.” George folded, and Dwan—with, yes, deuce-seven—took the pot.
In poker, recursion is a dangerous game. You don’t want to get caught one step behind your opponent, of course—but there’s also an imperative not to get too far ahead of them either. “There’s a rule that you really only want to play one level above your opponent,” explains poker professional Vanessa Rousso. “If you play too far above your opponent, you’re going to think they have information that they don’t actually have—[and] they won’t be able to glean the information that you want them to glean from your actions.” Sometimes poker pros will deliberately bait their opponent into a convoluted recursion, meanwhile playing completely by-the-book, unpsychological poker themselves. This is known as luring them into “a leveling war against themselves.”
One of the most colorful, bizarre, and fascinating episodes in the history of man-vs.-machine chess came in a 2008 blitz showdown between American grandmaster Hikaru Nakamura and leading computer chess program Rybka.
Nakamura immediately gridlocked the board, and proceeded to make repetitive, meaningless moves as fast as he could click. Meanwhile, the computer wasted precious moments fruitlessly searching for winning variations that didn’t exist and doggedly trying to anticipate all the possible future moves by Nakamura, who himself was simply doing the chess equivalent of twiddling his thumbs. When the computer had nearly exhausted its time and began flailing so as not to lose by the clock, Nakamura finally opened the position and crashed through.
Reaching Equilibrium
In one of the seminal results in game theory, the mathematician John Nash proved in 1951 that every two-player game has at least one equilibrium
Put broadly, the object of study in mathematics is truth; the object of study in computer science is complexity. As we’ve seen, it’s not enough for a problem to have a solution if that problem is intractable.
More generally, the Nash equilibrium offers a prediction of the stable long-term outcome of any set of rules or incentives. As such, it provides an invaluable tool for both predicting and shaping economic policy, as well as social policy in general.
In a game-theory context, knowing that an equilibrium exists doesn’t actually tell us what it is—or how to get there
And so, the original field of game theory begat algorithmic game theory—that is, the study of theoretically ideal strategies for games became the study of how machines (and people) come up with strategies for games.
By the end of the twentieth century, determining whether a game has more than one equilibrium, or an equilibrium that gives a player a certain payoff, or an equilibrium that involves taking a particular action, had all been proved to be intractable problems. Then, from 2005 to 2008, Papadimitriou and his colleagues proved that simply finding Nash equilibria is intractable as well.
As Papadimitriou explains, “If an equilibrium concept is not efficiently computable, much of its credibility as a prediction of the behavior of rational agents is lost.” MIT’s Scott Aaronson agrees. “In my opinion,” he says, “if the theorem that Nash equilibria exist is considered relevant to debates about (say) free markets versus government intervention, then the theorem that finding those equilibria is [intractable] should be considered relevant also.” The predictive abilities of Nash equilibria only matter if those equilibria can actually be found by the players. To quote eBay’s former director of research, Kamal Jain, “If your laptop cannot find it, neither can the market.”
Dominant Strategies, for Better or Worse
A dominant strategy avoids recursion altogether, by being the best response to all of your opponent’s possible strategies—so you don’t even need to trouble yourself getting inside their head at all. A dominant strategy is a powerful thing.
This has emerged as one of the major insights of traditional game theory: the equilibrium for a set of players, all acting rationally in their own interest, may not be the outcome that is actually best for those players.
the price of anarchy.
The price of anarchy measures the gap between cooperation (a centrally designed or coordinated solution) and competition (where each participant is independently trying to maximize the outcome for themselves). In a game like the prisoner’s dilemma, this price is effectively infinite: increasing the amount of cash at stake and lengthening the jail sentences can make the gap between possible outcomes arbitrarily wide, even as the dominant strategy stays the same. There’s no limit to how painful things can get for the players if they don’t coordinate. But in other games, as algorithmic game theorists would discover, the price of anarchy is not nearly so bad.
Surprisingly, Tim Roughgarden and Cornell’s Éva Tardos proved in 2002 that the “selfish routing” approach has a price of anarchy that’s a mere 4/3. That is, a free-for-all is only 33% worse than perfect top-down coordination.
Roughgarden and Tardos’s work has deep implications both for urban planning of physical traffic and for network infrastructure. Selfish routing’s low price of anarchy may explain, for instance, why the Internet works as well as it does without any central authority managing the routing of individual packets. Even if such coordination were possible, it wouldn’t add very much.
James Branch Cabell: “The optimist proclaims that we live in the best of all possible worlds; and the pessimist fears this is true.”
Quantifying the price of anarchy has given the field a concrete and rigorous way to assess the pros and cons of decentralized systems, which has broad implications across any number of domains where people find themselves involved in game-playing (whether they know it or not). A low price of anarchy means the system is, for better or worse, about as good on its own as it would be if it were carefully managed. A high price of anarchy, on the other hand, means that things have the potential to turn out fine if they’re carefully coordinated—but that without some form of intervention, we are courting disaster. The prisoner’s dilemma is clearly of this latter type. Unfortunately, so are many of the most critical games the world must play.
By and large we cannot shift the dominant strategies from within. But this doesn’t mean that bad equilibria can’t be fixed. It just means that the solution is going to have to come from somewhere else.
Ken Binmore sees at least some of that controversy as misguided. As he argues, it’s “just plain wrong that the Prisoner’s Dilemma captures what matters about human cooperation. On the contrary, it represents a situation in which the dice are as loaded against the emergence of cooperation as they could possibly be.”*
a branch of game theory known as “mechanism design.” While game theory asks what behavior will emerge given a set of rules, mechanism design (sometimes called “reverse game theory”) works in the other direction, asking: what rules will give us the behavior we want to see? And if game theory’s revelations—like the fact that an equilibrium strategy might be rational for each player yet bad for everyone—have proven counterintuitive, the revelations of mechanism design are even more so.
the prisoner’s dilemma, with one crucial addition: the Godfather.
The counterintuitive and powerful thing here is we can worsen every outcome—death on the one hand, taxes on the other—yet make everyone’s lives better by shifting the equilibrium.
adding divine force to injunctions against other kinds of antisocial behavior, such as murder, adultery, and theft, is likewise a way to solve some of the game-theoretic problems of living in a social group. God happens to be even better than government in this respect, since omniscience and omnipotence provide a particularly strong guarantee that taking bad actions will have dire consequences. It turns out there’s no Godfather quite like God the Father.
Religion seems like the kind of thing a computer scientist rarely talks about; in fact, it’s literally the subject of a book called Things a Computer Scientist Rarely Talks About. But by reducing the number of options that people have, behavioral constraints of the kind imposed by religion don’t just make certain kinds of decisions less computationally challenging—they can also yield better outcomes
Mechanism Design by Evolution
So what has acted up in these people, in the absence of an external authority, to make them buck the selfish equilibrium? Anger, for one thing. Whether prompted by a shoddy business or a petty thief, outrage can override rationality. And in these instances, it may be that the hand of evolution has done what it would otherwise have taken an authority outside the game to accomplish
Nature is full of examples of individuals being essentially hijacked to serve the goals of another species. The lancet liver fluke (Dicrocoelium dendriticum), for instance, is a parasite that makes ants deliberately climb to the tops of grass blades so that they’ll be eaten by sheep—the lancet fluke’s preferred host. Likewise, the parasite Toxoplasma gondii makes mice permanently lose their fear of cats, with similar results.
Precisely because feelings are involuntary, they enable contracts that need no outside enforcement. Revenge almost never works out in favor of the one who seeks it, and yet someone who will respond with “irrational” vehemence to being taken advantage of is for that very reason more likely to get a fair deal. As Cornell economist Robert Frank puts it, “If people expect us to respond irrationally to the theft of our property, we will seldom need to, because it will not be in their interests to steal it. Being predisposed to respond irrationally serves much better here than being guided only by material self-interest.”
Love is like organized crime. It changes the structure of the marriage game so that the equilibrium becomes the outcome that works best for everybody.
Information Cascades: The Tragic Rationality of Bubbles
Whenever you find yourself on the side of the majority, it is time to pause and reflect.
—MARK TWAIN
An enormously influential paper by the economists Sushil Bikhchandani, David Hirshleifer, and Ivo Welch has demonstrated that under the right circumstances, a group of agents who are all behaving perfectly rationally and perfectly appropriately can nonetheless fall prey to what is effectively infinite misinformation. This has come to be known as an “information cascade.”
Information cascades offer a rational theory not only of bubbles, but also of fads and herd behavior more generally. They offer an account of how it’s easily possible for any market to spike and collapse, even in the absence of irrationality, malevolence, or malfeasance. The takeaways are several. For one, be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they’re doing it, where you’re more concerned with your judgments fitting the consensus than fitting the facts. When you’re mostly looking to others to set a course, they may well be looking right back at you to do the same. Second, remember that actions are not beliefs; cascades get caused in part when we misinterpret what others think based on what they do. We should be especially hesitant to overrule our own doubts—and if we do, we might want to find some way to broadcast those doubts even as we move forward, lest others fail to distinguish the reluctance in our minds from the implied enthusiasm in our actions. Last, we should remember from the prisoner’s dilemma that sometimes a game can have irredeemably lousy rules. There may be nothing we can do once we’re in it, but the theory of information cascades may help us to avoid such a game in the first place.
To Thine Own Self Compute
the Vickrey auction, just like the first-price auction, is a “sealed bid” auction process. That is, every participant simply writes down a single number in secret, and the highest bidder wins. However, in a Vickrey auction, the winner ends up paying not the amount of their own bid, but that of the second-place bidder. That is to say, if you bid $25 and I bid $10, you win the item at my price: you only have to pay $10.
In a landmark finding called the “revelation principle,” Nobel laureate Roger Myerson proved that any game that requires strategically masking the truth can be transformed into a game that requires nothing but simple honesty.
The revelation principle may seem hard to accept on its face, but its proof is actually quite intuitive. Imagine that you have an agent or a lawyer who will be playing the game for you. If you trust them to represent your interests, you’re going to simply tell them exactly what you want, and let them handle all of the strategic bid-shading and the recursive strategizing on your behalf.
And the revelation principle just expands this idea: any game that can be played for you by agents to whom you’ll tell the truth, it says, will become an honesty-is-best game if the behavior you want from your agent is incorporated into the rules of the game itself. As Nisan puts it, “The basic thing is if you don’t want your clients to optimize against you, you’d better optimize for them. That’s the whole proof.… If I design an algorithm that already optimizes for you, there is nothing you can do.”
If changing strategies doesn’t help, you can try to change the game. And if that’s not possible, you can at least exercise some control about which games you choose to play. The road to hell is paved with intractable recursions, bad equilibria, and information cascades. Seek out games where honesty is the dominant strategy. Then just be yourself.
====Computational Kindness====
We can be “computationally kind” to others by framing issues in terms that make the underlying computational problem easier. This matters because many problems—especially social ones, as we’ve seen—are intrinsically and inextricably hard.
seemingly innocuous language like “Oh, I’m flexible” or “What do you want to do tonight?” has a dark computational underbelly that should make you think twice. It has the veneer of kindness about it, but it does two deeply alarming things. First, it passes the cognitive buck: “Here’s a problem, you handle it.” Second, by not stating your preferences, it invites the others to simulate or imagine them. And as we have seen, the simulation of the minds of others is one of the biggest computational challenges a mind (or machine) can ever face.
In such situations, computational kindness and conventional etiquette diverge.
Politely withholding your preferences puts the computational problem of inferring them on the rest of the group. In contrast, politely asserting your preferences (“Personally, I’m inclined toward x. What do you think?”) helps shoulder the cognitive load of moving the group toward resolution.
In the hard cases, the best algorithms are all about doing what makes the most sense in the least amount of time, which by no means involves giving careful consideration to every factor and pursuing every computation to the end. Life is just too complicated for that.
Any dynamic system subject to the constraints of space and time is up against a core set of fundamental and unavoidable problems. These problems are computational in nature, which makes computers not only our tools but also our comrades. From this come three simple pieces of wisdom