The birthday paradox is most commonly presented as a thought experiment: How many people should be invited to a party that the odds of finding two persons sharing the same birthday be more than even? The answer is 23, which is surprisingly low and counter-intuitive. Other than the unexpectedly small numeric value, there is nothing illogical or self-contradictory about the problem; calling it a “paradox” is somewhat of a misnomer persisting for its expressiveness and history.

The paradox is basis of many cryptanalytic techniques, which is why the output lengths of hash functions are twice as long as the keys of symmetric-key encryption algorithms with equivalent security. It can also be used for constructive purposes, such as a micropayment scheme or the mark-and-recapture method for population size estimation.

The standard derivation of the birthday paradox computes the probability that ${n}$ i.i.d. samples from the uniform distribution on ${N}$ objects are all distinct. It turns out that this probability quickly approaches 1/2 when ${n=\lceil\sqrt{N\cdot2\ln 2}\rceil}$, which for ${N=365}$ results in ${n=23}$. More precisely, the probability that there is a shared birthday among 22 people (and ${N=365}$) is 47.57% and for 23 people it becomes 50.73%, which is more than 1/2.

A simple argument shows that the uniform distribution on ${N}$ objects is actually the “worst” case of the birthday paradox, i.e., if the distribution is non-uniform, then the even odds are achieved for smaller ${n}$.

We know that birthdays are not distributed uniformly due to the annual cycle of festivities, vacations, and religious customs (Lent and Ramadan suppose abstinence; carnivals in Latin America and the Kupala Night–celebration of the summer solstice in Eastern Europe—emphatically do not). Is the non-uniformity of birthdays sufficient to bring down ${n}$ to a value smaller than 23?

The rest of the post is going to be US-centric because my sources are. We first look up the distribution of birth by month from the National Vital Statistics Report released by CDC for the year 2009, the latest available.

 month total births births per day January 337,980 10,903 February 316,641 11,309 March 347,803 11,219 April 337,272 11,242 May 345,257 11,137 June 346,971 11,566 July 368,450 11,885 August 359,554 11,599 September 361,922 12,064 October 347,625 11,214 November 320,195 10,673 December 340,995 11,000

Births by month, 2009. Source: National Vital Statistics Report, vol. 60, no. 1.

Although more children were born in July than in any other month, on a per day basis, September comes out ahead. The least fertile month is November, followed by January, with 13% fewer births per day than in September.

Assuming uniform distribution of births within a month, the probability of observing a collision when ${n=22}$ increases slightly to 47.6%.

Less obvious but more significant are variations in the number of births falling on different days of the week. With more than 30% of births delivered through a C-section, and a widespread use of labor induction, doctors and mothers are flexible in choosing the delivery date. Since no-one wants to stay in the hospital over the weekend, Tuesdays and Wednesdays are the two most popular days for giving birth in the US. According to this table, the difference in the number of births between Sundays and Tuesdays is more 80%!

 day of week births per day Sunday 7,298 Monday 12,087 Tuesday 13,336 Wednesday 13,034 Thursday 12,765 Friday 12,364 Saturday 8,308

Births by day of the week, 2009. National Vital Statistics Report, vol. 60, no. 1.

More detailed exploration of data possible through interactive query interface on the CDC website demonstrates that Thursday edges out Wednesday in September (and only in September) for the busiest day of the week.

If we encounter a group of people born in the same year (think of a freshman class), how does the birthday paradox change if we account for the day-of-week variations in the number of births? One possible way of answering this question is to fit a distribution with the marginals given by the monthly and day-of-week tables. We would definitely miss some effects due to holidays, several of which fall on Mondays. Another is to use actual microdata with information about all (or a representative sample of) births in the US. Unfortunately, I don’t have access to this information (if you do, please let me know). After briefly contemplating mining social networks for information about birthdays, I turned to another source released by the federal government with a macabre name of the Death Master File.

The primary purpose of the file is to prevent identity fraud, specifically reuse of Social Security Numbers (SSNs) of the dead. The file contains records of virtually all deaths in the US in recent years, together with the date of birth and the SSN. The logic behind it is that transparency is an effective and cheap crime deterrent: by making the SSNs of the deceased easily available (for example, they can be search on-line on a number of genealogical sites), it prevents a particularly nasty type of identity theft, since the dead cannot monitor their credit rating and report fraudulent transactions.

We take a version of the DMF from November 2011 and turn it around by using the date of birth of the deceased as a proxy for birth rates for the years when they were born. We exclude children who survived less than a year (they are strongly correlated with problem births, most likely resulting in cesarians), and look at the years 1980–2010. Tabulating statistics over nearly half a million records that satisfy these criteria, we find out that the two weeks following the week of the Labor Day have the highest number of births of the entire year.

The most recent leap year when the Labor Day fell on September 3 as it did this year was 1984 (not much has changed since then—a war is raging in Afghanistan, Apple is in the news, “Are you better off than four years ago?”). If we plug in distribution of birthdays from that year extracted from the Death Master File, it is still not sufficient to push the birthday paradox threshold below 23 (the probability of seeing a collision with 22 people born in 1984 is slightly more than 48%).

What’s up with September 16, the most popular birthday according to the New York Times? It turns out that the data was aggregated over births between 1973 and 1999, and over this data range, September 16 had more Tuesdays, Wednesdays, and Thursdays and fewer weekends than adjacent days.

The astute reader surely has noticed that Thursday of the week after the Labor Day—a likely candidate for the highest number of births this year—is today. Happy birthday to more than 13,000 kids born on this date and congratulations to their parents! (And don’t forget your offspring’s birth certificate—you can never know.)

1. September 15, 2012 10:28 pm

For whatever reason, I find this post quite funny! What prompted this post Ilya? 🙂

– Omkant

• September 16, 2012 4:34 am

It was meant to be funny of sorts—a literal interpretation of a metaphor. Plus I’ve always found data exploration quite exciting—it’s like a box of chocolate, you never know what you are going to get.

September 17, 2012 2:56 pm

As an early September baby, this post touched close to home.

I would be curious to know if the skew of the distribution is heavier in countries with more Caesarians and which are more culturally homogeneous than the US. My understanding is that Brazil has an extremely high c-section rate. Maybe a good place to start…

• September 18, 2012 1:16 am

> As an early September baby, this post touched close to home.

Try not to think too hard about it 😉 But if you do, this is a link to a study on seasonality of births in Canada: http://www.canpopsoc.org/journal/CSPv20n1p1.pdf

It contains some discussion of annual birth patterns in other countries. As of 1993 it was still not well understood phenomenon.