‘Tis the Season for C-Section

The birthday paradox is most commonly presented as a thought experiment: How many people should be invited to a party that the odds of finding two persons sharing the same birthday be more than even? The answer is 23, which is surprisingly low and counter-intuitive. Other than the unexpectedly small numeric value, there is nothing illogical or self-contradictory about the problem; calling it a “paradox” is somewhat of a misnomer persisting for its expressiveness and history.

The paradox is basis of many cryptanalytic techniques, which is why the output lengths of hash functions are twice as long as the keys of symmetric-key encryption algorithms with equivalent security. It can also be used for constructive purposes, such as a micropayment scheme or the mark-and-recapture method for population size estimation.

The standard derivation of the birthday paradox computes the probability that ${n}$ i.i.d. samples from the uniform distribution on ${N}$ objects are all distinct. It turns out that this probability quickly approaches 1/2 when ${n=\lceil\sqrt{N\cdot2\ln 2}\rceil}$ , which for ${N=365}$ results in ${n=23}$ . More precisely, the probability that there is a shared birthday among 22 people (and ${N=365}$ ) is 47.57% and for 23 people it becomes 50.73%, which is more than 1/2.

A simple argument shows that the uniform distribution on ${N}$ objects is actually the “worst” case of the birthday paradox, i.e., if the distribution is non-uniform, then the even odds are achieved for smaller ${n}$ .

We know that birthdays are not distributed uniformly due to the annual cycle of festivities, vacations, and religious customs (Lent and Ramadan suppose abstinence; carnivals in Latin America and the Kupala Night–celebration of the summer solstice in Eastern Europe—emphatically do not). Is the non-uniformity of birthdays sufficient to bring down ${n}$ to a value smaller than 23?

The rest of the post is going to be US-centric because my sources are. We first look up the distribution of birth by month from the National Vital Statistics Report released by CDC for the year 2009, the latest available.

month	total births	births per day
January	337,980	10,903
February	316,641	11,309
March	347,803	11,219
April	337,272	11,242
May	345,257	11,137
June	346,971	11,566
July	368,450	11,885
August	359,554	11,599
September	361,922	12,064
October	347,625	11,214
November	320,195	10,673
December	340,995	11,000

Births by month, 2009. Source: National Vital Statistics Report, vol. 60, no. 1.

Although more children were born in July than in any other month, on a per day basis, September comes out ahead. The least fertile month is November, followed by January, with 13% fewer births per day than in September.

Assuming uniform distribution of births within a month, the probability of observing a collision when ${n=22}$ increases slightly to 47.6%.

Less obvious but more significant are variations in the number of births falling on different days of the week. With more than 30% of births delivered through a C-section, and a widespread use of labor induction, doctors and mothers are flexible in choosing the delivery date. Since no-one wants to stay in the hospital over the weekend, Tuesdays and Wednesdays are the two most popular days for giving birth in the US. According to this table, the difference in the number of births between Sundays and Tuesdays is more 80%!

day of week	births per day
Sunday	7,298
Monday	12,087
Tuesday	13,336
Wednesday	13,034
Thursday	12,765
Friday	12,364
Saturday	8,308

Births by day of the week, 2009. National Vital Statistics Report, vol. 60, no. 1.

More detailed exploration of data possible through interactive query interface on the CDC website demonstrates that Thursday edges out Wednesday in September (and only in September) for the busiest day of the week.

If we encounter a group of people born in the same year (think of a freshman class), how does the birthday paradox change if we account for the day-of-week variations in the number of births? One possible way of answering this question is to fit a distribution with the marginals given by the monthly and day-of-week tables. We would definitely miss some effects due to holidays, several of which fall on Mondays. Another is to use actual microdata with information about all (or a representative sample of) births in the US. Unfortunately, I don’t have access to this information (if you do, please let me know). After briefly contemplating mining social networks for information about birthdays, I turned to another source released by the federal government with a macabre name of the Death Master File.

The primary purpose of the file is to prevent identity fraud, specifically reuse of Social Security Numbers (SSNs) of the dead. The file contains records of virtually all deaths in the US in recent years, together with the date of birth and the SSN. The logic behind it is that transparency is an effective and cheap crime deterrent: by making the SSNs of the deceased easily available (for example, they can be search on-line on a number of genealogical sites), it prevents a particularly nasty type of identity theft, since the dead cannot monitor their credit rating and report fraudulent transactions.

We take a version of the DMF from November 2011 and turn it around by using the date of birth of the deceased as a proxy for birth rates for the years when they were born. We exclude children who survived less than a year (they are strongly correlated with problem births, most likely resulting in cesarians), and look at the years 1980–2010. Tabulating statistics over nearly half a million records that satisfy these criteria, we find out that the two weeks following the week of the Labor Day have the highest number of births of the entire year.

The most recent leap year when the Labor Day fell on September 3 as it did this year was 1984 (not much has changed since then—a war is raging in Afghanistan, Apple is in the news, “Are you better off than four years ago?”). If we plug in distribution of birthdays from that year extracted from the Death Master File, it is still not sufficient to push the birthday paradox threshold below 23 (the probability of seeing a collision with 22 people born in 1984 is slightly more than 48%).

What’s up with September 16, the most popular birthday according to the New York Times? It turns out that the data was aggregated over births between 1973 and 1999, and over this data range, September 16 had more Tuesdays, Wednesdays, and Thursdays and fewer weekends than adjacent days.

The astute reader surely has noticed that Thursday of the week after the Labor Day—a likely candidate for the highest number of births this year—is today. Happy birthday to more than 13,000 kids born on this date and congratulations to their parents! (And don’t forget your offspring’s birth certificate—you can never know.)

6 thoughts on “‘Tis the Season for C-Section”

Omkant says:

September 15, 2012 at 10:28 pm

For whatever reason, I find this post quite funny! What prompted this post Ilya? 🙂

– Omkant

1. ilyamironov says:
  
  September 16, 2012 at 4:34 am
  
  It was meant to be funny of sorts—a literal interpretation of a metaphor. Plus I’ve always found data exploration quite exciting—it’s like a box of chocolate, you never know what you are going to get.
  
  1. Adam Smith says:
    
    September 17, 2012 at 2:56 pm
    
    As an early September baby, this post touched close to home.
    
    I would be curious to know if the skew of the distribution is heavier in countries with more Caesarians and which are more culturally homogeneous than the US. My understanding is that Brazil has an extremely high c-section rate. Maybe a good place to start…
  2. ilyamironov says:
    
    September 18, 2012 at 1:16 am
    
    > As an early September baby, this post touched close to home.
    
    Try not to think too hard about it 😉 But if you do, this is a link to a study on seasonality of births in Canada: http://www.canpopsoc.org/journal/CSPv20n1p1.pdf
    
    It contains some discussion of annual birth patterns in other countries. As of 1993 it was still not well understood phenomenon.
terdon says:

April 28, 2013 at 9:31 am

Does anybody knows about any kind of empirical/statistical tests of numerology?

1. Ilya Mironov says:
  
  April 29, 2013 at 4:10 am
  
  If you mean astrology, i.e., relationship between one’s date of birth and fate or character, I don’t think you will find much science there. On the other hand, there are multiple studies exploring how the date of birth has long-term consequences due to factors such as school or sports eligibility rules (which explains why many professional hockey players were born in January or February).

	Mostafa Touny on Avi wins the Turing, TCS for…
	Harvard Professors o… on Open letter to the Harvard…
	Harvard Professors o… on Open letter to the Harvard…
	Harvard, we have a p… on Letter to the Harvard Corporat…
	Eitan bachmat on Letter to the Harvard Corporat…

‘Tis the Season for C-Section

Published by Ilya Mironov

6 thoughts on “‘Tis the Season for C-Section”

Leave a comment Cancel reply

Share this:

Related

Published by Ilya Mironov

6 thoughts on “‘Tis the Season for C-Section”

Leave a comment Cancel reply