Skip to content

‘Tis the Season for C-Section

September 13, 2012

The birthday paradox is most commonly presented as a thought experiment: How many people should be invited to a party that the odds of finding two persons sharing the same birthday be more than even? The answer is 23, which is surprisingly low and counter-intuitive. Other than the unexpectedly small numeric value, there is nothing illogical or self-contradictory about the problem; calling it a “paradox” is somewhat of a misnomer persisting for its expressiveness and history.

The paradox is basis of many cryptanalytic techniques, which is why the output lengths of hash functions are twice as long as the keys of symmetric-key encryption algorithms with equivalent security. It can also be used for constructive purposes, such as a micropayment scheme or the mark-and-recapture method for population size estimation.

The standard derivation of the birthday paradox computes the probability that {n} i.i.d. samples from the uniform distribution on {N} objects are all distinct. It turns out that this probability quickly approaches 1/2 when {n=\lceil\sqrt{N\cdot2\ln 2}\rceil}, which for {N=365} results in {n=23}. More precisely, the probability that there is a shared birthday among 22 people (and {N=365}) is 47.57% and for 23 people it becomes 50.73%, which is more than 1/2.

A simple argument shows that the uniform distribution on {N} objects is actually the “worst” case of the birthday paradox, i.e., if the distribution is non-uniform, then the even odds are achieved for smaller {n}.

We know that birthdays are not distributed uniformly due to the annual cycle of festivities, vacations, and religious customs (Lent and Ramadan suppose abstinence; carnivals in Latin America and the Kupala Night–celebration of the summer solstice in Eastern Europe—emphatically do not). Is the non-uniformity of birthdays sufficient to bring down {n} to a value smaller than 23?

The rest of the post is going to be US-centric because my sources are. We first look up the distribution of birth by month from the National Vital Statistics Report released by CDC for the year 2009, the latest available.

month

total births

births per day
January

337,980

10,903
February

316,641

11,309
March

347,803

11,219
April

337,272

11,242
May

345,257

11,137
June

346,971

11,566
July

368,450

11,885
August

359,554

11,599
September

361,922

12,064
October

347,625

11,214
November

320,195

10,673
December

340,995

11,000

Births by month, 2009. Source: National Vital Statistics Report, vol. 60, no. 1.

Although more children were born in July than in any other month, on a per day basis, September comes out ahead. The least fertile month is November, followed by January, with 13% fewer births per day than in September.

Assuming uniform distribution of births within a month, the probability of observing a collision when {n=22} increases slightly to 47.6%.

Less obvious but more significant are variations in the number of births falling on different days of the week. With more than 30% of births delivered through a C-section, and a widespread use of labor induction, doctors and mothers are flexible in choosing the delivery date. Since no-one wants to stay in the hospital over the weekend, Tuesdays and Wednesdays are the two most popular days for giving birth in the US. According to this table, the difference in the number of births between Sundays and Tuesdays is more 80%!

day of week

births per day
Sunday

7,298
Monday

12,087
Tuesday

13,336
Wednesday

13,034
Thursday

12,765
Friday

12,364
Saturday

8,308

Births by day of the week, 2009. National Vital Statistics Report, vol. 60, no. 1.

More detailed exploration of data possible through interactive query interface on the CDC website demonstrates that Thursday edges out Wednesday in September (and only in September) for the busiest day of the week.

If we encounter a group of people born in the same year (think of a freshman class), how does the birthday paradox change if we account for the day-of-week variations in the number of births? One possible way of answering this question is to fit a distribution with the marginals given by the monthly and day-of-week tables. We would definitely miss some effects due to holidays, several of which fall on Mondays. Another is to use actual microdata with information about all (or a representative sample of) births in the US. Unfortunately, I don’t have access to this information (if you do, please let me know). After briefly contemplating mining social networks for information about birthdays, I turned to another source released by the federal government with a macabre name of the Death Master File.

The primary purpose of the file is to prevent identity fraud, specifically reuse of Social Security Numbers (SSNs) of the dead. The file contains records of virtually all deaths in the US in recent years, together with the date of birth and the SSN. The logic behind it is that transparency is an effective and cheap crime deterrent: by making the SSNs of the deceased easily available (for example, they can be search on-line on a number of genealogical sites), it prevents a particularly nasty type of identity theft, since the dead cannot monitor their credit rating and report fraudulent transactions.

We take a version of the DMF from November 2011 and turn it around by using the date of birth of the deceased as a proxy for birth rates for the years when they were born. We exclude children who survived less than a year (they are strongly correlated with problem births, most likely resulting in cesarians), and look at the years 1980–2010. Tabulating statistics over nearly half a million records that satisfy these criteria, we find out that the two weeks following the week of the Labor Day have the highest number of births of the entire year.

The most recent leap year when the Labor Day fell on September 3 as it did this year was 1984 (not much has changed since then—a war is raging in Afghanistan, Apple is in the news, “Are you better off than four years ago?”). If we plug in distribution of birthdays from that year extracted from the Death Master File, it is still not sufficient to push the birthday paradox threshold below 23 (the probability of seeing a collision with 22 people born in 1984 is slightly more than 48%).

What’s up with September 16, the most popular birthday according to the New York Times? It turns out that the data was aggregated over births between 1973 and 1999, and over this data range, September 16 had more Tuesdays, Wednesdays, and Thursdays and fewer weekends than adjacent days.

The astute reader surely has noticed that Thursday of the week after the Labor Day—a likely candidate for the highest number of births this year—is today. Happy birthday to more than 13,000 kids born on this date and congratulations to their parents! (And don’t forget your offspring’s birth certificate—you can never know.)

6 Comments leave one →
  1. September 15, 2012 10:28 pm

    For whatever reason, I find this post quite funny! What prompted this post Ilya? 🙂

    – Omkant

    • September 16, 2012 4:34 am

      It was meant to be funny of sorts—a literal interpretation of a metaphor. Plus I’ve always found data exploration quite exciting—it’s like a box of chocolate, you never know what you are going to get.

      • Adam Smith permalink
        September 17, 2012 2:56 pm

        As an early September baby, this post touched close to home.

        I would be curious to know if the skew of the distribution is heavier in countries with more Caesarians and which are more culturally homogeneous than the US. My understanding is that Brazil has an extremely high c-section rate. Maybe a good place to start…

      • September 18, 2012 1:16 am

        > As an early September baby, this post touched close to home.

        Try not to think too hard about it 😉 But if you do, this is a link to a study on seasonality of births in Canada: http://www.canpopsoc.org/journal/CSPv20n1p1.pdf

        It contains some discussion of annual birth patterns in other countries. As of 1993 it was still not well understood phenomenon.

  2. terdon permalink
    April 28, 2013 9:31 am

    Does anybody knows about any kind of empirical/statistical tests of numerology?

    • April 29, 2013 4:10 am

      If you mean astrology, i.e., relationship between one’s date of birth and fate or character, I don’t think you will find much science there. On the other hand, there are multiple studies exploring how the date of birth has long-term consequences due to factors such as school or sports eligibility rules (which explains why many professional hockey players were born in January or February).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: