What was your name again?
One of the thoughts which sometimes crosses my mind when I join a group of people is "What if there is someone with my name already? How likely is that?" In Slovenia, the name Lorand is very rare. Nevertheless it is interesting to think about how likely name collisions are for different names.
This article aims to answer two questions:
- How likely is it that there is a name collision in a group of people in Slovenia?
- Given a specific name, how likely is it that it will not be unique in a group of given size?
First of all, let me list a few simplifying assumptions that I make:
- the group of people I am looking at is a random sample of the population. Groups of people working together at a job or studying together at the universitity are far from random, I hope :)
- there cannot be a name collision with a person of the opposite gender, i.e. the set of girl and boy names are disjoint (with a few exceptions this is true; only Saša occurs as both a girl's and a boy's name in the 200 most frequent names in Slovenia)
The problem of computing the probability of a name collision is related to the birthday problem, where birthday collisions are computed. What makes it different is the distributions. Birthdays are uniformly distributed, while the frequency distribution of names is exponential.
The figure below shows the frequency of each of the most popular 200 Slovenian names, on the right, and the percentage of people accounted for by the names up to a certain popularity (on the left). Indeed, an exponential decrease of popularity can be observed, also, half of the population is covered by the most popular 50 names.
The top 200 names only cover about 80% of the population (819966 boys and 824442 girls). Using this data only would leave around 120000 boys and 115000 girls nameless.
Knowing that the popularity of names decreases exponentially, we can estimate the frequency of the boy names which didn't make it into the top 200 by tuning the parameters n and b in the equation below:
Doing some quick trial and error in WolframAlpha I came up with the values of n = 1800 and b = 0.006, which look reasonable. The tail of the boy name frequency distribution therefore looks like this:
For girls, the popularity of rare names was found in a similar way.
Knowing the frequency distributions of names, it is time to determine the probability of collisions. I am sure that there is a mathematical solution to this problem, which I would very much like to hear, however I decided to estimate the probabilities by simulation.
The simulation algorithm is very simple. For example, if we want to determine the probability of at least two girls having the same name in a group of 20 girls, we can sample 20 names from the distribution derscibed above, and see if there is a collision or not. If we do this one million times we get a pretty good estimate of the probability of collision by dividing the number of groups where there was a collision by one million. For our example, the probability of at least two of 20 girls having the same name turned out to be 77.53% (compare this with the birthday problem, where the probability of collision in a group of 20 is under 50%)
The above figure shows the probability of name collisions in groups of sizes 2 up to 200. Groups always have people all of the same gender. Dotted lines show probabilities for girls and dashed lines for boys. It looks like in a group of 14, the chances of at least two people having the same name are about 50%, both for girls and for boys. If a group is larger than 50, then it is practically certain that two people will have the same name, moreover ot is very likely that there will be three girls (80%) or three boys (70%) with the same name. As many as six people having the same name is almost certain if the group is large enough (200 people or more). In general, groups of girls are more likely to have collisions (especially more than two with the same name), because the most frequent girl's name, Marija, is more than twice as popular than the most popular boy's name.
Finally, let's look at the probability of collision for a specific name; how likely is a collision of Matjaz compared to a collision of Luka for instance.
Two random people having the same name is very unlikely, even for the most popular names (well under 1%). For groups of 50, any of the top 10 names has approx. 10% or more chance of not being unique, and for groups of 200 each of the top 100 names has at least a 10% chance of not being unique, while the top names are almost certainly not unique in the group.
In conclusion, name collisions are surprisingly frequent. Perhaps one could make some money by standing in front of a supermarket and taking bets about the names of people inside.