OkCupid Match Algorithm in Bourne Shell

Comment Code

#!/bin/sh -eu

To explain how the old OkCupid matching algorithm worked, I wrote a shell script that calculates the match percentage for two users. It is based on a Calculating Match Percentages article which OkCupid published in 2011. OkCupid has retired this algorithm in late 2017.

The article from OkCupid is available in the Internet Archive here: https://web.archive.org/web/20110222203148/https://www.okcupid.com/help/match-percentages


To determine match percentage for two users, one needs three values for each question a user answers: The answer, answers that the user finds acceptable and how important the topic should be for a match.

The columns for the tab-separated (TSV) data sets are the question, points a matching answer is worth, an answer given and answers that the user considers acceptable, separated by pipe characters (|).

sort <<EOF >dataset1
Do you like the taste of chili?	1	Yes	Yes
What is your favourite Pokémon?	50	Mew	Mew|Pikachu
EOF

sort <<EOF >dataset2
What is your favourite Pokémon?	1	Pikachu	Pikachu
Do you like the taste of chili?	10	No	Yes
EOF

Every piece of software that takes inputs must fully recognize them before processing any of it; this ensures both safety and security. Occupy Babel! provides an explanation: http://langsec.org/occupy/

Validate the input with a regular expression, to make sure that the lines can be processed. ! grep -v outputs lines that do not match and exits with a status code indicating failure after it finds any.

validate() (
 VALID_LINE_REGEX='^[^	]\+	[0-9]\+	[^	]\+	[^	]\+$'
 ! grep -v "${VALID_LINE_REGEX}" >&2
)

validate <dataset1
validate <dataset2

Users get points for each answer matching the stated preferences of others. It is therefore necessary to track points users accumulated and the maximum number of points they could possibly accumulate. In the beginning, values for current and maximum points are both zero.

POINTS_CUR_1=0
POINTS_CUR_2=0
POINTS_MAX_1=0
POINTS_MAX_2=0

Read in lines of both data sets, split them into fields and iterate over all. Skip to the next iteration if the questions do not match, so that code inside the loop processes only the matching questions.

The internal field separator $IFS is set to a tab character so that the shell can correctly split every tab-separated line into fields.

IFS='	'

while read -r QUESTION_1 POINTS_1 ANSWER_1 ACCEPTABLE_ANSWERS_1; do
 while read -r QUESTION_2 POINTS_2 ANSWER_2 ACCEPTABLE_ANSWERS_2; do
  if [ "${QUESTION_1}" != "${QUESTION_2}" ]; then
   continue
  fi

For all questions that both users answered, increase the maximum amount of points per user by the amount of points the other user assigned to this question.

  POINTS_MAX_1=$(( ${POINTS_MAX_1} + ${POINTS_2} ))
  POINTS_MAX_2=$(( ${POINTS_MAX_2} + ${POINTS_1} ))

For each answer of a user that fits the other user's preferences, increase the accumulated points per user by the amount of points that the other user assigned to this question.

The internal field separator $IFS for the acceptable answers list is the pipe character (|) – which does not clash with the field separator tab. $IFS is set to the tab character again afterwards.

   IFS='|'

   for ACCEPTABLE_ANSWER_2 in ${ACCEPTABLE_ANSWERS_2}; do
    if [ "${ANSWER_1}" = "${ACCEPTABLE_ANSWER_2}" ]; then
     POINTS_CUR_1=$(( ${POINTS_CUR_1} + ${POINTS_2} ))
    fi
   done

   for ACCEPTABLE_ANSWER_1 in ${ACCEPTABLE_ANSWERS_1}; do
    if [ "${ANSWER_2}" = "${ACCEPTABLE_ANSWER_1}" ]; then
     POINTS_CUR_2=$(( ${POINTS_CUR_2} + ${POINTS_1} ))
    fi
   done

   IFS='	'

In shell scripts, input to while-read loops is given at the end.

 done <dataset2
done <dataset1


For each user, satisfaction with the other users' answers is the percentage of points that the other user accumulated in relation to the maximum number of points the user could have accumulated.

percentage() {
 printf '4 k %s %s / p' ${1} ${2} | dc
}

SATISFACTION_1=$(
 percentage ${POINTS_CUR_1} ${POINTS_MAX_1}
)

SATISFACTION_2=$(
 percentage ${POINTS_CUR_2} ${POINTS_MAX_2}
)

The first user got 10 points of 11 possible – around 91 percent.

test ${POINTS_CUR_1} = 10
test ${POINTS_MAX_1} = 11
test ${SATISFACTION_1} = .9090

The second user got 50 points of 51 possible – around 98 percent.

test ${POINTS_CUR_2} = 50
test ${POINTS_MAX_2} = 51
test ${SATISFACTION_2} = .9803

To get a match percentage for two users, multiply satisfactions, then take the square root. OkCupid likes to think of each match percentage as the probability you'd get along and assumes those probabilities are independent.

product() {
 printf '%s %s * p' ${1} ${2} | dc
}

square_root() {
 printf '%s v p' ${1} | dc
}

CALCULATED_MATCH=$(
 square_root $(
  product ${SATISFACTION_1} ${SATISFACTION_2}
 )
)

The calculated match is around 94 percent.

test ${CALCULATED_MATCH} = .9439

The margin of error defined by OkCupid is one divided by the number of questions that both users share. This seems to have been decided arbitrarily: We’ve toyed with multiple formulas for confidence […] if we’re too aggressive, people with few questions answered will never show up in match results. If we’re too lenient, you might see too many matches who just got lucky on a few questions.

SHARED_QUESTION_COUNT=$( join -t'	' dataset1 dataset2 | wc -l )

reciprocal() {
 printf '4 k 1 %s / p' ${1} | dc
}

ERROR_MARGIN=$( reciprocal ${SHARED_QUESTION_COUNT} )

Both users have 2 questions in common; the margin of error is 0.5.

test ${SHARED_QUESTION_COUNT} = 2
test ${ERROR_MARGIN} = .5000

To give the most confidence in the match process, OkCupid published the lower boundary of the match range – i.e. the difference between the calculated match and the error margin.

difference() {
 printf '%s %s - p' ${1} ${2} | dc
}

TRUE_MATCH=$(
 difference ${CALCULATED_MATCH} ${ERROR_MARGIN}
)

The true match is around 44%.

test ${TRUE_MATCH} = .4439

Source code: okcupid-match-algorithm-in-bourne-shell.sh