How does the 'did you mean' feature works in git?

Published on: 25 November 2025

Whenever you make a typo when writing a git command you get something like this.

$ git stat
git: 'stat' is not a git command. See 'git --help'.

The most similar commands are
        status
        stage
        stash

And I have always wondered, what algorithm does git uses to figure out similar commands? So I looked at the source code and found the answer: the Damerau-Levenshtein algorithm. Wikipedia has a great article about it, but in short what this algorithm does is to measure the minimum number of operations (swap, substitution, addition and deletion) that are required to change one word into another. This algorithm is commonly used in various spell checking tools. If two strings are the same the algorithm will return "0" and the more the number of operations are required the higher the return value.

Instead of treating each operation equally, in the git implementation, they added the possibility to have weights for each operation, to give more or less importance to each operation. Git uses the following weights:

0 for swap
2 for substitution
1 for addition
3 for deletion

So, if you mistype the command "status" here are various examples with these weights.

"statsu" will have "0" as return value because the only operation needed is to swap the last two characters, and the swap operation weight is 0.
"statu" will have "1" as return value because the character 'a' needs to be added.
"statud" will have "2" as return value because the character 'd' needs to be substituted with 's'.
"statuss" will have "3" as return value because the last character 's' needs to be delete.

As you can see from the examples, these weights make a lot of sense for this case, because it's much more likely that you swap two characters while typing rather than adding a new character.

Here is the full implementation in git.

/*
 * This function implements the Damerau-Levenshtein algorithm to
 * calculate a distance between strings.
 *
 * Basically, it says how many letters need to be swapped, substituted,
 * deleted from, or added to string1, at least, to get string2.
 *
 * The idea is to build a distance matrix for the substrings of both
 * strings.  To avoid a large space complexity, only the last three rows
 * are kept in memory (if swaps had the same or higher cost as one deletion
 * plus one insertion, only two rows would be needed).
 *
 * At any stage, "i + 1" denotes the length of the current substring of
 * string1 that the distance is calculated for.
 *
 * row2 holds the current row, row1 the previous row (i.e. for the substring
 * of string1 of length "i"), and row0 the row before that.
 *
 * In other words, at the start of the big loop, row2[j + 1] contains the
 * Damerau-Levenshtein distance between the substring of string1 of length
 * "i" and the substring of string2 of length "j + 1".
 *
 * All the big loop does is determine the partial minimum-cost paths.
 *
 * It does so by calculating the costs of the path ending in characters
 * i (in string1) and j (in string2), respectively, given that the last
 * operation is a substitution, a swap, a deletion, or an insertion.
 *
 * This implementation allows the costs to be weighted:
 *
 * - w (as in "sWap")
 * - s (as in "Substitution")
 * - a (for insertion, AKA "Add")
 * - d (as in "Deletion")
 *
 * Note that this algorithm calculates a distance _iff_ d == a.
 */
int levenshtein(const char *string1, const char *string2,
                int w, int s, int a, int d)
{
        int len1 = strlen(string1), len2 = strlen(string2);
        int *row0, *row1, *row2;
        int i, j;

        ALLOC_ARRAY(row0, len2 + 1);
        ALLOC_ARRAY(row1, len2 + 1);
        ALLOC_ARRAY(row2, len2 + 1);

        for (j = 0; j <= len2; j++)
                row1[j] = j * a;
        for (i = 0; i < len1; i++) {
                int *dummy;

                row2[0] = (i + 1) * d;
                for (j = 0; j < len2; j++) {
                        /* substitution */
                        row2[j + 1] = row1[j] + s * (string1[i] != string2[j]);
                        /* swap */
                        if (i > 0 && j > 0 && string1[i - 1] == string2[j] &&
                                        string1[i] == string2[j - 1] &&
                                        row2[j + 1] > row0[j - 1] + w)
                                row2[j + 1] = row0[j - 1] + w;
                        /* deletion */
                        if (row2[j + 1] > row1[j + 1] + d)
                                row2[j + 1] = row1[j + 1] + d;
                        /* insertion */
                        if (row2[j + 1] > row2[j] + a)
                                row2[j + 1] = row2[j] + a;
                }

                dummy = row0;
                row0 = row1;
                row1 = row2;
                row2 = dummy;
        }


        i = row1[len2];
        free(row0);
        free(row1);
        free(row2);

        return i;
}

Now let's investigate more deeper how this distance is used.

Whenever git doesn't know the command that is provided as input, it calls the help_unknown_cmd() function, and below I'm going to explain the most important parts.

load_command_list("git-", &main_cmds, &other_cmds);

add_cmd_list(&main_cmds, &cfg.aliases);
add_cmd_list(&main_cmds, &other_cmds);
QSORT(main_cmds.names, main_cmds.cnt, cmdname_compare);
uniq(&main_cmds);

extract_cmds(&common_cmds, common_mask);

All the git commands (including the aliases) are loaded into an array called main_cmds, and then separately it creates a list of common_cmds which are only the most commonly used commands.

I found out that git actually stores a file called command-list.txt with the list of all commands and one or more category assigned to each one. During build time the file is converted into command-list.h by generate-cmdlist.sh, and this is how the extract_cmds() knows which ones are the most common commands (it filters by common_mask which contains the most common categories).

/* This abuses cmdname->len for levenshtein distance */
for (i = 0, n = 0; i < main_cmds.cnt; i++) {
        int cmp = 0; /* avoid compiler stupidity */
        const char *candidate = main_cmds.names[i]->name;

        /*
         * An exact match means we have the command, but
         * for some reason exec'ing it gave us ENOENT; probably
         * it's a bad interpreter in the #! line.
         */
        if (!strcmp(candidate, cmd))
                die(_(bad_interpreter_advice), cmd, cmd);

        /* Does the candidate appear in common_cmds list? */
        while (common_cmds[n].name &&
               (cmp = strcmp(common_cmds[n].name, candidate)) < 0)
                n++;
        if (common_cmds[n].name && !cmp) {
                /* Yes, this is one of the common commands */
                n++; /* use the entry from common_cmds[] */
                if (starts_with(candidate, cmd)) {
                        /* Give prefix match a very good score */
                        main_cmds.names[i]->len = 0;
                        continue;
                }
        }

        main_cmds.names[i]->len =
                levenshtein(cmd, candidate, 0, 2, 1, 3) + 1;
}
FREE_AND_NULL(common_cmds);

QSORT(main_cmds.names, main_cmds.cnt, levenshtein_compare);

Once the arrays are generated it loops through all the commands and calls levenshtein() to calculate the distance between the candidate and the cmd that was used, with the weights as explained above.

Additionally they also checks if the command provided has the same prefix of one of the common commands, and in that case it is going to set the distance to 0 because it's very likely that that is the command you wanted to type.

/* skip and count prefix matches */
for (n = 0; n < main_cmds.cnt && !main_cmds.names[n]->len; n++)
        ; /* still counting */

if (main_cmds.cnt <= n) {
        /* prefix matches with everything? that is too ambiguous */
        best_similarity = SIMILARITY_FLOOR + 1;
} else {
        /* count all the most similar ones */
        for (best_similarity = main_cmds.names[n++]->len;
             (n < main_cmds.cnt &&
              best_similarity == main_cmds.names[n]->len);
             n++)
                ; /* still counting */
}

/* ... other code ... */

fprintf_ln(stderr, _("git: '%s' is not a git command. See 'git --help'."), cmd);  

if (SIMILAR_ENOUGH(best_similarity)) {
        fprintf_ln(stderr,
                   Q_("\nThe most similar command is",
                      "\nThe most similar commands are",
                   n));

        for (i = 0; i < n; i++)
                fprintf(stderr, "\t%s\n", main_cmds.names[i]->name);
}

Once the distance calculation is done for all the commands, it is going to skip all the ones that has a prefix and then count all the most similar ones (the ones that have the same sortest distance) and finally the ones that are "similar enough" (which is calculated as having a distance less than 7, which is an "empirically derived magic number" (cit.)) are going to be printed out as options.

It is a very clean algorithm, and it can be used in many other ways, so it's good to keep it as reference for the future.

One last important thing to mention is that there is also the autocorrect parameter in git, which has various options that you can specify.

If git detects typos and can identify exactly one valid command
similar to the error, git will try to suggest the correct command or
even run the suggestion automatically. Possible config values are:

- 0 (default): show the suggested command.
- positive number: run the suggested command after specified
  deciseconds (0.1 sec).
- "immediate": run the suggested command immediately.
- "prompt": show the suggestion and prompt for confirmation to run the
  command.
- "never": don’t run or show any suggested command.

I didn't know about this option util I saw the code for it. I personally don't use it but it can be useful for some.