Learn about AI >

Why AI Sometimes Takes the Scenic Route with Manhattan Distance

Manhattan Distance measures distance by summing the absolute differences of the coordinates of two data points. While Euclidean distance calculates the shortest path “as the crow flies,” Manhattan distance calculates the path a taxi would have to take. This seemingly small distinction has profound implications, making it the preferred tool for a wide range of AI tasks, from guiding robots through warehouses to helping a model decide which words in a sentence are the most meaningful.

Imagine you’re in a city like Manhattan, where the streets form a perfect grid. To get from your apartment to a coffee shop, you can’t just cut straight through the buildings. You have to walk along the streets, block by block, making a series of right-angle turns. The total distance you walk—the number of blocks east-west plus the number of blocks north-south—is your Manhattan distance. In the world of artificial intelligence, this simple, real-world idea of “taxicab geometry” provides a powerful alternative to the straight-line Euclidean distance, offering unique advantages in how machines perceive similarity, handle messy data, and even learn which features are most important.

This metric, also known as the L1 norm or city block distance, measures distance by summing the absolute differences of the coordinates of two data points. While Euclidean distance calculates the shortest path “as the crow flies,” Manhattan distance calculates the path a taxi would have to take. This seemingly small distinction has profound implications, making it the preferred tool for a wide range of AI tasks, from guiding robots through warehouses to helping a model decide which words in a sentence are the most meaningful.

From City Blocks to Abstract Spaces

The concept was formalized by the German mathematician Hermann Minkowski, a contemporary of Einstein who did pioneering work on the geometry of numbers in the late 19th and early 20th centuries. Minkowski developed a generalized framework for measuring distance, now called the Minkowski distance, which can be seen as a parent category that includes both Euclidean and Manhattan distance as special cases (DataCamp, 2024).

The formula for Manhattan distance is a straightforward summation. For two points, A and B, in an n-dimensional space, the distance is:

d = |a₁ − b₁| + |a₂ − b₂| + ... + |aₙ − bₙ|

Unlike the Euclidean formula with its squares and square roots, the Manhattan distance formula is built on simple addition and absolute values. This lack of squaring is the key to its most important properties. Because it doesn’t square the differences, a large difference in a single dimension (an outlier) doesn’t have an exponentially larger impact on the total distance. This makes the metric inherently more robust to outliers, a common problem in real-world datasets (Wang, Nie, & Huang, 2014).

This formula is also referred to as the L1 norm. In mathematics, a “norm” is a function that assigns a strictly positive length or size to each vector in a vector space. The L1 norm simply sums the absolute values of a vector’s components, just as the Manhattan distance sums the absolute differences between the components of two vectors. This connection to the L1 norm is what links Manhattan distance to one of the most powerful techniques in modern machine learning: sparsity and feature selection.

A comparison of common distance and similarity metrics used in AI, highlighting their core ideas, key strengths, and common geometric analogies.
Metric Core Idea Key Strength Geometric Analogy
Manhattan Distance (L1) Sum of absolute differences Robust to outliers, promotes sparsity Walking along city blocks
Euclidean Distance (L2) Square root of sum of squared differences Intuitive, geometrically smooth Flying in a straight line
Cosine Similarity Angle between two vectors Ignores magnitude, good for text Comparing the direction of two arrows

Achieving Sparsity with L1 Regularization and LASSO

Perhaps the most significant application of the L1 norm in AI is in a process called regularization, specifically L1 regularization. In machine learning, a model “learns” by adjusting its internal parameters, or weights, to minimize the difference between its predictions and the actual outcomes in the training data. However, a model with too many features or too much complexity can start to “memorize” the training data, including its noise and quirks. This is called overfitting, and it results in a model that performs poorly on new, unseen data.

Regularization is a technique used to prevent overfitting by adding a penalty to the model’s loss function for having large weights. L1 regularization adds a penalty proportional to the L1 norm of the weight vector—the sum of the absolute values of all the weights. This has a fascinating and useful side effect.

Because of the geometry of the L1 norm (its “diamond” shape compared to the L2 norm’s “circle”), as the model optimizes its weights to minimize the loss function plus the L1 penalty, it finds that the most efficient way to reduce the penalty is to shrink some of the weights to be exactly zero. In effect, the model learns that some features are not useful for making accurate predictions and completely eliminates them from the equation. This process of automatically driving irrelevant feature weights to zero is known as sparsity (Tibshirani, 1996).

This is the principle behind LASSO (Least Absolute Shrinkage and Selection Operator), a regression technique introduced by Robert Tibshirani in 1996 that uses L1 regularization to perform automatic feature selection. For a data scientist working with a dataset that has thousands or even millions of features—common in fields like genomics or finance—LASSO is an indispensable tool. It can sift through a vast number of potentially irrelevant inputs and identify the small subset that truly matters, leading to simpler, more interpretable, and often more accurate models.

The geometric intuition for why this works is compelling. The L1 penalty creates a constraint region shaped like a diamond (in two dimensions) or a hyper-diamond in higher dimensions. The model seeks the point where the elliptical contours of the least-squares error function first touch this diamond. Because the diamond has sharp corners that lie on the axes, it is highly probable that the contact point will be at one of these corners. A corner, by definition, is a point where one of the feature weights is exactly zero. In contrast, the L2 norm (used in Ridge regression) creates a circular or spherical constraint region. Its smooth boundary means the contact point is unlikely to be exactly on an axis, so weights get close to zero but rarely become exactly zero. This seemingly small difference in geometric shape is what gives the L1 norm its powerful feature-selection capabilities (Ng, 2004). The ability to produce sparse solutions is a direct and profound consequence of the underlying mathematics of the Manhattan distance and the L1 norm.

Navigating Grids and Game Worlds

Manhattan distance finds its most intuitive application in problems where movement is constrained to a grid. This makes it a cornerstone of pathfinding algorithms, particularly the celebrated A (A-star) search algorithm*, which is widely used in robotics, logistics, and video games to find the shortest path between two points (Amit’s A* Pages, n.d.).

A* works by intelligently exploring a graph of possible paths. For every potential next step (a node on the grid), it calculates a score, f(n), which is the sum of two other scores:

  • g(n): The actual cost of the path from the starting point to the current node n.
  • h(n): The heuristic’s estimated cost from the current node n to the destination.

The algorithm always chooses to explore the node with the lowest f(n) score first. This balance between the known past cost (g(n)) and the estimated future cost (h(n)) is what makes A* so effective. To guarantee that A* finds the absolute shortest path, the heuristic h(n) must be “admissible,” meaning it can never overestimate the true cost.

On a grid where movement is restricted to horizontal and vertical steps, the Manhattan distance is a perfect admissible heuristic. It’s computationally cheap to calculate and it accurately represents the minimum number of moves required to reach the goal, assuming no obstacles. A robot navigating a warehouse, a character moving across a chessboard-like game map, or a GPS app routing a car through a city’s street grid are all operating in worlds where Manhattan distance is a more faithful representation of reality than Euclidean distance.

Consider a simple 5x5 grid. A robot starts at S (0,4) and needs to reach goal G (4,0). There is an obstacle blocking the direct path. The A* algorithm, using Manhattan distance as its heuristic, will calculate the f(n) score for each reachable square. For the square at (1,4), the cost would be:

  • g(n) = 1 (one move from the start)
  • h(n) = |4-1| + |0-4| = 3 + 4 = 7 (Manhattan distance to the goal)
  • f(n) = 1 + 7 = 8

By always expanding the node with the lowest f(n) score, the algorithm efficiently explores the grid, avoiding the obstacle and finding the optimal path without wasting time on less promising routes. The Manhattan distance heuristic provides a powerful and accurate guide, ensuring the search is directed squarely toward the goal (MIT OpenCourseWare, 2010). This prevents the algorithm from getting sidetracked by paths that, while appearing to move in the right general direction, are actually taking it further from the most efficient route.

Applications in NLP and Image Processing

While its most obvious use is in grid-like spaces, the properties of Manhattan distance also make it valuable in the high-dimensional, abstract spaces of Natural Language Processing (NLP) and computer vision.

In NLP, particularly before the dominance of dense embeddings, Manhattan distance was a valuable tool for working with sparse vector representations like TF-IDF (Term Frequency-Inverse Document Frequency). In a TF-IDF matrix, most documents have a score of zero for the vast majority of words in the vocabulary. When comparing two documents, Euclidean distance would square the differences, potentially amplifying the effect of a few high-frequency words. Manhattan distance, by simply summing the absolute differences, provides a more linear and sometimes more intuitive measure of dissimilarity, especially when the presence or absence of many different words is more important than the high frequency of a few (The Programming Historian, 2020). For example, in document clustering for topic modeling, using Manhattan distance could lead to more robust clusters by being less sensitive to single words that appear with unusually high frequency in one document.

In computer vision, the L1 distance serves as a fast and surprisingly effective metric for certain tasks. When comparing two images, you can treat them as flat vectors of pixel values and calculate the sum of the absolute differences for each corresponding pixel. This is sometimes called the Sum of Absolute Differences (SAD). While this simple approach is highly sensitive to shifts in alignment, lighting, and rotation, its computational efficiency makes it ideal for real-time applications like video compression. Motion estimation algorithms in video codecs often use SAD on small blocks of pixels to find where a block has moved between frames. Because it avoids multiplications and square roots, it is extremely fast to compute on hardware. Furthermore, its inherent robustness to outliers means that a few “sparkling” pixels from sensor noise or a reflection will not disproportionately affect the overall difference score, making it a reliable choice for quick image comparisons under controlled conditions.

Choosing Your Path Wisely

Manhattan distance is not a universal replacement for Euclidean distance, but rather a specialized tool with a distinct set of strengths. Its main advantages—robustness to outliers and its connection to sparsity—make it the clear choice in specific scenarios.

  • When your data has significant outliers: If your dataset is noisy or contains extreme values that you don’t want to dominate your distance calculations, Manhattan distance is often a safer choice than Euclidean distance.
  • When you want to induce sparsity: If your goal is feature selection and building a simpler, more interpretable model, the L1 norm used in LASSO is the tool for the job.
  • When your features are on different scales: While feature scaling is always recommended, Manhattan distance can be less sensitive to features with different scales compared to Euclidean distance, although scaling remains a best practice.
  • When your problem space is inherently grid-like: For pathfinding in games, robotics, or urban planning, Manhattan distance is the most natural and efficient heuristic.

However, it also has limitations. The primary drawback is that it can be less intuitive geometrically. The shortest path is, by definition, a straight line, and Euclidean distance captures this fundamental geometric truth. In many low-dimensional, clean datasets where features are well-scaled and outliers are not a major concern, Euclidean distance will perform just as well or better, and its smooth, differentiable nature makes it more compatible with certain optimization algorithms like gradient descent.

Furthermore, while Manhattan distance is often cited as being more robust than Euclidean distance in high-dimensional spaces, this is a nuanced issue. All distance metrics, including Manhattan, can suffer from the curse of dimensionality. As the number of dimensions grows, the volume of the space increases exponentially, and data points tend to become sparse and roughly equidistant from each other. The concept of a “nearest” neighbor starts to lose its meaning. While some studies suggest that the L1 norm’s lower sensitivity to the sum of many small, irrelevant feature differences makes it marginally more robust in certain high-dimensional scenarios, it is not a complete solution to this fundamental problem. In the extremely high-dimensional spaces of modern deep learning embeddings (often with thousands of dimensions), neither metric is typically the first choice. In these cases, cosine similarity, which measures the angle between vectors rather than their distance, is often the most effective metric, as it focuses purely on the direction or orientation of the data points, ignoring magnitude entirely (Weaviate, 2023).

Ultimately, the choice of distance metric is a critical modeling decision that depends on the structure of your data and the goals of your task. Manhattan distance, with its city-block intuition and its deep connection to sparsity and robustness, is an essential and powerful tool in the modern AI practitioner’s toolkit.