Skip to content

l0-Norm, l1-Norm, l2-Norm, … , l-infinity Norm

Posted on:May 12, 2012 at 07:26 PM

2020 remark: This article was first published in 2012 under the name l0-Norm, l1-Norm, l2-Norm, … , l-infinity Norm. It was - and still is - the most popular piece of writing in my life, with a few hundreds visits everyday. It was largely responsible for the reason I could not migrate my blogs for so many years.

I’m working on things related to norm a lot lately and it is time to talk about it. In this post we are going to discuss about a whole family of norm.

What is a norm?

Mathematically a norm is a total size or length of all vectors in a vector space or matrices. For simplicity, we can say that the higher the norm is, the bigger the (value in) matrix or vector is. Norm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc.

Most of the time you will see the norm appears in a equation like this:

x\left \lVert x \right \rVert where xx can be a vector or a matrix.

For example, a Euclidean norm of a vector a=[3 2 1]a = \begin{bmatrix} 3 \\\ -2 \\\ 1 \end{bmatrix} is a2=32+(2)2+12=3.742\left \lVert a \right \rVert_2 = \sqrt{3^2 + (-2)^2 + 1^2} = 3.742 which is the size of vector aa.

The above example shows how to compute a Euclidean norm, or formally called an l2l_2-norm. There are many other types of norm that beyond our explanation here, actually for every single real number, there is a norm correspond to it (Notice the emphasised word real number, that means it not limited to only integer.)

Formally the lpl_p-norm of xx is defined as:

xp=ixipp\left \rVert x \right \rVert_p = \sqrt[p]{ \sum_i \vert x_i \vert^p } where pRp \in \mathbb{R}.

That’s it! A p-th-root of a summation of all elements to the p-th power is what we call a norm.

The interesting point is even though every lpl_p-norm is all look very similar to each other, their mathematical properties are very different and thus their application are dramatically different too. Hereby we are going to look into some of these norms in details.

l0-norm

The first norm we are going to discuss is a l0l_0-norm. By definition, l0l_0-norm of xx is

x0=ixi00\left \rVert x \right \rVert_0 = \sqrt[0]{ \sum_i \vert x_i \vert^0 }

.

Strictly speaking, l0l_0-norm is not actually a norm. It is a cardinality function which has its definition in the form of lpl_p-norm, though many people call it a norm. It is a bit tricky to work with because there is a presence of zeroth-power and zeroth-root in it. Obviously any x>0x>0 will become one, but the problems of the definition of zeroth-power and especially zeroth-root is messing things around here. So in reality, most mathematicians and engineers use this definition of l0l_0-norm instead:

x0=(ixi0)\left \lVert x \right \rVert_0 = \sum(i | x_i \neq 0)

that is a total number of non-zero elements in a vector.

Because it is a number of non-zero element, there is so many applications that use l0l_0-norm. Lately it is even more in focus because of the rise of the Compressive Sensing scheme, which is try to find the sparsest solution of the under-determined linear system. The sparsest solution means the solution which has fewest non-zero entries, i.e. the lowest l0l_0-norm. This problem is usually regarding as a optimisation problem of l0l_0-norm or l0l_0-optimisation.

l0-optimisation

Many application, including Compressive Sensing, try to minimise the l0l_0-norm of a vector corresponding to some constraints, hence called “l0l_0-minimisation”. A standard minimisation problem is formulated as:

minx0\min \left \lVert x \right \rVert_0 subject to Ax=bAx = b .

However, doing so is not an easy task. Because the l0l_0-minimisation is proven by computer scientist to be an NP-hard problem, simply says that it’s too complex and almost impossible to solve.

In many cases, l0l_0-minimisation problem is relaxed to be higher-order norm problem such as l1l_1-minimisation and l2l_2-minimisation.

l1-norm

Following the definition of norm, l1l_1-norm of xx is defined as

x1=ixi\left \rVert x \right \rVert_1 = \sum_i \vert x_i \vert

.

This norm is quite common among the norm family. It has many name and many forms among various fields, namely Manhattan norm is it’s nickname. If the l1l_1-norm is computed for a difference between two vectors or matrices, that is

SAD(x1,x2)=x1x21=x1ix2iSAD(x_1,x_2) = \left \Vert x_1-x_2 \right \Vert_1 = \sum \left \vert x_{1_i}-x_{2_i} \right \vert

,

it is called Sum of Absolute Difference (SAD) among computer vision scientists.

In more general case of signal difference measurement, it may be scaled to a unit vector by:

MAE(x1,x2)=1nx1x21=1nx1ix2iMAE(x_1,x_2) = \frac{1}{n} \left \Vert x_1-x_2 \right \Vert_1 = \frac {1} {n} \sum \left \vert x_{1_i} - x_{2_i} \right \vert where nn is a size of xx,

which is known as Mean-Absolute Error (MAE).

l2-norm

The most popular of all norm is the l2l_2-norm. It is used in almost every field of engineering and science as a whole. Following the basic definition, l2l_2-norm is defined as

x2=ixi2\left \Vert x \right \Vert_2 = \sqrt{\sum_{i}x_i^2}.

l2l_2

-norm is well known as a Euclidean norm, which is used as a standard quantity for measuring a vector difference. As in l1l_1 -norm, if the Euclidean norm is computed for a vector difference, it is known as a Euclidean distance:

x1x22=i(x1ix2i)2\left \Vert x_1-x_2 \right \Vert_2 = \sqrt{\sum_i (x_{1_i}-x_{2_i})^2}

,

or in its squared form, known as a Sum of Squared Difference (SSD) among Computer Vision scientists:

SSD(x1,x2)=x1x222=i(x1ix2i)2SSD(x_1,x_2) = \left \Vert x_1-x_2 \right \Vert_2^2 = \sum_i (x_{1_i}-x_{2_i})^2

.

It’s most well known application in the signal processing field is the Mean-Squared Error (MSE) measurement, which is used to compute a similarity, a quality, or a correlation between two signals. MSE is

MSE(x1,x2)=1nx1x222=1ni(x1ix2i)2MSE(x_1,x_2) = \frac{1}{n} \left \Vert x_1-x_2 \right \Vert_2^2 = \frac{1}{n} \sum_i (x_{1_i}-x_{2_i})^2

.

As previously discussed in l0l_0-optimisation section, because of many issues from both a computational view and a mathematical view, many l0l_0-optimisation problems relax themselves to become l1l_1– and l2l_2-optimisation instead. Because of this, we will now discuss about the optimisation of l2l_2.

l2-optimisation

As in l0l_0-optimisation case, the problem of minimising l2l_2-norm is formulated by

minx2min \left \Vert x \right \Vert_2 subject to Ax=bAx = b .

Assume that the constraint matrix AA has full rank, this problem is now a underdertermined system which has infinite solutions. The goal in this case is to draw out the best solution, i.e. has lowest l2l_2-norm, from these infinitely many solutions. This could be a very tedious work if it was to be computed directly. Luckily it is a mathematical trick that can help us a lot in this work.

By using a trick of Lagrange multipliers, we can then define a Lagrangian

L(x)=x22+λT(Axb)\mathfrak{L}(\boldsymbol{x}) = \left \Vert \boldsymbol{x} \right \Vert_2^2+\lambda^{T}(\boldsymbol{Ax}-\boldsymbol{b})

where λ\lambda is the introduced Lagrange multipliers. Take derivative of this equation equal to zero to find a optimal solution and get

x^opt=12ATλ\hat{\boldsymbol{x}}_{opt} = -\frac{1}{2} \boldsymbol{A}^{T} \lambda

plug this solution into the constraint to get

Ax^opt=12AATλ=b\boldsymbol{A}\hat{\boldsymbol{x}}_{opt} = -\frac{1}{2}\boldsymbol{AA}^{T}\lambda=\boldsymbol{b} λ=2(AAT)1b\lambda=-2(\boldsymbol{AA}^{T})^{-1}\boldsymbol{b}

and finally

x^opt=AT(AAT)1b=A+b\hat{\boldsymbol{x}}_{opt}=\boldsymbol{A}^{T} (\boldsymbol{AA}^{T})^{-1} \boldsymbol{b}=\boldsymbol{A}^{+} \boldsymbol{b}

By using this equation, we can now instantly compute an optimal solution of the l2l_2-optimisation problem. This equation is well known as the Moore-Penrose Pseudoinverse and the problem itself is usually known as Least Squares Problem, Least Squares Regression, or Least Squares optimisation.

However, even though the solution of Least Squares method is easy to compute, it’s not necessary be the best solution. Because of the smooth nature of l2l_2-norm itself, it is hard to find a single, best solution for the problem.

L2 Norm

In contrary, the l1l_1-optimisation can provide much better result than this solution.

l1-optimisation

As usual, the l1l_1-minimisation problem is formulated as

minx1min \left \Vert x \right \Vert_1 subject to Ax=bAx = b .

Because the nature of l1l_1-norm is not smooth as in the l2l_2-norm case, the solution of this problem is much better and more unique than the l2l_2-optimisation.

L1 Norm

However, even though the problem of l1l_1-minimisation has almost the same form as the l2l_2-minimisation, it’s much harder to solve. Because this problem doesn’t have a smooth function, the trick we used to solve l2l_2-problem is no longer valid. The only way left to find its solution is to search for it directly. Searching for the solution means that we have to compute every single possible solution to find the best one from the pool of “infinitely many” possible solutions.

Since there is no easy way to find the solution for this problem mathematically, the usefulness of l1l_1-optimisation is very limited for decades. Until recently, the advancement of computer with high computational power allows us to “sweep” through all the solutions. By using many helpful algorithms, namely the Convex Optimisation algorithm such as linear programming, or non-linear programming, etc. it’s now possible to find the best solution to this question. Many applications that rely on l1l_1-optimisation, including the Compressive Sensing, are now possible.

There are many toolboxes for l1l_1-optimisation available nowadays. These toolboxes usually use different approaches and/or algorithms to solve the same question. The example of these toolboxes are l1-magic, SparseLab, ISAL1.

Now that we have discussed many members of norm family, starting from l0l_0-norm, l1l_1-norm, and l2l_2-norm. It’s time to move on to the next one. As we discussed in the very beginning that there can be any ll-whatever norm following the same basic definition of norm, it’s going to take a lot of time to talk about all of them. Fortunately, apart from l0l_0-, l1l_1– , and l2l_2-norm, the rest of them usually uncommon and therefore don’t have so many interesting things to look at. So we’re going to look at the extreme case of norm which is a ll_{\infty}-norm (l-infinity norm).

l-infinity norm

As always, the definition for ll_{\infty}-norm is

x=ixi\left \Vert x \right \Vert_{\infty} = \sqrt[\infty]{\sum_i x_i^{\infty}}

.

Now this definition looks tricky again, but actually it is quite strait forward. Consider the vector x\boldsymbol{x}, let’s say if xjx_j is the highest entry in the vector x\boldsymbol{x}, by the property of the infinity itself, we can say that

xjxiijx_j^{\infty}\gg x_i^{\infty} \forall i \neq j

then

ixi=xj\sum_i x_i^{\infty} = x_j^{\infty}

then

x=ixi=xj=xj\left \Vert x \right \Vert_{\infty} = \sqrt[\infty]{\sum_i x_i^{\infty}} = \sqrt[\infty]{x_j^{\infty}} = \left \vert x_j \right \vert

Now we can simply say that the ll_{\infty}-norm is

x=max(xi)\left \Vert x \right \Vert_{\infty} = \max(\left \vert x_i \right \vert)

that is the maximum entries’ magnitude of that vector. That surely demystified the meaning of ll_{\infty}-norm

Now we have discussed the whole family of norm from l0l_0 to ll_{\infty}, I hope that this discussion would help understanding the meaning of norm, its mathematical properties, and its real-world implication.

References and further reading:

Edit (15/02/15) : Corrected inaccuracies of the content.