\(\newcommand{\atantwo}{\text{atan2}} \) \(\newcommand{\R}{\mathbb{R}} \) \(\newcommand{\C}{\mathbb{C}} \) \(\newcommand{\Q}{\mathbb{Q}} \) \(\newcommand{\N}{\mathbb{N}} \) \(\newcommand{\vec}[1]{\boldsymbol{\mathbf{#1}}} \) \(\newcommand{\ver}[1]{\boldsymbol{\mathbf{\hat #1}}} \) \(\newcommand{\tensalg}[1]{\mathcal{T}(#1)} \)

Disclaimer: This is a really old post I reposted here because I think it might be interesting. I would definitely approach some things differently today! Also, I copied this here from one of my Quora answers, and Quora handles Latex differently, so I apologize for any typos!

In this post, I’ll tell the story of the “mathematical development landmarks”: each of the following represents a new, key concept along the journey of discovery, a new tool added to our abstract reasoning kit. I find it fascinating how we are sequentially enlightened while learning mathematics.

I’m only superficially acquainted with many of the topics and ideas mentioned, but as a beginner self-learner, I think I can provide some insight as to why these concepts feel mysterious and open up doors. So let’s get started!

Warning: This is going to be a long ride. You might wanna get some coffee.


1- The Formal Definition of a Limit

It has some cadence and exquisite intangibility to it. When I first read it, I had this uncomfortable feeling of not being able to find the limit \(L\) and couldn’t quite grasp how exactly this represented the concept I had in mind. The definition is:

Given \(f:A\subseteq \R \to \R\), and an \(a\in \R\), define:

$$ \lim_{x\to a} f(x) = L \iff \forall \epsilon > 0, \exists \delta > 0: 0 < |x-a|< \delta \Rightarrow |f(x)-L| < \epsilon $$

Ugh… OK I guess. Where is \(L\)? How do I find it?

And this is a conceptual leap from high school math: the definition is non-constructive. It captures the essence of a concept, but is not computational. Specifically, it captures the idea that the limit of a function as \(x\to a\) is the point to which it gets arbitrarily close. And this all has a nice graphical representation:

There it is. Choose any positive \(\epsilon\). I’ll find you a \(\delta\) such that if you restrict the input \(x\) of the function to be between \(a-\delta\) and \(a+\delta\), the output will always fall between \(L-\epsilon\) and \(L+\epsilon\). Even if \(\epsilon\) is extremely small: keep narrowing the range arbitrarily and if I always can find you a \(\delta\), then \(L\) is the limit. When all this starts sounding natural, you’ve advanced one step!

Hey, not so quickly! You still want to compute limits, don’t you? Yes, you do. How do you do that? You prove by definition some limits of basic functions which should be intuitively true. Else, your definition failed to capture your idea of what a limit should be. I’ll come back to the nature of definitions later, so keep this in mind!

Anyway, now you have to write things like this: \(f:A\subseteq \R \to \R\) is continuous at \(a\in A\) iff:

\(a \in A\)

  1. The limit of \(f\) as \(x\to a\) exists. Namely, \(\exists L \in \R: \displaystyle \lim_{x \to a} f(x) = L \).
  2. \(f(a)=L\)
    \(L = f(a)\)

It’s downright disrespectful when people throw around things like “a function is continuous at a point if it is equal to the limit at that point”. No, you have to check existence before even daring to write something down. We will also come back to this later.

2- Algebraic Structures

Typical high school problem:

Solve \(x + 2 = 2x – 1\)

To someone who likes mathematics in all its magnificence, this looks absolutely horrible. I had one teacher when I was about 12 who very wisely said “things don’t move around in equations” and taught us what stuff actually meant. People complained and he got fired. But he was right, teaching children that “the \(x\) goes to the other side, but changes sign” should be a criminal offense.

Later on, you might get something like:

Find all \(x \in \R\) such that \(x^2–10x+1=0\)

This is way better, at least it tells us that the equation holds for some \(x\) we are supposed to find, and is not merely a statement like the first one. But why would the methods to solve them work?

The answer to that inevitably falls on algebraic structures. The thing about answering this question is that the question ceases to be relevant the second you introduce the first preliminary concepts. And it’s that the idea of sets endowed with operators with certain properties is damn powerful. Here, I’ll focus on only two of them: fields and vector spaces.

A field \((F,+,\cdot)\) is a set \(F\) with two operations — addition \(+\) and multiplication \(\cdot\) — such that the field axioms hold. This is one of the first encounters with the proof of known properties in a formal setting. So what things are fields? Good question:

  • \((\{0,1\},+,\cdot)\). This leads to what’s commonly known as Boolean Algebra, it is just algebra over a different field. Here, \(+\) is the xor operator and \(\cdot\) is the and operator. This is the smallest field we can construct.
  • Similarly, we can construct a finite field \(\mathbb{F}_q\) for every prime power \(q\). This is done by means of modular arithmetic. Here is \(\mathbb{F}_4\), as an example (note that \(a\) and \(b\) are now not variables or things you are supposed to find, but rather plain names for two elements of \(\mathbb{F}_4\): I could just as well have called them 2 and 3):

  • The Rational Numbers \(\Q\) are also a field.
  • And naturally, the Real Numbers \(\R\) are a field.

The power here lies that once we prove all of these are fields, we only have to proof a lot of stuff once. What stuff?

  • There exist only one additive identity and only one multiplicative identity.
  • We can prove that if we allow the existence of a multiplicative inverse of the additive identity, no fields can exist. Then, if something is a field, we know it’s zero does not have a multiplicative inverse. Namely, you cannot divide by zero if nice algebraic properties are to be retained.

This is all cool. But now what? Is there something we can use fields for?

A vector space \((\mathbb{V}, +, \mathbb{F},\cdot)\) is a set \(\mathbb{V}\) with two operations — vector addition \(+\) and scalar multiplication \(\cdot\) — such that the vector space axioms axioms hold.

This definition gives rise to the breathtakingly powerful machinery of linear algebra. It turns out many things can be thought of as vector spaces and linear transformations between them. Examples (with usual operations, unless otherwise specified):

  • \(\R^n\) and \(\C^n\) are vector spaces over \(\R\) and \(\C\) respectively.
  • \(\R\) is a vector space over \(\Q\).
  • The set of solutions to a linear differential equation is a vector space.
  • Quantum states form a vector space over \(\C\), which is what gives rise to the idea of superposition of states.
  • The derivative is a linear transformation from \(C^k[a,b]\) to \(C^{k-1}[a,b]\), which are two vector spaces. \(C^k[a,b]\) represents the set of all real valued functions on \([a,b]\subseteq \R\) such that it’s differentiable \(k\) times and the \(k-\text{th}\) derivative is continuous.
  • Planes in \(\R^3\) going through the origin are vector spaces over \(\R\).
  • The set of all Real sequences \(a_n\) such that \(a_{n+1} – a_n \to 0\) is a vector space over \(\R\).
  • Polynomials with Complex coefficients of degree of at most \(n\) form a vector space over \(\C\).
  • The set of distributions forms a vector space (what these are is left for a future post).
  • In many encryption algorithms, the set of encrypted messages form a vector space.

There is a whole lot more of vector spaces. We have now unified the study of the most varied corpses of data under one single framework. How many pieces of information do I need to specify a datapoint? [Answer: dimension] If I combine my data this way, will I get another valid piece of information? [Answer: closure and subspaces] Will this transformation of data which makes my analysis easier change the nature of the information contained within it? [Answer: isomorphisms]. Those are all questions coming up frequently in data analysis. We are usually concerned about how the different structures map to each other. And here the maps are linear transformations.

Recall linear transformations are maps from a vector space \(V\) to another vector space \(W\) (both over a field \(F\)), such that for any \(\alpha , \beta \in F\) and for any \(v,u \in V\):

$$ T(\alpha v + \beta u) = \alpha T(v) + \beta T(u) $$

Now, we can see these things are maps. We saw the derivative was a map between two function spaces. Would that mean it’s behavior is akin to the behavior of a function? Could it make sense to, for example, ask if the differential operator is continuous?

3- Topology, Completeness and Functional Analysis

The beauty about beginning to study topology resides in feeling completely naked again. You are back dealing with sets and only sets and having to build mathematical machinery from the ground. It is like being forced to go back to the definition of a limit: intuition and definitions become main characters. Back to sketching things like this:

And suddenly, after your drawing skills significantly improve because practice, you get it. You are ready to handle the set-theoretic description of weird geometrical properties. Here is continuity expressed in a rather strange way:

A map between topological spaces \(f:X\to Y\) is continuous if the pre-image of an open set \(V \subseteq Y\) is open in \(X\).

Or pointwise:

A map between topological spaces \(f:X\to Y\) is continuous at \(x\in X\) iff for every neighborhood \(V\) of \(f(x)\), there exists a neighborhood \(U\) such that \(f(U) \subseteq V\).

Try to draw pictures and see why this is the case! Intuitively, an open set does not contain its boundary, and a neighborhood of a point \(x\) is well… just the set of points neighboring \(x\) (formally, it is a set containing \(x\) which includes an open set containing \(x\)).

Then you throw in metric spaces, and realize that the notion of distance was never assumed: you have to define it and there might be more than one way to do so. By the way, a metric on a set \(M\) is just a function \(\varphi : M \times M \to \R\) which measures the distance between two elements of \(M\) and has the properties you would expect.

Having new, generalized spaces known as topological spaces and metric spaces, we are wondering whether we can throw in some of what we learned in linear algebra and analysis. Ever wondered why don’t we do calculus over the Rationals? \(\Q\) has “holes” in it, and each irrational number stands for one of them: \(\sqrt{2}\) and \(\pi\) represent holes which the Rational Numbers, despite being dense, failed to cover.

So what’s this property of having “no holes”? It’s called completeness. Here is how we define it:

A metric space \((M, \varphi)\) is complete iff every Cauchy sequence in \(M\) converges to an element of \(M\).

What? I mean, what the hell?

OK, let’s follow through. What is a Cauchy sequence?

A sequence in a metric space \((M, \varphi)\) is said to be Cauchy iff \(\forall \epsilon > 0, \exists N \in \N\) such that \(\varphi(a_n, a_m) < \epsilon \,\,\, \forall n,m \geq N\).

Oh… no. Please no. This looks familiar. Let’s rewrite the definition of completeness:

A metric space \((M, \varphi)\) is complete iff:
Given a sequence \(a_n\) in \(M\) for which, for any \(\epsilon > 0\), \(\exists N \in \N\) such that:

$$\varphi(a_n, a_m) < \epsilon \,\,\,\, \forall n,m \geq N $$

Then \(a_n\) has a limit in \(M\).

Think about it. The terms in \(a_n\) are getting closer and closer. As the distance between \(a_n\) and \(a_{n+1}\) becomes very small, what does \(a_n\) look like? It must be approaching somewhere. If that somewhere is not in \(M\), then it looks like the sequence “would like” to converge, but can’t. Hence, there would be a hole in \(M\). For example, take the following sequence in \(\Q\):

$$a_0 = 1$$ $$a_1 = 1 + 1 = 2$$ $$a_2 = 1 + 1 + \frac12$$

$$\dots$$

$$a_n = \displaystyle \sum_{k=0}^{n} \frac{1}{k!}$$

It is clear that each of the terms is a Rational Number. By closure (remember fields?), the sum is a Rational. But the series is known to not converge in \(\Q\). It converges to the well known irrational \(e\). I leave for the reader to prove this sequence is Cauchy (use the metric \(\varphi(a,b) = |a-b|\)). Therefore, there is something missing in \(\Q\), as visualized here:

You can see the distance between the successive terms getting arbitrarily small and the sequence seems to align themselves closer to a horizontal line. That horizontal line lands somewhere in the \(y\)-axis. If that “somewhere” is an element of \(M\), then the sequence converges to an element of \(M\). If it is not, there is a “hole” and \(M\) is not complete.

So what kind of thing is a complete metric spaces?

  • \(\R^n\) and \(\C^n\) with the Euclidean norm are complete metric spaces.\(\R^n\)
  • The space of real valued continuous functions on \([a,b] \subset \R\) under the uniform norm \(||\cdot||_{\infty}\) (i.e. \(||f||_{\infty}\) is the supremum of \(|f(x)|\)). You should check that a normed vector space is always a metric space.

Now, let’s answer the question. Why don’t we do calculus over the Rationals? Because it’s a utter mess: you can define the concepts, but you get a theory that is worth nothing out of it. Continuity ceases to means what we like it to be: We can use some of the holes to perform “jumps” which continuity on \(\Q\) would fail to capture. Many limits cease to exist and with that derivatives also fall. The nice geometric and visual intuition goes away. Ultimately, you will cause most theorems of analysis to fail, since many of them are equivalent or dependent upon the completeness of \(\R\).

This is interesting… now, I’m thinking: \(C[a,b]\), the space of real valued continuous functions on \([a,b]\), is complete. Well, is it? Let’s prove it!

We have to prove that if \(\{f_n \}\) is Cauchy, then \(\{f_n \}\) converges uniformly to a continuous function. Let \(\{f_n \}\) be Cauchy. Let’s fix an \(x\in [a,b]\). Then, we can see the sequence

\(\{f_n(x) \}\) is a Cauchy sequence in \(\R\). Since \(\R\) is complete, then the sequence converges in \(\R\). Now, this happens for any fixed \(x\) in the interval. Therefore, we say:

$$f_n \to f \text{ (pointwise)}$$

Pointwise convergence then consists in manipulating the functional sequence as a family of number sequences:

Fix an \(x\) here and observe the the sequence along that vertical line (in the picture, \(n\) increases top-down). Try to guess where each sequence converges. Can you see how the pointwise limit is a function which is 0 in \([0,1)\) and 1 at \(x=1\)?

Great. Now, since \(\{f_n \}\) is Cauchy: \(\forall \epsilon > 0 , \exists n_0:\)

$$||f_n – f_m|| < \epsilon \,\,\,\, \forall n,m \geq n_0$$

Here, \(||f||\) is the uniform norm. Namely, it is the supremum of the function \(|f|\) (since we are on \([a,b]\), it is also the maximum of \(|f|\)). You will soon see why this is relevant. We know there is a function \(f\) to which \(\{f_n \}\) converges pointwise. Here is a little trick:

$$||f|| = ||f – f_{n_0} + f_{n_0}|| \leq ||f – f_{n_0}|| + ||f_{n_0}||$$

Where in the last step we applied the triangle inequality (which is an axiom of the norm). The function \(f_{n_0}\) is continuous, since it is in our sequence. By the extreme value theorem (which is equivalent to the completeness of \(\R\)) , or by the fact \(||\cdot||\) is a norm, \(\exists M\in \R: ||f_{n_0}|| \leq M\). Also, remember:

$$||f_n – f_{n_0}|| < \epsilon$$

And in the limit as \(n\to \infty\):

$$||f – f_{n_0}|| \leq \epsilon $$

Since we had \(||f|| \leq ||f – f_{n_0}|| + ||f_{n_0}||\):

$$||f|| \leq \epsilon + M $$

Hence, \(f\) is bounded.

We previously had:

$$ ||f_m – f_n|| < \epsilon \,\,\,\, \forall n,m \geq n_0 $$

Taking the limit:

$$\displaystyle \lim_{m\to \infty}||f_m – f_n|| = ||f-f_n|| \leq \epsilon \,\,\,\, \forall n\geq n_0$$

Which is the definition of uniform convergence. What does that mean? It means the bound does not depend on \(x\) as in pointwise convergence, but rather we found a uniform bound (i.e. “universal”). It’s nice to see it in pictures:

The black curve corresponds to \(f\), the red one for \(f_n\) for \(n<n_0\) and the red/blue ones to \(f_k\) for \(k\geq n_0\). What uniform converges says is that while you go about narrowing the range, all functions of the sequence from an \(n_0\) onward will stay within that range. That’s why the uniform norm is important: every continuous function \(f\) in \([a,b]\) is bounded by \(-||f||\) and \(||f||\). Hopefully you can visualize why if a sequence of continuous functions converges uniformly, then it converges to a continuous function. Now go back to the previous image, and try to understand why that sequence does not converge uniformly.

Now we know \(f_n \to f\) uniformly everytime \(\{f_n \}\) is Cauchy. We also know \(f\) is bounded. To prove it’s continuous, we need to go down a whole new venue: the uniform limit theorem. The proof is rather simple, so I leave it to the reader to go find. \(C[a,b]\) is a complete metric space (under the infinity norm). It is also a normed vector space. Consider the following linear transformation:

$$ T[f] = \int_a^b f(x)dx $$

$$T[f] = \displaystyle \int_{a}^{b} f(x)dx $$

You could consider \(T:C[a,b] \to \R\) or \(T:C[a,b]\to C[a,b]\). After all, constants can be taken to be functions on \([a,b]\) (type theorists or C programmers… please don’t!). My question is, what happens when I make a subtle change to \(f\). A “perturbation” \(f+\epsilon \mu\) (where \(\epsilon\) is a small Real and \(\mu\) is a continuous function). This idea is what lead to calculus of variations and the infamous Principle of Least Action in physics.

Now, the differential operator can be taken to be a linear transformation:

$$\frac{d}{dx} : C^1[a,b] \to C[a,b] $$

The notation \(C^1\) specifies the functions are not only continuous at \([a,b]\), but also differentiable at \((a,b)\). Of course, the derivative of a function need not be differentiable. There is an alternative definition of continuity, based on sequences. Namely, a map \(f\) between metric spaces is continuous at a iff given any sequence \(\{a_n \}\) converging to a, then the sequence \(\{f(a_n) \}\) converges to \(f(a)\).

Assume we have a sequence \(\{f_n\} \in C^1[a,b]\) converging uniformly to \(f\). Now, do these sequences converge always to a differentiable, continuous function? It turns out \(C^1[a,b]\) is not complete under the \(||\cdot ||_{\infty}\) norm! So with these norms, we stand at the same place that we did doing analysis on \(\Q\).

Challenge: Can you provide me with a better norm? How does it make sense?

4- Definitions and Intuition

Up to this point, we were always presented with weirdly convoluted definitions and tried to unwrap how they appropriately captured the intuitive understanding we have on mind. But when we get to asking questions such as the continuity of arbitrary linear transformations, we wonder whether our definition generalizes well to such hard-to-visualize cases. That’s exactly how we get things like:

[Insert mathematician’s name] [property]

A classical example is the property of Lebesgue integrability. There are lots of examples to be found.

Those usually agree in the simple intuitive cases. They can be shown to be completely equivalent in such realms. But once you go up the abstraction ladder, they show remarkable differences. This evidences more about the nature of human cognition rather than the nature of mathematical objects.

Cognition is hardwired to create patterns. Imagine distilling those patterns to their purest, more intrinsic form. In a way where the only thing left is true, naked beauty and the excitement of mystery. This deep oasis devoid of the mundane is mathematics.

Here is a bold statement. Think about it. Mathematics is the highest expression of our own incredulity concerning our existence.

5- Abstract Nonsense

Speaking of the nature of mathematical objects, could we say something about them in a generalized fashion? Category theory does exactly that: it deals with the mappings between objects of various classes. It’s a whole new level of abstraction and that’s why certain specific proofs happening at the category theoretic layer usually receive the name of “abstract nonsense”.

Learning to think in categories is quite enlightening. For excellent introductory content on this refer you to the excellent Senia Sheydvasser and his posts. They are a true journey of discovery and amazement. Have fun!

A commutative diagram.

What if you could define the Real Numbers by how the maps between them and other fields occur? Or even better, what if you could study various properties mathematicians in many areas look for in a unified fashion? What if I told you this endeavor even has practical applications?

Let’s see one of the simplest one: a categorical definition of the product of two object (e.g. sets, topological spaces, vector spaces).

Here is how the process works. We have lots of definitions of what looks like different things, but feel kinda similar. In this case, we have lots of about products of objects. Sets, groups, vector spaces, topological spaces, they all have a notion of product. And they all feel like the same underlying idea is lurking around. Even worse, they feel repetitive — tedious in the sense of “oh, this again”. Mathematicians hate that.

So what do they do? They hunt the underlying idea down. And in this case, it’s an easy prey. The product of two objects means putting together the two in an independent way, so that you can recover them shall you need to. Whatever structure was imposed on each of the two, shall be reimposed on each of the independent components of the product. In algebraic categories, this means the structure of the product transforms to acting pointwise on the components as the structure of each of the objects — i.e. for \(C=A\times B\) we have \((a_1,b_1)\odot_C (a_2,b_2)=(a_1 \odot_A a_2, b_1 \odot_B b_2)\). Since we need to be able to recover our original two objects, the product needs to come equipped with projection maps \(\pi_1\) and \(\pi_2\) that take the product back to each of the original objects!

Say our product is \(X_1\times X_2\). Since we have the projection maps \(\pi_1:X_1\times X_2\to X_1\) and \(\pi_2:X_1\times X_2\to X_2\), that means that if we have some object \(Y\) and maps \(f_1:Y\to X_1\) and \(f_2:Y\to X_2\), then there will be a unique map \(f:Y\to X_1\times X_2\) that uses \(f_1\) to generate the \(X_1\) part of the image and \(f_2\) to generate the \(X_2\) part of the image. Notationally, and rather informally, the concept is: \(f(y)=(f_1(y),f_2(y))\).

This is still not a formal definition! Our “notational” remark was just a reminder of the concept we want to encode in categorical language, to see its full potential. We want \(f\) to be fully determined by \(f_1\) and \(f_2\). By that, we said we wanted \(f\) to use each \(f_i\) to generate each independent part of the product. That means that if we applied \(f\) and then recovered one of the independent parts, we should get the same effect as one of the \(f_i\), since we said \(f\) in an independent part must be \(f_i\). But wait! We have a way of recovering the independent parts: The two projections \(\pi_i\). So what we really want is \(\pi_i(f(y))=f_i(y)\).

You can think of that as follows. The map \(f\) takes \(y\) and uses \(f_1\) to generate the \(X_1\) part of the image and \(f_2\) to generate the \(X_2\) part of the image. So, if we put the image of \(f\) through a machine that gives us back one of the independent parts, (the projection map), we will get exclusively the effect of one of the \(f_i\). In old notation, this is telling us that \(\pi_1(f(y))=\pi_1(f_1(y),f_2(y))=f_1(y)\) should hold. But remember, we want to get rid of ordered pairs (a concept from set theory) and express the idea purely in categorical language. More concisely, we can write: \(\pi_i \circ f = f_i\).

Mathematicians, at this point, are almost ready to throw a party. Almost. Our definition is too long, wordy and is too easy to understand. Let’s turn it into something elegant and impenetrable. Given two objects \(X_1\) and \(X_2\), their product is an object \(X_1\times X_2\) together with projection maps \(\pi_1:X_1\times X_2 \to X_1\) and \(\pi_2:X_1\times X_2 \to X_2\) such that for every object \(Y\) with maps \(f_1:Y\to X_1\) and \(f_2:Y\to X_2\), there is a unique map \(f:Y\to X_1\times X_2\) such that the following diagram commutes:

Here “commutes” means that \(\pi_i \circ f = f_i\).

And we are done.

Keep dreaming!


I hope you enjoyed your way through this post, and found it somewhat valuable. It was definitely fun to write.

Again, I’m merely a learner who finds it helpful to explain things and wants to communicate part of his passion. Please, do point out any mistakes!

Thanks for reading!

Categories: Mathematics