# The integral of csc(x)

[NOTE: At the end of editing this, I found that the substitution used below is famous enough to have a name, and for Spivak to have called it the "world's sneakiest substitution".  Glad I'm not the only one who thought so.]

In the course of working through some (very good) material on neural networks (which I may try to work through here later), I noticed that it was beneficial for a so-called “activation function” to be able to be written as the solution of an “easy” differential equation.  Here by “easy” I mean something closer to “short to write” than “easy to solve”.

The [activation] function sigma.

In particular, two often used activation functions are
$\sigma (x) := \frac{1}{1+e^{-t}}$
and
$\tau (x) := \tanh{x} = \frac{e^{2x}-1}{e^{2x}+1}.$

One might observe that these satisfy the equations
$\sigma' = \sigma (1-\sigma),$
and
$\tau' = 1-\tau^2.$

By invoking some theorems of Picard, Lindelof, Cauchy and Lipschitz (I was only going to credit Picard until wikipedia set me right), we recall that we could start from these (separable) differential equations and fix a single point to guarantee we would end up at the functions above.  In seeking to solve the second, I found after substituting cos(u) =τ that
$-\int\frac{du}{\sin{u}} = x+C,$
and shortly after that, I realized I had no idea how to integrate csc(u).  Obviously the internet knows (substitute v = cot(u) + csc(u) to get the integral being -log(cot(u)+csc(u))), which is a really terrible answer, since I would never have gotten there myself.

Not the right approach.

Instinctually, I might have tried the approach to the right, which gets you back to where we started, or by changing the numerator to cos2x+sin2x, which leads to some amount of trouble, though intuitively, this feels like the right way to do it.  Indeed, eventually this might lead you to using half angles (and avoiding integrals of inverse trig functions).  We find
$I = \int \frac{du}{\sin{u}} = \int \frac{\cos^2{u/2} + \sin^2{u/2}}{2\cos{u/2}\sin{u/2}}.$
Avoiding the overwhelming temptation to split this integral into summands (which would leave us with a cot(u)), we instead divide the numerator and denominator by sin2(u) to find
$I=\int \frac{1+\tan^2{u/2}}{2\tan{u/2}} du.$
Now substituting v = tan(u/2)we find that dv = 1/2 (1+tan2(u/2))du = 1/2(1+v2)du, so making this substitution, and then undoing all our old substitutions:
$I = \int \frac{1+v^2}{v}*\frac{2}{1+v^2}dv = \int \frac{dv}{v} = \log{|v|} + C = \log{|\tan{\frac{u}{2}}|}+C = \log{|\tan{\frac{\cos^{-1}\tau}{2}}|}+C.$

The function tau we worry so much about. Looks pretty much like sigma.

Using the half angle formulae that everyone of course remembers and dropping the C (remember, there’s already a constant on the other side of this equation), this simplifies to (finally)
$I = \log{|\frac{\sqrt{1-\tau^2}}{1+\tau}|}.$  Subbing back in and solving for $\tau(x)$ gives, as desired,

$\tau(x) = \frac{e^{2x}-1}{1+e^{2x}}$.

Phew.

# Expectations II

A contour plot of the function. Pretty respectable looking hills- maybe somewhere in the Ozarks- if I say so myself.

As a further example of yesterday’s post, I was discussing multivariable calculus with a student who had never taken it, and mentioned the gradient.  Putting our discussion into the framework of this post, here is what he wanted out of such a high dimensional analogue of the derivative of a function $f: \mathbb{R}^2 \to \mathbb{R}$ (note to impressionable readers: the function defined below is not quite the gradient):
2. Describe the answer:  should be a function from $\mathbb{R}^2 \times \mathbb{R}^2 \to \mathbb{R}^3$, which takes a point in the domain, a direction in the domain, and returns the vector in the range.  The idea being that if you had a map, knew where you are and in which direction you wished to travel, then the gradient should tell you what 3-dimensional direction you would head off in.

Certainly there is such a function, though in some sense we are making it too complicated.  As an example we have some pictures of the beautiful hills formed by the function

$f(x,y) = \sin{3y} + \sin{(4x + 5y)} - x^2 - y^2 + 4.$

The (actual) gradient of this function is

$\nabla f(x,y) = \left(4\cos (4x + 5y) - 2x, 3\cos(3y) - 2y + 5\cos(4x + 5y)\right)$.
Plugging in a point in the plane will give a single vector, and then taking the dot product of this vector with a direction will give a rate of change for f at that point, in that direction.  Specifically, if we start walking north at unit speed from the origin, the gradient will be (4,8), and I take the dot product of this with (0,1) to find that I will be climbing at 8m/s (depending on our units!)

Now the correct answer from my student’s point of view would be that the answer is (0,1,8), since this is the direction in 3 dimensions that one would travel, and that the correct definition for would have
$Df(x,y) \cdot v = \left(x,y,\nabla(x,y) \cdot v \right)$.

The graph of the indicated function, including the vector of the "pseudo-gradient" we discuss.

Of course there are more sophisticated examples of this.  Suppose a function $u: \mathbb{R}^n \to \mathbb{R}$ is harmonic.  That is to say, $\Delta u := \sum_{j = 1}^n \frac{\partial^2 u}{\partial x_j^2} = 0$.  Notice that in order to write down this equality, we already named our solution u.  But just working from this equation, we can deduce a number of qualities that any solution must have: u is infinitely differentiable and, restricted to any compact set, attains its maximum and minimum on the boundary of that set.  Such properties quickly allow us to narrow down the possibilities for solutions to problems.

A picture of some hills that might be shaped like the function we're looking at. In the Ozarks of all places!

# Expectations

Hard thinkin' being done today.

It is useful (for me!) to think about the importance of math as teaching us how to think about problems, rather than providing us with useful factoids (I’m looking at you, history class).  There are a lot of problems/puzzles/patterns in the world, and the chance of seeing the same problem twice is very low (and really, I’ve never seen Batman use the Pythagorean theorem even once, so what’s the point?), so we focus on solving problems in as broad of a context as possible.  In this way, I’d argue, mathematicians become very good problem solvers (“toot! toot!” <– my own horn)

One method of problem solving I would like to focus on today is to name and describe your answer before you have found it.  As a simple example, in order to answer the question “what number squared is equal to itself?”, we would:
1. Name the answer: Suppose x squared is equal to x.
2. Describe the answer: This is where the explicitly developed machinery comes in: We know that $x^2 = x$, so we deduct that x also has the property $x(x-1) = 0$, and conclude that either x = 0 or x = 1.

A geometric way of looking at the word problem. NOT TO SCALE.

As a second example, much of linear algebra is naming objects, describing them, and then realizing you accidentally completely described them.  For example, suppose we wanted to identify every matrix with a number, and make sure that every singular matrix has determinant 0:
1. Name the answer: Let’s call the answer the determinant, or det() for short.
2. Describe the answer: det() should be a function from matrices to numbers, and at least satisfy the following properties: (i) det(I) = 1, so that the identity matrix is associated with the number 1 (so at least some nonsingular matrices will not have determinant zero), (ii) if the matrix A has a row of zeros, then det(A) = 0 (so that at least some singular matrices will have determinant zero, and (iii) the determinant is multilinear, which takes some motivation, will definitely respect identifying singular matrices.

Well, it turns out that these three properties have already completely determined the object we are looking for!  If I had been greedy and asked (iv) each nonsingular matrix is associated with a unique number, then I would have deduced that no such map exists.  If I had not included property (iii), then I would have found there are many such maps.  It is a fairly enjoyable exercise to deduce the other properties of determinants starting from just these three rules.

More filler photos! This is from Cinqueterre in Italy, between some two of the towns.

# Another nice theorem

Trying to visualize the projection map using fibers. You'll have to take my word that the lines stop before getting to the origin.

Today’s Theorem of the Day (TM) I used to compute the Jacobian of a radial projection.  In particular, consider the map $F: \mathbb{R}^n \to \mathbb{R}^n$ where $x \mapsto x/|x|$ for all $|x| > 1$.  This projects all of n-space onto the surface of the unit ball, and leaves the interior untouched.  Then we may compute the derivative $\frac{\partial F_j}{\partial x_k} = \frac{\delta_{j,k}|x|^2 - x_k^2}{|x|^3}$.

To calculate the Jacobian of means we have to calculate the determinant of that matrix.  With a little figuring, we can write that last sentence as $|JF(x)| = \det \left(\frac{1}{|x|} ( I - \frac{x^Tx}{|x|^2} ) \right) = \frac{1}{|x|^n} \det \left(I-\frac{x^Tx}{|x|^2}\right)$.

Now we apply The Theorem, which Terry Tao quoted Percy Deift as calling (half-jokingly) “the most important identity in mathematics”, and wikipedia calls, less impressively, “Sylvester’s determinant formula“.  Its usefulness derives from turning the computation of a very large determinant into a much smaller determinant.  At the extreme, we apply the formula to vectors u and v, and it says that $\det (I+u^Tv) = 1+v^Tu$.  In our case, it yields $|JF(x)| = 0$.  Thus we turned the problem of calculating the determinant of an n x n matrix into calculating the determinant of a 1 1 matrix.

Pretty nifty.

# Busy days.

Somehow spring break has turned into one of the busier weeks of my year.  Trying to keep up with real life work has not left a ton of time for writing anything thoughtful/reasonable, though at least for continuity I will try to keep a paragraph or so up here each day with my favorite thought of the day.  This also means I can reuse some old graphics!

Today I really enjoyed a particular fact about Sobolev functions.  Recall that these are actually equivalence classes of functions, as they are really defined under an integral sign, which “can’t see” sets of small measure.  However, the following quantifies exactly how small the bad set might have to be:

If $f \in W^{1,p}(\Omega)$ for $\Omega \subset \mathbb{R}^n$, then the limit $\lim_{r \to 0} \frac{1}{\alpha(n)r^n}\int_{B(x,r)}f(y)~dy$ exists for all x outside a set E with $\mathcal{H}^{n-p+\epsilon}(E) = 0$ for all $\epsilon > 0$.

Put another way, every Sobolev function may be “precisely defined” outside a set of small dimension, where the dimension gets smaller as p gets larger.  I suppose a given representative may be worse, but this allows you to require that the member of the equivalence class of Sobolev functions has some nice properties.

The fibers of two functions in a sequence. I was thinking the above argument might imply that the limit was not Sobolev, but the limit is precisely represented outside a set with positive 1-dimensional measure, so the result is silent on this issue.

# L’Hopital’s rule.

Two photos from a recent trip up north. Major bonus points for knowing which of New England's many trails this was taken on.

L’Hopital’s rule is really how every student of calculus (and I believe Leibniz, though I cannot find a reference) wishes the quotient rule worked.  Specifically, that

$\lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}.$

Of course, it can’t be that easy.  We also need that f and g are differentiable in a neighborhood of a, that both function approach 0, or they both approach $\infty$, or they both approach $-\infty$ as x approaches this point a, and finally that the limit on the right hand side exists (though we all recall that if it does not work the first time, we may continue to apply L’Hopital until the limit does exist, which then justifies using L’Hopital in the first place).

I was thinking of this rule in relation to generating interesting examples of limits.  In particular, if we are in a situation where L’Hopital’s applies, then we can apply the rule in two ways:

$\lim_{x\to a}\frac{f'(x)}{g'(x)}=\lim_{x \to a}\frac{f(x)}{g(x)}=\lim_{x\to a}\frac{\left(\frac{1}{g(x)}\right)'}{\left(\frac{1}{f(x)}\right)'}.$

Proceeding informally (i.e., I’m not going to keep track of hypotheses), the right hand side of this evaluates to
$\lim_{x\to a}$$\frac{f(x)}{g(x)}=\lim_{x\to a}\frac{f(x)^2}{g(x)^2}\frac{g'(x)}{f'(x)}.$

This is all well and good- the right hand side looks appropriately ugly, but now the trick is picking f and g to get interesting limits.  I have worked out two reasonable examples:

1. Choosing f(x) = sin(x)g(x) = x, we get

$\lim_{x \to 0} \frac{\sin{x}}{x^2}\tan{x} = 1.$

Also, moderate amounts of bonus points for naming (at least) two universities in the northeast with this mascot.

2. Choosing $f(x) = e^x-1$ and $g(x) = \log{x}$, and applying (hopefully correctly!) a number of logarithm rules, we can get

$\lim_{x \to 1} \frac{(e^x-1)^2}{\log{(x^{\log{(xe^x)}})}} = 0.$

What would be interesting is to find an example where it is difficult/impossible to evaluate without recognizing that it was created using this process.  This second example might fit the “difficult” bill, as I would not want to take the derivative of the denominator directly, but factoring, you might recognize it as $xe^x (\log{x})^2$, and then be able to reverse engineer this process, somehow.

As usual, just a thought I’ve been playing with.

# More with fibers of functions

I posted earlier on a way of visualizing the fibers of certain maps from high dimensions to low dimensions.  Specifically, if the range can be embedded in the domain so that f is the identity of the image of the range, then we can draw the inverse image at each point.  I had some images of functions whose inverse image was a torus, but had trouble making these sorts of images for maps $f: \Omega \subset \mathbb{R}^3 \to \mathbb{R}^2$, so that the inverse image of a point is a line.  Well, no more!  Here are two images, one is the projection of a cube onto a square, and the other is somewhat more complicated, and is the string hyperboloid map.  See the previous post for more details on these specific maps, but I just thought these were nice images!

Fibers of the projection map from the cube to the square.

Fibers of the "twisted cylinder", which are again straight lines.

# More geometry with inverse images

Yesterday’s post was on inverse images of functions as sets, and ways to visualize them.  Today, I realized that despite my early series of posts on the Jacobian derailing, I probably have enough background to describe the area and coarea formulae.  The two give a relationship between the “size” of the fibers and the derivative of the map.  The first thing I’ll need to do is define the Jacobian for maps $f: \mathbb{R}^m \to \mathbb{R}^n$.  The definition will be slightly different depending on whether m or n is larger, but if $n \geq m$, then

$|Jf(x)| := \sqrt{|Df(x)^T \cdot Df(x)|}$,

and if $m \geq n$, then

$|Jf(x)| := \sqrt{|Df(x) \cdot Df(x)^T|}$

where Df is the n x m matrix of partial derivatives of f, and we use the absolute value bars to indicate a determinant.  Notice that if m = n, then the definitions agree, and it is just the absolute value of the determinant of the matrix of partial derivatives.  If n = 1 so that f is a real-valued function, then the Jacobian is the length of the gradient of f.

Now then, the area formula says that for a Lipschitz $f:\mathbb{R}^m \to \mathbb{R}^n$ with $m \leq n$, and any Lebesgue measurable $U \subset \mathbb{R}^m$,

$\int_U |Jf(x)| d\mathcal{L}^m(x) = \int_{\mathbb{R}^n} \#(f^{-1}(y) \cap U)~d\mathcal{H}^m(y),$

A hyperboloid projecting onto a circle.

where, for a set S, $\#(S)$ denotes the cardinality of S, i.e., how many points are in S.  We expect this number to be finite (for most functions f I think of, each inverse image has cardinality either 1 or 0).  Indeed, notice that if f is a smooth embedding, then f is one-to-one, and the right hand side of the above is always 1 if f maps there.  Hence the right hand side will be $\mathcal{H}^m(f(U))$, the area of the image of U under f.  This explains why it is called the area formula- it agrees with the classical area of parametrized surfaces.

The coarea formula (the subject of my research) keeps all the conditions above, but now f maps from high dimensions into a lower one, so $m \geq n$.  We have

$\int_U |Jf(x)| d\mathcal{L}^m(x) = \int_{\mathbb{R}^n} \mathcal{H}^{m-n}(f^{-1}(y) \cap U)~d\mathcal{H}^n(y).$

In plain English, the integral of the Jacobian of f is equal to the integral of the length of the fibers of f.  [Technical sentences coming up!] One surprising fact is that this coarea formula was first proven in 1959 in Herbert Federer’s paper “Curvature Measures“, while the area formula was (is) a basic calculus fact, at least for smooth functions.  The formula has since played a role in image processing, as when f is a real valued function, the quantity is usually referred to as the total variation.  De Giorgi showed that the fibers of functions which minimized the left hand side are actually minimal surfaces.

A string hyperboloid.

I’ve included a few illustrations of how the coarea formula might relate to “projections” of hyperboloids onto the circle.  The first shows such a hyperboloid, along with the fibers of the “projection”.  The coarea of this map will be the surface area of the hyperboloid.  Such a hyperboloid can be made with string from a cylinder, and twisting the top.  See the second figure.  The final .gif illustrates continuing to twist the top, and the resulting surfaces.  In each case, integrating the function whose level sets are these straight lines will return the surface area of the hyperboloid

Twisting hyperboloids

# Fibers of functions

Inverses

Something that is easy to miss in early calculus classes is that the inverse of a function is typically not a function.  We go through this whole confusing notion first with the square root (because while it is true that if $x^2 = 16$ then $x = \pm 4$, we all know that we like +4 better), then with trig functions.  I would argue that it is helpful to think about the inverse of a function as a set, and then point out the wonderful fact that if all the inverses of individual points have only one or zero members, then there is a function g so that g(f(x)) = x.

Typically though, inverse images will have more than one point.  Indeed, for a map $f: \mathbb{R}^m \to \mathbb{R}^n$, you will expect $f^{-1}(y)$ to be m-n dimensional, if m is bigger than n, and a point otherwise.  Intuitively, this is because we have m equations and n unknowns, leaving us with m-n free variables.  This suggests a way of visualizing functions that I have actually never seen used (references to where it has been used are welcome).

What I have in mind is that, if you have a function $f: U \subset \mathbb{R}^m \to \mathbb{R}^n$, and it so happens that $f(U)$ can be isometrically embedded back into U by choosing the well from the sets $f^{-1}(y)$, then we may plot the inverse images of f on the same graph as we draw the domain of f.

That last paragraph was confusing, so let me give an example right away.  We will look at the function f which maps from the solid torus (donut) to the real numbers, so

The map of the torus that gives the radius of a point. The line in red is the range of the map. Notice it intersects with every shell exactly once.

that f(x) is the distance of x from the center of the solid torus.  Hence $f^{-1}(r)$ will be the (not solid) torus of radius r. I have made the graph I describe above for this map.  Notice that the image of the torus under f, a circle, is indicated in blue in the left of the graph.

This picture has a nice intuition: each surface will map down to one point (so our intuition earlier holds up- f maps a three dimensional object down onto one dimension, so the inverse images are all two dimensional), so we can easily look at this and see the domain, range and action of f on the domain.  Notice also that to plot this in a traditional manner it would take either 4 dimensions as a graph, or 1 overloaded dimension as a parametric plot.  This particular example *could* be displayed using a movie, though again we would be displaying fibers of the map.

The last image of this sort is where we instead map a torus (again, non-solid) to a circle.  Notice that now the map is from a 2-D surface to a 1-D curve, so we expect (and see that) the fibers to be 1 dimensional.

The inverse images of the torus-radius map, as the radius goes from 0 to 1.

Inverse images of a projection-of-sorts of the torus onto a circle.

# A note on graphs

As a followup to my previous post on functions, I’d like to talk about graphs.  More specifically, ways of visualizing functions.  As before, we’ll go by cases, though this time we are somewhat limited by having only two dimensions to depict a function.

Real valued functions of one real variable: These are what we all started with in high school, and unfortunately all most people will ever see (though maybe also some parametric graphs).  When someone says “graph” this is typically what they mean.

This happens to be a graph of the function $f(x) = x^2 + \sin{4x}$, displayed from x = -2 to +2.  To make a graph like this by hand, one would go to x = 0, figure out that f(0) = 0, then put a single dot at the point (0,0).  Then we would move to, say, x = 1, see that f(1) = 1+sin(4), and put another dot there.  We repeat this process ad infinitum, and get a picture.  If we wanted to make this by computer, we ask MATLAB to make a vector with, in this case, 100 elements, looking like

x = (-2.00, -1.95, -1.90, … , 1.85, 1.90, 1.95, 2.00).

Then, since MATLAB enjoys doing arithmetic a vector at a time, we calculate a new vector y = x.^2+sin(4*x), and then plot x and y by typing plot(x,y).  The computer then plots all the points in the vectors, and connects them to create the smoothish line above.

Real valued functions of two real variables: These are the objects of study in multivariable calculus, and I will stop explaining everything so much.  Suffice it to say that we plot the height of a function above the point (x,y), so we can start creating things that look like surfaces:

This guy looks pretty wild (if I do say so myself), but we’re just getting started. For interest’s sake, it is the graph of

$f(x,y) = x^2 + y^2 + \sin(4x) + \sin(4y).$

Notice that neither function we have plotted has had any self intersections, and each has been a proper function.  It passes the “vertical line test”, to borrow a phrase no one uses outside of calc 1.  If we want spheres and donuts, we can’t let the vertical line test stop us.

Parameterizations of a line in the plane: Above, it cost us one dimension for each variable in the domain, and one more for the range, meaning we’ll have to be clever to draw the graph of a real valued function of 3 or more variables.  However, with parametric plots, we will use precisely as many dimensions as are in the range.  First, a parametrization of a line in the plane can be thought of intuitively as telling a line how to sit in the plane.

The above is the graph u(t) = (sin(3t),sin(8t)).  As mentioned, it won’t be possible to realize the above curves as a traditional graph, but I should say that it is easy to realize any traditional graph as a parametric graph.  For example, the graph of y = sin(x) is given parametrically by u(t) = (t,sin(t)).

Functions of one real variable into three space: This is very useful, and gets you used to the idea that we are really just telling a line where to perch.  I’ve animated this graph, rather than using shadows, to give a sense of depth.

The above is the plot u(t) = (cos(t)sin(.2t),sin(t)sin(.2t),cos(.2t)).  I chose this because the curve would sit on the surface of the sphere.  It is late, and the post is long, so I will give just one more example for now.

Functions of two variables into three space:  These are much more general surfaces then we saw earlier.  Again, we can think of the maps as telling the plane how to sit in three space.

The above is a torus, which is generated using the parameterization

$u(s,t) = \cos(s)(2+\cos(t)),\sin(s)(2+\cos(t)),\sin(t)).$

By adding sin(5s) to the z coordinate, you can get the following, rather more complicated looking graph:

Hopefully sometime in the future I’ll talk about more exotic techniques of graphing.