When five different smooths are compared, which one comes out on top?

Neil Cummings

The short answer is that it depends on how rapidly and non-smooth (read: wavy) your data need a smooth. If all you need is a summary line plotted through a cloud of dots, it's generally preferable to apply Ockham's razor to your data and choose the simplest method: Bin Smoothing, Simple Moving Average, or Loess.
If you want something more intricate and/or wiggly, on the other hand, you should compare smooths based on their smoothness, accuracy, and speed. When it comes to the one smooth to rule them all, though, there is no free lunch.

Let's look at how a few of the smooths from my previous post fit into Friedman's formula, which I chose because it's a smooth function with a fast and slow curve (plus there's literature on it, starting with Friedman & Stuetzle (1984) on Supersmoother). (Not to be confused with the Friedmann equation, which predicts how fast the universe is expanding and is derived from Einstein's Theory of General Relativity.)

We'll start by fitting Friedman's formula with a simple and exponential moving average, then Loess with varied spans, and finally a simple generalised additive model with a cubic spline. According to Breiman and Peters, I finish the essay with some general conclusions on the performance of a few prominent smoothing approaches (1992).

That doesn't appear to be a good idea. Simple and Exponential Moving Average plainly don't suit the underlying Friedman function, the red line, very well, so let's try Loess with two different parameter settings: I the ggplot2 default span and (ii) a narrower span for a wigglier line. Keep in mind that in the real world — when you're at your desk and your boss is breathing down your neck — you'll have no idea what the red line is. I display it so you can see where the grey data points come from, because that's what we're trying to fit.

Our web scraping Services provides high-quality structured data to improve business outcomes and enable intelligent decision making,Our Web scraping service allows you to scrape data from any websites and transfer web pages into an easy-to-use format such as Excel, CSV, JSON and many others.

The plot on the left (A) in the Loess plot plainly matches Friedman's formula much better than either of the moving averages, but it has a hard time pulling up at the bends, especially around x =.2. The Loess plot on the right (B) addresses this issue by showing a Loess smooth with shorter spans, allowing the curve to flex at both the fast and slow curves.

Loess comes out on top.

Loess is O(n2) in memory, thus while it appears to be faster, it may be slow on large datasets. In reality, once n is greater than 1,000, ggplot2::geom smooth() alters its default smooth method from Loess to a Generalized Additive Model (GAM) fit by: formula = y s(x, bs = “cs”), as illustrated here.

Surprisingly, the GAM fits our data just as well as the Loess, despite having a smaller bandwidth. The tradeoff is speed, which is why, as n grows larger and memory becomes more valuable, we're compelled to use GAM or another approach.

Why don't we try putting on a spline?

The storey is the same.

If you're looking for a great paper that delves into the various density estimation strategies and examines the accuracy and speed of the R packages that implement them, Deng and Wickham's paper is a great place to start (2011).

Breiman and Peters (1992) did simulations evaluating a few of smooths on five criteria (on a Sun 3/160 with just 16M with-an-m! of RAM and 16.67 MHz...) for a more complete comparison of popular smoothers. In general, they came to the following conclusion:

Despite the fact that there was no single smoother that excelled in all five aspects (Root Mean Square Error, Root Mean Square Bias, Maximum Deviation, Smoothness, and Band width), distributions, and functions, delete-knot regression splines were the smoothest, most efficient, and most accurate on straight-line data.

Smoothing splines were the slowest and roughest on large samples (n >= 225).

Smoothing splines had the maximum accuracy, as evaluated by RMSE, on small samples (n = 25).

I skipped over a lot of information on each smoother I discussed, which I'd like to go over in more depth in a future post, especially given I didn't even mention the differences between parametric and non-parametric approaches.

Data input is a briskly growing sector in Pakistan. Information Transformation Services can meet a variety of data entry services competently and professionally. ITS will not only provide various data entry services to customers, but will also offer some creative and customized solutions to meet your needs

These plots' R code can be accessed on my GitHub page here.

Neil Cummings

WHO TO FOLLOW