In defense of being premature
Premature optimization is one of those mantra phrases in the programming community that gets applied and repeated for a lot of situations. The term itself has been a part of the field longer than most of us, basically ever since Donald E. Knuth wrote:
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
This 1974 quote is something I wholeheartedly agree with. However, I think a lot of programmers zealously over-apply this whole “don’t worry about efficiency in this phase” notion. Back when the quote was made, programmers didn’t have the luxury to not think about performance for the most common of use cases. The hardware was leagues behind the algorithms of the day, compiler level optimization was in its infancy, and sharing libraries had a major logistical component to it. In short, the software development process was an entirely different beast, and worrying about efficiency could easily become a command by command struggle and therefore a major time sink.
This of course does not mean that the term does not apply to modern programming, it very much does, just that it should not be used as a cancellation stamp on every efficiency thought in the early stages of development. Thinking about big O time complexity, network payload sizes, read/write frequency, text search indexing… All of those are in a way efficiency concerns that need to be at least partially addressed before a single line of code is written. The decisions a developer makes about the importance of those issues can have lasting impacts on the overall architecture. Even if the software in question is “just a prototype”, there is a good chance that its skeleton will be a part of the end product, or that a lot of implementation decisions will be: “do it like it’s been done over there”. That being said, most of these concerns are respected and regarded as valid, at least by engineers (managers might disagree). Therefore, I’m not going to dwell on those use cases in this article. If someone calls you a premature optimizer because you’ve asked what’s the time complexity of a search implementation, that person seriously needs to be sent back to CS 101.
Let’s build the classical Hello World project for learning a new JS framework — the Todo App. Actually, to demonstrate performance impacts without getting into the nitty-gritty of an actual view library (and make this example completely useless for a lot of people), I’m gonna need a bit more complex example, so it’s gonna be a Trello clone. If you’ve never used Trello, it’s basically a highly customizable todo app with a bunch of plugin options, none of which are relevant for this example.
The feature set and requirements of our clone will be the following:
- todos are represented with cards
- cards can have users assigned to them
- cards can have labels (text + color)
- cards are a part of a list
- lists are a part of a board
- users have roles per board where they can either:
- only view the board and its content (GUEST)
- edit existing and create new cards (MEMBER)
- manage (create, edit, or delete) both cards and lists (ADMIN)
- each board has only one user as its owner
- boards can be grouped into workspaces
- workspaces also have only one owner
- boards not grouped into workspaces are regarded as the owner’s “personal workspace”
Initially, I planned to add a simple class diagram here of the described entities, but decided against it because I’d end up obsessing over line alignments in the diagram. All of the classes are pretty simple, one object has a collection of a bunch of other objects it should reference (1:N and N:M relationships). The code should be understandable even without this description and if anything seems unclear, don’t worry about it. Once we reach the performance part it will all be domain agnostic.
I’m going to fast forward a bit and assume that you’ve built this app (in your head, please close that editor tab) in your library/framework of choice. A new requirement just came in. The client wants an analytics screen, and the first data selection they want goes like this:
Find all users that are assigned to cards with the label “DESIGN” of my personal workspace boards; or have a “MEMBER” or “ADMIN” permission on any board in my workspace called “DESIGN”; or are owners of any of the boards in my workspace called “DESIGN”. Eliminate duplicates by id.
Ok, that was a bit of a mouthful, but here’s an implementation of that to get a better idea of what’s the requirement. The following code will rely only on
Array.prototype methods, if any of them are unfamiliar head over to MDN to check them out.
While at first glance it might look like a mess of arrow functions, the code itself is pretty straightforward. It goes:
- concatenate the following three lists:
1.1. users grabbed from
'DESIGN'cards of all the boards of the target user's personal workspace
1.2. users that have the
'ADMIN'role in the target user's
1.3. users that are owners of a board in the target user’s
- filter out duplicates by id by looking back if an element with the same id property exists
COUNT(personalWorkspaceBoards) * COUNT(lists) * COUNT(cards) * MAX(COUNT(labels), COUNT(users))[step 1.1]
COUNT(workspaces) * COUNT(boards) * COUNT(boardUsers)[step 1.2]
COUNT(users) * COUNT(users)[step 2]
For example, the first optimization idea that comes to my mind is combining the “find workspace” parts of steps 1.2 and 1.3 by extracting the find result into a variable above the return. This only relates to the second bullet in the list above and its execution remains the same. Another idea is combining sequential
map calls into a single
reduce method. This impacts two of the bullets, and impacts the innermost parts of the execution so it can make a lot of difference (spoiler alert, it did, but not for the reason you think). However, going back to the big O, this is still the same order of time complexity. The execution time is halved, but that is a constant factor, so from an algorithmic point of view, it's meaningless. A third idea is using
flatMap instead of this awkward
.concat(...list.map(/*...*/)) syntax. It removes extra objects and iterations caused by this constructing, spreading, and then reconstructing of arrays, and it just makes the code look MUCH nicer. The caveat is that it's an ES 2019 feature (proposal link) and might not be available in every users' environment. You know what? It's 2021, IE is dead, caniuse.com says 92% coverage and that's good enough for me, BAM, implemented it. And... it's the same type of optimization that
reduce ended up being, just a constant factor that multiplies the count of the list related to it.
All of this is not very surpassing when you think about it. After all, the structure of the data itself requires the function to iterate through all of the described elements. The most that could be done from an algorithmic point of view is to try to find a loop that can be skipped by precomputing a lookup (map) of results that that loop needs to compute. However, since the data described is tree-like, needs to be traversed root to leaves (i.e. constant
parent.children object accessing), and there aren't repeated computations (other than the first optimization idea), I'm afraid that I'm not well versed in dynamic programming to find a suitable optimization approach if it exists. Therefore, approving this code in a PR, with the notion that it isn't worth spending any extra time optimizing for minor gains, is a perfectly valid decision.
Some time passes and more analytics data selections similar to this one get added. The screen starts getting a bit janky on load, but that’s just on the initial load so the screen’s users don’t mind as much. What the users do mind is that they spend a lot of time on that screen, often time keeping it loaded in another tab and forgetting to refresh that tab to get new data. Now refreshing when they focus on the tab would be a good temporary fix for this, but it seems that they also keep the tab in focus for a long time while making some notes on the side. Also, one user (who is our biggest whale) keeps the screen on their office TV for those #leadership #motivation #entrepreneur #business #productivity #icanttakethisanymore pics, and is a “technologist” that doesn’t understand why the screen can’t be refreshed real-time because every modern app does real-time analytics. So yeah, we’re not gonna do real-time, but refreshing the data every few seconds (with caching) is a good enough compromise for our project manager.
The screen is expectedly somewhat unresponsive now, but not so bad that it needs a total refactor. It becomes a bit janky for a frame every 30 seconds. A few minor optimizations just to keep the calculations within a frame should do it. Good thing we’ve already written down those optimizations, here they are in all their glory:
This, and similar optimizations on the other analytics queries do make things better, but not enough. The janky frame now appears every 45 seconds on average (the numbers are fudged but they make sense, I swear). We walk up to the PM explaining that this just isn’t worth optimizing anymore, that we’d have to restructure the entire thing for one user to be happy. He gives out a sigh and says:
Make it every minute on average and ship it.
Okay, that’s a reasonable goal, but what’s the easiest way to achieve it?
Now I’ll give up a little secret I’ve been keeping. That jank isn’t caused by the function’s execution time. In fact, the average execution time is exactly the same as it was before. The jank is caused by the garbage collector sweeping dead objects when the heap reaches a certain limit. When we implemented this optimization, we got rid of some extra array objects created both by unnecessary double iterations (and their results) and those empty arrays used for
concat. This function still has a lot of unnecessary extra objects in the form of arrow functions.
Each time a function is defined inside a loop, it’s created anew, i.e. as a new function object. Therefore, every arrow function in our example, other than the outermost ones, is being constantly redefined. The same thing goes for any “constant” objects defined inside a loop (such as
Therefore, another route in optimizing this function is extracting all anonymous functions that don’t depend on variables in the outer scope. This “outer scope” part is the only thing we need to keep in mind, but the linter will warn you if you slip up there (or you’ll get a pretty obvious
cannot read property of undefined error). Let's apply that method to our v1 function and see how it holds up.
I don’t know about you, but I find this implementation much easier to read than the previous two. But how does this
v3 hold up to the
v2 optimization? Well now the junk appears every 50 seconds, so this is a slightly better optimization than
v2 is. Combining both approaches will make sure we hit the "always less frequent than one minute" mark (told you the numbers are fudged).
But where do these numbers come from? Well, I did some metrics on each of these versions of the
getDesigners function (plus a
v4 which is just the anonymous function optimization applied on
v2) over a number of iterations, and scaled the average memory impact on this garbage collector memory limit scenario. I'll spare you the details how the metric was done for now, they will be added to the example addendum because I wasted too much time on getting it as best as possible, but here are the results:
If you scale the average memory decrease per version to the number of seconds in this example, you’ll end up with roughly the same numbers. Note that
v2 becomes more impactful than
v3 as the number of iterations increases, but
v3 still averages out as a bit better in this dataset. That is fitting since the first row simulates the memory impact of the function described in the scenario, and the garbage collector really did fire around that point, but more on that in the addendum.
Now someone might say that this example or these measurements are a bit far-fetched, but I disagree. I can easily imagine a function similar to this one being called for a thousand users in a single request, and saving 5 MB of server memory per request is a lot. I’ve worked on data-heavy screens that required view models that had lists of children view models with lists of grandchildren view models and so on for at least a dozen layers and multiple paths in the hierarchy. A lot of those view model lists were initially implemented by doing:
this.children = parentData.children.map((_childData) =>
in the parent view model's constructor. This ended up being not only costly but not easily noticeable because every anonymous function was the "outermost one" in its file. When looking at in a review, you didn't have the context of the anonymous mapper function being defined in a loop inside another loop and so on. When the endpoint using that view model eventually came up for optimization and refactor, the team and I did some back-of-the-envelope calculations and figured out we were wasting around 5 MB just on all those mapping functions. It was by no means the biggest issue that needed to be optimized, but was something that could be done in half an hour while we figured out what to do next. After this situation, we adopted the practice of avoiding anonymous functions in VMs, especially the "simple" shared ones, because we don't know how deep they'll end up being used. Extracting and naming a black-box function only takes a few extra seconds, but it can noticeably impact performance and resource use when in the long run when done consistently.
The main issue I wanted to bring up is laziness, and laziness to think in particular. A lot of us grew up (as programmers) with the phrase “memory is cheap”, with Algorithms and Data Structure 101 courses that exclusively focus on big O function orders, and with the flawed notion that any line-level optimization just makes the code less readable.
First of all, memory is not cheap, that mentality got us in this situation where you need a high-end laptop if you want to have more than three tabs open without Chrome taking up your entire RAM. In the mobile world, it’s even worse, a two-year-old phone with Facebook installed requires the user to learn how to use device maintenance software to clear up background apps and memory. We’ve reached a point where developers behave so haphazardly with memory utilization, that OS memory management is the thing impacting most users’ day-to-day device experience.
Mid-conclusion rant over and back to the other points. The big O is the cornerstone of computing and has precedence in any performance analysis, but it’s not the only thing that exists. Analyzing big O complexity is just the first step in trying to find a better solution. The next step is of course finding ways to improve performance by those constant factors like two or three times because they matter on scale. After, or rather along with that, there is also going into the code and measuring how things hold up in the real world. It’s pain to do but necessary every now and then to get a better grasp of how each line-by-line decision impacts the app’s overall performance. Reality isn’t perfect, and the existence of elements out of your control like garbage collectors, optimizing compilers, various layers caching data, the entire OS with its services and process manager… All of that can drastically distort any approach that looks good on paper, so things need to be occasionally measured and remeasured before something can be concluded as optimal or just “enough”.
On the note of code readability, while that may be completely subjective, the
v3 in the example is far more readable than the
v1 implementation. It's a bit much, I agree. A midpoint would be great. However, comparing the two extremes, I prefer the one that has its helper functions named. When going through a bunch of code I want the function name and signature to tell me all I need to know, and trust my teammates that it's correctly implemented, and not getting bogged down reading the entire flow just to go "Yeah, I think I get what the result is gonna be".
An optimization based on extracting and naming code segments is an easy thing to point to as improving code readability, but I am not saying that optimizations lead to more readable code. I’m only saying that the readability vs optimization dichotomy is a false one. The two exist as separate attributes of a piece of code. They can go against one another, together, or be completely non-applicable, all on a case-by-case basis.
The point I want to hammer home with this article, which exploded far beyond its initial scale, is: don’t think you’re wasting time taking a few extra minutes to think. A minute “wasted” in advance can be a day saved in the future. Don’t get bogged down on every minute detail, yes, but don’t just code like there is no tomorrow. Every time you’re done with a file, class, function, or even just a block, take a moment to stretch (your back needs it) and have a look if something can be better with just a few last-minute tweaks.
Addendum: Example Methodology
For those of you just wanting to see the code, here you go. Word of caution, the code is ugly and full of (linter) errors.
I didn’t use any fancy performance tooling because I needed to repeat this experiment for multiple variations of a similar dataset. Therefore, I needed something that could give me results on memory usage within a script. At first, I used Chrome’s non-non standard memory extension of the Performance interface, but it didn’t fully suit my needs. Chrome tabs are not the stablest to do test runs in, and the memory extension itself didn’t seem detailed enough for my needs at first. Another issue I came across while building my test case is how to control the garbage collector, so I opted for moving the script to Node (the current version I have installed is
v12.16.3) with the hopes of maybe disabling garbage collection.
I quickly found out that Node, or rather V8 does not offer any garbage collector control (SO link 1, SO link 2), but it does expose V8 option flags when running the process, so I started experimenting with those. In the end
--trace-gc ended up being the only useful thing to include. Registering more or less memory for the process, changing GC sweep intervals, disabling background GC sweeps... all made little to no difference in how often the garbage collector ran its sweep.
However, while logging those GC sweeps to get some sense on how to exclude memory lost and time performance increases due to garbage collection, I noticed that if a sweep happened during a function’s execution, the heap used snapshot (as returned by
process.memoryUsage()) difference between the end and start of the function was negative (and usually by a substantial amount). Therefore, as a solution to my garbage collector problem, I decided to make that negative memory difference value a condition for re-running an iteration (noticed the
i-- in the linked code), and just summing the memory and time differences of each individual iteration, instead of the entire loop as I did initially (the
console.time calls in the code are a remnant of that).
With that out of the way, I started doing at least 10 measurements per a number of test iterations (number of iterations being how many calls of a function are done in a script’s run — 100, 250, 1000, 2500, 10000, or 25000). However, the results started to look wrong once a high number of iterations was reached (10000). On some test runs the
v1 code ended up being the optimal one in terms of speed (memory was more or less as expected, just a bit less pronounced), which just didn't add up with the logic. Not that I expected it to be the worst every time, processes are finicky and a lot of things can go wrong, that's why I wanted to do a minimum of 10 measurements per iteration number. However, the
v1 code was consistently 10-25% better than the
v2 is basically the same code but looping twice as less. Then it hit me, each iteration was calling the function on the same dataset. The engine was probably optimizing the code in runtime, and for some reason, it did that better for
To eliminate that issue as best as I can, I decided to create an array of datasets and run each iteration over its own dataset. This ended up being hard to achieve if I wanted to get results for my runs within minutes as the
createDatabase code is pretty heavy and I didn't want to invest too much time in that part anymore (I already did some tweaks before to get just enough needed for my example), so I limited the overall number of datasets to 100 and just loped over those. In the worst case (25000), each dataset is called 250 per function and with at least 100 different calls in-between. Even if the engine is optimizing that scenario in runtime, the end measurements will be displayed alongside the other ones that have less or no repeats of the same data. At this point, this was a compromise I was willing to take.
The results that I’ve shared have a more pronounced difference in memory utilization on the lower number of iteration, but if you ask me this is a feature of the measurements. In a real-world scenario, if you had such an extreme number of function calls in a short time frame, you’d also have runtime optimizations helping you. Therefore, this result is maybe better for making decisions than a one completely stripped away of hidden optimizations or with a disabled garbage collector.
For those interested in time performance, here are the average times in milliseconds:
These are of course the times with the garbage collection iterations excluded. If you put these values in relation to one another, you’ll get a 3% difference at most that is not significant by any means.
Originally I counted the garbage collection sweeps and wanted to include them in the results, but found that they were pretty unreliable and sometimes random. Because I lumped all the test cases one after another, sweeps didn’t make any sense on the low iteration numbers. Sometimes a single garbage collection sweep was made in the end case (after all other cases filled up the memory), sometimes there was one just at the beginning (after the dataset initialization), and sometimes it triggered in all four loops. It all depended highly on the number of results in a dataset. What I can say is that on the higher iteration numbers there is a consistent pattern of
v1 doing the most and
v4 doing the least sweeps, but how substantial are the differences depend on the number of results the dataset gives.
While writing this article I discovered I missed one anonymous function in the
v3 implementation. I corrected it in the code but was too lazy to rerun all the measurements.
Originally published at https://nuntius.hashnode.dev.