Joseph
Yiasemides
Keeping It Simple
Working with A Large Code Base

The bigger a code base, the better the tooling we need. Think about Hack at Meta, Basel or Protocol Buffers at Google, and Kafka at LinkedIn. Big tech is big on their tooling. It helps them scale to giant code bases and giant teams.

Navigating Cities of Code puts forward the idea that a code base requires us to adopt more sophisticated techniques and practices at successively larger LOC. It suggests what some of those techniques and practices could be. Documentation, testing, and expressive strong static types are some concrete examples 1. The idea is great 2, but some of the suggestions are ambiguous, and are the author’s personal experience.

If we just observe the industry, we can see that tooling is hugely important, and that the degree of tool-building is of upmost importance as a code base grows. At one end of the scale, big tech has entire departments dedicated to internal tooling. A significant part of their organisation works on improving software development for the rest of them. On the other side of the spectrum, it’s common for smaller shops to have at least a platform team.

Tool-building should be more common and widespread in day-to-day development. Most of the industry produces two artifacts: code and tests. Tools are a third important and valuable artifact 3. Just like expressive strong static types and tests help a team to scale to a bigger code base, so does dedicated tooling.


  1. Expressive strong static typing is my own addition.

  2. That is, the idea that a bigger code base simply can not be navigated in the same way as a smaller code base, and that there is in fact a succession of techniques and practices that must be employed at larger and larger LOC.

  3. As I learnt working alongside the team at feenk. This is a big part of their work on the Glamorous Toolkit.

3 Ways To Fibonacci In Elixir

Introduction

About 10 years ago, when I was learning to code, I read a post listing five ways to code the Fibonacci Sequence in Python. I wrote more or less the same thing in Elixir a few years ago but never published it.1 Here’s a heavily revised version.

I consider the Fibonacci Sequence an analogue to the C classic, Hello World, but for Functional Programming. It’s one of the first things I code up when I come to a new language. Here’s its definition:

Fn = Fn-1 + Fn-2 with F0 = 0, F1 = 1 , and n > 1

Of course, the formula isn’t what I want to show, it just helps to have it here for reference.

Plain & Simple

Our first version is as simple as we can get it.

defmodule Math.V1 do
  def f(0), do: 0
  def f(1), do: 1
  def f(n) when n > 1, do: f(n - 1) + f(n - 2)
end

I was going to say it’s similar to the formula above but the defs and dos make me think again. One thing is for sure, out of all the examples, it’s most clear what it does because it’s a translation of the formula into code directly. There’s no how it does it to get in the way of clarity or obscure what it does.

To calculate a term (i.e. f(n)) with V1 we have to compute the entire tree that grows out of f(n - 1) + f(n - 2). Let’s see an example. To calculate f(4) we have to calculate f(4 - 1) = f(3) and f(4 - 2) = f(2), to calculate f(3) we have to calculate f(3 - 1) = f(2), and so on (see below). In doing so, we repeat lots of the calculation, paying the same term a visit at different parts of the tree. We end up calculating f(2) twice:

f(4)
┣ f(3)
┃ ┣ f(2) ## 1st
┃ ┃ ┣ f(1)
┃ ┃ ┗ f(0)
┃ ┗ f(1)
┗ f(2)   ## 2nd
  ┣ f(1)
  ┗ f(0)

The bigger the n we pass to f(n), the bigger the tree, and the more the repetition. How can we do better while showing-off some neat code in Elixir?

Don’t Repeat

As per the section title, we’ll store our calculations, so that we don’t repeat them. It was more challenging, and more fun, to write this memoized code than I thought it’d be. Here’s my code after a few attempts:

defmodule Math.V2 do
  def f(n) when n >= 0 do
    memo = %{0 => 0, 1 => 1}

    memo
    |> f(n)
    |> Map.fetch!(n)
  end

  defp f(memo, n) when is_map_key(memo, n) do
    memo
  end

  defp f(memo, n) do
    memo
    |> f(n - 1)
    |> f(n - 2)
    |> then(&Map.put_new(&1, n, &1[n - 1] + &1[n - 2]))
  end
end

In V2 we store Fibonacci terms that we’d calculate again and again in memo. It still bares a lot of resemblance to the formula: (1) the initial memo lists the initial terms for n = 0 and n = 1, and (2) we can clearly see a variation of f(n - 1) + f(n - 2) in f/2.2

My first attempts at f/2 returned a tuple of {nth, memo} for the nth term and the memo, the result and the accumulator respectively, much like Elixir’s get_and_update/3s. Instead, returning just the memo means that we can pipe the result through, rather than juggle four more variables.3

Bottom-Up -v- Top-Down

The examples above work top-down. They start at f(n), work their way through their definitions, till they bottom-out at f(1) and f(0). They start at a recursive case (f(n)) and work their way to a base case (either f(0) or f(1)), before the stack unwinds, adding up the terms either side of the + in f(n - 1) + f(n - 2).

A different approach is to start with a base-case, f(0) and f(1), and make our way to the recursive case we want, f(n).3 A bottom-up approach. Let’s see what I mean in the trace below. We only ever need the two preceding Fibonacci terms to calculate the next:

0, 1, ...The first two terms.

0, 1, 1, ... 0 + 1 = 1.

0, 1, 1, 2, ... 1 + 1 = 2.

0, 1, 1, 2, 3, ... 1 + 2 = 3.

0, 1, 1, 2, 3, 5, ... 2 + 3 = 5.

So, starting with the base terms of 0 and 1, we can iterate our way to successive values in chunks of two terms at a time. It’s important to see that in this version evaluation starts at the base cases of f(0) and f(1) building up to f(n), which is confusing, because our previous versions started evaluation at the recursive case f(n) breaking it down with f(n - 1)s and f(n - 2)s (even though their definitions started with the base case).4

You can read more on Wikipedia. In Elixir:

defmodule Math.V3 do
  def f(n) do
    {0, 1}
    |> Stream.iterate(fn {a, b} -> {b, a + b} end)
    |> Enum.at(n)
    |> then(fn {a, _} -> a end)
  end
end

V3 is most efficient but most cryptic in some ways. It keeps the bare minimum of two terms in memory, but you can’t see much that resembles the formula. It’s all about the how, i.e. this particular method for generating the sequence, so the what is lost but the method is revealed instead.

Conclusion

Considering all the interesting twists and turns in all these versions and others, this post could easily have been about two-dozen variations on Fibonacci, or all the iterations it took to get to the code here. It’s a shame because nice code takes a lot of work, but we only see the end result in a post, we don’t experience the journey with the author. I’ve been tinkering with these on and off for what must total a few days over a few years. How to better share the journey?


  1. I wanted to use it to teach functional programming and Elixir syntax.

  2. In Elixir the syntax f/2 means the function named f that consumes two arguments. The functions named f of arity 1 and 3, f/1 and f/3, are different functions.

  3. Dropping the result part and keeping just the accumulator could extend to other code too. Though higher-order functions that use an accumulator cover most of my day-to-day needs.

  4. I mean that the definitions for the previous versions start with the base cases. V1 lists def f(0) and def f(1) before def f(n). V2 lists the initial terms in the memo before calling f/2 for help. In V3 we don’t have that. The order it’s evaluated in can’t possibly be written in any other order than the way it’s listed. V1 could be listed differently, with def f(n) listed before def f(0) and def f(1), so that evaluation proceeds in the same order as it is declared for n > 1. Of course, evaluation order doesn’t have to follow definition order, but I think it can be confusing.

What is Functional Programming?

What is Functional Programming (FP)? I didn’t have an answer I was happy with for a long time. Over the years I discovered that it meant different things to different people, but there was a clear common thread, around which more sophisticated ideas revolved. So I asked colleagues and folks at the Recurse Center what it meant for them. I got lots of different answers. Some of which surprised me. At the time I settled on “FP is about functions, their inputs, and their outputs”. That’s it. Functions consume their inputs and produce an output. That’s all a Functional Programming Language lets you do. Specifically, there is no extraneous data. Only input and output.

The problem is it’s hard to see how a bigger, more substantial program, is coded that way. I’d liken it to saying Evolution is about random mutation. Yes, that’s part of it, but not the whole story. So much so that it misses the forest for the trees. Over long stretches of time populations have members with a mutation, some survive, some don’t and if that mutation helped members survive then it’s more likely to live on as descendants inherit that mutation. That’s a much better description of Evolution. That’s what I was looking for when it came to FP. A macroscopic definition or description that fit in with my day to day experience.

Then, back in 2020, I came across this Game of Life in JS. If you’re like me, over the years you’ve seen a few examples like these in Go, Smalltalk (last page), and APL. I thought the JS code was amazing! I’d never seriously considered a set of (x,y) points (living cells) as a representation for Life. I had always thought of it, and coded it, as a matrix of some sort. 1

Here’s the JS code:

I got to work reproducing it in Elixir, Phoenix LiveView, and SVG. I started out from scratch, but I couldn’t get it as neat as the JS code. So, I copied it into my editor and turned it into Elixir code, construct by construct, so that it resembled the JS original syntactically as much as possible. Once I had it working I started to rework it into something that felt more like idiomatic functional code in Elixir (to be honest - that first working copy was pretty good). I’d have never thought it a worthwhile exercise to translate code, construct by construct, from one style to another, but it was. I reworked the code and a different program unravelled.

I saw that what was a nested block in JS became nested data in Elixir, nested for loops became nested lists, and nested control structure became nested data-structure. Variables became functions. By the end of it I had worked the program into the data. The difference between Structured Programming (the JS code) and Functional Programming (the Elixir code) became so clear to me. It’s easy to see this if you compare and contrast the JS above with the Elixir below construct-by-construct.

Here’s the final code in Elixir:

The imperative I like to use now is:

Manipulate program data; Don’t manipulate program text. 2

The essence of FP isn’t just about functions, inputs, and outputs. FP is about representation; Specifically, one representation after the other. The succession of data; Not a succession of instructions. A succession of collections; Not a succession of statements. It’s your application snap-shot by snap-shot; Not step by step. Now with this idea in mind, it’s much easier to see how a bigger program is coded in a functional style, how it’s pieced together from plain old functions, and just their inputs, and outputs.

So that was the story of how I found my answer. In conclusion I want to highlight three learnings:

  • Representation, as we know, can make a big difference.3 So goes the adage about data structures being more important than functions, procedures, etc.

  • Even after years of looking at a problem; The elegance of the representation above eluded me. We can miss a great solution over and over again. Perhaps it’s worth entertaining the seemingly strangest formulations using all the tools at our disposal when we start out.

  • Translation can be a valuable way to learn. I hadn’t thought so because it would detract from getting into the right mindset. I was wrong.

Some other reading you might enjoy:


  1. That is, an array of arrays, or a list of lists for example. What we want, in fact, is a sparse array (as my dad pointed out).

  2. Aditya Athalye kindly pointed out that in some languages, like Lisp, data can be code and code can be data and manipulated as such. That’s not what I mean here.

  3. I haven’t compared this code to any of my historical code in Elixir that uses a list of lists, or similar representations, but it did make a difference to how much fun this was!

Visualizing Amass Data Over Time

I want to share an idea I’m really excited about!

There is a lot you can do with all the historical data you’re collecting, if you’re running Amass periodically, and you’re currently tracking changes between just the most recent enumerations.1

How about getting your data to look like this…

Figure 1. Domain Name → Host Count Over Time. The chart plots the Amass enumeration index along the x-axis, domain name along the y-axis, and the dot size and shade indicate the number of hosts behind that domain for that enumeration. The data has been shortened and domain names obfuscated for this post.

…instead of like this?

Found: about.example.com XXX.XXX.XXX.XXX
Found: analytics.example.com XXX.XXX.XXX.XXX
Found: api.example.com XXX.XXX.XXX.XXX,XXX.XXX.XXX.XXX
Found: app.example.com XXX.XXX.XXX.XXX
Found: blog.example.com XXX.XXX.XXX.XXX
Figure 2. Example sample of Amass's `track` output.

AFAICT, the status quo is all about using a line-based diff to capture changes to an asset space over the two most recent enumerations, and then getting notified about those changes.2 The idea is that this will keep us up to date with assets that might be more vulnerable because they haven’t been exposed as long, any issues are yet to be discovered, and yet to be fixed.

Sticking to just the two most recent enumerations seems extremely limiting. Why only compare the most recent data by doing a simple line-by-line comparison when the Amass DB stores historical data stretching further back and much more of it than we can get with an amass track? Admittedly, making good sense of connected categorical data (domain names, IP address, etc.) over time is difficult, but it’s clear from the chart that there’s a lot more going on than a line-by-line comparison can show us. Not least that what looks like a new asset when comparing just two enumerations, had in fact existed historically: it was there, it went, and it came back.

A line-based diff is excellent for looking at the details (e.g. exactly which domain and address at which point in time need closer examination), but they make it much harder to get the big picture and discern patterns, like what’s normal and what’s not.3 This sort of framework might actually help us learn about an organization’s routines, habits, and historical mistakes.

Spotting a pattern in the past and learning to exploit it in the future means that we can be proactive in trying to predict or narrow our focus. No doubt, notifications are great for keeping up-to-date, but as real-time as they might be, they will only ever be a reactive mechanism.

I have a few more ideas working along this theme, but more than anything, I’d love your thoughts and feedback. I don’t have as much time as I’d like to validate this idea but perhaps you do. If you have Amass set-up and running periodically you can point the code (and write-up) in this notebook to your Amass data directory to get the same chart as above.

Write to me at joseph@yiasemides.com.


  1. In short, Amass is a tool used by security professionals to gather information on a target off of the Web, including domains, IP Addresses, etc.

  2. I learnt about reconnaissance over time from Codingo’s video, sw33tLie’s post, and Hakluke’s post. I searched the Web (Google, Bing, and DuckDuckGo), consulted ChatGPT, searched on forums and chat servers, and took a look at every product I could find in the space. I couldn’t find anything providing an overview over time. Let me know if there’s something out there.

  3. The chart generated by the notebook below does in fact provide a diff when hovering over the dots.