Blog · Joseph Yiasemides

Working with A Large Code Base

The bigger a code base, the better the tooling we need. Think about Hack at Meta, Bazel or Protocol Buffers at Google, and Kafka at LinkedIn. Big tech is big on their tooling. It helps them scale to giant code bases and giant teams.

Navigating Cities of Code puts forward the idea that a code base requires us to adopt more sophisticated techniques and practices at successively larger LOC. It suggests what some of those techniques and practices could be. Documentation, testing, and expressive strong static types are some concrete examples 1. The idea is great 2, but some of the suggestions are ambiguous, and are the author’s personal experience.

If we just observe the industry, we can see that tooling is hugely important, and that the degree of tool-building is of upmost importance as a code base grows. At one end of the scale, big tech has entire departments dedicated to internal tooling. A significant part of their organisation works on improving software development for the rest of them. On the other side of the spectrum, it’s common for smaller shops to have at least a platform team.

Tool-building should be more common and widespread in day-to-day development. Most of the industry produces two artifacts: code and tests. Tools are a third important and valuable artifact 3. Just like expressive strong static types and tests help a team to scale to a bigger code base, so does dedicated tooling.

↩
Expressive strong static typing is my own addition.
↩
That is, the idea that a bigger code base simply can not be navigated in the same way as a smaller code base, and that there is in fact a succession of techniques and practices that must be employed at larger and larger LOC.
↩
As I learnt working alongside the team at feenk. This is a big part of their work on the Glamorous Toolkit.

Take Questions; Not Notes!

I picked up an unassuming little book at the end of my master’s degree. It changed how I study. In the 10 years since, I’ve gotten 95% plus on such endeavors as professional certificates and material as mundane as driving theory, with minimal study and no stress.

The practical technique I developed revolves around taking questions, not notes. In its entirety:

It is essentially a flash-card technique1.
Jot down a question as you are reading, watching a video, or in a lecture2.
Jot down a task if it can’t be phrased as a question.
Do not take notes blindly (for the fear of missing something important); Let yourself engage with the material. Phrase a concise question with enough substance for you to retain the material you’re learning. If a question, a point of note, or a thought come to you, then use them, write them down.
There is not enough time, it is not effective, and it is distracting to take complete notes3.
If there are multiple points to an answer include a point count on the card, next to the question, so that you know you need that many points for a complete answer4.
In batches, at regular intervals, sections, between videos, or at a break in a lecture, work through the questions and write down their answers.
Review and revise these immediately! As above, read the question, write the answer, compare in batches, and revise any wording, phrasing, etc. Everyone has see Ebbinghaus’s forgetting curve, well, this is a way to rehearse that is not going to bore you.
Review and revise these at regular intervals moving forward5.
Sit an exam to practice using what you’ve learnt or otherwise assess with an exercise sheet, a problem set, or an essay (or similar).
Go back to find any more material you need in order to improve.
Repeat.

This has been extremely effective. I learnt my stuff, I enjoyed it, and with a bit of time management in an exam it has gotten me excellent results.

A bit on the origin of this technique. That unassuming little book I picked up was called “Learn How to Study” by Derek Rowntree in its first edition. It was small, drastically smaller than all the other study books I found in the library months prior (including its later editions), which made it tractable. Like I hinted at, later editions became bloated, and the first edition is the one that is really worth it (though you might like to compare). Not only was it small, but it was in style called programmed learning, a sort of Socratic question and answer style that I found familiar. At the same time, I was reading another book, in a similar style. The Little Schemer which is a CS classic. It became really clear that this style worked well for me and this morphed into my unique technique.

It’s worked for learning the details of research (including memorizing authors and dates), mathematics (my original field), dry driving theory, programming languages and APIs. My retention is great and my study time is structured, predictable, and minimal. It hasn’t worked for foreign languages, and it has not worked for everyone, but it might work for you. If this specific technique doesn’t work for you, pick up Rowtree’s book, and develop your own6.

↩
Pencil and paper are great, and they say retention is better, but at some point I like to have these electronically. Having them on my phone helps to study them on the go.
↩
Don’t accept a teaching format that is stupid. Our lecturers would copy from paper to board, we’d copy from board to paper, but that was 2010. There are better ways to deliver material today, so that you can learn more actively, not transcribe in class.
↩
Seriously, they should be made available to you electronically, one way or another.
↩
Borrowed from the (UK) exam system.
↩
Again, having them on a phone, helps to study them on a commute for example. I like Cram notes but you could try Anki.
↩
In case it isn’t clear, this isn’t a book review, this technique isn’t spelled out in the book. I started to develop it while reading the book as the title suggests.

3 Ways To Fibonacci In Elixir

Introduction

About 10 years ago, when I was learning to code, I read a post listing five ways to code the Fibonacci Sequence in Python. I wrote more or less the same thing in Elixir a few years ago but never published it.1 Here’s a heavily revised version.

I consider the Fibonacci Sequence an analogue to the C classic, Hello World, but for Functional Programming. It’s one of the first things I code up when I come to a new language. Here’s its definition:

F_n = F_n-1 + F_n-2 with F₀ = 0, F₁ = 1 , and n > 1

Of course, the formula isn’t what I want to show, it just helps to have it here for reference.

Plain & Simple

Our first version is as simple as we can get it.

defmodule Math.V1 do
  def f(0), do: 0
  def f(1), do: 1
  def f(n) when n > 1, do: f(n - 1) + f(n - 2)
end

I was going to say it’s similar to the formula above but the defs and dos make me think again. One thing is for sure, out of all the examples, it’s most clear what it does because it’s a translation of the formula into code directly. There’s no how it does it to get in the way of clarity or obscure what it does.

To calculate a term (i.e. f(n)) with V1 we have to compute the entire tree that grows out of f(n - 1) + f(n - 2). Let’s see an example. To calculate f(4) we have to calculate f(4 - 1) = f(3) and f(4 - 2) = f(2), to calculate f(3) we have to calculate f(3 - 1) = f(2), and so on (see below). In doing so, we repeat lots of the calculation, paying the same term a visit at different parts of the tree. We end up calculating f(2) twice:

f(4)
┣ f(3)
┃ ┣ f(2) ## 1st
┃ ┃ ┣ f(1)
┃ ┃ ┗ f(0)
┃ ┗ f(1)
┗ f(2)   ## 2nd
  ┣ f(1)
  ┗ f(0)

The bigger the n we pass to f(n), the bigger the tree, and the more the repetition. How can we do better while showing-off some neat code in Elixir?

Don’t Repeat

As per the section title, we’ll store our calculations, so that we don’t repeat them. It was more challenging, and more fun, to write this memoized code than I thought it’d be. Here’s my code after a few attempts:

defmodule Math.V2 do
  def f(n) when n >= 0 do
    memo = %{0 => 0, 1 => 1}

    memo
    |> f(n)
    |> Map.fetch!(n)
  end

  defp f(memo, n) when is_map_key(memo, n) do
    memo
  end

  defp f(memo, n) do
    memo
    |> f(n - 1)
    |> f(n - 2)
    |> then(&Map.put_new(&1, n, &1[n - 1] + &1[n - 2]))
  end
end

In V2 we store Fibonacci terms that we’d calculate again and again in memo. It still bares a lot of resemblance to the formula: (1) the initial memo lists the initial terms for n = 0 and n = 1, and (2) we can clearly see a variation of f(n - 1) + f(n - 2) in f/2.2

My first attempts at f/2 returned a tuple of {nth, memo} for the nth term and the memo, the result and the accumulator respectively, much like Elixir’s get_and_update/3s. Instead, returning just the memo means that we can pipe the result through, rather than juggle four more variables.3

Bottom-Up -v- Top-Down

The examples above work top-down. They start at f(n), work their way through their definitions, till they bottom-out at f(1) and f(0). They start at a recursive case (f(n)) and work their way to a base case (either f(0) or f(1)), before the stack unwinds, adding up the terms either side of the + in f(n - 1) + f(n - 2).

A different approach is to start with a base-case, f(0) and f(1), and make our way to the recursive case we want, f(n).3 A bottom-up approach. Let’s see what I mean in the trace below. We only ever need the two preceding Fibonacci terms to calculate the next:

0, 1, ...The first two terms.

0, 1, 1, ... 0 + 1 = 1.

0, 1, 1, 2, ... 1 + 1 = 2.

0, 1, 1, 2, 3, ... 1 + 2 = 3.

0, 1, 1, 2, 3, 5, ... 2 + 3 = 5.

So, starting with the base terms of 0 and 1, we can iterate our way to successive values in chunks of two terms at a time. It’s important to see that in this version evaluation starts at the base cases of f(0) and f(1) building up to f(n), which is confusing, because our previous versions started evaluation at the recursive case f(n) breaking it down with f(n - 1)s and f(n - 2)s (even though their definitions started with the base case).4

You can read more on Wikipedia. In Elixir:

defmodule Math.V3 do
  def f(n) do
    {0, 1}
    |> Stream.iterate(fn {a, b} -> {b, a + b} end)
    |> Enum.at(n)
    |> then(fn {a, _} -> a end)
  end
end

V3 is most efficient but most cryptic in some ways. It keeps the bare minimum of two terms in memory, but you can’t see much that resembles the formula. It’s all about the how, i.e. this particular method for generating the sequence, so the what is lost but the method is revealed instead.

Conclusion

Considering all the interesting twists and turns in all these versions and others, this post could easily have been about two-dozen variations on Fibonacci, or all the iterations it took to get to the code here. It’s a shame because nice code takes a lot of work, but we only see the end result in a post, we don’t experience the journey with the author. I’ve been tinkering with these on and off for what must total a few days over a few years. How to better share the journey?

↩
I wanted to use it to teach functional programming and Elixir syntax.
↩
In Elixir the syntax f/2 means the function named f that consumes two arguments. The functions named f of arity 1 and 3, f/1 and f/3, are different functions.
↩
Dropping the result part and keeping just the accumulator could extend to other code too. Though higher-order functions that use an accumulator cover most of my day-to-day needs.
↩
I mean that the definitions for the previous versions start with the base cases. V1 lists def f(0) and def f(1) before def f(n). V2 lists the initial terms in the memo before calling f/2 for help. In V3 we don’t have that. The order it’s evaluated in can’t possibly be written in any other order than the way it’s listed. V1 could be listed differently, with def f(n) listed before def f(0) and def f(1), so that evaluation proceeds in the same order as it is declared for n > 1. Of course, evaluation order doesn’t have to follow definition order, but I think it can be confusing.

What is Functional Programming?

What is Functional Programming (FP)? I didn’t have an answer I was happy with for a long time. Over the years I discovered that it meant different things to different people, but there was a clear common thread, around which more sophisticated ideas revolved. So I asked colleagues and folks at the Recurse Center what it meant for them. I got lots of different answers. Some of which surprised me. At the time I settled on “FP is about functions, their inputs, and their outputs”. That’s it. Functions consume their inputs and produce an output. That’s all a Functional Programming Language lets you do. Specifically, there is no extraneous data. Only input and output.

The problem is it’s hard to see how a bigger, more substantial program, is coded that way. I’d liken it to saying Evolution is about random mutation. Yes, that’s part of it, but not the whole story. So much so that it misses the forest for the trees. Over long stretches of time populations have members with a mutation, some survive, some don’t and if that mutation helped members survive then it’s more likely to live on as descendants inherit that mutation. That’s a much better description of Evolution. That’s what I was looking for when it came to FP. A macroscopic definition or description that fit in with my day to day experience.

Then, back in 2020, I came across this Game of Life in JS. If you’re like me, over the years you’ve seen a few examples like these in Go, Smalltalk (last page), and APL. I thought the JS code was amazing! I’d never seriously considered a set of (x,y) points (living cells) as a representation for Life. I had always thought of it, and coded it, as a matrix of some sort. 1

Here’s the JS code:

I got to work reproducing it in Elixir, Phoenix LiveView, and SVG. I started out from scratch, but I couldn’t get it as neat as the JS code. So, I copied it into my editor and turned it into Elixir code, construct by construct, so that it resembled the JS original syntactically as much as possible. Once I had it working I started to rework it into something that felt more like idiomatic functional code in Elixir (to be honest - that first working copy was pretty good). I’d have never thought it a worthwhile exercise to translate code, construct by construct, from one style to another, but it was. I reworked the code and a different program unravelled.

I saw that what was a nested block in JS became nested data in Elixir, nested for loops became nested lists, and nested control structure became nested data-structure. Variables became functions. By the end of it I had worked the program into the data. The difference between Structured Programming (the JS code) and Functional Programming (the Elixir code) became so clear to me. It’s easy to see this if you compare and contrast the JS above with the Elixir below construct-by-construct.

Here’s the final code in Elixir:

The imperative I like to use now is:

Manipulate program data; Don’t manipulate program text. 2

The essence of FP isn’t just about functions, inputs, and outputs. FP is about representation; Specifically, one representation after the other. The succession of data; Not a succession of instructions. A succession of collections; Not a succession of statements. It’s your application snap-shot by snap-shot; Not step by step. Now with this idea in mind, it’s much easier to see how a bigger program is coded in a functional style, how it’s pieced together from plain old functions, and just their inputs, and outputs.

So that was the story of how I found my answer. In conclusion I want to highlight three learnings:

Representation, as we know, can make a big difference.3 So goes the adage about data structures being more important than functions, procedures, etc.
Even after years of looking at a problem; The elegance of the representation above eluded me. We can miss a great solution over and over again. Perhaps it’s worth entertaining the seemingly strangest formulations using all the tools at our disposal when we start out.
Translation can be a valuable way to learn. I hadn’t thought so because it would detract from getting into the right mindset. I was wrong.

Some other reading you might enjoy:

↩
That is, an array of arrays, or a list of lists for example. What we want, in fact, is a sparse array (as my dad pointed out).
↩
Aditya Athalye kindly pointed out that in some languages, like Lisp, data can be code and code can be data and manipulated as such. That’s not what I mean here.
↩
I haven’t compared this code to any of my historical code in Elixir that uses a list of lists, or similar representations, but it did make a difference to how much fun this was!

Visualizing Amass Data Over Time

I want to share an idea I’m really excited about!

There is a lot you can do with all the historical data you’re collecting, if you’re running Amass periodically, and you’re currently tracking changes between just the most recent enumerations.1

How about getting your data to look like this…

Figure 1. Domain Name → Host Count Over Time. The chart plots the Amass enumeration index along the x-axis, domain name along the y-axis, and the dot size and shade indicate the number of hosts behind that domain for that enumeration. The data has been shortened and domain names obfuscated for this post.

…instead of like this?

Found: about.example.com XXX.XXX.XXX.XXX
Found: analytics.example.com XXX.XXX.XXX.XXX
Found: api.example.com XXX.XXX.XXX.XXX,XXX.XXX.XXX.XXX
Found: app.example.com XXX.XXX.XXX.XXX
Found: blog.example.com XXX.XXX.XXX.XXX

Figure 2. Example sample of Amass's `track` output.

AFAICT, the status quo is all about using a line-based diff to capture changes to an asset space over the two most recent enumerations, and then getting notified about those changes.2 The idea is that this will keep us up to date with assets that might be more vulnerable because they haven’t been exposed as long, any issues are yet to be discovered, and yet to be fixed.

Sticking to just the two most recent enumerations seems extremely limiting. Why only compare the most recent data by doing a simple line-by-line comparison when the Amass DB stores historical data stretching further back and much more of it than we can get with an amass track? Admittedly, making good sense of connected categorical data (domain names, IP address, etc.) over time is difficult, but it’s clear from the chart that there’s a lot more going on than a line-by-line comparison can show us. Not least that what looks like a new asset when comparing just two enumerations, had in fact existed historically: it was there, it went, and it came back.

A line-based diff is excellent for looking at the details (e.g. exactly which domain and address at which point in time need closer examination), but they make it much harder to get the big picture and discern patterns, like what’s normal and what’s not.3 This sort of framework might actually help us learn about an organization’s routines, habits, and historical mistakes.

Spotting a pattern in the past and learning to exploit it in the future means that we can be proactive in trying to predict or narrow our focus. No doubt, notifications are great for keeping up-to-date, but as real-time as they might be, they will only ever be a reactive mechanism.

I have a few more ideas working along this theme, but more than anything, I’d love your thoughts and feedback. I don’t have as much time as I’d like to validate this idea but perhaps you do. If you have Amass set-up and running periodically you can point the code (and write-up) in this notebook to your Amass data directory to get the same chart as above.

Write to me at joseph@yiasemides.com.

↩
In short, Amass is a tool used by security professionals to gather information on a target off of the Web, including domains, IP Addresses, etc.
↩
I learnt about reconnaissance over time from Codingo’s video, sw33tLie’s post, and Hakluke’s post. I searched the Web (Google, Bing, and DuckDuckGo), consulted ChatGPT, searched on forums and chat servers, and took a look at every product I could find in the space. I couldn’t find anything providing an overview over time. Let me know if there’s something out there.
↩
The chart generated by the notebook below does in fact provide a diff when hovering over the dots.

Tragedy Of The Commons

There is one analogy that relates to my experience of software development that I like, and I like to bring up, again and again: the dishwasher in a shared space. Any shared space, any dishwasher, it’s always a mess. Take the kitchen at work. You say, “it doesn’t really matter”, I say “it is so simple to keep it tidy”. I liken it to software development. Even when we start with the best of intentions and great code, it gets messed up, quickly. Like the first person putting their stuff on the rack, and those who follow, at some point it becomes a wreck. A sort of tragedy of the commons1.

So why is this such a good analogy and how is it illuminating? Most of us work with smart people, yet we seemingly cannot manage the simplest of shared activity, keeping the dishwasher functional. Why is that? We clearly have the brain power to tell a bad arrangement from a good one; Do we not? We can see, that filled to the brink, it does not perform well; Can we not? Is it because we don’t notice? Is it because… it’s not a priority (we don’t care about that particular thing, at the last minute, heading out the office to catch a bus, or we’re distracted by some other thing)? Is it because we’re not the ones doing the clean up (not my problem)? We don’t see the consequences so we don’t identify the cause. Is it because you think I don’t care, because I think you don’t care, because we might not notice each others good will? We chuck it in, run out, and the one person that cared enough to sort it out and write to everyone won’t be doing it for long. They’re not stupid.

It might be some of those, it might be all of those, either way software development is the same. We tend to think that software is complex2, and that’s why quality deteriorates quickly, but to me it looks like it is actually something else. An emergent phenomenon, be it a simple or complex task, as long as it is a shared activity. Technical competency, it seems, is not the only thing that counts.

I’ve observed and experienced more diffusion of responsibility than I’ve seen distribution of responsibility (it takes a lot to step up). That is how it seems to work. It is not necessarily conscious, though the effort to fix it is. Team-work can be hard, even a committed two-person team (as soon as it’s more than one person), to get it all right all the time. You essentially have a split brain. You forget to communicate this or that. You are not in synch. You make assumptions your peers cannot make. You have priors your peers cannot possibly have. It is mutual. Normally there is more than two of you. Things start to go a little wrong. You notice. You take corrective action.

To what extent? Enough to fix that one instance or for all time? Should you have done it sooner? Probably, but better late than never. You see, that is because we must identify these issues largely alone, but fix them as a team. Systematically crossing these boundaries, from a single instance, to dealing with it, is a difficult transition to make.

I think the analogy not only helps understand the state of software but can also help fix it. How far can we take it and how deeply analogous are any fixes? I had a colleague who explained that quality was a matter of process: quality in an artifact is because of quality in the process 3.

↩
Strictly speaking, Tragedy of the Commons is about overuse and destruction, not neglect and contamination.
↩
I sometimes find myself arguing that Software Development is not special, it is not in fact more complex than other engineering disciplines, it may be simpler.
↩
Check out Gordon Guthrie but find him elsewhere on the Web because he’s written widely.

How I Prompt

Joel Spolsky wrote he wouldn’t hire an engineer that couldn’t write1. That seems more pertinent today than ever! Of course, the industry has argued for a long time that communication is important for a software engineer, for everything from code comments to code review, documentation and demos. That is, communication with people, but prose is clearly turning into communication with machines too.

So, as a programmer, how do I make the most of the GPTs?

My personal tips:

Learn your stuff. You have to know your stuff better than ever if you are going to hand-hold a GPT. That’s high-level stuff it struggles with, and low-level stuff it will get wrong on occasion, even though the same model got it right on several prior occasions.
Pay attention to scope. This is still a core skill in software development! What goes in is what comes out. Too much and it will loose you in verbosity even if it manages to stay on track. Too little and you won’t get anything helpful if it doesn’t miss the mark completely.
You can not count on a GPT to give you the right answer. You can count on them to generate text. So count on them to do just that. Need the right answer on your code-base? Don’t ask directly. Ask for code that’ll give you what you need.
Go step by step: prompt, edit, and repeat. Better hand-hold than try to do most of it at once.
Use absolute, not relative, references. Name the thing don’t use “it”, “this”, or “that”.

If you understand the tech under a GPT, you’re better placed to play to its strengths, and circumvent its weaknesses. You’ll be able to diagnose issues with how you’re using it and really leverage the tech.

To put things into perspective, too much of my social media is about people surprised at the novelty of writing prose to produce an artifact 2, more than its technical abilities or deficiencies 3. A while back, Paul Graham was interviewing two founders about their “learn to code by making games” start-up, when he asserted 4:

Why do they need to learn all that fundamental programming stuff? These days you can pull in a package to do this or that and just glue it all together.

The point then, and now, is that you need the fundamentals or precursors to take advantage of any industry development be it package management or GPTs in their early stages. At the time you had to choose packages, or decide when to write your own, and glue them together. You couldn’t do that blindly with no experience. Today you have to instruct a GPT step by step. At least for the time being. It’s changed day to day software development from solving some smaller problems to solving some bigger ones.

There was a generation that had to look up everything in a printed manual. Then there was Google, StackOverflow, and YouTube. Now we’ve got Generative AI, LLMs, and GPTs. Here’s the thing. I once had a colleague who knew our APIs inside out. Never had to look them up. Not via Google and not via the documentation. He could just get on with work. Random access into his own memory. Code straight out of his fingertips 5. They say recent generations can Google better than older generations, but that kind of thing aside, who do you think is better equipped to use a GPT; Someone who knows their craft inside-out or someone who doesn’t even want to?

↩
It might have been in The Guerrilla Guide To Interviewing.
↩
In a sense code is just an intermediate representation.
↩
Prose is not always going to be the best way to talk to a machine, even as this tech advances, we have code for a reason. Just try explaining a sorting algorithm or data-structure, writing an essay on them, or better still ask a GPT. It’s verbose in prose, several paragraphs of text worth might distill into several lines of code, as I discovered long time ago. There is ambiguity, as is widely noted, with natural language. That is not to say the current programming languages or mathematics notation are optimal.
↩
I paraphrase from memory but you can look it up on YouTube.
↩
He could type fast.