Paul Tagliamonte: Complex for Whom?

Complexity Culture
In a large enough organization, where the team is high functioning enough to
have and maintain trust amongst peers, members of the team will specialize.
People will begin to engage with subsets of the work to be done, and begin to
have their efficacy measured against that part of the organization s problems.
Incentives shift, and over time it becomes increasingly likely that two
engineers may have two very different priorities when working on the same
system together. Someone accountable for uptime and tasked with responding to
outages will begin to resist changes. Someone accountable for rapidly
delivering features will resist gates between them and their users. Companies
(either wittingly or unwittingly) will deal with this by tasking engineers with
both production (feature development) and operational tasks (maintenance), so
the difference in incentives isn t usually as bad as it could be.
When we get a bunch of folks from far-flung corners of an organization in a
room, fire up a slide deck and throw up some aspirational to-be architecture
diagram in order to get a sign-off to solve some problem (be it someone needs a
credible promotion packet, new feature needs to get delivered, or the system
has begun to fail and needs fixing), the initial reaction will, more often than
I d like, start to devolve into a discussion of how this is going to introduce
a bunch of complexity, going to be hard to maintain, why can t you make it
less complex?
Right around here is when I start to try and contextualize the conversation
happening around me understand what complexity is that being discussed, and
understand who is taking on that burden. Think about who should be owning
that problem, and work through the tradeoffs involved. Is it best solved here,
or left to consumers (be them other systems, developers, or users). Should
something become an API call s optional param, taking on all the edge-cases and
on, or should users have to implement the logic using the data you
return (leaving everyone else to take on all the edge-cases and maintenance)?
Should you process the data, or require the user to preprocess it for you?
Frequently it s right to make an active and explicit decision to simplify and
leave problems to be solved downstream, since they may not actually need to be
solved or perhaps you expect consumers will want to own the specifics of
how the problem is solved, in which case you leave lots of documentation and
examples. Many other times, especially when it s something downstream consumers
are likely to hit, it s best solved internal to the system, since the only
thing that can come of leaving it unsolved are bugs, frustration and
half-correct solutions. This is a grey-space of tradeoffs, not a clear decision
tree. No one wants the software manifestation of a katamari ball or a junk
drawer, nor does anyone want a half-baked service unable to handle the simplest
use-case.
Head-in-sand as a Service
Popoffs about how complex something is, are, to a first approximation, best
understood as meaning complicated for the person making comments . A lot of
the #thoughtleadership
believe that an AWS hosted EKS k8s
cluster running
images built by CI talking to an AWS hosted PostgreSQL RDS is not complex.
They re right. Mostly right. This is less complex less complex for them.
It s not, however, without complexity and its own tradeoffs it s just
complexity that they do not have to deal with. Now they don t have to
maintain machines that have pesky operating systems or hard drive failures.
They don t have to deal with updating the version of k8s
, nor ensuring the
backups work. No one has to push some artifact to prod manually. Deployments
happen unattended. You click a button and get a cluster.
On the other hand, developers outside the ops function need to deal with
troubleshooting CI, debugging access control rules encoded in turing complete
YAML, permissions issues inside the cluster due to whatever the fuck a service
mesh is, everyone needs to learn how to use some k8s
tools they only actually
use during a bad day, likely while doing some x.509
troubleshooting to
connect to the cluster (an internal only endpoint; just port forward it) not
to mention all sorts of rules to route packets to their project (a single
repo s binary being run in 3 containers on a single vm host).
Beyond that, there s the invisible complexity complexity on the interior of
a service you depend on. I think about the dozens of teams maintaining the EKS
service (which is either run on EC2 instances, or alternately, EC2 instances in
a trench coat, moustache and even more shell scripts), the RDS service (also
EC2 and shell scripts, but this time accounting for redundancy, backups,
availability zones), scores of hypervisors pulled off the shelf (xen
, kvm
)
smashed together with the ones built in-house (firecracker
, nitro
, etc)
running on hardware that has to be refreshed and maintained continuously. Every
request processed by network ACL rules, AWS IAM rules, security group rules,
using IP space announced to the internet wired through IXPs directly into ISPs.
I don t even want to begin to think about the complexity inherent in how those
switches are designed. Shitloads of complexity to solve problems you may or
may not have, or even know you had.
What s more complex? An app running in an in-house 4u server racked in the
office s telco closet in the back running off the office Verizon line, or an
app running four hypervisors deep in an AWS datacenter? Which is more complex
to you? What about to your organization? In total? Which is more prone to
failure? Which is more secure? Is the complexity good or bad? What type of
Complexity can you manage effectively? Which threaten the system? Which
threaten your users?
COMPLEXIVIBES
This extends beyond Engineering. Decisions regarding what tools are we able to
use be them existing contracts with cloud providers, CIO mandated SaaS
products, a list of the only permissible open source projects will incur
costs in terms of expressed complexity . Pinning open source projects to a
fixed set makes SBOM production less complex . Using only one SaaS provider s
product suite (even if its terrible, because it has all the types of tools you
need) makes accreditation less complex . If all you have is a contract with
Pauly T s lowest price technically acceptable artisinal cloudary and
haberdashery, the way you pay for your compute is less complex for the CIO
shop, though you will find yourself building your own hosted database template,
mechanism to spin up a k8s cluster, and all the operational and technical
burden that comes with it. Or you won t and make it everyone else s problem in
the organization. Nothing you can do will solve for the fact that you must
now deal with this problem somewhere because it was less complicated for the
business to put the workloads on the existing contract with a cut-rate vendor.
Suddenly, the decision to reduce complexity because of an existing contract
vehicle has resulted in a huge amount of technical risk and maintenance burden
being onboarded. Complexity you would otherwise externalize has now been taken
on internally. With a large enough organizations (specifically, in this case,
i m talking about you, bureaucracies), this is largely ignored or accepted as
normal since the personnel cost is understood to be free to everyone involved.
Doing it this way is more expensive, more work, less reliable and less
maintainable, and yet, somehow, is, in a lot of ways, less complex to the
organization. It s particularly bad with bureaucracies, since screwing up a
contract will get you into much more trouble than delivering a broken product,
leaving basically no reason for anyone to care to fix this.
I can t shake the feeling that for every story of technical mandates gone
awry, somewhere just
out of sight there s a decisionmaker optimizing for what they believe to be the
least amount of complexity least hassle, fewest unique cases, most
consistency as they can. They freely offload complexity from their
accreditation and risk acceptance functions through mandates. They will never
have to deal with it. That does not change the fact that someone does.
TC;DR (TOO COMPLEX; DIDN T REVIEW)
We wish to rid ourselves of systemic Complexity after all, complexity is
bad, simplicity is good. Removing upper-bound own-goal complexity ( accidental
complexity in Brooks s terms) is important, but once you hit the lower bound
complexity, the tradeoffs become zero-sum. Removing complexity from one part of
the system means that somewhere else maybe outside your organization or in a
non-engineering function must grow it back. Sometimes, the opposite is the
case, such as when a previously manual business processes is automated. Maybe that s a
good idea. Maybe it s not. All I know is that what doesn t help the situation
is conflating complexity with everything we don t like legacy code,
maintenance burden or toil, cost, delivery velocity.
- Complexity is not the same as proclivity to failure. The most reliable
systems I ve interacted with are unimaginably complex, with layers of internal
protection to prevent complete failure. This has its own set of costs which
other people have written about extensively.
- Complexity is not cost. Sometimes the cost of taking all the complexity
in-house is less, for whatever value of cost you choose to use.
- Complexity is not absolute. Something simple from one perspective may
be wildly complex from another. The impulse to burn down complex sections of
code is helpful to have generally, but
sometimes things are complicated for a reason,
even if that reason exists outside your codebase or organization.
- Complexity is not something you can remove without introducing complexity
elsewhere. Just as not making a decision is a decision itself; choosing to
require someone else to deal with a problem rather than dealing with it
internally is a choice that needs to be considered in its full context.
Next time you re sitting through a discussion and someone starts to talk about
all the complexity about to be introduced, I want to pop up in the back of your
head, politely asking what does complex mean in this context? Is it lower
bound complexity? Is this complexity desirable? Is what they re saying mean
something along the lines of I don t understand the problems being solved, or
does it mean something along the lines of this problem should be solved
elsewhere? Do they believe this will result in more work for them in a way that
you don t see? Should this not solved at all by changing the bounds of what we
should accept or redefine the understood limits of this system? Is the perceived
complexity a result of a decision elsewhere? Who s taking this complexity on,
or more to the point, is failing to address complexity required by the problem
leaving it to others? Does it impact others? How specifically? What are you
not seeing?
What can change?
What should change?
#thoughtleadership
believe that an AWS hosted EKS k8s
cluster running
images built by CI talking to an AWS hosted PostgreSQL RDS is not complex.
They re right. Mostly right. This is less complex less complex for them.
It s not, however, without complexity and its own tradeoffs it s just
complexity that they do not have to deal with. Now they don t have to
maintain machines that have pesky operating systems or hard drive failures.
They don t have to deal with updating the version of k8s
, nor ensuring the
backups work. No one has to push some artifact to prod manually. Deployments
happen unattended. You click a button and get a cluster.
On the other hand, developers outside the ops function need to deal with
troubleshooting CI, debugging access control rules encoded in turing complete
YAML, permissions issues inside the cluster due to whatever the fuck a service
mesh is, everyone needs to learn how to use some k8s
tools they only actually
use during a bad day, likely while doing some x.509
troubleshooting to
connect to the cluster (an internal only endpoint; just port forward it) not
to mention all sorts of rules to route packets to their project (a single
repo s binary being run in 3 containers on a single vm host).
Beyond that, there s the invisible complexity complexity on the interior of
a service you depend on. I think about the dozens of teams maintaining the EKS
service (which is either run on EC2 instances, or alternately, EC2 instances in
a trench coat, moustache and even more shell scripts), the RDS service (also
EC2 and shell scripts, but this time accounting for redundancy, backups,
availability zones), scores of hypervisors pulled off the shelf (xen
, kvm
)
smashed together with the ones built in-house (firecracker
, nitro
, etc)
running on hardware that has to be refreshed and maintained continuously. Every
request processed by network ACL rules, AWS IAM rules, security group rules,
using IP space announced to the internet wired through IXPs directly into ISPs.
I don t even want to begin to think about the complexity inherent in how those
switches are designed. Shitloads of complexity to solve problems you may or
may not have, or even know you had.
What s more complex? An app running in an in-house 4u server racked in the
office s telco closet in the back running off the office Verizon line, or an
app running four hypervisors deep in an AWS datacenter? Which is more complex
to you? What about to your organization? In total? Which is more prone to
failure? Which is more secure? Is the complexity good or bad? What type of
Complexity can you manage effectively? Which threaten the system? Which
threaten your users?
COMPLEXIVIBES
This extends beyond Engineering. Decisions regarding what tools are we able to
use be them existing contracts with cloud providers, CIO mandated SaaS
products, a list of the only permissible open source projects will incur
costs in terms of expressed complexity . Pinning open source projects to a
fixed set makes SBOM production less complex . Using only one SaaS provider s
product suite (even if its terrible, because it has all the types of tools you
need) makes accreditation less complex . If all you have is a contract with
Pauly T s lowest price technically acceptable artisinal cloudary and
haberdashery, the way you pay for your compute is less complex for the CIO
shop, though you will find yourself building your own hosted database template,
mechanism to spin up a k8s cluster, and all the operational and technical
burden that comes with it. Or you won t and make it everyone else s problem in
the organization. Nothing you can do will solve for the fact that you must
now deal with this problem somewhere because it was less complicated for the
business to put the workloads on the existing contract with a cut-rate vendor.
Suddenly, the decision to reduce complexity because of an existing contract
vehicle has resulted in a huge amount of technical risk and maintenance burden
being onboarded. Complexity you would otherwise externalize has now been taken
on internally. With a large enough organizations (specifically, in this case,
i m talking about you, bureaucracies), this is largely ignored or accepted as
normal since the personnel cost is understood to be free to everyone involved.
Doing it this way is more expensive, more work, less reliable and less
maintainable, and yet, somehow, is, in a lot of ways, less complex to the
organization. It s particularly bad with bureaucracies, since screwing up a
contract will get you into much more trouble than delivering a broken product,
leaving basically no reason for anyone to care to fix this.
I can t shake the feeling that for every story of technical mandates gone
awry, somewhere just
out of sight there s a decisionmaker optimizing for what they believe to be the
least amount of complexity least hassle, fewest unique cases, most
consistency as they can. They freely offload complexity from their
accreditation and risk acceptance functions through mandates. They will never
have to deal with it. That does not change the fact that someone does.
TC;DR (TOO COMPLEX; DIDN T REVIEW)
We wish to rid ourselves of systemic Complexity after all, complexity is
bad, simplicity is good. Removing upper-bound own-goal complexity ( accidental
complexity in Brooks s terms) is important, but once you hit the lower bound
complexity, the tradeoffs become zero-sum. Removing complexity from one part of
the system means that somewhere else maybe outside your organization or in a
non-engineering function must grow it back. Sometimes, the opposite is the
case, such as when a previously manual business processes is automated. Maybe that s a
good idea. Maybe it s not. All I know is that what doesn t help the situation
is conflating complexity with everything we don t like legacy code,
maintenance burden or toil, cost, delivery velocity.
- Complexity is not the same as proclivity to failure. The most reliable
systems I ve interacted with are unimaginably complex, with layers of internal
protection to prevent complete failure. This has its own set of costs which
other people have written about extensively.
- Complexity is not cost. Sometimes the cost of taking all the complexity
in-house is less, for whatever value of cost you choose to use.
- Complexity is not absolute. Something simple from one perspective may
be wildly complex from another. The impulse to burn down complex sections of
code is helpful to have generally, but
sometimes things are complicated for a reason,
even if that reason exists outside your codebase or organization.
- Complexity is not something you can remove without introducing complexity
elsewhere. Just as not making a decision is a decision itself; choosing to
require someone else to deal with a problem rather than dealing with it
internally is a choice that needs to be considered in its full context.
Next time you re sitting through a discussion and someone starts to talk about
all the complexity about to be introduced, I want to pop up in the back of your
head, politely asking what does complex mean in this context? Is it lower
bound complexity? Is this complexity desirable? Is what they re saying mean
something along the lines of I don t understand the problems being solved, or
does it mean something along the lines of this problem should be solved
elsewhere? Do they believe this will result in more work for them in a way that
you don t see? Should this not solved at all by changing the bounds of what we
should accept or redefine the understood limits of this system? Is the perceived
complexity a result of a decision elsewhere? Who s taking this complexity on,
or more to the point, is failing to address complexity required by the problem
leaving it to others? Does it impact others? How specifically? What are you
not seeing?
What can change?
What should change?
- Complexity is not the same as proclivity to failure. The most reliable systems I ve interacted with are unimaginably complex, with layers of internal protection to prevent complete failure. This has its own set of costs which other people have written about extensively.
- Complexity is not cost. Sometimes the cost of taking all the complexity in-house is less, for whatever value of cost you choose to use.
- Complexity is not absolute. Something simple from one perspective may be wildly complex from another. The impulse to burn down complex sections of code is helpful to have generally, but sometimes things are complicated for a reason, even if that reason exists outside your codebase or organization.
- Complexity is not something you can remove without introducing complexity elsewhere. Just as not making a decision is a decision itself; choosing to require someone else to deal with a problem rather than dealing with it internally is a choice that needs to be considered in its full context.