This is the second blog in a series of 3 blogs about the Velocity Conference 2019 in Berlin. Read the second blog here!
Thursday: Even more keynotes, sessions and the inevitable trip back home
How to deploy infrastructure in just 13.8 billion years Ingrid Burrington (Independent)
A talk about the history of the universe and how we got to computer. It was a very abstract talk, but in the end, I think the gist is that computers are still a very young field and we should work on the future of the technology instead of maintaining the status-quo of the systems of today.
Ingrid told us we, as distributed computing engineers, have more power than we might think because we can look and understand the scale, which will probably shape the way we will compute in the future. And we have to think about the impact it will make to future generations. Hard to describe a concrete take-away of this talk.
The ultimate guide to complicated systems Jennifer Davis (Microsoft)
Jennifer starts with Hitchhikers guide to the galaxy references which is a win on its own. Companies have problems figuring out how to move their infrastructure and software to the future and start talking about a bunch of popular buzzwords.
When building new projects, documentation is important. Planning is important and checklists are too. We need a shared understanding of what we are building. We also need to prepare for trouble and make sure we understand outages will happen in ways we didn’t plan for. Jennifer also states that snowflakes are cool and they should be described and documented properly. Also, change is inevitable. You probably end up with something else then you planned for when you started building it. You probably have scaling issues. It’s important to understand that you need problem solving skills when you are on the way to your end goal.
It’s also important to take care of your own mental health and physical health. Watch out for burn-out and anxiety. This seems to be an increased problem in our industry.
It’s also important to keep learning. On new industry stuff, but also learn from the systems you build. Also, don’t think you need to do what everyone else is doing.
5 things Go taught me about open source? Dave Cheney (VMWare)
Looked forward to this since I really like go. I learned it over the summer and I am applying it at work already!
According to Dave, everyone in IT has been impacted by Opensource software. I know my career has. But since Velocity is not a software development conference, he takes a step back and looks at the circumstances and the why of how a opensource project becomes popular. Using Go as an example.
The languages used before had a high burden on management. You need to manage a language runtime like the JVM or the Python runtime. Golang simplifies this by producing a single binary that runs on a big amount of different machines. Cross compilation is dead simple. This makes the first user experience a breeze and that is really important.
Marketing is important too. The gopher mascot is a big recognition point to the language. The logo is a nice example on how you can contribute to an opensource projects without writing a line of code, but using other skills. Those contributions should be acknowledged as well.
If you are going to invest time and money in a new technology, that implicitly means you are going to let your current skills lapse. If management asks people to learn new stuff, they also ask them to let their current skills go.
Go got lucky as well. Docker, Kubernetes, the Hashicorp software, all written in go. It’s the same as “writing a hit song”. Luck and coincidence.
Golang also makes sure they are very inclusive and their code of conduct, which is important because it sets a precedence on how your community will behave. Golang waiting too long with tackling this. He advices to have a code of conduct in place in your projects from day one. It’s required for a healthy, inclusive opensource project.
Next to opensource, you also need open governance. Because otherwise, the party holding the governance is a hard group to join and that’s not really inclusive.
Building high-performing engineering teams, 1 pixel at a time Lena Reinhard (CircleCI)
A talk about how to build distributed team is hard. It needs more attention than a local team because there is no watercooler banter. Less organic forming of a team occurs. A team geographically spread is the most complex distributed system we know. In order to succeed, you need equal opportunities to contribute. This is based on psychological safety, dependability, structure and clarity, meaning and impact. After introducing this, Lena breaks it out and shows how we can implement this in our team.
Trust is build by small actions and moments to connect. Small, “pixel sized” actions. It is important to make room for human interaction in your workspace. Gitlab does this by making people schedule remote coffee moments to get to know your colleagues. It’s important to define expectations, lifting each other up and express praise. A difficult thing to implement, but a very important one, is to make sure feedback is shared. According to Lena, this is hard, but your feedback muscle can be trained.
To make more room for Collaboration, it’s important to stay humble. This way, you make space for others. Build relationships and make sure ‘hero culture’ is battled consistently so we reduce pressure on individuals. It also weakens teams. It’s a result of an organization failure, which is a statement I agree on. It’s also important to show what you don’t know, so an opportunity to learn stuff and improve documentation is created. It is also important to create a culture that encourages experimentation.
communication is hard. It’s hard to communicate bad news. Learn to write better: use crisp, short sentences. Be inclusive, avoid cultural references, and add emotion. Make sure you do some editing before you hit send. IS the message clear, what is the expectation you set and what can you anticipate. By making sure you document you invest in your future.
The last piece of building great teams is continuity. It’s never done. It’s something you invest in every day. Using connection, collaboration and continuity. One pixel at a time. 😉
Controlled chaos: The inevitable marriage of DevOps and security Kelly Shortridge (Capsule8)
Kelly states that infosec can’t remain in a silo and needs to be build into your software distribution pipelines. Starting out with chaos engineering, which is a way we randomly break stuff in order to test our resilience, we also should make sure we test our security controls in the same way. This means we have to architect our security controls in a way that we expect them to fail. This also goes for users causing security incidents.
According to Kelly, we should not avoid security incidents, we should hone our skills to respond to them. A good way to do this is to organize game days and use them like planned fire drills. In order to prevent security incidents, you need to raise the cost to exploit your systems. We also need to align infosec be to Distributed, immutable and ephemeral.
Service meshes are a great way to raise the cost of an attack as it abstracts the network away and an attacker needs a way to break out of these to get to the iptables rules. immutable infrastructure, which are infrastructure bits that do not change after they are deployed, provide a bunch of security as well. Remove shell access. Patching is no longer a nightmare because your images are version controlled and complete systems get redeployed. Rollbacks are way easier this way as well. Ephemeral makes our infrastructure have a very low lifespan. Infrastructure that can die any moment makes it hard for an attacker to persist. This raises the cost to attack with a huge factor.
Another fun excersize, shuffle your IP block regularly to confuse attackers and make your infra less predictable. Also, inject errors in your service mesh to test your authentication schemes. Immutable infra is like a phoenix rising from the ashes. The way infrastructure can be gone any time forces attacker to stay in memory. Get rid of the state, get rid of the bugs
This talk was an inspiration. I’m trying all infrastructure projects I architect for to move in this way. I’m going to watch this talk on Youtube once more. This was my favorite keynote.
Day 2 Sessions
Cultivating production excellence: Taming complex distributed systems Liz Fong-Jones (Honeycomb)
In previous lives, the server was either up or down. Because systems get increasingly more complex, the state is more in between both states now. We need to do reliability work, and we are adding complexity. Liz also makes the point that heroism isn’t going to work because the heroes are burning out. Production ownership needs to be defined in a better way. A lot of complexity is added by conference hype. Do think if you need Kubernetes! A complex environment is hard to operate and people tend to forget who operate these systems. It’s easy to understate the human factor and the importance of human happiness. Invest in people, culture and process.
Systems need to be clear and friendly to the people operating them. All people how have a stake in the system need to be involved and you need to make up how to create this. Create a culture in which asking questions in encouraged and learn about the system together. Make sure you see the difference between essential complexity and unnecessary complexity. The latter is also known as technical debt.
A concept from site reliability engineering is service level indicators, service level objectives. The way you graph this is by looking for the event in context, which you can map as a customer journey. We need to know if an event is good or bad. Set thresholds. For example, successful is an http 200 with a response of < 200ms. We also need to check if the event is eligible. Is it a real user, or a botnet?
Then set a SLO and check if you meet it. window and target percentage is important here. So an SLO can be 99,9% of good events within 200ms in the last 30 days. Then monitor if the SLO is in danger. Decide on how long it will take until you will not longer meet the SLO to decide if you need to page someone at 4 AM. You can also use this to decide if it’s feasible to do a feature release that might impact the reliability. Or tell the PM now is not the time to experiment.
Iterate on this, remind perfect is the enemy of good. You can’t predict future failures as well. You need to be able to debug cases in production. Observability goes much further than break/fix and add a human touch to it. Include non-engineering departments as well.
Do game days, chaos engineering, train for outages. Lack of observability is a system risk. Lack of collaboration is also a system risk. Production Excellence is the key ingredient to success.
Knative: A Kubernetes framework to manage serverless workloads Nikhil Barthwal (Google)
Serverless is great because there is no hardware for you to manage, but I was always a bit skeptical because it also sounded like the new AWS/Azure/GCP vendor lock-in. Knative is a serverless platform running on Kubernetes, which is cloud agnostic. This all sounds very interesting since you are in control with which vendor gets your money while also abstracting the platform away for your developers.
The room was packed. People were sitting on the floor. The time was too short. There wasn’t enough time to demo. This talk is really encouraging me to learn more about Knative but mostly, start playing with Knative as it looks like it’s an abstraction upon Kubernetes. It feels like it’s still too much of an infrastructure platform and not enough of a developers platform, but I figure it is hard to find the correct tradeoff between complexity and portability.
Stateful systems in the time of orchestrators Danielle Lancashire (HashiCorp)
Containers are ephemeral and can be destroyed at any moment. But sometimes, you want to save your state somewhere. In Linux, you do this by mounting storage to containers on some kind of persistent storage. Managing this is hard, but it’s getting better. Danielle starts out with how Nomad and cap theorem works, after which she moves on to the container storage api, to be used as an industry standard to connect orchestration platforms with their storage. This api is to be implemented by the storage provider, and will be controlled from the controller service of the orchestration plane. Basically to create, attach or delete storage.
Secret management is a problem here as the controller needs a long lived secret for it actually to work. After the secret is inserted, the storage provider will then create, format and provide the storage to be mounted. This unfortunately requires privileged containers. It will not solve all problems though, since the API does not specify how it should be used by applications.
Interesting: In Kubernetes, if you run a statefulset and the node goes away, that statefulset will not recover if you do not manually (force?) delete the pod! I didn’t know that yet.
A nice explanation about a problem space we are still in the progress of solving.
Revolutionizing a bank: Introducing service mesh and a secure container platform Janna Brummel (ING Netherlands), Robin van Zijll (ING Netherlands)
I’ve been at a bank in the past. I know how good they are at adopting new technologies and moving away from old patterns. Yeah, not too good, but probably for good reasons. I wanted to attend this talk because I was really curious how they break out of the old paradigms.
It was an interesting summary of the journey of ING moving from a separate dev and ops organization to devops, and after that a busdevops, with the product managers joining the teams. IT risk and security are of extreme importance. This made their environment composed out of software written by themselves and they really wanted to keep stuff in their own control.
ING wanted a secure container platform because the future of banks is unclear. New competitors are working on payment solutions, like google and apple. ING needs to prove they can add value an IT plays a big role in that. Views need to be real-time, and they need to deliver relative functionality. They waste a bunch of time spending time on things a customer does not directly benefit from.
ING build a container platform on the public cloud. This is a accelerator which they used to use services they didn’t want to build themselves. Containers are used because it is vendor independent and very portable. They use a service mesh because it improves observability and security. Public cloud meets their internal SLO. They can learn from other parties by adopting more generic ways of solving their problems.
I was interested about the service mesh. They use it because it’s easy to change things like mutual tls by an update instead of 10 weeks of manual replacement. It also centralizes control. Combined with SRE exprecience they improve the observability across the application landscape. It’s also much easier to expand or introduce traffic or network policies. Also, IT does not need to spend as much on security since the service mesh takes care of that. This talk sold me on the service mesh. I’m going to take some training on this.
They use Azure Kubernetes Service, with Envoy / Istio as their service mesh. The advantages are that they can use chaos engineering, zero trust and they have no direct access, but everything is in a pipeline.
This talk was reassuring. We’re on the right track at my current customer.
Test-driven development (TDD) for infrastructure Rosemary Wang (HashiCorp)
So many relevant talks! I was tired when attending this, but I might looked forward to this session more than all the other ones. We want to adopt Terraform and really redo the way we do infrastructure at my current customer, and being able to TEST your actual infrastructure is vital.
In test driven development, you normally write tests first, then write code for the feature, then refactor. Red, green, refactor. It helps you to keep the scope of your feature clean. It also ends in more modular, testable code. In infrastructure, we do a lot of end-to-end tests with a bunch of manual testing.
I expected Rosemary to use Terratest, but she started out with the Open Policy Agent and validate against a policy. It’s easier for operation teams as they do not need to learn a programming language. (Like Golang, which is used by Terratest)
In the end, testing infrastructure code is hard. But it’s a good idea to invest in this so you get a feel for what the blast radius of your changes can be.
Takeaways
This was my first, and also my last Velocity ever. Because it’s the last Velocity ever. I’ve learned a lot about my craft, even when I was reviewing the things said on the event remotely by shared slide decks. The conference had the amount of depth in architecture and site reliability engineering.
The new conference in 2020 will be “Infrastructure and Ops”, I hope to attend this conference as well, as I learned so much on this conference I have issues bringing it to words.
For any remarks, or questions, feel free to reach out! bas.langenberg@syntouch.nl is probably the easiest way to reach me.