This is the second blog in a series of 3 blogs about the Velocity Conference 2019 in Berlin. Read the first blog here!
Day one Keynotes
The overall theme of today is a shared language between different types of roles within your IT operations. Something the writer of this blog is struggling with this on a daily basis.
My love letter to computer science is very short and I also forgot to mail it – James Mickens
A lighthearted opening about blockchain, cryptocurrencies and the problems they don’t solve, next to the old technology that became mature. It’s hard to capture this session in words, but it made me laugh a lot. It went on about morals in the tech industry. If I need to summarize this session in my own words, we should use our tech skills to make the world a better place.
Kubernetes at scale: The good, the bad, and the ugly – Karthik Gaekwad
A talk about the managed Kubernetes platform by Oracle. We recently moved to Azure Kubernetes Service at my current customer, so I was curious about the lessons learned by a team that has built a similar service.
OKE (Oracle Kubernetes Engine, as Karthik calls it) is one of the most popular service on the Oracle Cloud. When developing this, they started with a full stack team doing everything around this service. They learned that k8s is a too big of a topic to be done by one team. Multiple developers are getting deep experience in different areas of k8s. The team was split in the Control Plane, Data Plane and a Platform team.
When you have a small team, there is a constant balance issue between fixing issues and firefighting. Firefighting tends to turn out in apathy and burnouts. This can be fixed by empower the teams to fix issues, grow the team to lessen the operational burden and prioritize bug fixed and features equally. And rotate on calls between feature teams.
Also Kubernetes is not always the answer to your problem. When you hold a hammer everything tends to look like a nail. You should pick the solution for your problem and not what everyone else is doing.
Observability: Understanding production through your customers’ eyes – Christine Yen
This is a field that’s changing rapidly at the moment if you ask me. Christine started about the big difference in priorities and views ops and dev have in the day to day operations. Developers care about shipping software quickly. Operations care about reliability. They use a different set of tools to look at application behavior. In 2019 a lot of companies are blurring the gap between Dev and Ops. Developers are on call and need to support their application in production. Operations folks are starting to program.
Observability can be summarized by “What is my software doing and why is it behaving this way”. A common language between Dev and Ops they can use to communicate. According to Christine, the next step is to make observability tooling useful for other roles as well. (Sales, Finance, Product, Marketing and Support)
The power of good abstractions in systems design – Lorenzo Saine
A Soviet scientist found that 90% of the problems in engineering had been solved with 40 basic principles. An engineering principle called TRIZ, or “Theory of Inventive Problem Solving”, was mentioned. It is curious to see a lot of problems are already solved in a different discipline.
Often 2000 years ago.
This is a system problem, and not an individual problem. Abstract problems need abstract solutions that might be applicable for different concrete problems.
Lorenzo makes the case that how bigger the problem space is abstracted, the more reuse of existing solutions can be applied. It makes me realize it is often a good idea to take a step back and look at the bigger picture. Understand the constraint might not be equal for your specific problem.
Secure reliable systems – Ana Oprea
Security and reliability need to be core concepts in your designs. Although it is easy to qualify them as “100%”, this is undesirable. It’s very expensive and most often, your customers don’t need it. Match the design of your product to the risk profile of your customers.
You manage security risks by understanding your adversaries. You need to map the actors and their motives to determine actions and the target of the attack. Map out what harm these actors can cause. It’s undesirable to make security choices that are not in line with the attack patterns and make the operational burden bigger. Security design principles are the usual suspects. Least privilege, Zero trust, multi-party authentication, auditing and detection and recovery. It’s interesting to see that reliability design looks a lot like security design. An important step in this is “Zero Touch”, as your systems should be managed by automated tooling and not being accessed directly by engineers.
It is also important to manage your ‘error budget’. This is the difference between the SLO set by your stake-holders and 100% reliability. You can then check on the tradeoffs in your application. Be aware that insider attackers exist as well, both intentional or accidental.
Everything is a little bit broken; or, The illusion of control – Heidi Waterhouse
The more optimized something is, the less room there is for problems. These problems can be classified as your error budget. I’ve learned that planes have an error budget too.
In your daily work, a bunch of abstractions are used. We develop in languages like Java and Go, not assembler. This talking to the CPU directly has been abstracted away, we are standing on the shoulders of giants. Technical debt is inevitable as well. We need to address it, put it on some kind of backlog. Otherwise it’ll rot.
What we build stands upon the reliability of others. Your cloud provider, the writer of your compiler or runtime. We should be more aware of that. There is technical debt in that as well. Heartbleed is a good example of this.
We choose suppliers as good as we can manage, but we are relying on the quality of our suppliers. Heidi goes as far as calling control an illusion.
Let’s learn from this when we build our systems to be better. Make stuff fail in a correct way without too many side effects. Again, Error budgets, SLO’s, and harm mitigation. People make bad choices, let’s not let them make fatal choices. This is where circuit breakers are for. Put them in your network, but also in your software. Also, make sure you have a rollback plan without an emergency deployment. Make sure there is layered access to prevent you from destroying your complete application.
A powerful quote: “Everyone’s backend is someone else frontend”, this goes on until we are back at the fundamental rocks we though to think, which was a nice reference to the origin of computers.
Conference day one Sessions
Creating a scalable monitoring system that everyone will love – Molly Struve
An interesting talk! Molly starts by explaining how she started out in a mess of different systems monitoring several things and a bunch of alerts sent to mail, slack, sms and calling mobile phones. She describes how she picked Datadog to consolidate all the other tools, and reduced the alerts to a more manageable portion. Benefits are easy onboarding of new employees and developers who are on call as well have a clear incentive on what to do because every alert should require some sort of action.
The presentations hits home to me, because I’ve been at the monitoring mess in the beginning myself and it’s hard to overstate how much your life will improve with a functioning monitoring system with a single view over the entire system without the alert fatigue.
eBPF-powered distributed Kubernetes performance analysis – Lorenzo Fontana
This one was a bit far from home to me. It is a talk on how you can use Kubernetes pods to interface directly with the underlying Linux Kernel. Although the support for Kubernetes is up and coming, it looks like it’s already usable to debug programs running in Kubernetes.
I don’t think I’ll be using this soon, but it is very interesting to learn about possibilities of Kubernetes and the Linux kernel which are a bit outside of my comfort zone.
Autoscaling in reality: Lessons learned from adaptively scaling Kubernetes – Andy Kwiatkowski
As I’m solving scalability issues with Kubernetes at my current customer now, I was happy to attend this talk. Andy makes a point that in a traditional datacenter setup, you need to provision for your peak load. In the cloud, this is a waste of resources. This is where auto scaling comes in.
At Shopify, they implemented a custom autoscaler. They do use the Horizontal Pod Autoscaler which is “included” with Kubernetes. But because a new node needs to be span up, it can take a couple of minutes before all the Kubernetes pods scale up. This does not match with their sudden flash sales which will cause a large increase of requests. During these flash sales the scale up takes longer. Sometimes longer than the flash sale runs.
Way. To. Slow. This is why Shopify wrote and operate their own scaler. They know when the flash sales run, though. So product managers can scale a scale-up in advance so the compute is there before the traffics hit the pods. A pretty smart way to do it without an engineering person needing to run kubectl manually.
They can also set a target utilization. So if the servers are 70% busy, and they are above or below that, the system will scale up or down to meet this utilization. This will scale with improvements to the system or conditions that occur that might make the system utilization go up. A cloud autoscaler for GKE or AKS most of the times scales much more aggressively which is very expensive. Shopify wrote a tool which uses real life load data to simulate the auto scaler behavior. They also use an override to scale differently when they run, for example, load tests.
At my current assignment, our load is pretty constant, so we only scale down in the weekends and the nights to save cost. If the load increases or becomes more unreliable, we might need something like this as well, but probably the capabilities of vanilla Kubernetes will be enough.
Deploying hybrid topologies with Kubernetes and Envoy: A look at service discovery – Lita Cho and Jose Nino
An interesting view from 2 engineers at Lyft about Kubernetes and how to manage a sudden load because people apparently need a lot of cabs.
Envoy is injected as a sidecart in all pods for easy inter-pod communication. It knows about the details so your pods don’t have to. All the Envoy processes form a mesh network which abstracts the network away.
Service discovery is the pairing between logical services and physical network addresses. Before Kubernetes, Lyft used a Python service backed by DynamoDB. This is now done by Envoy. This way of working gives you the capabilities to transparently route traffic, do incremental rollouts and give yourself the capability to roll back. Envoy is able to use the Kubernetes Control Plane and create routes to the correct services it needs to use.
The service mesh is a technique I was very doubtful about during this conference as it sounded unnecessary complex to me. I’ve seen some more talks later on and my opinion shifted to a more favorable stance to service meshes. At the moment, I would state my opinion as “It Depends”. As most of my opinions are.
Practical case studies are often the most clear way to place a certain technology in perspective. I really appreciate this talk.
M3 and Prometheus: Monitoring at planet scale for everyone – Rob Skillington and Łukasz Szczęsny
We’re using Prometheus at the office, so I’m curious how to move to planet scale. 😉 I wasn’t sure what M3 was before I got into this session.
M3 is build by Uber as a datastore for Prometheus and supposed to scale horizontally. Prometheus is the well known monitoring system lot of DevOps professional uses. You can add M3 as a remote read and a remote write target in Prometheus, which might be more favorable as a long-term monitoring storage source for Prometheus. This advantage is there because M3 is able to scale over multiple nodes where Prometheus is bound by a single node configuration storage wise. I can see how this might solve scaling issues with Prometheus which we did not run into yet.
First live demo of the day as well. I would lie if it wasn’t a bit refreshing.