1 DAY CONFERENCE

YOW! Hong Kong 2020

Topics covered at #yowhk

Friday, 4th September, Online Conference

7 experts spoke.
Overview

Since 2008, YOW! has brought 200+ International Software Experts from North America, Europe and countries around the world to over 30,000 software professionals in Australia. Now we're bringing them to Hong Kong.

YOW! Speakers are chosen based on their expertise; they provide excellent, technically rich content, completely independent of commercial concerns such as sponsorship or product. This means no advertising promotions, ever, just lots of case studies and stories from the trenches.

Come to this one-day conference to discover the latest trends and network with fellow developers. Hear international software experts share best practices in development and delivery.

Serious software professionals and IT leaders from all across the organisation will benefit from attending. Whether you’re a developer, architect, product owner, team lead, coach, or management, don’t miss this learning opportunity. Sign up for a workshop for an intense learning experience. Network with people who truly care about delivering great software. Meet your favourite authors and bloggers - our speakers have a wealth of experience they’re eager to share with you.

Excited? Share it!

Programme

Linux Systems Performance

Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.

Q&A

Question: How does tools installation work in containerised env where runtime images are expected to be lean / distroless?

Answer: There's different ways to deal with the container problem:

One way is to install everything on the host and then debug the containers from there (which I currently do), but that doesn't help end users of containers, since they typically don't have host access.

Another way is to have a "debug container" image with all the tools that you can spin up that shares the same namespaces as the target container, and then give access to the debug container. Side car container.


Question: how do you get started with flame graphs and how do you compare them to jvm monitoring tools (like connecting to the JVM and profiling cpu?)

Answer: VM tools that go via JVMTI are typically Java methods only. They make it easier to see the full un-inlined stack, but miss other code paths including GC, libraries, and the kernel.

So I prefer perf/BCC-based profiling so I can see all CPU consumers and code paths. As for getting started: that depends on which target language. I've posted instructions for targets like Java http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java

If it's just a compiled language like C, C++, or golang, then it's easy.

things like Java get complex since it compiles methods into the heap which don't have a standard symbol table (only JVMTI understands it) so you need a way to export the symbol table for regular profilers to see it.

IntelliJ added flame graphs. So did yourkit. And there's a way to add it to JMC. If you have a profiller UI today, check if it supports flame graphs, either it does now or it will. It's just a visualization that's easy to add.


Question: What's the biggest curve ball issue you've ever had to solve? I like war stories

Answer: The problem is a lot of my issues don't sound complex once I've found the answer. :) One ongoing one is one of our major microservices, some instances will be slightly slower (10% at most). CPU flame graphs point to a particular Java method, but there's no reason why it's slow on some instances and not another. I've used the hsdis package to dump its assembly and found it gets compiled to be a massive method on the slow instances, and is a tiny method on the fast ones -- the hotspot compiler is emitting different instructions. But we still don't know why, and why it's slow. That's gotta be my worst because it's currently unsolved.

I suspect something triggers the hotspot compiler to massively inline methods when normally it doesn't, hence the big size. It might not be a curve ball. Might just be something dumb. Actual curve balls include getting a system where one PCI expansion bus was running at half speed due to a physical manufacturing error. I noticed storage cards had different max speeds based on which slot they were plugged into, then used PMCs to look at bus-level metrics. Another was a 1% performance issue that I debugged and found was due to a server was in a hotter part of a server room, and the CPUs couldn't turbo boost as much.

You can see a lot with PMCs and MSRs, except tooling for them has been lacking. I've published various shell scripts on github that we use at Netflix (pmc-cloud-tools, msr-cloud-tools), but they may only work on the Netflix instances (since I've never tested them for other processors; e.g., ARM).

Oh, and another curve ball was disks that sometimes had bad write performance, and I debugged it by yelling at the disks. You might have seen the video


Question: have you tried running the short optimised bytecode generated by the fast ones in the slower ones? Maybe it won't run? Hotspot compiler might be detecting some different instruction set and assuming there are instructions missing and generating a bigger code that it believes will work

Answer: I don't know how it'd detect a different instruction set, since it's the same instance type... we get slow instances within the same ASG.


Question: Brendan, Flame Graph needs some patches in libc to reach its full potential, so is musl/Alpine in the plans?

Answer: Well, we're rolling out our own patched libc. Plus I'm hoping to get Canonical to provide an official libc-fp. And I'm also hoping to convince the gcc developers (GNU tool maintainers) to revert the -fomit-frame-pointer default in gcc for x8664, fixing this for everything (but they want to see performance numbers to understand the regression)


Question: I understand musl is a completely different beast, right? With the increased usage of Alpine Linux, are there any plans to support it?

Answer: Getting flame graphs (or any profiler) to work with musl depends on how it's compiled. gcc's default is -fomit-frame-pointer, which breaks stacks. I don't know how musl is packaged, but one solution is to get the package maintainers to include -fno-omit-frame-pointer in the build process.

We did that for the Netflix libc build, and we're trying to get Canonical/Debian to adopt it for the standard libc package (since we'd rather not do our own builds of things if it can be avoided -- we rather upstream then consume).


Question: Is there much difference in how Linux performance tools work vs Unix or are they pretty similar given the heritage?

Answer: vmstat/iostat/mpstat/top/sar are pretty much the same. The big difference is the newer tracing tools, based on BPF. But these differences don't come often -- so once you learn the basics it'll probably be the same for the next few decades.


Question: When you train new performance engineers - how do you do it and how long does it take?

Answer: I've developed performance classes before, typically 5-day classes. Sun Microsystems once allowed me to create a 10-day class that brought anyone up to speed with systems performance. It was great. But nowadays training isn't the same as it was decades ago, and it's hard to convince companies and people to find the time. My recent class, a BPF perf analysis class at Netflix, is 4 hours -- and that's the longest class in the catalog.

As for how to do it: it's a mix of theory and practice. The best results are from setting up simulated performance issues and have the students try to solve them, without the answers. Such hands on is when you really have to think about things and engage the brain. I have a suite of programs that I install on systems as binaries only, and students run them and try to debug their performance.


Question: Having off cpu flame graphs will be very useful for a full picture..?

Answer: Right. There's a bunch of challenges with it. Off-CPU analysis shows that most threads are sleeping most of the time. Imagine a 100-thread Java application and only 4 threads are doing work: your off-CPU flame graph is now 96% stuff you don't care about. So it's important to zoom into the threads doing work.

e.g., I'll find a Java method called something like "dorequest" and then filter on that.


Question: There is lots of discussion of using ML to build autonomic infrastructure. What are your thoughts on this?

Answer: I've seen ML tried for performance analysis using the system metrics as input, and I think it assumes the metrics are good and complete to begin with, which they aren't. I'm worried about garbage in / garbage out. I'd first fix/add metrics to the system, including using new tracing sources, so we had a complete set of metrics (USE method complete) and then feed that to ML.



Brendan Gregg

Sr. Performance Architect
Netflix


Discontinuous Improvement

Continuous improvement is based on balancing what we desire with the practicalities of time and effort. Continuous improvement gives us control of our changes and our situation, guided by feedback and reflection. But change is not always ours to control and, if we are honest, some changes are better carried out discretely than discreetly.

Change from outside can arrive gradually and then all at once. It can be a surprise because we could not have known, or it can be a surprise because we chose not to know. Most changes described as disruptive were there all along and, in truth, are only disruptive because of attitude and entrenchment. But whether the change forced upon is sudden and catastrophic because it is sudden and catastrophic or simply because we didn't know better, we may find that evolution is forced upon us.

How can we respond?

Q&A

Question: Do you think companies and the people inside of them don't really want to be agile? Change inspires fear and discomfort, are there ways to get people to embrace the uncertainty of true agile?

Answer: I think there is an element of truth here, but that truth is well hidden. While some people explicitly don't want to change, for many that resistance is not so obvious (even to them). Humans are messy, and we often find that we can both want something but at the same time work against ourselves getting it. Fear and discomfort tend to work at this level, so they're harder to see and account for.


Question: Any tips for ways to estimate business value (without breaking laws of physics)?

Answer: Like other estimations, a good tip is to look at the past: both for (i) what did we say? (ii) what did it turn out to be? and for what is the range/shape of variation. That'll give you the appropriate humility and scepticism to temper estimates of future value. And, as with any estimate, don't rely on a single point value on a linear scale to be your estimate. Estimates are probability distributions, so acknowledge that by estimating with a range, three points or a non-linear scale.


Question: Could we suggest that the reduction in company lifespans is related to missing a big step/radical change at the right time, instead of focussing on continuous everything?

Answer: As mentioned, continuous is often a way of saying discrete with smaller-than-what-we're-used-to steps. So, the steps are getting smaller with time, and the amount of what can happen in a period of time increases with increased connectivity, which is a fairly good description of many trends of the 20th and 21st centuries.


Question: Just wanted to pick up on the idea of 'travelling light' or having less stuff. I've been thinking about this recently, in terms of how do you avoid an ever growing pile of code. I wonder if companies die quicker because we can type quicker now, so they drown in technical debt that stifles their agility faster. Any practical tips on how to deprecate code / systems / features faster?

Answer: So the amount of what can happen in a given time is going up, but our habits/complacency might be preventing exactly that awareness of the right time. Well, typing is not the bottleneck in software development :) But our ability to create and connect more stuff and connect more people to it effectively increases the amount and pressure on the stuff that we create. We are more pressured to add in the short term, but neglect removal and retirement of code, dependencies, frameworks, systems, etc., which is often what is needed for the long term.

I think we have to increase our awareness of the code and the feature set through techniques such code analytics (see Adam Tornhill's book Your Code as a Crime Scene), ADRs (Architecture Decision Records), marking features as experimental/deprecated/etc. (and meaning it), runtime monitoring (what actually is going on in our code, and what gets used?), dependency management (we need to reduce our depend-on-everything unconditionally approach, e.g., npm) and so on.



Kevlin Henney

Programming · Patterns · Practice · Process


The Secrets of High Performing Organizations

Many codebases contain code that is overly complicated, hard to understand and hence expensive to change and evolve. Prioritizing technical debt is a hard problem as modern systems might have millions of lines of code and multiple development teams — no-one has a holistic overview. In addition, there's always a trade-off between improving existing code versus adding new features so we need to use our time wisely. So what if we could mine the collective intelligence of all contributing programmers and start to make decisions based on information from how the organization actually works with the code?

In this talk, you'll see how easily obtained version-control data let you uncover the behaviour and patterns of the development organization. This language-neutral approach lets you prioritize the parts of your system that benefit the most from improvements so that you can balance short- and long-term goals guided by data. The specific examples are from real-world codebases like Android, the Linux Kernel, .Net Core Runtime and more. This new perspective on software development will change how you view code.

Q&A

Question: What version control systems does this work on?

Answer: The tools I mention work towards Git, but I have used the techniques on SVN, Mercurial, TFS as well. In that case I simply convert the original VCS into a readonly Git repository that I point the analysis to.

It’s an automated conversion using tools like git-tfs, etc.


Question: Is it fair to say that this kind of code refactoring wouldn't help in architectural changes? E.g. Monolith to cloud native

Answer: I like to think that behavioral code analysis is important during legacy migrations and architectural change too. Two main use cases:

  1. Pull the risk forward: use hotspots to figure out what to migrate first.
  2. Supervise the new system: make sure the new system doesn’t end up with problematic hotspots from the beginning. Use behavioral code analysis as a safety net.

Question: I think a hotspot like this is a useful objective data point to help facilitate the refactoring / rearchitect to avoid the same again.

Answer: I’ve also found that hotspots are useful to communicate between tech and non-tech stakeholders like product managers. That way, we can make our case for refactorings and re-designs based on data and also show visible improvements.


Question: When you're looking at frequency of change, I'm assuming that's only historical. Do you ever use forecasting algorithms, or run model simulations depending on possible business objective changes?

Answer: Yes, I do predictions too. For example, I have an article that explains how CodeScene can predict a future code health decline: https://empear.com/blog/codescene-predict-future-code-quality-issues/

This gives an organization the ability to act early and prevent future issues.


Question: Why doesn't everyone use this stuff? It seems like such a handy tool

Answer: I’ve seen an increased awareness and interest in this space over the past years.

The main issue I have with software is its lack of physics; there’s no way of picking up a software system, turning it around, and inspecting it. I might be biased here, but I do think behavioral code analysis brings us a much needed visibility…within a context.

A tip: when running a behavioral code analysis, I use different history depths:

  1. Technical analyses like Hotspots and Change coupling: ~1 year or ~6 months back since I’m not interested in historical problems.
  2. Social analyses like System Mastery and Knowledge Loss: here I use the full history since I want an accurate map over the contribution history.

Question: This is forensically useful stuff. There’s nothing like a rabbit hole with pretty graphs to keep distracted.

Answer: I look at the visualizations from my own code on a weekly basis. It helps me build and maintain a mental model of what the code looks like.

Some additional distractions: https://codescene.io/showcase


Question: Any chance this will become a sonar plugin?

Answer: I’d like to see it the other way around: static analysis — like Sonar — is useful. But it’s most useful in a context. So for example, once I find a hotspot, I do like to view the static analysis findings for that hotspot. That can provide additional insights.


Question: Are there other hotspot “combinations” that you use to analyse technical debt? You mentioned code complexity/frequency and code complexity/active developer. Any more?

Answer: Yes, there are a bunch of other metrics that I use:

  1. Trends in Planned vs Unplanned work: an increase in unplanned work (e.g. defects, unexpected re-work) typically indicates a growing problem.
  2. Trends in Code Health: a declining code health in a hotspot is likely to indicate debt with high interest.

I have an article that explains the code health concept in more detail here: https://www.linkedin.com/pulse/measure-health-your-codebase-adam-tornhill/


Question: I didn’t quite catch it in the microservices example … was that analysis across multiple repos (one service per repo) or a monorepo? If the latter, is the former possible with the current tooling?

Answer: It was actually across multiple repos. The example had ~35 git repos.

Question: In that example was it that the code had sync/async call outs to the instances to determine coupling?

Answer: No, in CodeScene, we use ticket/issue references from the commit messages for this. If some commit in one repo references the same ticket/issue as commits in another, then there’s a logical connection. If it happens frequently, then there’s change coupling.


Some Links:

Software Design X-Rays:  https://pragprog.com/titles/atevol/software-design-x-rays/

My personal blog:  https://adamtornhill.com/

My company blog on tech debt and behavioral code analysis:  https://empear.com/blog/

Code complexity in context:  https://empear.com/blog/bumpy-road-code-complexity-in-context/

Codescene.io and the public showcases on well-known open source projects: https://codescene.io/showcase



Grow To Where We’re Going

As developers, we don't deliver just code. We deliver change in the world.

In software, we want to go fast. We draw a road map, so we can move from here to there. When software is part of a larger system, we want it to go to new places and also support the customers it supports now. Growth is a better metaphor for software change than movement. When we grow to a new place, we still exist in the old place. We are still there for the systems that integrate with us.
 

  • The value of software is in its connections to the wider system: in its use.
  • Growth never stops until death; software is “done” when it’s out of production.
  • We don’t grow -in- an environment; we grow -with- an environment. Influence is healthier than control.
  • Teams are like forests. The whole forest is communicating underground through a network of roots and fungi. The essence of a team’s work is out of sight, in
  • the knowledge we exchange with each other and embed into the software.
  • Developers are like trees in a forest. Sometimes management can’t see the “team” for the “resources.”
  • Trees are green because intensity isn’t everything. To go fast is less important than to keep going.
This is about more than software. As a person, I grow to where I’m going. As a culture, we grow to where we’re going. We don’t control our environment, but we do influence it.

Q&A

Question: I've often thought addressing tech debt in a system is like tending a garden. You'll have a more productive garden if you weed it regularly.

Answer: yes! Eric Evans has a good talk about software gardening, too


Question: Do you have an interest in biology that formed the seed (see what I did there) of this talk? Or did you look around for parallels to software development and found biology?

Answer: oh fun question. I have an interest in systems thinking, and that often leads to ecology. This one came out of a paper about using plant growth as a model for robotics.


Link: jessitron.com/plants



Jessica Kerr

Principal Developer Evangelist
Honeycomb.io


Who do You Trust? Beware of Your Brain

Cognitive scientists tell us that we are more productive and happier when our behavior matches our brain’s hardwiring—when what we do and why we do it matches the way we have evolved to survive over tens of thousands of years. One problematic behavior humans have is that we are hardwired to instantly decide who we trust. We generally aren't aware of these decisions—it just happens. Linda explains that this hardwired “trust evaluation” can get in the way of working well with others. Pairing, the daily stand-up, and close communication with the customer and others outside the team go a long way to overcome our instant evaluation of others. As Linda helps you gain a better understanding of this mechanism in your behavior and what agile processes can do to help, you are more likely to build better interpersonal relationships and create successful products.

Q&A

Question: Was the clogged water pipe so effective because it was a common threat to both groups equally? in the sense that “my enemy’s enemy is my friend”?

Answer: The clogged water pipe was not a common enemy. In fact, the idea that we would be united against a common enemy is proving to be false. Witness the current pandemic. It is a common enemy but it has not united us. Sad but true.


Question: how important is the language used when trying to reduce dividing lines between groups? "us" and "ours" vs "them" and "theirs", is the language important, or a trivial factor?

Answer: Words matter. Language is important. That's why females make such a big deal about the use of "mankind" :)


Question: Are “agile” workplaces really “better” environments in this regard? isn’t the whole agilist industry reliant on maintaining the “us vs them”/“business vs tech” divide in order to justify continued investment in specialist roles and career paths (scrum master, release train engineer, agile coach, trainers, book authors, conference speakers etc)?

Answer: I guess I mean "agile" in the best sense of the word. I'm an idealist. I also know that agile has been translated to a lot of settings that are definitely not agile.


Question: Thank you for your talk, Linda. I would like to know your views on possibility of using “us” vs “them” positively for healthy competition, without making it “us”.

Answer: It's really hard to have "healthy" competition. I can recall several retrospectives where I was called in to facilitate a contentious meeting about t-shirts! "They" got t-shirts and we didn't even though we worked just as hard as they did! T-shirts? Yep. T-shirts. I think you have to be careful with team identities, language, and symbols.


Question: I am not sure if this is worldwide or just in Australia but pairing is not as common as (I reckon) it should be - what do you think is the impact of dropping this practice on trust levels in "Agile" (Scrummy) workplaces?

Answer: As I said, I think pairing is the least appreciated but most powerful of the agile practices. It has been around for decades, so it's not new. It has measurable benefit and even though there is evidence (as opposed to many agile practices) of its effectiveness, it's not widely used. Sad :(


Question: "catch them doing something right" is a well used phrase in teaching as well. Linda, how much crossover did you find between teaching at a university and working in the business world in terms of peopleware?

Answer: My university teaching was at a university that worked closely with industry, so there was a lot of overlap. In many academic environments, this is not the case. I was lucky.


Question: Interesting that all the studies were on “boys”… Linda were there experiments that were girls or a mix of genders?

Answer: Sorry about only using experiments on young boys. I always get that comment ! The experiments have been done in several cultures with different genders and ages. I think I like these two because they are similar but so different.


Question: In all I do I’m learning more and more that, generally speaking, we are emotional creatures much more than cognitive ones. Do we need to reacher higher for cognition rather than settle in with emotions?

Answer: We are not primarily rational creatures. Read "Thinking Fast and Slow" -- or just search for my talk on YouTube. Daniel Kahneman won the Nobel prize for his work in behavioral economics that illustrates how flawed our decision making is. This doesn't mean we are never rational -- it's just that the driver (System 1) is in control.

The evidence suggests that we need that emotional component to make any decision. Studies of brain damaged people without areas in their brains that are tied to emotion cannot make even simple decisions. System 2 is good at laying out the logic but System 1 is in charge. And that, for the most part, works well. Every now and then System 2 has to override a System 1 decision. I really wish we could meet face to face. I never feel I'm doing a good job of anything by typing! Thanks again. Be well and stay safe!


Question: Have you ever encountered (or heard of) a situation where all the tools you could bring to bear, could not get the team to "unlearn" these habits of mind?

Answer: Oh yes. When I worked on the 777 airplane -- I was supposed to be the resident Ada guru and I struggled mightily to get people to move from FORTAN and Pascal. Of course, that was long before Fearless Change. I wish I had known more when I was younger -- a common regret among older folks :(.


Question: what advice would now-Linda give to past-Linda?

Answer: The biggest piece of advice is to don't rush in and tell people stuff -- take time to listen and don't believe that because they seem to be resistant that they are stupid. I have made that mistake so many times and I'm making it now in my community watching the political signs go up in my neighbor's yards.


Question: Not entirely related to the talk... but I'm interested on your thoughts about the tech environment these days for women and whether you think it's changed much since you started. Seems like there's still a lot of cases of "I'm the only woman on the team, and..."

Answer: The world is so much better for women -- not just in the tech environment. Soooooo much better! In my first job interview (1963) I was asked what kind of birth control I was using and what plans my husband had (he was then a sophomore medical student) after graduation. That would not happen today. We don't want the world to swing too far in the other direction. One of the worst teams I worked with recently was all women. Not good.

(Balance & Diversity lead to better outcomes.) That's what the evidence shows. Diversity -- people with different genders, races, educational background, experience, and, yes, political persuasions.

I hear many "agile organizations" say they are hiring for "culture fit" -- which I think is code for "just like us" -- when that should never be the goal. Unfortunately, we like people who are like us and believe that's the best way to work. Lots of evidence shows that homogeneous teams are not as innovative, creative, productive, or efficient!


Question: Sometimes those same companies hire for “diversity”. Can you have both at the same time?

Answer: That's good. Sometimes they do say that, but they mean, something like a little bit of strawberry with the vanilla, not a radical Moose Tracks :) . And, really, this is hardwired. We spent thousands of years in small groups (no more than 150) of people we had known all our lives. We feel comfortable with that. These nice people we know well. Not too upsetting.

I hate it when my husband calls out some biased statement and says, "See, you do it, too!"! :). That's Kahneman's biggest argument for diversity. We can't see our own biases, but teams of diverse individuals have some hope of overcoming at least some of those distorted viewpoints.



Linda Rising

With a Ph.D. from Arizona State University in the field of object-based design metrics, Linda Rising’s background includes university teaching and industry work in telecommunications, avionics, and tactical weapons systems.


Prioritizing Technical Debt as if Time and Money Matters

How can we apply technology to drive business value? For years, we've been told that the performance of software delivery teams doesn't matter―that it can't provide a competitive advantage to our companies. Through six years of groundbreaking research, the DORA team set out to find a way to measure software delivery performance―and what drives it―using rigorous statistical methods. This talk presents the findings of that research, including how to measure the performance of software teams, what capabilities organizations should invest in to drive higher performance, and how software leaders can apply these findings in their own organizations.

Q&A

Link:

http://bit.ly/dora-bfd

https://cloud.google.com/solutions/devops/devops-culture-westrum-organizational-culture

bit.ly/hsbc-devops


Question: I’ve found it really hard with most of the technologies we use (Java/Typescript) to have (and maintain) a set of tests that are fast enough to run on every commit. Do you have any tips?

Answer: I recommend checking out this blog post from my former TW colleague Dan Bodart: http://dan.bodar.com/2012/02/28/crazy-fast-build-times-or-when-10-seconds-starts-to-make-you-nervous/

He did a talk too: https://www.youtube.com/watch?v=nRDlYvIbSBU


Question: For Lead Time -> do you consider time to get to a dev team? i.e time from conception of an idea to production? or just from commit to production?

Answer: Just commit to production. The "fuzzy front end" is a fundamentally different domain (product development) where high variability is a good thing. The delivery domain is supposed to be low variability.


Question: Have you captured “age” as part of your “firmographics”? Age is often sometimes equated to maturity but does it correlate with organizational performance in any way?

Answer: You mean age of the organization? No, we haven't looked at that. I have doubts as to its reliability as a measure because so much of performance is team level, and the "age" of a team is super fuzzy - how do you control for the impact of re-orgs, people leaving and joining, changes in management etc. Determining if there is any evidence that organizations get better or worse over time (when corrected for other factors) would require a longitudinal study, which is a good idea, but we don't do that.


Question: How well do you think your “secrets” could be applied to government? Do you think it could improve the management of pandemics?

Answer: We actually got a nice message from a team in the NIAID saying the changes they'd made as a result of the assessment we did there had helped them move more quickly when COVID hit. So yes, I think it can. And my personal experience working on the cloud.gov team is that these ideas definitely work in government. (Also, we find in our analysis that the results apply equally well for respondents who say they are in government).


Question: Will there be a DORA State of DevOps Survey and Report in 2020?

Answer: We have some things in the works but there's nothing I can talk about at the moment. Nicole is now at GitHub, and she released something related to your question back in May: https://github.blog/2020-05-06-octoverse-spotlight-an-analysis-of-developer-productivity-work-cadence-and-collaboration-in-the-early-days-of-covid-19/


Question: Is it possible to have highly aligned, loosely coupled teams working on the same delivery? Is scaled agile advisable? Does it come down to how you slice up the work so you aren't stepping on each other's toes?

Answer: It really depends what you mean on "same delivery". If you mean, working towards the same outcome: yes, and Scaled Agile isn't necessary. If things are tightly coupled that is going to be harder, and scaled agile can help as a countermeasure, but ultimately it can only get you so far: you really have to invest in decoupling organizational structure (small, cross-functional, autonomous teams) and enterprise architecture (making sure you can independently test and release your services - in other words, having a true service-oriented architecture)


Question: Is chaos engineering a core part of google cloud too? Just had that question when you shared about the DiRt remarks.

Answer: With the methodology we use to do the research, we can only really look at practices that are reasonably widely adopted. Chaos engineering is still very emergent. It's definitely very closely related philosophically though. Oh sorry, you're asking about whether we do chaos engineering of the Netflix flavor at Google in particular - I am still very new to SRE so I don't actually know the answer to that, sorry.


Question: Asking for a friend: organisational transformation through building communities and seeding PoCs takes time for the gnarliest of problems. It doesn't look like anything's happening, whereas a rolling out of Best Practices from a Centre of Excellence looks and sounds like Great Strides are being taken. Even if the grassroots efforts are regularly communicating progress and success, if the problem is hard enough that it takes effort to understand the subtleties, then most decision makers in the organisation - who are busy with other things because of the trying-to-do-it-all-at-once problem - will buy into the easy, just-do-what-the-Centre-of Excellence-tells-you-to-do option. Any advice on the most effective lever to attempt to pull here? Get the communities of practice shouting louder, tell the CoE to learn patience, or (somehow) reduce WIP across the org?

Answer: (Nick) Make the work visible. Then people can discuss the work in the CoP's and you get real synergy. CoE's are pretty much useless IMHO (see DORA metrics 2019). But you have to be careful not to let management co-opt the process.

My favourite Deming quote is "when a measure becomes a target then it stops being a measure". So, for example, if you make the number of passing/failing tests visible at each level, some bright spark will put a 90% coverage target on it or something. Every team will then hit 90%, no matter what it takes, rendering the measure useless.

(Jez) Targets of this kind can work provided they are set by the team. What leads to the behavior you describe - and which I have certainly seen - is targets set from above.

(Nick) Also on reducing WIP, I'd say the same - make the work visible. Often it isn't visible so if you make it visible the burden becomes obvious and people start to talk about it. One way you can provoke this is to make batches smaller (which leads to faster cadence). In big batches it's easier to hide the WIP and the overhead it produces (aka Don Reinertsen "Flow"). It will also expose bottlenecks where the WIP piles up. In Lean they call this "lowering the water level to see the rocks".

(Jez) In addition to what Nick said, I think there's also a cultural change that needs to happen. Managers and execs need to see their role less as making stuff happen, but rather as creating a culture where the people doing the work can get stuff done, removing obstacles, helping people acquire the necessary skills to succeed, increasing alignment and transparency.

One way I've seen to demonstrate progress is for management and execs to invest in community events, like an internal devopsdays where everybody gets a day off and - crucially - the people doing the work get to set the agenda, including an open spaces where anyone can talk about what they've been doing.



Adam Tornhill

Author
Software Design X-Rays


A Humane Presentation about Graph Database Internals

Databases are everywhere, but did you ever wonder what goes on inside the box? In this talk we’ll dive into the internals of Neo4j - a popular graph database - and see how its designers deal with distributed systems challenges now and in the future. Borrowing heavily from the academic literature, we'll see why computers are far too easy to program and why oppositely distributed systems are far too hard. We'll follow that with some approaches to making distributed systems safer and contrast that with conflicting approaches that make systems more scalable! If that doesn't sound nightmarish enough, we'll finish up by showing how we can build systems that are safe and scalable by borrowing and gluing together a bunch of ideas from folks who are smarter than me. Come experience the last 10 years of my harrowing day job in less than an hour. You might even enjoy it, or at least empathise!

Q&A

Question: How critical is clock sync?

Answer: Clock sync is not important for these classes of algorithms. If we had accurate distributed clocks (and they're coming) then we can make many simplifying assumptions about ordering and perhaps use simpler algorithms where coordination is reserved for worst-case scenarios (see: Google Spanner)

I understand hardware manufacturers are soon able to create accurate GPS clocks nowadays on commodity servers. That was Google-level science fiction until recently.


Question: Awesome progress over the last few years - “amazing and terrible” - tell us the terrible bits?

Answer: Distributed algorithms are really sneaky. You think you designed one that works, but then there's some really, really subtle thing that means in reality the whole thing is unsafe.

Repeat that numerous times.

Of course now we think we have several that work, but that nagging doubt that you've missed something never leaves.


Question: how do you go about testing these?

Answer: We use proofs, which is OK. We'd like to move to using automated model checkers (e.g. TLA+) but we find the effort of doing that is about as high as building the software itself. Perhaps that's because we're unskilled at formal methods?


Question: Has this been used in any real-world application?

Answer: Yeah, the Linear Transactions protocol has been used in a KV store. But the mixture of LT and Raft I think is novel.

One other terrible thing is communicating with management around deadlines. It's hard to explain why getting two computers to agree is so hard and takes so long! honestly it does sound easy to get to computers to agree on a number, right?


Question: Do you think these Spanner-type databases will soon become a more mainstream choice because you no longer need to trade off consistency for scalability and availability?

Answer: I think for cloud providers, yes. They have the hardware, while there definitely are trade-offs, ownership of the full stack means you can actively manage those trade offs.


Question: now we can only hope that Spanner becomes more economical for smaller use cases. but there are other hosted options like Yugabyte and Cockroach.

Answer: Yeah, those are interesting. Cockroach in particular has a novel transactional setup (raft again) that obviates the need for clocks, but trades off Spanner performance in the edge cases.


Question: I love all this stuff... I'm crazy enough for my mind to idly contemplate coordination challenges while going through the motions of life but I'll never be smart enough to contribute to the field. Thank you very much Jim.

Answer: You have the smarts - it's no different from single computer programming in a way, you have to wallow in the field, spend time being petrified, and then take tentative steps into building terrible algorithms. Over time, you get a bit better.


Question: Are the Raft capabilities available via projects like SpringData-Neo4j, or only when working directly with the drivers?

Answer: They're built into the database servers, so however you access the database, this infrastructure is activated (unless you explicitly use a single server). The causal clustering protocol is handled by drivers as you hint, but those drivers are used by the Spring Data, so you're all good.


Question: General graph DB question (for someone who hasn't used them in anger) - have you ever had a situation where you thought - hmmm I can't do this in a graph DB I need to switch back to a common-or-garden-variety DB?

Answer: yeah, graphs are pretty general purpose, but I wouldn't use one for bulk storage, blob storage, etc because there aren't relationships in those kinds of data to be exploited.

Looking back at the times I used RDBMS when I worked for ThoughtWorks (hi, folks) it seems to me that most of those times a graph would have been a more humane and performant tool to use, if only they'd been around.


Question: I guess then conversely: what are the signs to look out for that would let you know that you should be considering a graphDB? like the London Underground example perhaps obviously looks like something with nodes but maybe ppl just don't know when or how to change their view of a problem?

Answer: E.g. If you have a RDB which is join heavy, that's probably a graph.

We call this the "graph problem problem" at Neo4j, and we're trying to solve it through education. E.g. there's a new, free "For Dummies" book out on graph dbs that you can get from the Neo4j web site.


Question:I think the consensus algorithms you went through were general case … Is there any scope for consensus algorithms tailored to specific contexts to achieve better X,Y,Z? Don’t know if I have a specific example.

Answer: I don't know of any off-hand, I suspect you might be able to take advantage of knowing the number of parties to coordinate, or timeliness (e.g. coming from compliance) etc to simplify/streamline the protocol.


Question: I work on the client side of distributed simulation but have thought about what distributed systems concepts go into the simulation algorithm. Consensus around time is important, as is some knowledge about how many messages each connected simulation application has sent.

Answer: Oh very cool. I've never done any distributed simulations, but did some work over the last few years simulating what goes wrong when your consensus (actually your consistency model) is weak. Spoiler: it turns to poop.

In that setup, we used an (expensive) simulation to calibrate a (cheap) numerical approximation for figuring out how quickly your data would rot. Quite useful.



Jim Webber

Dr. Jim Webber is Chief Scientist with Neo Technology, the company behind the popular open source graph database Neo4j, where he works on R&D for highly scalable graph databases and writes open source software. His proven passion for microservices ecosystems and REST translate into highly engaging workshops that foster collaboration and discussion.


SkillsCasts
Other Years