Biology of Distributed Information Systems: Team Meshes and Agility at Scale

1. Introduction

Software development is a team sport. Whereas programming may be considered an individual practice or skill, developing a piece of software requires the strength and versatility of a team. This is one of the most undisputed statement of software development, every book that talks about Agile, Lean Software, DevOps, etc. emphasizes the importance of cross-functional autonomous teams. I could have quoted any book that I have reviewed in the past 10 years in my two blogs, I decided to pick “Teams of Teams” from General Stanley Mc Christal as an illustration, because it is a foundational book about the importance of autonomous, self-organized teams. Teams should be “long lived”, with stable membership that is critical to capitalize collective learning, but they are living objects, that also require constant adaptation, with new members coming in and out, as is rightly pointed out by Heidi Helfand in “Dynamic Reteaming”.

Agility at scale is a necessity. This is a disputed claim since some agilists believe that one should try to avoid scaling by enforcing a strict modularity and keeping software development at the scale of a well-functioning agile team. Unfortunately, reality shows that this ideal of modularity is hard to reach in most cases. Even if the system to be built, grown or maintained is decomposed into smaller units – microservices to pick a fashionable example – the orchestration or synchronization of teams is an issue. This is precisely what “agile at scale” frameworks such as SAFE are attempting to do. These frameworks should not be seen as “lean/agile methodologies in a box”, they are toolboxes to address the pains of scale (as I will point out in the conclusion, no one disputes that agility principles are easier to maintain at small scale).

The common thread for this blog post is that teams should be organized into a mesh of teams. This means that the “team of teams” or the “community of teams” require some form of organization, but a form that is dynamic, scalable, self-organized from principles and highly adaptative. The mesh metaphor is a classical pattern from networking : mesh networks are precisely distributed, scalable and self-organized local area network that emerge from connecting principles. To quote Wikipedia, “connections are direct, dynamic, non-hierarchical”. In the context of software development at scale, using the “mesh” metaphor is a tool to emphasize a few key ideas : distributed (versus centralized) governance, self-organization from principles that are applied locally (the global behavior emerges versus top-down design) and the structure adapts continuously to the environment (it is dynamic as opposed to static or rigid). On the other hand, a mesh is much more than a community of teams, because we have a structure, topology principles and connecting rules.

This post is organized as follows. The next section is a short book review of “Team topologies”, which gives an interesting blueprint for team orchestration. This book focuses on team interactions, to favor collaboration and reduce cognitive overload (which is the pressing issue if orchestration is obtained by maximizing communication). The authors define four types of teams and propose interaction patterns associated with the goals and the types of each Team. The resulting mesh self-organizes to maximize flow and to optimize the management of change. Section 3 proposes a book review of “Data Mesh”, which addresses the questions of running distributed data flows and processing at scale. A data mesh is a both a distributed system mesh and a mesh of teams. The contribution of Zhamak Dehghani is to provide a framework (from mental model to tools and practices) to address the issue faced by large companies who run multiple flows between many sources and many consumer data platforms. I will return to Agility at scale in the conclusion to show how the principle of team meshes define a “sweet spot” between two unpractical ideals : the ideal of modularity with autonomous teams that “do not need to talk much to each other” and the ideal of community where each team is perfectly aware of the goals/purpose of each other team.

2. Team Topologies

The book “Team Topologies: Organizing Business and Technology Teams for Fast Flow” from Matthew Skelton, Manuel Pais, and Ruth Malan, has quickly become a “must read” because it addresses the key question of team coordination and because it proposes a relevant toolbox that is both practical and applicable. The main purpose is very much aligned with the lean tradition that I promote in this blog : “optimize for fast flow across the whole organization, not just in small parts”. This is clearly a “system thinking” book, with deep insights gathered both from multiple experiences with team management and from careful analysis. I appreciated the emphasis of feedback loops, where the structure for team interaction is guided by continual experimentation and learning (“sensing and feedback from every team interaction”). As a book about the management of teams, there is logically a great deal of attention to job satisfaction, making sure that the management principles and the overall complexity of large scale development organization do not result into conditions that create job unsatisfaction, with a great reference to “Drive” : “This is not surprising if we consider Dan Pink’s three elements of intrinsic motivation: autonomy (quashed by constant juggling of requests and priorities from multiple teams), mastery (“jack of all trades, master of none”), and purpose (too many domains of responsibility)”. As a side note, it still surprises me to see companies who aspire to become major digital players and ignore the key findings from motivation science. This is a great book to read because it is thought-provoking. I did not agree with everything I read, especially the need for “radical departure from the past” or the idea that “Conway’s Law” is an absolute truth and that “large, up-front designs by software architects are doomed to fail unless the designs align with the way in which the teams communicate”. Reality is always more complex and richer than the abstractions found in books (including mine J).

2.1 Principles derived from the Conway Law

Conway’s law states that the system produced by a set of teams is organized with an architecture that mimic the teams organization, because the organization dictates the facility to communicate. The purpose of the book is thus to derive organization principles about how the teams collaborate with each other to produce a better software system: “Team Topologies addresses the design of the software development organization, with Conway’s law in view”. By better, the authors mean more adaptable, produced faster (reducing lead time) while retaining quality (from safety to resilience): “Team Topologies focuses on how to set up dynamic team structures and interaction modes that can help teams adapt quickly to new conditions, and achieve fast and safe software delivery”. The “and” is important: “Businesses can no longer choose between optimizing for stability and optimizing for speed”. The authors speak often of “Reverse Conway” when they mean to adapt the team topology from the intended system architecture: “the organization is set up to match the communication paths needed in the software and systems architecture”. The book starts, therefore, by investigating how teams collaborate and define three patterns : (full) collaboration, which requires a lots of exchanges for full synchronization, “as a service” which is asymmetrical and correspond to the “agency model” (when one team performs as task on behalf of the other one), and facilitating (also asymmetrical) when a team is a facilitator/enabler of the other team’s activity : “The remit of the team undertaking the facilitation is to enable the other team(s) to be more effective, learn more quickly, understand a new technology better, and discover and remove common problems or impediments across the teams”.

Understanding the level of implied communication and the direction of the information flows help to see the associated benefits/drawbacks of the interaction patterns. Innovation, for instance, is easier with more communication: “By design, innovation across the boundary happens more slowly than with collaboration, precisely because X-as-a-Service has a nice, clean API that has defined the service well”. The (X-)”as a service” interaction model requires a formalization of the “agency” (that is of what is expected by the “principal” team) : “The X-as-a-Service team interaction mode is suited to situations where there is a need for one or more teams to use a code library, component, API, or platform that “just works” without much effort, where a component or aspect of the system can be effectively provided “as a service” by a distinct team or group of teams”.

2.2 Team Orchestration and Complexity

The focus on communication and organization is dictated by the complexity produced by scale. Obviously communication between team members is always required, useful and time-consuming, but when the size of the overall organization grows, communication becomes harder, hence more of a bottleneck (which is precisely the reason for Conway’ Law). The goal of the book is to address this issue and the resulting cognitive load when the amount of communication that is required by a team member exceeds what is manageable with comfort : “As the complexity of the system increases, so, generally, do the cognitive demands on the organization building and evolving it. Managing cognitive load through teams with clear responsibilities and boundaries is a distinguishing focus of team design in the Team Topologies approach”.

This focus on cognitive load is critical to achieve at the same time speed and agility, while adapting constantly to new technology and business conditions (which is the rule of the game in the 21st VUCA century) : “When cognitive load isn’t considered, teams are spread thin trying to cover an excessive amount of responsibilities and domains. Such a team lacks bandwidth to pursue mastery of their trade and struggles with the costs of switching contexts”.

2.3 Team Topologies in Four Patterns

The book “identifies four team patterns, describing their outcomes, form, and the forces they address and are shaped by”:

The “complicated system” is a team that works mostly amongst itself, where most of the time is spent “internally” building a subsystem versus working on the interfaces. We could say that it fits a “legacy subsystem mode”.
The “platform team” builds a component that is used much more freely, and more easily, by other teams so that this component may be seen as a “platform”. A platform is defined by its clear interfaces, the APIs, that helps the modularity of the decomposition (a complicated system is also a sub-system, but without the benefits of decoupling implied by the platform label).
The “stream-aligned” team is somehow similar to a feature team that develops an “end-to-end” value stream (or user experience) while relying on platforms produced by other teams. From a value generation point of view, these teams are the top of the value chain. As stated by the authors: “The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy”.
The “enabler” teams build the glue on the “team mesh”, they play a facilitator / enabler / assistance role to help the other team work better. As noticed by the authors, “the feature-team/product-team pattern is powerful but only works with a supportive surrounding environment”.

These four modes are combined to produce a topology that favors change (adaptation) by better managing flow: “Overall, the Team Topologies approach advocates for organization design that optimizes for flow of change and feedback from running systems”. The level and the type of communication for these modes are quite different, so they should drive the kind of work environment that is being proposed: “However, different people need different environments at different times to be productive. Some tasks (e.g., implementing and testing a complicated algorithm) might require full concentration and low levels of noise. Other tasks require a very collaborative approach (e.g., defining user stories and acceptance criteria”. I have used the expression “mesh of teams” to emphasize the necessity to keep a dynamic vision for the way teams collaborate with each other. This is stated explicitly by the authors : “The topologies became an effective reference of team structures for enterprise software delivery; however, they were never meant to be static structures, but rather a depiction of a moment in time influenced by multiple factors, like the type of products delivered, technical leadership, and operational experience. The implicit idea was that teams should evolve and morph over time”.

The book emphasizes, without surprise, the need for cross-functional autonomous teams as mentioned in the introduction: “The use of cross-functional, stream-aligned teams has a very useful side effect. Precisely because stream-aligned teams are composed of people with various skills, there is a strong drive to find the simplest, most user-friendly solution in any given situation. Solutions that require deep expertise in one area are likely to lose against simpler, easier-to-comprehend solutions that work for all members of the stream-aligned team”. “Cross-functional autonomous” does not mean that each team owns the totality of its required skill set. It should own most of it and the exact balance is an art, more than a science (hence the focus on experimentation and adaptation). For instance, the book quotes the example of DBA : “database-administrator (DBA) teams can often be converted to enabling teams if they stop doing work at the software-application level and focus on spreading awareness of database performance, monitoring, etc. to stream-aligned teams”. This is good advice, but there also exist situations where teams such as platform teams should have their own DBAs. Indeed, a platform team is organized for facilitate flow to maximize speed: “A digital platform is a foundation of self-service APIs, tools, services, knowledge and support which are arranged as a compelling internal product. Autonomous delivery teams can make use of the platform to deliver product features at a higher pace, with reduced coordination”.

2.4 Product Mode

The emphasis on the team’s autonomy and self-sufficiency comes from move to “Product mode”, as stated by the authors : “ The recent focus (at least within IT) on product and team centricity, as illustrated by Mik Kersten’s book on moving from Project to Product, is another major milestone”. The book quotes the example of Addidas who has applied the product principles to re-organize its software development teams: “Adidas invested 80% of its engineering resources to creating in-house software delivery capabilities via cross-functional teams aligned with business needs. The other 20% were dedicated to a central-platform team taking care of engineering platforms and technical evolution, as well as consulting and onboarding new professionals”. As the title “team topology” implies, it is not enough to optimize communication, organization, intra-teams flows … it all starts with well-functioning teams, “a stable grouping of five to nine people who work toward a shared goal as a unit”. The emphasis on the low numbers of members comes from the necessity to “achieve predictable behavior and interactions inside the team”. The authors quote from “Teams of Teams” : “the best-performing teams “accomplish remarkable feats not simply because of the individual qualifications of their members but because those members coalesce into a single organism”. They also quote from my favorite book, “Accelerate”: “we must . . . ensure delivery teams are cross-functional, with all the skills necessary to design, develop, test, deploy, and operate the system on the same team.”

A key principle of team organization is that of ownership, responsibility and empowerment. Each subsystem, API or feature should have a clear owner: “The danger of allowing multiple teams to change the same system or subsystem is that no one owns either the changes made or the resulting mess”. The team becomes the unit of ownership. Ownership should not go down to the individual level, because this is not robust to fast change. The team is organized to maintain itself according to its purpose in a world of change, including team members: “The team takes responsibility for the code and cares for it, but individual team members should not feel like the code is theirs to the exclusion of others. Instead, teams should view themselves as stewards or caretakers as opposed to private owners. Think of code as gardening, not policing”.

2.5 Architectural Practices

As the Conway law suggests, there are many similarities between software and organization design. Therefore, the proven “good practices” from software architecture have more general applicability and find their place in teams’ topologies:

Loose coupling, trying to minimize dependencies between components and teams, is why we want to build “platforms” versus “complicated systems”. As mentioned by Jeff Bezos, in an ideal state, teams that build modular, loosely coupled, units do not need to communicate much.
High cohesion, that requires components to have clearly bounded responsibilities and strongly related internal elements, is what will focus most of the communication flow within the team.

The concept of platform, which I have debated at length in many of my blog posts, is indeed both a business and a software concept, and both an organization and an architecture pattern. In this book, the platform is the preferred pattern whenever possible: “Every software application and every software service is built on a platform. Often the platform is implicit or hidden, or perhaps not noticed much by the team that builds the software, but the platform is still there. As the philosophical expression goes: it’s turtles all the way down”. The authors notice something that we have also observed at Michelin: there is a strong link between growing a platform and working in a “product mode”: “how do we manage a live software system with well-defined users and hours of operation? By using software-product-management techniques. The platform, therefore, needs a roadmap curated by product-management practitioners, possibly co-created but at least influenced by the needs of users … the evolution of the platform “product” is not simply driven by feature requests from Dev teams; instead, it is curated and carefully shaped to meet their needs in the longer term”. A little further they summarize the platform ambition as follows : “A platform is not just a collection of features that Dev teams happened to ask for at specific points in the past, but a holistic, well-crafted, consistent thing that takes into account the direction of technology change in the industry as a whole and the changing needs of the organization”.

The concept of “platform team” does not apply only to software functional components, it also applies to enabling technologies. Infrastructure teams – thing of Infra as Code, critical for DevOps – can be organized with the “platform team” pattern. The “High cohesion principle” forbids to define the boundary of a team based on implementation technology, but finding the best way to use the four patterns requires judgement : “ Existing teams based on a technology component should either be dissolved, with the work going into stream-aligned teams or converted into another team type: as part of the platform (if the component is a lower-level “platform” component), to an enabling team (if the component is easy enough for stream-aligned teams to work with), or to a complicated-subsystem team (if the subsystem really is needed and really is too complicated for stream-aligned teams to work with)”.

2.6 “API to reify modularity”

To reify is to make something into a practical object, a “first-class citizen” when speaking in OOP (object-oriented programming) lingo. What this subtitle says is that API is what makes system really modular: “A crucial role of a part-time, architecture-focused enabling team is to discover effective APIs between teams and shape the team-to-team interactions with Conway’s law in mind”. So, there is no surprise if the “Team Topology” books talks a lot about API: “With stable, long-lived teams that own specific bits of the software systems, we can begin to build a stable team API: an API surrounding each team. An API (application programming interface) is a description and specification for how to interact programmatically with software, so we extend this idea to entire interactions with the team”. The book also insists on the importance of the developer UX (user experience) when consuming a platform’s API. The experience must be consistent (from one API to the other), intuitive and simple.

A key idea of the book is that team topology, which is both derived from identifying how the global system (and the global organization) is modularized and decompose, and from the topology itself (the type of teams and the nature of interaction), should be optimized to increase flow and to decrease un-necessary interactions. “Flow is difficult to achieve when each team depends on a complicated web of interactions with many other teams. For a fast flow of change to software systems, we need to remove hand-offs and align most teams to the main streams of change within the organization”. A key factor to optimize flow is to keep the platform/components at reasonable size (which is precisely the intuition between the microservices architecture) : “ In all cases, we should aim for a thinnest viable platform (TVP) and avoid letting the platform dominate the discourse. As Allan Kelly says, “software developers love building platforms and, without strong product management input, will create a bigger platform than needed.”

3. Data Meshes

The book, “Data Mesh – Delivering Data Value at Scale”, by Zhamak Dehghani, is the last step, of a journey that started with a few well-received articles and Thoughtworks podcasts. The concept of data mesh addresses a key problem for data-driven companies, who need to combine flow processing with long-term analysis that require storage, what is often called hot and cold data processing (for instance in the lambda architecture). As soon as you combine distributed storage and distributed flows, there is a data governance question that precisely addressed by the concept of data mesh. I developed this opinion in my keynote lecture at Dataquitaine 2022 (French-speaking readers may view the video here : L’approche du SI exponentiel au service d’une transformation digitale tirée par les données).

This is a great book because it explains the data mesh concepts thoroughly – including detailed examples - and addresses the questions that one may have after reading the original papers. The concept of a data mesh is what the name says: a mental model to look at the data flows produced by a data-driven company that builds a mesh of data components from source to consumer platforms, including “store and forward” platforms. The value of the book is not the recognition of the mesh, it is a set of governance tools and practices to address scalability and evolvability, as told by the author “I wish I could claim that data mesh principles were novel and new and I cleverly came up with them. On the contrary, the principles of data mesh are a generalization and adaptation of practices that have evolved over the last two decades and proved to solve our last complexity challenge: scale of software complexity led by the mass digitization of organizations”. The book is written at a time where the concept is still fresh (the term data mesh was coined in 2019) and the data-driven community lacks the long-term experience of running data meshes along the practices described in this book : “ It’s worth considering that this book is being written at the time when data mesh is arguably still in the innovator and early adopter phase of an innovation adoption curve”. Here the analogy with SAFE, mentioned in the introduction, is worth noticing : “Data Mesh” is a “mental model” (a framework to see and describe what you already have in a data-driven company) and a set of governance patterns to solve a hard problem when running the data mesh at scale . This model does not apply for everything (transactional data and ACID constraints being one counter example). As always, the summary that I propose in this blog post is both partial and too short to do justice to the content, so you should read the book.

3.1 Principles for Data Meshes

The principles of “Data Mesh” are designed to help a company scale its data operations, with a clear focus on advanced analytics, from business intelligence to machine learning advanced services. It follows the thread of this blog post, that is its utility is related to the size of the problem: “ Data mesh is a solution for an organization planning to get value from data at scale. It requires the commitment of product teams and business units to imagine using intelligent decision making and actions in their applications and services”. To say it more bluntly, this is not a technology or an architecture concept, nor is it useful for small organizations such as startups: “Data mesh is a decentralized sociotechnical approach to share, access, and manage analytical data in complex and large-scale environments—within or across organizations”.

The core of the data mesh approach is to organize the data-driven landscape into data domains – with a decomposition which is as modular as possible – and to recognize the flows of dependencies with a “data as a product” philosophy and a “self-serve data platform” practice. Because, modularity and loose coupling is only an ideal, Data mesh relies on federated governance to address the remaining dependencies at scale : “Organizationally, it shifts from centralized ownership of data by specialists who run the data platform technologies to a decentralized data ownership model pushing ownership and accountability of the data back to the business domains where data is produced from or is used ”. As explained in the introduction, the purpose of the mesh is to support scalable growth by defining the local policies for the mesh nodes that may be enforced as automatically as possible (versus returning to a centralized governance body) : “ it shifts data governance from a top-down centralized operational model with human interventions to a federated model with computational policies embedded in the nodes on the mesh “. Although automated policies favor distributed scalability, the strength of the data mesh relies first and foremost on the human side, the role of the teams, who are called to manage “data as a product” : “Remove the possibility of creating domain-oriented data silos by changing the relationship of teams with data. Data becomes a product that teams share rather than collect and silo. Create a data-driven innovation culture, by streamlining the experience of discovering and using high-quality data, peer-to-peer, without friction”. Thinking of data as a product is not a new idea – data architects have been advocating for FAIR guiding principles for scientific data management and stewardship for while – but the practices proposed in this book go further : “The baseline characteristics listed in this section are an addition to what has been known as FAIR data in the past—data that meets the principles of findability, accessibility, interoperability, and reusability ”.

3.2 Data Domains

The fabric of the data mesh is the decomposition of data into data domains, following a domain-driven design principle. Data is organized into domains that reflects the business knowledge taxonomy. Domain-Driven architecture has many benefits, and aims at producing agility though the tight coupling between business and systems, and, hopefully, the loose coupling between domains: “Domain-driven design, and the idea of breaking software modeling based on domains, has greatly influenced the software architecture of the last decade, for example with microservices. Microservices architecture decomposes large and complex systems into distributed services built around business domain capabilities. It delivers user journeys and complex business processes through loose integration of services”. Business data domains gives to the participant of the domain “shared awareness” about the meaning and the purpose of the data, which is represented in DDD as “bounded context”: “A bounded context is “the delimited applicability of a particular model [that] gives team members a clear and shared understanding of what has to be consistent and what can develop independently.” In the words of Zhamak Dehghani, data domains help to decentralize the ownership of analytical data to business domains closest to the data (either the source of the main consumer). This decentralization of ownership is critical to get the agility and the scalability that is expected from data-driven companies: “Data mesh, at its core, is founded in decentralization and distribution of data responsibility to people who are closest to the data. This is to support a scale-out structure and continuous and rapid change cycles”.

We find here the discussion started in the previous section: domain decomposition should aim to be as modular as possible, but also should recognize that (1) modularity is an ideal (2) the world evolves constantly so the domains will evolve, including the relationships that they have with each other. This is why the Data Mesh approach introduced of the concept of a “federation of data domains”, which is the equivalent of what we call “a federation of data models” at Michelin (for the same reason : a monolith approach is impractical, but true modularity is too difficult to achieve): “ A federated group of domain representatives defines the policies and the data platform automates them. This is data mesh’s federated computational governance principle”. The term federation is a metaphor for two things: the participants are fairly autonomous, and there is a set of rules and policies to manage the interdependencies: “Organizationally, by design, data mesh is a federation. It has an organizational structure with smaller divisions, the domains, where each has a fair amount of internal autonomy”. Interdependencies include the inevitable data objects that are shared in multiple context, what is called a “polyseme” in the book and what I would call “a pivot business object” : “Following DDD, data mesh allows different domains’ analytical data to model a polyseme according to the bounded context of their domain. However, it allows mapping a polyseme from one domain to another, with a global identification scheme”.

3.3 Data as a Product

To think of data as a product means to understand that data that is shared with other team needs to deliver operational qualities such as availability, performance and freshness in a reliable and consistent manner. Sharing data (original data or processed) becomes a business-critical activity and the community of users must be addressed “as a market of consumers”. Measuring usage, volumes, performance and user satisfaction becomes the necessity of a mesh participant that shares its data “as a product”. This is only way to avoid the risk of data siloes and its counterpart, massive replication and associated desynchronization: “ The principle of data as a product is a response to the data siloing challenge that may arise from the distribution of data ownership to domains”. The need to think of “data as a product” also comes from the practical consideration that in today’s state of operations, the jobs of most data scientist and data engineering teams is to clean up and re-organize the data obtained from other sources : “recent report from Anaconda, a data science platform company, “The State of Data Science 2020”, finds that nearly half of a data scientist’s time is spent on data preparation—data loading and cleansing”. The practice of “data as a product” is a “shiftleft” of data cleansing to the upstream nodes in the mesh.

The relationship between the topic of data mesh and team topologies, as expressed in the introduction, does not escape Zhamak Dehghani: “Domain data product teams as stream-aligned teams According to Team Topologies, a stream-aligned team is the primary team type in an organization. It is responsible for an end-to-end delivery of a single product, service, set of features, etc. In the case of data mesh, the (cross-functional) domain teams have one or multiple stream-aligned teams including the application development teams (app dev for short) as well as the data product delivery teams (data product for short). A domain team is a logical team, a team of teams”.

In a distributed scalable mesh, data products must be easy to find and to consume: “Data products are automatically accessible through the global data discovery tool. They share and guarantee a set of service-level objectives (SLOs), such as how often each playlist is refreshed, its accuracy, and timeliness”. Here I notice that SLO is becoming the new buzzword, but I believe that SLA (agreement) should be used in this context. Let me recall that in the SRE approach, the SLA is what you agree with the customer, the SLO is the team’s objective, which is more ambitious and opens the practice of error budgeting to continuously improve site reliability. As the producer’s manager, I care about SLO, but as a consumer, I care about SLAs. The Data product owner is accountable to meet the SLA for the data users, to ensure their satisfaction and to maintain the life cycle of the data products. To put it in Zhamak’s terms: “It’s the responsibility of a data product to share the information needed to make itself discoverable, understandable, trustworthy, and explorable”.

The Data Mesh approach relies on platforms and automation as we shall later see. The use of self-serve platforms support the implementation of policies, to alleviate the load of the Data owners: “All data products must implement global policies such as compliance, access control, access audit, and privacy. The platform is the key enabler in embedding these policies in all data products. Data products can define the policy configurations as code and test and execute them during their life cycle. The platform offers the underlying engine that implements the management of policies as code... The principle of a self-serve data platform essentially makes it feasible for domain teams to manage the life cycle of their data products with autonomy and utilize the skillsets of their generalist developer to do so ”. This use of platform is combined with the empowerment of the data owner to manage the data governance locally (within the constraints of the federated mesh): “ If I could leave you with one takeaway from this chapter, it would be to invert your perspective on whose responsibility it is to manage, govern, and observe data; shift the responsibility from an external party getting engaged after the fact to the data product itself ”.

3.4 Data Mesh Governance

The Data Mesh governance follows the federated model that we described. A large part is done locally by the data domains themselves, what remains to be seen is how to manage the federation itself. The book gives a few principles and practices to run the mesh: “a data governance operating model based on a federated decision-making and accountability structure, with a team composed of domain representatives, data platform, and subject matter experts—legal, compliance, security, etc.”. The goal is precisely to scale the benefits of the domain-driven decomposition that we showed in the previous two sections : “Governance is the mechanism that assures that the mesh of independent data products, as a whole, is secure, trusted, and most importantly delivers value through the interconnection of its nodes”. An implicit benefit of the mesh is that it unifies different kinds of nodes: source nodes, aggregation or transformation nodes, consumer nodes so that each node can play multiple roles. It helps to consider “analytical nodes” as “operational nodes”, and to insert advanced data producers (for instance using ML models) into operational business processes.

Governance is required because the data domains, data flows and data products evolve constantly, with a scope that goes beyond that of data products (data owners manage the evolvability of their data products): “Data mesh is a dynamic system with a continuously changing topology. The shape of the mesh continuously changes”. Here is a good summary of the overall governance organization: “Data mesh governance in contrast embraces constant change to the data landscape. It delegates the responsibility of modeling and quality of the data to individual domains, and heavily automates the computational instructions that assure data is secure, compliant, of quality, and usable. Risk is managed early in the life cycle of data, and throughout, in an automated fashion. It embeds the computational policies in each and every domain and data product. Data mesh calls this model of governance a federated computational governance”. There are three key elements to facilitate this data mesh governance: automating and embedding policies as code – which we mentioned in the previous section –, delegating central responsibilities of governance to data product owners, and organizing “federated” instances where each data owner is represented that support sharing of needs, concerns and awareness. Raising the “shared awareness” through education is necessary to deliver the value of the data mesh: “Increasing participation of people in data sharing across the organization with different roles and levels of skills is a common goal of many organizations—referred to as data democratization”.

3.5 Data Platforms

Self-serve platform is the core concept for the implementation of the data mesh: “The platform makes data product compatibility possible. For example, platforms enable data product linking—when one data product uses data and data types (schema) from another data product. For this to be seamlessly possible, the platform provides a standardized and simple way of identifying data products, addressing data products, connecting to data products, reading data from data products, etc. Such simple platform functions create a mesh of heterogeneous domains with homogeneous interfaces”.

Thinking in terms of platform also means to think in terms of community. To grow a platform is to grow its user community: “An interesting lens on the data mesh platform is to view it as a multisided platform—one that creates value primarily by enabling direct interactions between two (or more) distinct parties. In the case of data mesh, those parties are data product developers, data product owners, and data product users”. This means that one must build experiences (UX) and not mechanisms, as Zhamak Dehghani warns: “I have come across numerous platform building/buying situations, where the articulation of the platform is anchored in mechanisms it includes, as opposed to experiences it enables. This approach in defining the platform often leads to bloated platform development and adoption of overambitious and overpriced technologies”. A platform is a complex system that must be grown according to the feedback provided by the users: “ This mechanism is called a negative or balancing feedback loop. The intention of this feedback loop is self-correction, in this case reducing the number of duplicate, low-quality, and less usable data products”.

In the context of a data mesh, a platform is mostly defined by the API that it provides, which correspond to the data products: “ If you take away one thing from this chapter, I wish it to be this: there is no single entity such as a platform. There are APIs, services, SDKs, and libraries that each satisfy a step in the journey of the platform users … The ability to use the platform capabilities through self-serve APIs is critical to enable autonomy”. The most obvious API are the output API which are used to consume the data: “Output data APIs can receive and run remote queries on their data, for example an interface to run SQL queries on the underlying tables holding data. Output APIs share the data in multiple formats. For example, APIs read semi-structured files from blob storage or subscribe to an event stream”. The data mesh platform also proposed “input API” that are used internally by the data product to configure and read data from the upstream sources, and “discovery and observability API” to provide additional information about the data product, as well as to help with its discoverability and debugging.

4. Conclusion

The English version of my book about “The Lean Approach to Digital Transformation” is now available. My book is driven from 20 years of experience at Bouygues Telecom and AXA. It focuses on flow “from customer to code” (digital innovation following Lean Startup) and flow from “code to customer” (agile and lean software factories). My experience drives a keen interest with “agile at scale”. I have witnessed and experiences a number of failures with large projects that failed to reach an agile state and to deliver the expected value. From what I have seen, failure starts when each part is late and too busy to look outside - each team focuses on delivering “their part” at the expense of the complete system. Everyone works very hard, the integration starts with a large stack of bugs that grows continuously with un-detected regressions and feature interactions. Understanding the impact of size is straightforward with a lean mindset: it is too easy to focus on what you see and forget what you do not see. What you see is what you build (at team level) while what is harder to see is the inter-team landscape: integration issues, global performance issues, usability issues. It requires strength to look for problems that you do not see when you are late / overloaded with your task that you can see. Taking time to discuss and to synchronize takes time but is a sound investment. This precisely why the PI (product increment) planning step of the SAFE methodology is so necessary and helpful. The only situation when this effort is not needed is when development from precise requirements is possible, which is also when Waterfall works. The need for PI planning is a symptom of the complexity of the overall system, that is what should and should not happen because of multiple interaction. Obviously the quest for modularity and loose coupling is critical to fight this complexity, but, most of the time, it is not enough. As explained in “Team topology”, a well-crafted API makes the synchronization of PI planning unnecessary: The producer manages the API as a product and takes care of upward compatibility, SLA, documentation,… while the consumer consumes the API from a self-manageable platform.

The following slide is taken from a presentation that I made last year for the “Club Urba-EA”. It represent the 12 “Lean Software Factory” principles that are presented in my book, driven from 20 years of experience with Agile (from SCRUM to extreme programming) and Lean software development. The first row is colored in green because these principles apply at all scale. The second row is colored in blue because, although applying to a large set of teams makes it more difficult, there are ways and tools to adapt to the size. The last row, colored in red, is precisely why the topic “Agile at scale” matters because each of these four principles is definitely harder to implement at scale.

To conclude, I will summarize what I believe about Agility at scale with four ideas:

Companies need to be agile “because we do not know the future”, this is irrespective of size.
Agile at scale is a necessity and possibility, but it is hard, as expressed by the previous illustration.
The main challenge of agility at scale is about communication and synchronization. This is why the first book presented in this post, “Team topology” is relevant to agility at scale. To deliver flow, one must organize in a scalable way, and optimize, the necessary amount of inter-team collaboration.
Organizing the distributed mesh of teams – which is obviously more challenging in the case of geographical and time zone distribution – is a matter of distributed team network governance. This was the reason for commenting the second book, “Data Mesh”, in this blog post.

Biology of Distributed Information Systems

Sunday, May 22, 2022

Team Meshes and Agility at Scale