What is Magma: An open-source mobile network project

Systems approach This month’s column was co-written by Amar Padmanabhan, a lead developer of Magma, the open source project for creating carrier-grade networks; and Bruce Davie, member of the project’s technical advisory committee.

Discussions about mobile and wireless networks seem to attract buzzwords, especially with the transition to 5G. And so we see a wave of “cloudification” of mobile network teams – think containers, microservices, and user-plane control and separation.

But making an architecture cloud-native is more than just a buzzword app: it involves a number of principles related to scale, fault tolerance, and operating models. And in fact, it doesn’t really matter what you call architecture; what matters is how well it works in production.

In this post we have tried to articulate some of the defining features that have guided the developing from Magma, looking for a cost-effective and easy-to-operate solution for mobile networks in less developed areas.

Basic hardware

Most traditional networking equipment is proprietary and combines the software with precisely configured and specified hardware. But Magma, like most cloud-native systems, takes advantage of low-cost entry-level hardware. Performance is achieved through the use of scale-out approaches and reliability is achieved through software techniques that deal with unreliable hardware failures. Scale-out and planning for failure are themselves key principles of cloud native architecture, as discussed below.

Since its inception, Magma has been designed to be easy to operate in a variety of environments with basic hardware. Any component can be replaced with minimal cost and network disruption.

Scale horizontally instead of scale up

Cloud native systems typically scale by adding more core devices horizontally, rather than scaling the capacity of individual monolithic systems. Magma is based on a distributed architecture that scales horizontally. Capacity increases by adding small devices across the network; For example, networks can deploy hundreds of access gateways around radio towers instead of adding a few large boxes to the core, as is common in traditional EPC (Evolved Packet Core) designs. This distributed architecture is also important when it comes to our next point, designing for failure.

Small fault domains

In any cloud system, individual components are expected to fail, and failure is treated as a common part of the system’s operating flow. Many of Magma’s design decisions stem from that premise. In traditional telecommunications architectures, by contrast, failures are assumed to be rare and handled through specific exception paths, such as hot stands and completely redundant services.

A failure should affect as few users as possible (that is, failure domains should be small) and should not affect other components. For example, a small gateway failure may affect only a few hundred clients. On the contrary, if a network built on two large cores has a failure in one of its cores, half of its customers can lose service.

It is not enough to divide a large monolith into smaller components. You also need to locate status within components to limit the impact of failures. Magma does this by locating the state associated with any User Equipment (UE) given to a single access gateway. Therefore, the impact of component failure is limited: only UEs served by a given access gateway are affected. The access gateway is the location for the “runtime state” per UE, which is dependent on events such as powering up UEs or moving a UE into the coverage area of ​​a new base station. On the contrary, the execution state of the UE tends to spread between components in traditional 3GPP implementations.

While the runtime state is localized to the relevant access gateway, the configuration state is stored centrally in the Magma Orchestrator, because the configuration is a property of the entire network that is provided via the core API. If an orchestrator component fails, it only prevents configuration updates, but does not affect the runtime state; therefore, UEs can continue to function even if the Orchestrator is restarting.

Simplified operations

The scalability of cloud native systems applies to both operations and performance. Centralized control planes, such as those found in software-defined networks (SDN), have emerged as a way to simplify network operations. In fact, Magma was influenced by the experience of edifice Nicira’s SDN system. While centralized control was once considered an unacceptable single point of failure or an escalation bottleneck, it is now well understood that reliable, logically centralized controllers can be built from a collection of basic servers. It is much easier to operate an entire network when viewed from a central point of control, rather than determining how to configure each network device individually.

Diagram showing the native architecture of the Magma cloud

Magma architecture of distributed access gateways and logically centralized control

The logically centralized control point in Magma is the orchestrator, and it corresponds loosely to a controller in an SDN system. It is implemented on a set of machines (typically three), any one of which can fail without bringing down the orchestrator. This set of machines exposes a single API from which the network as a whole can be configured and monitored. As the size of the network increases and more access gateways are added, the operator maintains a single, centralized view of the network.

Access gateways represent a distributed data plane implementation and also contain local control plane components. The federation gateway implements a set of standard protocol interfaces to allow a Magma-enabled system to interoperate with standard cellular networks.

Like SDN systems, Magma separates the control plane and the data plane. This is essential for the simplified operating model, but it also affects reliability. Failure of a control plane element does not cause the data plane to fail, although it may prevent a new data plane state from being created (eg bringing a new UE online). This is a more complete separation than provided by the 3GPP CUPS specification as we have discussed previously.

Magma’s data plane, which runs on the access gateways, is controlled by software and programmed through well-defined and stable interfaces (similar in principle to OpenFlow) that are independent of hardware. Again, this is the same approach that was taken in SDN and allows the data plane to take advantage of basic hardware and easily evolve over time.

Desired state model

As in many cloud-native systems (such as Kubernetes), Magma uses a desired state configuration model. APIs allow users or other software systems to configure the desired state, while the control plane is responsible for ensuring that this state is realized. The control plane takes responsibility for mapping from the state intended by the operator to the actual implementation at the access gateways, reducing the operational complexity of managing a large network.

The desired state model has proven effective in other cloud-native contexts because it allows components to simply compare the current state to the desired state and then make any necessary adjustments. For example, if the desired state is that there are two active sessions, the control plane monitors the system to make sure it meets the requirement and takes steps to activate a session if it is missing. (Here is Joe Beda with my favorite description how Desired State works in Kubernetes).

In contrast, traditional 3GPP implementations have used a “CRUD” (create, read, update, delete) model, in which a sequence of actions (create a new session or update a session, for example) determines state. In the event of a message loss or component failure, it is difficult to determine whether the current status of a component is correct.

While many of these design decisions may seem obvious, they are quite different from the principles of the standard 3GPP architecture. For example, although 3GPP has the concept of CUPS, control plane elements usually have some user plane (data plane) state. Proper separation of control and data planes leads to a more robust architecture with better upgradeability.

Ultimately, it doesn’t matter what we call the architecture. What matters is how well it scales, how it handles failures, and how easy it is to operate. By embracing cloud services technologies and principles, Magma offers a mobile network solution that is robust, leverages entry-level hardware, scales from small to large deployments with elegance, and offers a central point of control for operational simplicity. We are collecting operational data as I write this; our goal is to quantify these claims in a later piece. ®

Leave a Comment