A brief analysis of the working principle of ZooKeeper

Author：Eve Cole Update Time：2025-08-02 06:00:02

This article talks about the principle of ZooKeeper. The editor thinks it is quite good. I will share it with you for your reference. The details are as follows:

Preface

ZooKeeper is an open source distributed coordination service created by Yahoo and is an open source implementation of Google Chubby. Distributed applications can implement functions such as data publishing/subscription, load balancing, naming services, distributed coordination/notification, cluster management, Master election, distributed locking and distributed queues based on ZooKeeper.

1. Introduction

2. Basic concepts

This section will introduce several core concepts of ZooKeeper. These concepts are explained in more depth later on ZooKeeper, so it is necessary to understand these concepts in advance.

2.1 Cluster Roles

In ZooKeeper, there are three characters:

Leader

Follower

Observer

A ZooKeeper cluster will only have one leader at the same time, and the others will be Follower or Observer.

The configuration of ZooKeeper is very simple. The configuration file (zoo.cfg) of each node is the same, and only the myid file is different. The value of myid must be the {numeric} part of server.{numeric} in zoo.cfg.

Example of zoo.cfg file content:

ZooKeeper

Execute zookeeper-server status on the terminal of the machine with ZooKeeper. You can see what role the ZooKeeper of the current node is (Leader or Follower).

ZooKeeper

As mentioned above, node-20-104 is the leader and node-20-103 is the follower.

ZooKeeper has only two roles: Leader and Follower by default, and no Observer role. To use Observer mode, add:peerType=observer to the configuration file of any node that wants to become an Observer and add:observer to the configuration line of the server configured in Observer mode in the configuration file of all servers, for example:

server.1:localhost:2888:3888:observer

All machines in the ZooKeeper cluster select a machine called "Leader" through a Leader election process. The Leader server provides read and write services to the client.

Both Follower and Observer can provide reading services, but not writing services. The only difference between the two is that the Observer machine does not participate in the Leader election process, nor does it participate in the "more than half of the write successful" strategy of writing operations, so Observer can improve the read performance of the cluster without affecting the write performance.

2.2 Session

Session refers to a client session. Before explaining the client session, let’s first understand the client connection. In ZooKeeper, a client connection refers to a long TCP connection between the client and the ZooKeeper server.

The default service port of ZooKeeper is 2181. When the client starts, a TCP connection will be established with the server. Starting from the first connection establishment, the life cycle of the client session also begins. Through this connection, the client can maintain a valid session with the server through heartbeat detection, and can also send requests to the ZooKeeper server and accept responses. At the same time, it can also receive Watch event notifications from the server through this connection.

The SessionTimeout value of Session is used to set the timeout time of a client session. When the client connection is disconnected due to excessive server pressure, network failure, or active disconnection of the client, as long as the server on the cluster can be reconnected within the time specified by SessionTimeout, the previously created session is still valid.

2.3 Data Node (ZNode)

When talking about distributed, generally "nodes" refer to each machine that forms a cluster. The data node in ZooKeeper refers to the data unit in the data model, called ZNode. ZooKeeper stores all data in memory. The data model is a tree (ZNode Tree). The path divided by slashes (/) is a ZNode, such as /hbase/master, where hbase and master are both ZNodes. Each ZNode will save its own data content, and a series of attribute information will be saved.

Note:

The ZNode here can be understood as both a file in Unix and a directory in Unix. Because each ZNode can not only write data itself (equivalent to files in Unix), but also have next-level files or directories (equivalent to directories in Unix).

In ZooKeeper, ZNode can be divided into two categories: persistent nodes and temporary nodes.

Persistent node

The so-called persistent node means that once this ZNode is created, the ZNode will be saved on ZooKeeper unless the ZNode is removed actively.

Temporary nodes

The life cycle of a temporary node is bound to the client session. Once the client session fails, all temporary nodes created by this client will be removed.

In addition, ZooKeeper also allows users to add a special attribute: SEQUENTIAL to each node. Once the node is marked with this attribute, when this node is created, ZooKeeper will automatically add an integer number after its node, which is an autoincremental number maintained by the parent node.

Version 2.4

Data is stored on each ZNode of ZooKeeper. Corresponding to each ZNode, ZooKeeper maintains a data structure called Stat for it. Stat records three data versions of this ZNode, namely version (the current ZNode version), cversion (the current ZNode child node version) and aversion (the current ZNode ACL version).

2.5 Status information

In addition to storing data content, each ZNode also stores some status information of the ZNode itself. Use the get command to obtain the content and status information of a certain ZNode at the same time. as follows:

ZooKeeper

In ZooKeeper, the version attribute is used to implement "write verification" in the optimistic lock mechanism (to ensure atomic operation of distributed data).

2.6 Transaction Operation

In ZooKeeper, an operation that can change the state of the ZooKeeper server is called a transaction operation. It generally includes operations such as data node creation and deletion, data content update, client session creation and failure. For each transaction request, ZooKeeper will assign it a globally unique transaction ID, represented by ZXID, usually a 64-bit number. Each ZXID corresponds to an update operation, from which ZooKeeper can indirectly identify the global order in which these transaction operation requests are processed.

2.7 Watcher

Watcher (event listener) is a very important feature in ZooKeeper. ZooKeeper allows users to register some Watchers on specified nodes, and when some specific events are triggered, the ZooKeeper server will notify the event to the interested client. This mechanism is an important feature of ZooKeeper to implement distributed coordination services.

2.8 ACL

ZooKeeper uses the ACL (Access Control Lists) policy to control permissions. ZooKeeper defines the following 5 permissions.

CREATE: Permission to create child nodes.

READ: Get permissions to node data and child node list.

WRITE: permission to update node data.

DELETE: Delete permissions of child nodes.

ADMIN: Set permissions for node ACL.

Note: CREATE and DELETE are both permission controls for child nodes.

3. Typical application scenarios of ZooKeeper

ZooKeeper is a highly available distributed data management and coordination framework. Based on the implementation of the ZAB algorithm, this framework can ensure the consistency of data in a distributed environment. It is also based on this feature that makes ZooKeeper a powerful tool to solve the distributed consistency problem.

3.1 Data Publishing and Subscription (Configuration Center)

Data publishing and subscription, the so-called configuration center, as the name suggests, is that the publisher publishes data to the ZooKeeper node for subscribers to data, thereby achieving the purpose of dynamically obtaining data, and realizing centralized management and dynamic update of configuration information.

In our usual application system development, we often encounter this requirement: some common configuration information is needed in the system, such as machine list information, database configuration information, etc. These global configuration information usually have the following three features.

The amount of data is usually relatively small.

The data content changes dynamically at runtime.

All machines in the cluster are shared and the configuration is consistent.

Such global configuration information can be published on ZooKeeper, allowing the client (cluster machine) to subscribe to the message.

There are generally two design modes for publishing/subscription systems, namely pushing and pulling.

Push: The server actively sends data updates to all subscribed clients.

Pull: The client actively initiates a request to obtain the latest data. Usually, the client adopts a timed polling and pulling method.

ZooKeeper uses a combination of push and pull. as follows:

The client wants the server to register the nodes it needs to pay attention to. Once the data of the node changes, the server will send a Watcher event notification to the corresponding client. After the client receives this message notification, it needs to actively go to the server to obtain the latest data (combined by push and pull).

3.2 Naming Service

Naming services are also a common type of scenario in distributed systems. In a distributed system, by using a naming service, client applications can obtain information such as the address of the resource or service, provider, etc. based on the specified name. Named entities can usually be machines in the cluster, services provided, remote objects, etc. - we can collectively call them names (Name).

Among them, the most common one is the service address list in some distributed service frameworks (such as RPC, RMI). By creating sequential nodes in ZooKeepr, it is easy to create a globally unique path, which can be used as a name.

ZooKeeper's naming service generates a globally unique ID.

3.3 Distributed Coordination/Notification

ZooKeeper has a unique Watcher registration and asynchronous notification mechanism, which can well realize notification and coordination between different machines and even different systems in a distributed environment, thereby realizing real-time processing of data changes. The usage method is usually that different clients register the same ZNode on the ZK to listen for changes in ZNode (including the content of ZNode itself and child nodes). If the ZNode changes, all subscribed clients can receive the corresponding Watcher notification and make corresponding processing.

ZK's distributed coordination/notification is a common way of communication between distributed systems and machines.

3.3.1 Heartbeat detection

The heartbeat detection mechanism between machines means that in a distributed environment, different machines (or processes) need to detect whether each other is running normally. For example, machine A needs to know whether machine B is running normally. In traditional development, we usually judge whether the host can directly PING each other. If it is more complicated, we will establish a long connection between machines and realize the heartbeat detection of the upper-level machines through the inherent heartbeat detection mechanism of TCP connection. These are very common heartbeat detection methods.

Let’s take a look at how to use ZK to implement heartbeat detection between distributed machines (processes).

Based on the characteristics of ZK's temporary nodes, different processes can create temporary child nodes under a specified node in ZK. Different processes can directly judge whether the corresponding process is alive based on this temporary child node. In this way, the detection and the detected system do not need to be directly related, but are associated through a certain node on the ZK, greatly reducing system coupling.

3.3.2 Work progress report

In a common task distribution system, usually after the task is distributed to different machines for execution, it is necessary to report the progress of its task execution to the distribution system in real time. This time it can be achieved through ZK. Select a node on ZK, and each task client creates temporary child nodes under this node, so that two functions can be achieved:

Determine whether the task machine survives by judging whether the temporary node exists.

Each task machine will write its task execution progress to this temporary node in real time so that the central system can obtain the task execution progress in real time.

3.4 Master Election

Master election can be said to be the most typical application scenario of ZooKeeper. For example, the election of Active NameNode in HDFS, the election of Active ResourceManager in YARN, and the election of Active HMaster in HBase.

In response to the needs of Master election, we can usually choose the primary key feature in common relational databases to implement: if the machines that become masters insert a record of the same primary key ID into the database, the database will help us perform primary key conflict checks, that is, only one machine can insert successfully - then, we think that the client machine that successfully inserts data into the database becomes the master.

Relying on the primary key characteristics of relational databases can indeed ensure that the only master is elected in the cluster.

But what should I do if the currently elected Master is dead? Who will tell me that Master is dead? Obviously, the relational database cannot notify us of this event. But ZooKeeper can do it!

Using ZooKeepr's strong consistency can ensure that the creation of nodes can ensure global uniqueness in distributed high concurrency, that is, ZooKeeper will ensure that the client cannot create an existing ZNode.

That is to say, if multiple clients request to create the same temporary node at the same time, then only one client request can be successfully created in the end. Using this feature, Master elections can be easily performed in a distributed environment.

The machine where the client that successfully created the node is located becomes the Master. At the same time, other clients that did not successfully create the node will register a watcher for child node change on the node to monitor whether the current Master machine is alive. Once the current Master is found to be dead, other clients will re-election.

This enables the dynamic election of Master.

3.5 Distributed lock

Distributed locks are a way to control synchronous access to shared resources between distributed systems.

Distributed locks are divided into exclusive locks and shared locks.

3.5.1 Exclusive lock

Exclusive Locks (X locks for short), also known as write locks or exclusive locks.

If transaction T1 adds an exclusive lock to data object O1, then during the entire locking period, only transaction T1 is allowed to read and update O1, and no other transaction can perform any type of operation on this data object (the object cannot be locked) until T1 releases the exclusive lock.

It can be seen that the core of the exclusive lock is how to ensure that only one transaction currently acquires the lock, and after the lock is released, all transactions waiting to acquire the lock can be notified.

How to use ZooKeeper to achieve exclusive lock?

Define lock

A ZNode on ZooKeeper can represent a lock. For example, the /exclusive_lock/lock node can be defined as a lock.

Obtain the lock

As mentioned above, consider a ZNode on ZooKeeper as a lock, and obtaining the lock is achieved by creating a ZNode. All clients go to /exclusive_lock/lock to create temporary child nodes /exclusive_lock/lock under /exclusive_lock node. ZooKeeper will ensure that among all clients, only one client can be created successfully, and then it can be considered that the client has obtained the lock. At the same time, all clients that have not obtained the lock need to register a watcher for child node changes on the /exclusive_lock node to listen on the lock node changes in real time.

Release the lock

Because /exclusive_lock/lock is a temporary node, it is possible to release the lock in both cases.

If the client machine currently obtaining the lock fails or restarts, the temporary node will be deleted and the lock will be released.

After the business logic is executed normally, the client will actively delete the temporary node it created and release the lock.

No matter under what circumstances the lock node is removed, ZooKeeper notifies all clients that have registered node change Watcher listening on the /exclusive_lock node. After receiving the notification, these clients re-initiate the distributed lock acquisition again, that is, repeating the "acquisition of locks".

3.5.2 Shared lock

Shared Locks (S locks, referred to as S locks), also known as read locks. If transaction T1 adds a shared lock to data object O1, then T1 can only read O1, and other transactions can also add a shared lock to O1 at the same time (not exclusive lock). O1 can only be added to an exclusive lock until all shared locks on O1 are released.

Summary: Multiple transactions can obtain a shared lock for an object at the same time (read at the same time). If there is a shared lock, you cannot add an exclusive lock (because the exclusive lock is a write lock)

4. Application of ZooKeeper in large distributed systems

The typical application scenarios of ZooKeeper have been introduced earlier. This section will introduce the application of ZooKeeper in the common big data products Hadoop and HBase as examples to help everyone better understand the distributed application scenarios of ZooKeeper.

4.1 Application of ZooKeeper in Hadoop

In Hadoop, ZooKeeper is mainly used to implement HA (Hive Availability), including HDFS's NamaNode and YARN's ResourceManager's HA. At the same time, in YARN, ZooKeepr is also used to store the operating status of the application.
The principle of using ZooKeepr to implement HA is the same in HDFS's NamaNode and YARN's ResourceManager, so this section introduces it with YARN as an example.

ZooKeeper

As can be seen from the above figure, YARN mainly consists of four parts: ResourceManager (RM), NodeManager (NM), ApplicationMaster (AM) and Container. The most core of them is ResourceManager.

ResourceManager is responsible for the unified management and allocation of all resources in the cluster. It also receives resource reporting information from each node (NodeManager) and allocates this information to each application (Application Manager) according to certain policies. It maintains the ApplicationMaster information, NodeManager information, and resource usage information of each application.

In order to implement HA, multiple ResourceManagers must coexist (usually two), and only one ResourceManager is in Active state, while the others are in Standby state. When the Active node cannot work properly (such as machine downtime or restart), the Standby will compete to generate a new Active node.

4.2 Main and backup switch

Let’s take a look at how YARN implements master-subsidy switching between multiple ResourceManagers.

1. Create a lock node. On ZooKeeper, there will be a /yarn-leader-election/appcluster-yarn lock node. When all ResourceManagers are started, they will compete to write a Lock child node: /yarn-leader-election/appcluster-yarn/ActiveBreadCrumb. This node is a temporary node.

ZooKeepr can ensure that only one ResourceManager can be created successfully in the end. The ResourceManager that was successfully created will be switched to Active state, while those that were not successful will be switched to Standby state.

ZooKeeper

You can see that ResourceManager2 in the cluster is Active at this time.

All ResourceManagers in Standby states will register a watcher for node change to the /yarn-leader-election/appcluster-yarn/ActiveBreadCrumb node. Using the characteristics of temporary nodes, you can quickly sense the operation of ResourceManagers in Active states.

Main and backup switch

When an Active state ResourceManager experiences an exception such as a downtime or restart, the client session connected to ZooKeeper will be invalid, so the /yarn-leader-election/appcluster-yarn/ActiveBreadCrumb node will be deleted. At this time, the other ResourceManagers in Standby status will receive the Watcher event notification from the ZooKeeper server, and then repeat the operation of Step 1.

The above is the process of using ZooKeeper to realize the main and backup switching of ResourceManager, which realizes the HA of ResourceManager.

The implementation principle of the NameNode HA in HDFS is the same as the implementation principle of the ResourceManager HA in YARN. Its lock node is /hadoop-ha/mycluster/ActiveBreadCrumb.

4.3 ResourceManager status storage

In ResourceManager, RMStateStore can store some internal state information of RM, including Application and their Attempts information, Delegation Tokens, Version Information, etc. It should be noted that most state information in RMStateStore does not need to be persisted storage, because it is easy to reconstruct it from context information, such as resource usage. In the storage design scheme, three possible implementations are provided, respectively as follows.

Based on memory implementation, it is generally used for daily development and testing.

File system-based implementations such as HDFS.

Based on ZooKeeper implementation.

Since the amount of data of these state information is not very large, Hadoop officially recommends that it realizes the storage of state information based on ZooKeeper. On ZooKeepr, the status information of the ResourceManager is stored under the root node of /rmstore.

ZooKeeper

The RMAppRoot node stores information related to each Application, and the RMDTSecretManagerRoot stores information related to security and other information. The ResourceManager of each Active state reads these status information from ZooKeeper during the initialization phase and continues to process accordingly based on these status information.

4.4 Summary:

The main applications of ZooKeepr in Hadoop are:

HA of NameNode in HDFS and HA of ResourceManager in YARN.

Store RMStateStore status information

5. Application of ZooKeeper in HBase

HBase mainly uses ZooKeeper to realize HMaster election and master-slip switching, system fault tolerance, RootRegion management, Region state management and distributed

SplitWAL Task Management, etc.

5.1 HMaster election and main backup switch

The principle of HMaster election and master-support switching is the same as the HA principle of NameNode in HDFS and ResourceManager in YARN.

5.2 System error tolerance

When HBase is started, each RegionServer will create an information node under the /hbase/rs node of ZooKeeper (hereinafter, we call this node a "rs state node"), such as /hbase/rs/[Hostname]. At the same time, HMaster will register and listen to this node. When a RegionServer fails, ZooKeeper will delete the rs status node corresponding to the RegionServer server because it cannot accept its heartbeat for a period of time (i.e. Session is invalid).

At the same time, HMaster will receive a NodeDelete notification from ZooKeeper, thus sensing a node is disconnected and immediately starting fault-tolerant work.

Why doesn’t HBase directly let HMaster be responsible for RegionServer monitoring? If HMaster directly manages the status of the RegionServer through heartbeat mechanisms, as the cluster becomes larger and larger, the management burden of HMaster will become heavier, and it may also fail, so the data needs to be persisted. In this case, ZooKeeper becomes the ideal choice.

5.3 RootRegion Management

For HBase clusters, the location information of the data storage is recorded on the metadata region, that is, the RootRegion. Every time the client initiates a new request and needs to know the location of the data, it will query the RootRegion, and the RootRegion's own location is recorded on ZooKeeper (by default, it is recorded in ZooKeeper's /hbase/meta-region-server node).

When the RootRegion changes, such as manual movement of the Region, reload balancing, or a failure of the RootRegion server, it can sense this change through ZooKeeper and take a series of corresponding disaster recovery measures to ensure that the client can always get the correct RootRegion information.

5.4 Region Management

Regions in HBase are often changed. The reasons for these changes are system failures, load balancing, configuration modification, Region splitting and merging, etc. Once the Region moves, it goes through the process of offline and online.

Data cannot be accessed during offline, and this state change in the Region must be known to the global level, otherwise transactional exceptions may occur.

For large HBase clusters, the number of Regions may be as high as 100,000, or even more. It is also a good choice to leave Region state management at this scale to ZooKeeper.

5.5 Distributed SplitWAL Task Management

When a RegionServer server hangs up, since there is always a portion of the newly written data that has not been persisted into the HFile, when migrating the RegionServer service, an important task is to restore this part of the data that is still in memory from the WAL. The most critical step in this part of work is SplitWAL, that is, the HMaster needs to traverse the WAL of the RegionServer server, divide it into small pieces and move it to the new address, and replay the log.

Because the log volume of a single RegionServer is relatively large (there may be thousands of regions, with GB of logs), users often hope that the system can quickly complete the log recovery work. Therefore, a feasible solution is to allocate this task that handles WAL to multiple RegionServer servers for co-processing, which requires a persistent component to assist HMaster in completing task allocation.

The current practice is that HMaster will create a SplitWAL node on ZooKeeper (by default, it is /hbase/SplitWAL node), store information such as "Which RegionServer handles which Region" on the node in a list, and then each RegionServer server will go to the node to collect the task and update the node's information after the task execution is successful or failed to notify HMaster to continue the subsequent steps. ZooKeeper plays the role of mutual notification and information persistence in distributed clusters here.

5.6 Summary:
The above are some typical scenarios in HBase that rely on ZooKeeper to complete distributed coordination functions. But in fact, HBase has more than this dependence on ZooKeeper. For example, HMaster also relies on ZooKeeper to complete the enable/disable status record of Table, and almost all metadata storage in HBase is placed on ZooKeeper.

Due to ZooKeeper's excellent distributed coordination capabilities and good notification mechanism, HBase has increasingly added ZooKeeper application scenarios in the evolution of each version, and from a trend perspective, the two have more and more intersections. All operations on ZooKeeper in HBase are encapsulated in the org.apache.hadoop.hbase.zookeeper package. Interested students can study it themselves.

The above is the Spring Boot module that the editor introduced to you. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time!