Creating a Containerized Kafka Cluster in GCP Compute Engine VM
Watching movies or videos on Netflix or youtube, enjoying music on Spotify or Amazon Music, listening audio books on Audible, all that I mentioned uses some sort of streaming of data to convey information to the end consumer. Even the weather network app we use to know would it snow today or there would be rain in the evening, that is also using streamed data in the background from different weather sensors. We can say, streaming is an integral part of our life now.
Apache Kafka is one of the prime technologies that deals with streaming data from different sources supporting through a distributed system. It is an open-source distributed platform, that means you can process streaming in cluster mode, that automatically provides node failure support through data replication in different nodes. For test purpose you can setup a single node Kafka service as well.
This article will demonstrate how to setup a streaming cluster in a Docker container inside a cloud based virtual machine. To be more precise, Google Cloud Platform Compute Engine VM has been used to create a containerized Kafka cluster.
Kafka Components
Before jump in to the setup and configuration, let’s first discuss briefly to know about a few services, applications and components related to Kafka.
Zookeeper
Zookeeper is a co-ordination service/ program that supports co-ordination of distributed applications. Zookeeper is high-performance and can be used off-the-shelf to support consensus, group management, leader election, and presence protocols etc. Zookeeper itself is replicated across servers and can be used to implement high level synchronization, groups and naming along with configuration management. As detailing of Zookeeper is out of scope of this story we are not going through the detail any more.
Kafka Broker
As a distributed service Kafka might consists of one to multiple servers and among those servers those serving as a storage layer are called Kafka Broker. The other servers can be used in different ways to ensure the external application communications and so on.
Kafka Event
In general sense an event is defined as something happened or happening. The same concept is true for Kafka Event as well. Event in Kafka Stream means something happened. Kafka events are also known as messages or records. Writing and reading data into Kafka services are dealt as events. Kafka Events has 3 major parts and an optional part. 3 major parts are event key, event value and event timestamp and the optional part is the event metadata.
Kafka Producer
Kafka producers are actually client programs that write data streams into Kafka service.
Kafka Consumer
Like Kafka producers, Kafka consumers are also client programs except that those read and process data streams from Kafka service.
Docker container
Docker is basically a technology organization that supports containerization of applications that are either interconnected or not. Sounds complex? Don’t worry, Docker’s logo is a kind of self explanatory. The giant wheal ship is carrying different containers and that is what docker does creates containers for programs/ codes along with all dependencies and maintains the communications among those programs / services if required.
Docker containers actually set themselves up on a host operating system and does not bother about the underneath infrastructure. It might contain its own image of OS without any hypervisor required for VM setup. Here is another self-explanatory figure of containerized applications.
Docker is easy to install, deploy and use. Docker can help developing faster and complete codes that is easy to share. The community is big enough to get support in a faster manner.
In our design we thought of reducing the overhead of creating different VMs for different purpose and imported the idea to implement the docker container based architecture. That not only reduced the cost, it is faster to deploy, reduced the overhead of configuring different parameters of ZooKeeper and Kafka as well as easy to monitor.
Architecture in mind
During the design phase we had the following architecture in mind. We wanted applications to connect within the docker host and from outside the docker host but within the GCP regime as well as from outside the world of GCP.
Step by step Solution
The first step was to identify different docker image for Kafka. We’ve found several images as follows,
All of these has different pros and cons. Say, the confluence one has Kafka Connect and Kafka Schema Registry integrated in the image. Again, the spotify one does not deploy Zookeeper in a different container, etc. However, we’ve chosen the confluence one so that later on we can use the schema registry, which can identify schema changes in data and can act accordingly.
Creating VM
We’ve created a VM from GCP console with following configuration. This will incur around 230 USD per month (you can shut this VM down if not using Kafka anymore or just used for test purpose).
- OS Image: Ubuntu 20.04 LTS
- Machine Type: n2-standard-8 (8 vCPU and 32GB Memory)
- Disk size: 30 GB
- Identity and API Access: Allow full access to all cloud APIs
- Network Tags: kafka
We have also created a firewall rule to allow a few ports of the VM for communication external to GCP network.
As we will be testing the cluster from outside GCP network we can include our IP addresses into the firewall rule. Put your IP address in the red marked position in the console shown below and also allow Ingress for a few TCP ports(9101, 9102, 9103, 9111, 9112, 9113) that we’ll be using in the Kafka configuration.
Installing Docker
To install docker in the VM, SSH to the VM from GCP console. Use following 2 commands to make the apt-get repository up-to-date.
sudo apt-get update
sudo apt-get upgrade -y
Now install a few other applications and packages to support the docker installation.
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y
Add docker apt key.
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
Add docker repository to source list.
sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable”
Install docker, docker cli and containerd.io.
sudo apt-get updatesudo apt-get install docker-ce docker-ce-cli containerd.io -y
Now transfer docker-compose into the directory “/usr/local/bin/docker-compose”.
sudo curl -L “https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)” -o /usr/local/bin/docker-compose
And change docker-compose as executable.
sudo chmod +x /usr/local/bin/docker-compose
This is completing the docker and docker-compose installation. Next we’ll create a docker-compose file with the necessary configuration for ZooKeeper and Kafka.
Create docker-compose.yaml
Create a file named docker-compose.yaml in the the home directory of the current user as
sudo nano docker-compose.yaml
Now, we’ll be adding lines one after another into the docker-compose.yaml file. That file actually contains the different services or applications we want to containerize. So, we can follow below structure to generate docker-compose.yaml file.
version: ‘version_number_in_float’
services:
serviceName1: name_of_the_service
image: name_of_the_image
hostname: host_name_for_the_service
ports:
— “HOST:CONTAINER”
environment:
environment_var_1: value
environment_var_2: value
environment_var_3: value
serviceName2: name_of_the_service
image: name_of_the_image
hostname: host_name_for_the_service
ports:
— “HOST:CONTAINER”
environment:
environment_var_1: value
environment_var_2: value
environment_var_3: value
However, before that we start modifying docker-compose.yaml, we want two environment variables to be set to use those into the the docker-compose.yaml file later. The first one is used to get the ephemeral IP address of the GCP Compute Engine VM and the next one it to get the private IP address of the same.
export EXTERNAL_IP=$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip)export INTERNAL_IP=$(curl -H Metadata-Flavor:Google http://metadata/computeMetadata/v1/instance/network-interfaces/0/ip)
And if you want to persist these environment variable you can use add those into the ~/.bashrc file and source that.
nano ~/.bashrcsource ~/.bashrc
Now, let’s produce docker-compose.yaml file. I’ll explain the whole configuration line by line later.
version: '3.5'
services:
zookeeper11:
image: confluentinc/cp-zookeeper:latest
hostname: zookeeper11
ports:
- "2182:2182"
environment:
ZOO_MY_ID: 1
ZOOKEEPER_CLIENT_PORT: 12182
ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2182 server.2=zookeeper21:2888:3888;2182 server.3=zookeeper31:2888:3888;2182
ZOO_TICK_TIME: 2000
ZOO_INIT_LIMIT: 5
ZOO_SYNC_LIMIT: 2
ZOO_MAX_CLIENT_CNXNS: 60
zookeeper21:
image: confluentinc/cp-zookeeper:latest
hostname: zookeeper21
ports:
- "2183:2182"
environment:
ZOO_MY_ID: 2
ZOOKEEPER_CLIENT_PORT: 22182
ZOO_SERVERS: server.1=zookeeper11:2888:3888;2182 server.2=0.0.0.0:2888:3888;2182 server.3=zookeeper31:2888:3888;2182
ZOO_TICK_TIME: 2000
ZOO_INIT_LIMIT: 5
ZOO_SYNC_LIMIT: 2
ZOO_MAX_CLIENT_CNXNS: 60
zookeeper31:
image: confluentinc/cp-zookeeper:latest
hostname: zookeeper31
ports:
- "2184:2182"
environment:
ZOO_MY_ID: 3
ZOOKEEPER_CLIENT_PORT: 32182
ZOO_SERVERS: server.1=zookeeper11:2888:3888;2182 server.2=zookeeper21:2888:3888;2182 server.3=0.0.0.0:2888:3888;2182
ZOO_TICK_TIME: 2000
ZOO_INIT_LIMIT: 5
ZOO_SYNC_LIMIT: 2
ZOO_MAX_CLIENT_CNXNS: 60
kafka11:
image: confluentinc/cp-kafka:latest
ports:
- "9101:9101"
- "9111:9111"
environment:
KAFKA_LISTENERS: LISTENER_DOCKER_INTERNAL://0.0.0.0:19101,LISTENER_DOCKER_HOST_INTERNAL://0.0.0.0:9101,LISTENER_DOCKER_HOST_EXTERNAL://0.0.0.0:9111
KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka11:19101,LISTENER_DOCKER_HOST_INTERNAL://$INTERNAL_IP:9101,LISTENER_DOCKER_HOST_EXTERNAL://$EXTERNAL_IP:9111
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
KAFKA_ZOOKEEPER_CONNECT: zookeeper11:12182,zookeeper21:22182,zookeeper31:32182
KAFKA_BROKER_ID: 1
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- zookeeper11
- zookeeper21
- zookeeper31
kafka21:
image: confluentinc/cp-kafka:latest
ports:
- "9102:9102"
- "9112:9112"
environment:
KAFKA_LISTENERS: LISTENER_DOCKER_INTERNAL://0.0.0.0:19102,LISTENER_DOCKER_HOST_INTERNAL://0.0.0.0:9102,LISTENER_DOCKER_HOST_EXTERNAL://0.0.0.0:9112
KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka21:19102,LISTENER_DOCKER_HOST_INTERNAL://$INTERNAL_IP:9102,LISTENER_DOCKER_HOST_EXTERNAL://$EXTERNAL_IP:9112
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
KAFKA_ZOOKEEPER_CONNECT: zookeeper11:12182,zookeeper21:22182,zookeeper31:32182
KAFKA_BROKER_ID: 2
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- zookeeper11
- zookeeper21
- zookeeper31
kafka31:
image: confluentinc/cp-kafka:latest
ports:
- "9103:9103"
- "9113:9113"
environment:
KAFKA_LISTENERS: LISTENER_DOCKER_INTERNAL://0.0.0.0:19103,LISTENER_DOCKER_HOST_INTERNAL://0.0.0.0:9103,LISTENER_DOCKER_HOST_EXTERNAL://0.0.0.0:9113
KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka31:19103,LISTENER_DOCKER_HOST_INTERNAL://$INTERNAL_IP:9103,LISTENER_DOCKER_HOST_EXTERNAL://$EXTERNAL_IP:9113
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_INTERNAL:PLAINTEXT,LISTENER_DOCKER_HOST_EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
KAFKA_ZOOKEEPER_CONNECT: zookeeper11:12182,zookeeper21:22182,zookeeper31:32182
KAFKA_BROKER_ID: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- zookeeper11
- zookeeper21
- zookeeper31
Let’s now talk more about what those lines mean. I’ve explained 2 parts, one is the Zookeeper configuration and the next is obviously the Kafka configuration.
Version: Version of the docker-compose
Services: Services that are to be introduced in later configuration lines
Zookeeper
zookeeper11: Name of the docker service/ application
image: Name of the image of zookeeper service, ‘confluentinc/cp-zookeeper:latest’ in our case
hostname: Given hostname for the zookeeper11 service
ports: Allowed ports for the zookeepert11 service
environment: environment variables for the zookeeper service
ZOO_MY_ID: Numeric ID for zookeeper11
ZOOKEEPER_CLIENT_PORT: Instructs ZooKeeper where to listen for connections by clients such as Kafka.
ZOO_SERVERS: List of zookeeper servers in the ensemble. The configuration for all servers are like server.ID=host:port:port;client_port
ZOO_TICK_TIME: Used for regulate heartbeats and default is 2000
ZOO_INIT_LIMIT: Amount of time, in ticks (see tickTime), to allow followers to connect and sync to a leader and default is 5
ZOO_SYNC_LIMIT: Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper and default is 2
ZOO_MAX_CLIENT_CNXNS: Limits the number of concurrent connections (at the socket level) that a single client, identified by IP address, may make to a single member of the ZooKeeper ensemble and the default is 60
Kafka
kafka11: Name of the kafka docker service/application
image: Name of the image of zookeeper service, ‘confluentinc/cp-kafka:latest’ in our case
ports: Allowed ports for kafka communication
— “9101:9101” (used for internal/ docker host communication)
— “9111:9111” (used for external to docker and docker host communication)
environment: Kafka environment variables
KAFKA_LISTENERS: Name, host and port of the listeners in comma separated manner. Name_Of_the_Listener://HostNameOrIPOfTheHost:ListtenerPort KAFKA_ADVERTISED_LISTENERS: In a multi-node (production) environment, you must set this property in your Dockerfile to the external host/IP address. This is equivalent to the advertised.listeners configuration parameter in kafka’s server.properties file. In our configuration you might have seen 2 different environment variable that has been used as in the place of hostname ($EXTERNAL_IP and $INTERNAL_IP). These 2 are basically the IP addresses to enable communication from the VM to the outside GCP world and within GCP world respectively. KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: Specifies the protocol for each listeners
KAFKA_INTER_BROKER_LISTENER_NAME: Name of the listener that is used for host’s internal communications.
KAFKA_ZOOKEEPER_CONNECT: Zookeeper service names (comma-separate) and zookeeper client port(defined by ZOOKEEPER_CLIENT_PORT)
KAFKA_BROKER_ID: Numeric ID for this kafka broker
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: Topic replication factor
depends_on: Name of the services on which other service/s this service is dependent. This service will be up and running if all of those services are up and running
This configuration will produce a connectivity like the following diagram.
Test the solution
So now we are all set. We can just run the following commands in the command prompt and see whether the docker services are running or not. You can just look at the ‘State’ column of the results given in the next figure.
sudo docker-compose -f docker-compose.yaml up -dsudo docker-compose ps
Conclusion
Now you have setup Kafka with Zookeeper in a docker environment you can now write a few codes in Java or Python to utilize the power of Kafka. I would like to write a few small few code snippets to leverage you to test the kafka services.
Confluence comes up with various tools that enriches kafka and it’s monitoring. You can use kafkaCat to test and debug the deployment you have done. Alongside for visual monitoring you can also use Control Centre.
Confluentic kafka comes up with another important tools that helps tracking changes in the structure of data, which is know is Schema-registry. You can add another service in the same docker-compose.yaml to see how it works.
I’ve not included any of the aforementioned services or components for the sake of keeping the setup simpler. However, I’ll try to add a schema-registry & control-centre based solution and will hope share a more coding based demonstration in future.
I hope you’ve enjoyed the article. Whatsoever, don’t hesitate to put your comment, suggestion or criticize.