Service Discovery in System Design: Registry, Health Checks & Client vs Server-Side Discovery (Visualized)
In dynamic microservice environments, service instances start and stop constantly โ their IPs are never static. Service discovery solves this by giving every service a way to find healthy peers at runtime without hardcoding addresses. This guide covers registries, health checks, DNS-based discovery, and client- vs server-side patterns โ with live animations.
Service discovery is the mechanism by which services in a distributed system automatically locate the network addresses of other services they need to communicate with, without relying on hardcoded IPs or manual configuration. In modern cloud and microservice architectures, instances are ephemeral โ they scale up and down, crash and restart, and migrate across hosts โ so any address you record today may be stale in minutes. Service discovery keeps that routing table accurate and up to date at all times.
The problem is deceptively simple on a single server but explodes in complexity at scale. Imagine 50 microservices, each running 3โ10 replicas, deployed across an auto-scaling cluster. Every replica has a different IP. When order-service needs to call inventory-service, which IP should it use? What if the only healthy instance just moved to a different node? Without service discovery, every deploy becomes a manual DNS update โ fragile, error-prone, and impossible to automate.
The Service Registry: The Central Source of Truth
At the heart of every service discovery system is a service registry โ a database that maps service names to the IP addresses and ports of their live instances. When a new service instance boots, it registers itself with the registry, advertising its name, address, and port. When it shuts down gracefully, it deregisters. If it crashes, the registry detects the absence via health checks and removes the entry automatically.
A service registry entry typically stores: the service name (e.g., inventory-service), the host IP and port, optional metadata tags (version, region, environment), and a health status field. Well-known registry implementations include Consul (HashiCorp), etcd (CNCF, also powers Kubernetes), Apache Zookeeper, and Netflix Eureka. Kubernetes builds its own registry abstraction on top of etcd using Service objects managed by CoreDNS.
Health Checks and Automatic Deregistration
A service registry is only as useful as the accuracy of its data. An instance can die without sending a deregistration message โ the process crashes, the network partitions, or the host goes offline. To handle this, registries use health checks: periodic probes sent by the registry (or by the instance itself, via heartbeats) to verify an instance is still alive. Common probe types include:
HTTP health endpoint: the registry calls GET /health and expects a 200 OK. TCP check: the registry opens a TCP socket โ if it connects, the instance is alive. Heartbeat / TTL: the instance sends a keep-alive ping to the registry every N seconds; if the registry does not receive a ping within the TTL window, it marks the instance dead and removes it. This is the model used by Consul (with TTL checks) and Eureka (with 30-second heartbeats and a 90-second eviction timeout).
The key insight is that health checks create a self-healing registry. Engineers do not need to manually remove failed instances from a load balancer config โ the registry does it automatically when the instance stops passing probes. In Consul, you configure a health check interval (e.g., every 10 seconds) and a deregister critical service after threshold (e.g., 30 seconds). In Kubernetes, liveness probes and readiness probes on Pods serve an equivalent function, with the control plane removing unready pods from Endpoints before CoreDNS propagates the update.
Client-Side vs Server-Side Discovery
There are two fundamental patterns for how a calling service finds the right instance to contact. The choice affects where intelligence lives โ in the client or in the infrastructure.
Client-Side Discovery
In client-side discovery, the calling service (client) queries the registry directly, retrieves the full list of healthy instances, and picks one itself using a load-balancing algorithm embedded in the client โ usually round-robin or random. This is the model used by Netflix Ribbon (with Eureka) and by service mesh sidecars like Envoy when operating in client-side mode. The client caches the instance list and refreshes it periodically to avoid hammering the registry on every call.
Server-Side Discovery
In server-side discovery, the client sends a request to a well-known stable address โ a load balancer or a DNS name โ and the infrastructure picks which instance to route to. The client does not know or care about individual instance IPs. This is the pattern used by Kubernetes Services (which give every service a stable cluster IP backed by kube-proxy) and by AWS Elastic Load Balancers integrated with ECS service discovery. The routing intelligence lives in the infrastructure, not the client library.
Comparing Client-Side vs Server-Side Discovery
| Client-Side Discovery | Server-Side Discovery | |
|---|---|---|
| Who picks the instance | Client library (e.g., Ribbon) | Load balancer / kube-proxy / DNS |
| Registry awareness | Client queries registry directly | Client knows only a stable VIP or DNS name |
| Load balancing logic | In the client | In the infrastructure |
| Flexibility | High โ client can apply custom algorithms | Lower โ LB algorithm is centrally managed |
| Operational complexity | Each language/framework needs a registry client | Simple clients; complexity in infra |
| Examples | Netflix Ribbon + Eureka, Envoy sidecar | Kubernetes Services + CoreDNS, AWS ELB + ECS |
| Failure surface | Client cache may be stale briefly | LB is a central choke point (must be HA) |
DNS-Based Service Discovery
DNS-based discovery leverages the Domain Name System to resolve a service name to one or more instance IPs. Instead of a proprietary registry API, services simply do a standard DNS lookup. The registry (or control plane) keeps the DNS records updated as instances come and go. Low TTLs (often 5โ30 seconds) ensure stale records are not cached too long.
In Kubernetes, CoreDNS is the cluster DNS server. Every Service object automatically gets a DNS entry: inventory-service.default.svc.cluster.local resolves to the Service's stable cluster IP. For headless services (clusterIP: None), CoreDNS returns A records for every ready Pod IP directly โ giving clients the full instance list for client-side selection. Consul also supports DNS-based discovery, letting any service call inventory-service.service.consul to get a healthy instance.
# Kubernetes: standard Service (server-side โ CoreDNS resolves to stable ClusterIP)
apiVersion: v1
kind: Service
metadata:
name: inventory-service
spec:
selector:
app: inventory
ports:
- port: 80
targetPort: 8080
---
# Kubernetes: headless Service (client-side โ DNS returns all Pod IPs)
apiVersion: v1
kind: Service
metadata:
name: inventory-service-headless
spec:
clusterIP: None # no VIP; returns Pod A records
selector:
app: inventory
ports:
- port: 8080Named Implementations at a Glance
| Tool | Discovery Style | Health Checks | Notable Feature |
|---|---|---|---|
| Consul | Client-side (API) + DNS | HTTP, TCP, script, TTL heartbeat | Service mesh, KV store, multi-datacenter |
| etcd | Client-side (via lease/watch API) | TTL-based leases (self-managed) | Strong consistency (Raft), powers Kubernetes |
| Eureka (Netflix) | Client-side (Ribbon) | Heartbeat every 30 s; eviction at 90 s | Java-first; peer-to-peer registry replication |
| Kubernetes Services + CoreDNS | Server-side (kube-proxy/iptables) + DNS | Readiness probes on Pods | Native k8s; headless mode for client-side |
| AWS Cloud Map | Client-side (API) + Route 53 DNS | Route 53 health checks | Native AWS; integrates with ECS, EKS, Lambda |
| Zookeeper | Client-side (Curator recipes) | Ephemeral nodes (session-based) | Strong consistency; used by Kafka, Hadoop |
Self-Registration vs Third-Party Registration
A subtle but important design choice is who registers a service instance. In self-registration, the instance calls the registry on startup (e.g., via a Consul agent sidecar or a Spring Cloud Netflix library). This is simple but couples the application code to the registry. In third-party registration, an external orchestration layer (the Kubernetes controller, a Nomad scheduler, or an AWS ECS agent) registers and deregisters instances on behalf of the service โ the application code is registry-agnostic. Kubernetes uses third-party registration: when you create a Deployment, the control plane manages the Endpoints object, not the Pod itself.
Common Pitfalls in Service Discovery
Stale instance cache: clients that cache registry results too aggressively will route to dead instances after a crash. Always set a short cache TTL and implement retry-with-jitter on connection failure. Registry as a single point of failure: the registry must itself be highly available (Consul runs a 3- or 5-node Raft cluster; etcd runs a similar quorum). A registry outage does not mean immediate service disruption โ clients can serve from cache โ but no new registrations or deregistrations will propagate. Health check amplification: in large clusters, every registry node probing every service instance can create significant network overhead; use agent-based checks (each host's local Consul agent probes its own services) rather than centralised checks. DNS caching and negative TTLs: JVM-based services are notorious for aggressive DNS caching; always set networkaddress.cache.ttl=5 in JVM deployments talking to CoreDNS or Consul DNS.
Frequently Asked Questions
What is the difference between service discovery and load balancing?
Service discovery answers the question "where are the healthy instances of service X right now?" โ it returns a list of addresses. Load balancing answers a follow-on question: "which one of those instances should I send this request to?" The two are complementary: discovery populates the pool, load balancing selects from it. In server-side discovery they are often bundled together (a load balancer both consults the registry and picks an instance), while in client-side discovery they are separate concerns handled by the client library.
How does Kubernetes handle service discovery without Consul or Eureka?
Kubernetes uses etcd as its backing store for all cluster state, including which Pods are healthy. The control plane maintains Endpoints objects that list the IPs of ready Pods for each Service. CoreDNS watches these Endpoints and serves DNS queries for <svc>.<ns>.svc.cluster.local. kube-proxy programs iptables or IPVS rules on each node to NAT traffic from the Service's stable virtual ClusterIP to one of the ready Pod IPs. From the application's perspective, a simple DNS lookup or a call to the ClusterIP is all that is needed โ the entire discovery and routing mechanism is invisible to the application code.
Should I use client-side or server-side discovery for a new microservice project?
For greenfield projects running on Kubernetes, server-side discovery via Kubernetes Services and CoreDNS is the right default โ it requires zero application code changes, works with any language, and is already battle-tested at massive scale. Client-side discovery is worth considering when you need custom load-balancing logic (e.g., latency-aware routing, circuit-breaking per instance) or when you are not on Kubernetes and want fine-grained control โ in which case a service mesh like Istio or Linkerd provides client-side routing via Envoy sidecar proxies without burdening your application code directly.
A service registry is the phone book of your microservice fleet โ keep it accurate, keep it highly available, and let health checks do the dirty work of evicting dead entries automatically.
โ alokknight Engineering
