Kafka Connect and the Wider Ecosystem
Connect architecture with workers, connectors, tasks, and converters, source vs sink connectors, standalone vs distributed mode, common connectors, and where ksqlDB fits.
A lot of Kafka work is moving data between Kafka and other systems: pulling changes out of a database, pushing records into S3 or a search index. You could write a producer or consumer for each, but that is repetitive and easy to get wrong. Kafka Connect is a framework that does this with configuration instead of code, and it is how tools like Debezium from Transactional Outbox and CDC run.
What you’ll be able to do after this module
- Describe Connect’s architecture: workers, connectors, tasks, and converters.
- Distinguish source connectors from sink connectors.
- Choose between standalone and distributed mode.
- Name common connectors and when to use Connect versus a consumer.
- Explain where ksqlDB fits.
1. What Connect is for
Kafka Connect is a separate runtime dedicated to integration. You give it a JSON configuration naming a connector class and its settings, and it streams data in or out, handling offsets, restarts, scaling, and retries for you.
flowchart LR
src[("Source system<br/>e.g. Postgres")] --> sc["Source connector"]
sc --> k["Kafka topics"]
k --> kc["Sink connector"]
kc --> dst[("Sink system<br/>e.g. S3")]
The point is to avoid writing and operating bespoke producer/consumer apps for standard integrations that thousands of teams need.
2. Architecture: workers, connectors, tasks, converters
Four concepts make up Connect.
| Concept | Role |
|---|---|
| Worker | A JVM process that runs connectors and tasks; the unit you deploy and scale |
| Connector | A configured integration to one external system; splits work into tasks |
| Task | The unit of parallelism that actually copies data (for example one per DB table or partition) |
| Converter | Serializes/deserializes record keys and values (JSON, Avro via Schema Registry, and so on) |
A connector is configuration; tasks are the running workers that do the copying. Converters plug Connect into the same schema and serialization world as the rest of the course, so a source connector can write Avro registered in Schema Registry.
3. Source vs sink connectors
- Source connector: reads from an external system and writes into Kafka. Example: Debezium reads a database’s transaction log and produces change events. This is the ingest side.
- Sink connector: reads from Kafka and writes to an external system. Example: an S3 sink archives a topic, or an Elasticsearch sink indexes it. This is the export side.
A sink connector is a managed consumer group, so it inherits at-least-once delivery and needs an idempotent or upsert-style target, the same concern as in Idempotent Consumers, Ordering, and Duplicates.
4. Standalone vs distributed mode
Connect runs in one of two modes.
| Mode | Use for |
|---|---|
| Standalone | A single worker, config from a file. Local dev and simple one-host setups |
| Distributed | Multiple workers forming a cluster, config via REST, automatic task rebalancing and failover. Production |
Distributed mode stores its connector configs, offsets, and status in internal Kafka topics, so a worker can fail and its tasks are reassigned to the survivors, just like a consumer group rebalance.
flowchart TD
subgraph cluster [Distributed Connect cluster]
w1["worker 1<br/>tasks"]
w2["worker 2<br/>tasks"]
w3["worker 3<br/>tasks"]
end
rest["REST API"] --> cluster
cluster --> topics["internal topics:<br/>configs, offsets, status"]
5. Common connectors and when to use Connect
A large connector ecosystem exists. A few you will meet:
- Debezium: change data capture from Postgres, MySQL, MongoDB, and others (the CDC engine behind the outbox in Transactional Outbox and CDC).
- JDBC source/sink: read from or write to relational databases via SQL.
- S3 sink: archive topics to object storage.
Reach for Connect when the integration is a standard system-to-Kafka move. Write your own consumer when the logic is genuinely application-specific, such as business processing, enrichment, or anything that belongs inside your service.
6. Where ksqlDB fits
ksqlDB lets you express stream processing as SQL over Kafka topics, running continuous queries without writing Java. It sits alongside Kafka Streams (and is built on it): Streams gives you a full programming model, while ksqlDB trades flexibility for the accessibility of SQL. Reach for ksqlDB when analysts or SQL-comfortable engineers need to filter, join, and aggregate streams without a JVM build, and reach for Kafka Streams when you need the full power of code, as in Kafka Streams.
7. Guided practical
Run this against the local lab.
- Start a Connect worker in distributed mode pointed at the broker (a
confluentinc/cp-kafka-connectcontainer works well). - Confirm the REST API responds at
http://localhost:8083/connector-plugins. - Configure a simple sink connector (for example a file or S3-compatible sink) via a
POSTto/connectors. - Produce records to the topic and confirm they appear in the sink target.
- Inspect connector and task status via
GET /connectors/<name>/status.
Next: Section 7, Security: TLS, SASL, ACLs, and MSK IAM, where the course turns to hardening Kafka for production.