Kafka Connect: Source and Sink Connectors and the Ecosystem

A lot of Kafka work is moving data between Kafka and other systems: pulling changes out of a database, pushing records into S3 or a search index. You could write a producer or consumer for each, but that is repetitive and easy to get wrong. Kafka Connect is a framework that does this with configuration instead of code, and it is how tools like Debezium from Transactional Outbox and CDC run.

What you’ll be able to do after this module

Describe Connect’s architecture: workers, connectors, tasks, and converters.
Distinguish source connectors from sink connectors.
Choose between standalone and distributed mode.
Name common connectors and when to use Connect versus a consumer.
Explain where ksqlDB fits.

1. What Connect is for

Kafka Connect is a separate runtime dedicated to integration. You give it a JSON configuration naming a connector class and its settings, and it streams data in or out, handling offsets, restarts, scaling, and retries for you.

flowchart LR
    src[("Source system<br/>e.g. Postgres")] --> sc["Source connector"]
    sc --> k["Kafka topics"]
    k --> kc["Sink connector"]
    kc --> dst[("Sink system<br/>e.g. S3")]

The point is to avoid writing and operating bespoke producer/consumer apps for standard integrations that thousands of teams need.

2. Architecture: workers, connectors, tasks, converters

Four concepts make up Connect.

Concept	Role
Worker	A JVM process that runs connectors and tasks; the unit you deploy and scale
Connector	A configured integration to one external system; splits work into tasks
Task	The unit of parallelism that actually copies data (for example one per DB table or partition)
Converter	Serializes/deserializes record keys and values (JSON, Avro via Schema Registry, and so on)

A connector is configuration; tasks are the running workers that do the copying. Converters plug Connect into the same schema and serialization world as the rest of the course, so a source connector can write Avro registered in Schema Registry.

3. Source vs sink connectors

Source connector: reads from an external system and writes into Kafka. Example: Debezium reads a database’s transaction log and produces change events. This is the ingest side.
Sink connector: reads from Kafka and writes to an external system. Example: an S3 sink archives a topic, or an Elasticsearch sink indexes it. This is the export side.

A sink connector is a managed consumer group, so it inherits at-least-once delivery and needs an idempotent or upsert-style target, the same concern as in Idempotent Consumers, Ordering, and Duplicates.

4. Standalone vs distributed mode

Connect runs in one of two modes.

Mode	Use for
Standalone	A single worker, config from a file. Local dev and simple one-host setups
Distributed	Multiple workers forming a cluster, config via REST, automatic task rebalancing and failover. Production

Distributed mode stores its connector configs, offsets, and status in internal Kafka topics, so a worker can fail and its tasks are reassigned to the survivors, just like a consumer group rebalance.

flowchart TD
    subgraph cluster [Distributed Connect cluster]
        w1["worker 1<br/>tasks"]
        w2["worker 2<br/>tasks"]
        w3["worker 3<br/>tasks"]
    end
    rest["REST API"] --> cluster
    cluster --> topics["internal topics:<br/>configs, offsets, status"]

5. Common connectors and when to use Connect

A large connector ecosystem exists. A few you will meet:

Debezium: change data capture from Postgres, MySQL, MongoDB, and others (the CDC engine behind the outbox in Transactional Outbox and CDC).
JDBC source/sink: read from or write to relational databases via SQL.
S3 sink: archive topics to object storage.

Reach for Connect when the integration is a standard system-to-Kafka move. Write your own consumer when the logic is genuinely application-specific, such as business processing, enrichment, or anything that belongs inside your service.

6. Where ksqlDB fits

ksqlDB lets you express stream processing as SQL over Kafka topics, running continuous queries without writing Java. It sits alongside Kafka Streams (and is built on it): Streams gives you a full programming model, while ksqlDB trades flexibility for the accessibility of SQL. Reach for ksqlDB when analysts or SQL-comfortable engineers need to filter, join, and aggregate streams without a JVM build, and reach for Kafka Streams when you need the full power of code, as in Kafka Streams.

7. Guided practical

Run this against the local lab.

Start a Connect worker in distributed mode pointed at the broker (a confluentinc/cp-kafka-connect container works well).
Confirm the REST API responds at http://localhost:8083/connector-plugins.
Configure a simple sink connector (for example a file or S3-compatible sink) via a POST to /connectors.
Produce records to the topic and confirm they appear in the sink target.
Inspect connector and task status via GET /connectors/<name>/status.

Next: Section 7, Security: TLS, SASL, ACLs, and MSK IAM, where the course turns to hardening Kafka for production.