A Tap is an application that takes a configuration file and an optional state file as input and produces an ordered stream of record, state and schema messages as output. A record is json-encoded data of any kind. A state message is used to persist information between invocations of a Tap. A schema message describes the datatypes of the records in the stream. A Tap may be implemented in any programming language.
Taps are designed to produce a stream of data from sources like databases and web service APIs for use in a data integration or ETL pipeline.
tap --config CONFIG [--state STATE] [--catalog CATALOG]
CONFIG is a required argument that points to a JSON file containing any
configuration parameters the Tap needs.
STATE is an optional argument pointing to a JSON file that the
Tap can use to remember information from the previous invocation,
like, for example, the point where it left off.
CATALOG is an optional argument pointing to a JSON file that the
Tap can use to filter which streams should be synced.
The configuration contains whatever parameters the Tap needs in order to pull data from the source. Typically this will include the credentials for the API or data source. The format of the configuration will vary by Tap, but it must be JSON-encoded and the root of the configuration must be an object.
For taps where the configuration needs to be changed during a run, the tap should write changes back to the supplied config file, so that the changes can be used on subsequent runs.
See the Config File section for more information.
The JSON encoded state is used to persist information between invocations of a Tap.
A Tap that wishes to persist state should periodically write STATE
messages to stdout as it processes the stream, and should expect the file
named by the --state STATE argument to have the same format as the value
of the STATE messages it emits.
A common use case of state is to record the spot in the stream where the
last invocation left off. If the Tap is invoked without a --state STATE
argument, it should start at the beginning of the stream or at some
appropriate default position. If it is invoked with a --state STATE
argument it should read in the state file and start from the corresponding
position in the stream.
See the State File section for more information.
The catalog is a JSON encoded file that lists the available streams and their schemas. The top level is an object, with a single key called "streams" that points to an array of stream objects. The metadata for each stream object can be modified to select whether a stream and/or its fields should be replicated, and how the data should be replicated (FULL TABLE vs INCREMENTAL).
See the Catalog section for more information.
Sync from the beginning without catalog
$ tap --config config.jsonSync starting from a stored state with catalog
$ tap --config config.json --state state.json --catalog catalog.jsonA Tap outputs structured messages to stdout in JSON format, one
message per line. Logs and other information can be emitted to stderr
for aiding debugging. A streamer exits with a zero exit code on success,
non-zero on failure.
The body contains messages encoded as a JSON map, one message per
line. Each message must contain a type attribute. Any message type
is permitted, and types are interpreted case-insensitively. The
following types have specific meaning:
RECORD messages contain the data from the data stream. They must have the following properties:
-
recordRequired. A JSON map containing a streamed data point -
streamRequired. The string name of the stream -
time_extractedOptional. The time this record was observed in the source. This should be an RFC3339 formatted date-time, like "2017-11-20T16:45:33.000Z".
A single Tap may output RECORDs messages with different stream names. A single RECORD entry may only contain records for a single stream.
Example:
Note: Every message must be on its own line, but the examples here use multiple lines for readability.
{
"type": "RECORD",
"stream": "users",
"time_extracted": "2017-11-20T16:45:33.000Z",
"record": {
"id": 0,
"name": "Chris"
}
}SCHEMA messages describe the datatypes of data in the stream. They must have the following properties:
-
schemaRequired. A JSON Schema describing thedataproperty of RECORDs from the samestream -
streamRequired. The string name of the stream that this schema describes -
key_propertiesRequired. A list of strings indicating which properties make up the primary key for this stream. Each item in the list must be the name of a top-level property defined in the schema. A value forkey_propertiesmust be provided, but it may be an empty list to indicate that there is no primary key. -
bookmark_propertiesOptional. A list of strings indicating which properties the tap is using as bookmarks. Each item in the list must be the name of a top-level property defined in the schema.
A single Tap may output SCHEMA messages with different stream
names. If a RECORD message from a stream is not preceded by a
SCHEMA message for that stream, it is assumed to be schema-less.
Example:
Note: Every message must be on its own line, but the examples here use multiple lines for readability.
{
"type": "SCHEMA",
"stream": "users",
"schema": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
},
"updated_at": {
"type": "string",
"format": "date-time"
}
}
},
"key_properties": ["id"],
"bookmark_properties": ["updated_at"]
}STATE messages contain the state that the Tap wishes to persist. STATE messages have the following properties:
valueRequired. The JSON formatted state value
The semantics of a STATE value are not part of the specification, and should be determined independently by each Tap.
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}
{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}
{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}
{"type": "STATE", "value": {"users": 2, "locations": 1}}
A Tap's API encompasses its input and output - including its configuration, how it interprets state, and how the data it produces is structured and interpreted. Taps should follow Semantic Versioning, meaning that breaking changes to any of these should be a new MAJOR version, and backwards-compatible changes should be a new MINOR version.