StreamSets Data Collector

Main interface

This is the main entry point used by users when interacting with SDC instances.

class streamsets.sdk.DataCollector(server_url, username=None, password=None, authentication_method='form', accounts_authentication_token=None, accounts_server_url=None, control_hub=None, dump_log_on_error=False, **kwargs)[source]

Class to interact with StreamSets Data Collector.

If connecting to an StreamSets Control Hub-registered instance of Data Collector, create an instance of streamsets.sdk.ControlHub instead of instantiating with a username and password.

Parameters
  • server_url (str) – URL of an existing SDC deployment with which to interact.

  • username (str, optional) – SDC username. Default: streamsets.sdk.sdc.DEFAULT_SDC_USERNAME.

  • password (str, optional) – SDC password. Default: streamsets.sdk.sdc.DEFAULT_SDC_PASSWORD.

  • authentication_method (str, optional) – StreamSets Data Collector authentication method. Default: streamsets.sdk.constants.ENGINE_AUTHENTICATION_METHOD_FORM.

  • accounts_authentication_token (str, optional) – StreamSets Accounts authentication token. Default: None

  • accounts_server_url (str, optional) – StreamSets Accounts server base URL. Default: None

  • control_hub (streamsets.sdk.ControlHub, optional) – A StreamSets Control Hub instance to use for SCH-registered Data Collectors. Default: None.

  • dump_log_on_error (bool) – Whether to output Data Collector logs when exceptions are raised by certain methods. Default: False

VERIFY_SSL_CERTIFICATES = True
add_pipeline(*pipelines, **kwargs)[source]

Add one or more pipelines to the DataCollector instance.

Parameters

*pipelines – One or more instances of streamsets.sdk.sdc_models.Pipeline.

capture_snapshot(pipeline, snapshot_name=None, start_pipeline=False, runtime_parameters=None, batches=1, batch_size=10, **kwargs)[source]

Capture a snapshot for given pipeline.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • snapshot_name (str, optional) – Name for the generated snapshot. If set to None, an auto-generated UUID (which can be recovered from the returned SnapshotCommand object’s snapshot_name attribute) will be used when calling the REST API. Default: None.

  • start_pipeline (bool, optional) – If set to true, then the pipeline will be started and its first batch will be captured. Otherwise, the pipeline must be running, in which case one of the next batches will be captured. Default: False.

  • runtime_parameters (dict, optional) – Runtime parameters to override Pipeline Parameters value. Default: None.

  • wait (bool, optional) – Wait for capture snapshot to finish. Default: True.

  • wait_for_statuses (list, optional) – Pipeline statuses to wait on. Default: ['RUNNING', 'FINISHED'].

  • timeout_sec (int) – Timeout to wait for snapshot, in seconds. Default: streamsets.sdk.sdc.DEFAULT_SNAPSHOT_TIMEOUT.

Returns

An instance of streamsets.sdk.sdc_api.SnapshotCommand.

change_password(old_password, new_password)[source]

Change password for the current user.

Parameters
  • old_password (str) – old password.

  • new_password (str) – new password.

Returns

An instance of streamsets.sdk.sdc_api.Command.

property current_user

Get currently logged-in user and its groups and roles.

Returns

An instance of streamsets.sdk.sdc_models.User.

property definitions

Get an SDC instance’s definitions.

Will return a cached instance of the definitions if called more than once.

Returns

An instance of json.

delete_snapshot(snapshot)[source]

Delete a snapshot from the specified pipeline.

Parameters

snapshot (streamsets.sdk.sdc_models.Snapshot) – The Snapshot object to be deleted.

Returns

An instance of streamsets.sdk.sdc_api.Command.

export_pipeline(pipeline, include_library_definitions=False, include_plain_text_credentials=False)[source]

Export single pipeline to json file.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – Pipeline instance.

  • include_library_definitions (boolean) – Set to true to export for Control Hub.

  • include_plain_text_credentials (boolean) – Default False.

Returns

A dict object containing the contents of pipeline.

export_pipelines(pipelines, include_library_definitions=False, include_plain_text_credentials=False)[source]

Export pipelines.

Parameters
  • pipelines (list) – A list of streamsets.sdk.sdc_models.Pipeline instances.

  • include_library_definitions (boolean) – Set to true to export for Control Hub. Default False.

  • include_plain_text_credentials (boolean) – Default False.

Returns

An instance of type bytes indicating the content of zip file with pipeline json files.

get_alerts()[source]

Get pipeline alerts.

Returns

An instance of streamsets.sdk.sdc_models.Alerts.

get_bundle(generators=None)[source]

Generate new support bundle.

Returns

An instance of zipfile.ZipFile.

get_bundle_generators()[source]

Get available support bundle generators.

Returns

An instance of streamsets.sdk.sdc_models.BundleGenerators.

get_jmx_metrics()[source]

Get SDC JMX metrics

Returns

An instance of streamsets.sdk.sdc_models.JmxMetrics.

get_logs(ending_offset=- 1, extra_message=None, pipeline=None, severity=None)[source]

Get logs.

Parameters
  • ending_offset (int) – ending_offset, Default: -1.

  • extra_message (str) – extra_message, Default: None.

  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance, Default: None.

  • severity (str) – severity, Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Log.

get_pipeline_acl(pipeline)[source]

Get pipeline ACL.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

Returns

An instance of streamsets.sdk.sdc_models.PipelineAcl.

get_pipeline_builder(**kwargs)[source]

Get a pipeline builder instance with which a pipeline can be created.

Returns

An instance of streamsets.sdk.sdc_models.PipelineBuilder.

get_pipeline_committed_offsets(pipeline)[source]

Get a pipeline’s committed offsets.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – Pipeline object.

Returns

A dict containing committed offsets (or an empty dict if no offsets exist).

get_pipeline_history(pipeline)[source]

Get a pipeline’s history.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

Returns

An instance of streamsets.sdk.sdc_models.History.

get_pipeline_metrics(pipeline)[source]

Get a pipeline’s metrics.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

Returns

An instance of streamsets.sdk.sdc_models.Metrics.

get_pipeline_permissions(pipeline)[source]

Return pipeline permissions for a given pipeline.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

Returns

An instance of streamsets.sdk.sdc_models.PipelinePermissions.

get_pipeline_status(pipeline)[source]

Get status of a pipeline.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

get_snapshots(pipeline)[source]

Get information about stored snapshots.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline, optional) – The pipeline instance.

Returns

A list of streamsets.sdk.sdc_models.SnapshotInfo instances.

get_stage_errors(pipeline, stage)[source]

Get stage errors.

Parameters
Returns

A streamsets.sdk.utils.SeekableList of streamsets.sdk.sdc_models.StageError instances.

get_stage_library_version(stage)[source]

Get the stage library version.

Parameters

stage (streamsets.sdk.sdc_models.Stage) – stage object

Returns

An instance of str

property id

Return id for StreamSets Data Collector.

Returns

A str SDC ID.

import_pipeline(pipeline, title=None)[source]

Import pipeline from json file.

Parameters
  • pipeline (dict) – JSON data loaded from file. Example usage: json.load(open(filename, ‘r’)).

  • title (str, optional) – Title of the pipeline. If left out, pipeline title from JSON object will be used. Default None.

Returns

An instance of streamsets.sdk.sdc_models.Pipeline.

import_pipelines_from_archive(archive)[source]

Import pipelines from archived zip directory.

Parameters

archive (file) – file containing the pipelines.

Returns

A streamsets.sdk.utils.SeekableList of streamsets.sdk.sdc_models.Pipeline.

property pipelines

Get all pipelines in the pipeline store.

Returns

A streamsets.sdk.utils.SeekableList of

streamsets.sdk.sdc_models.Pipeline instances.

remove_pipeline(*pipelines)[source]

Remove one or more pipelines from the DataCollector instance.

Parameters

*pipelines – One or more instances of streamsets.sdk.sdc_models.Pipeline.

reset_origin(pipeline)[source]

Reset origin offset.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – Pipeline object.

Returns

An instance of streamsets.sdk.sdc_api.Command.

run_dynamic_pipeline_preview(type, parameters={}, batches=1, batch_size=1, skip_targets=True, skip_lifecycle_events=True, end_stage=None, timeout=10000, test_origin=False, stage_outputs_to_override_json_text=None, stage_outputs_to_override_json=[], **kwargs)[source]

Run dynamic pipeline preview.

Parameters
  • type (str) – Dynamic preview request type. CLASSIFICATION_CATALOG or PROTECTION_POLICY

  • parameters (dict, optional) – Dynamic preview request parameters. Default: {}

  • batches (int, optional) – Number of batches. Default: 1.

  • batch_size (int, optional) – Batch size. Default: 1.

  • skip_targets (bool, optional) – Skip targets. Default: True.

  • skip_lifecycle_events (bool, optional) – Skip life cycle events. Default: True.

  • end_stage (str, optional) – End stage. Default: None.

  • timeout (int, optional) – Timeout. Default: 10000.

  • test_origin (bool, optional) – Test origin. Default: False.

  • stage_outputs_to_override_json_text (str, optional) – Stage outputs to override text. Default: None.

  • stage_outputs_to_override_json (list, optional) – Stage outputs to override. Default: [].

  • wait (bool, optional) – Wait for pipeline preview to finish. Default: True.

Returns

An instance of streamsets.sdk.sdc_api.PreviewCommand.

run_pipeline_preview(pipeline, rev=0, batches=1, batch_size=10, skip_targets=True, end_stage=None, timeout=2000, test_origin=False, stage_outputs_to_override_json=None, **kwargs)[source]

Run pipeline preview.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • rev (int, optional) – Pipeline revision. Default: 0.

  • batches (int, optional) – Number of batches. Default: 1.

  • batch_size (int, optional) – Batch size. Default: 10.

  • skip_targets (bool, optional) – Skip targets. Default: True.

  • end_stage (str, optional) – End stage. Default: None.

  • timeout (int, optional) – Timeout. Default: 2000.

  • test_origin (bool, optional) – Test origin. Default: False

  • stage_outputs_to_override_json (str, optional) – Stage outputs to override. Default: None.

  • wait (bool, optional) – Wait for pipeline preview to finish. Default: True.

Returns

An instance of streamsets.sdk.sdc_api.PreviewCommand.

property sample_pipelines

Get all sample pipelines in the pipeline store.

Returns

A streamsets.sdk.utils.SeekableList of

streamsets.sdk.sdc_models.Pipeline instances.

property sdc_configuration

Return all configurations for StreamSets Data Collector.

Returns

A dict with property names as keys and property values as values.

set_pipeline_acl(pipeline, pipeline_acl)[source]

Update pipeline ACL.

Parameters
Returns

An instance of streamsets.sdk.sdc_api.Command.

set_user(username, password=None)[source]

Set the user with which to interact with SDC.

Parameters
  • username (str) – Username of user.

  • password (str, optional) – Password for user. Default: same as username.

property stage_libraries

Get all stage libraries.

Returns

A streamsets.sdk.utils.SeekableList of

streamsets.sdk.sdc_models.StageLibrary instances.

start_pipeline(pipeline, runtime_parameters=None, **kwargs)[source]

Start a pipeline.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • runtime_parameters (dict, optional) – Collection of runtime parameters. Default: None.

  • wait (bool, optional) – Wait for pipeline to start. Default: True.

  • wait_for_statuses (list, optional) – Pipeline statuses to wait on. Default: ['RUNNING', 'FINISHED'].

  • timeout_sec (int) – Timeout to wait for pipeline statuses, in seconds. Default: streamsets.sdk.sdc.DEFAULT_START_TIMEOUT.

Returns

An instance of streamsets.sdk.sdc_api.PipelineCommand.

stop_pipeline(pipeline, **kwargs)[source]

Stop a pipeline.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • force (bool, optional) – Force pipeline to stop. Default: False.

  • wait (bool, optional) – Wait for pipeline to stop. Default: True.

  • timeout_sec (int) – Timeout to wait for pipeline stop, in seconds. Default: streamsets.sdk.sdc.DEFAULT_STOP_TIMEOUT.

Returns

An instance of streamsets.sdk.sdc_api.StopPipelineCommand.

update_pipeline(*pipelines)[source]

Update one or more pipelines in the DataCollector instance.

Parameters

*pipelines – One or more instances of streamsets.sdk.sdc_models.Pipeline.

validate_pipeline(pipeline)[source]

Validate a pipeline.

Parameters

pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

property version

Return the version of the Data Collector.

Returns

The version string.

Return type

str

wait_for_pipeline_metric(pipeline, metric, value, timeout_sec=30)[source]

Block until a pipeline metric reaches the desired value.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • metric (str) – The desired metric (e.g. 'output_record_count' or 'data_batch_count').

  • value – The desired value to wait for.

  • timeout_sec (int, optional) – Timeout to wait for metric to reach value, in seconds. Default: streamsets.sdk.sdc.DEFAULT_WAIT_FOR_METRIC_TIMEOUT.

Raises

TimeoutError – If timeout_sec passes without metric reaching value.

wait_for_pipeline_status(pipeline, status, timeout_sec=30)[source]

Block until a pipeline reaches the desired status.

Parameters
  • pipeline (streamsets.sdk.sdc_models.Pipeline) – The pipeline instance.

  • status (str) – The desired status to wait for.

  • timeout_sec (int, optional) – Timeout to wait for pipeline to reach status, in seconds. Default: streamsets.sdk.sdc.DEFAULT_WAIT_FOR_STATUS_TIMEOUT.

Raises

TimeoutError – If timeout_sec passes without pipeline reaching status.

sdc.DEFAULT_SDC_USERNAME = 'admin'
sdc.DEFAULT_SDC_PASSWORD = 'admin'

Models

These models wrap and provide useful functionality for interacting with common SDC abstractions.

Alerts

class streamsets.sdk.sdc_models.Alert(alert)[source]

Pipeline alert.

Parameters

alert (dict) – Python object representation of a pipeline alert.

property alert_texts

Get alert’s alert texts.

Returns

The alert’s alert texts as a str.

property label

Get alert’s label.

Returns

The alert’s label as a str.

property pipeline_id

Get alert’s pipeline ID.

Returns

The pipeline ID as a str.

class streamsets.sdk.sdc_models.Alerts(alerts)[source]

Container for list of alerts with filtering capabilities.

Parameters

alerts (dict) – Python object representation of alerts.

alerts

A list of streamsets.sdk.sdc_models.Alert instances.

Type

list

for_pipeline(pipeline)[source]

Get alerts for the specified pipeline.

Parameters

pipeline (str) – The pipeline for which to get alerts.

Returns

An instance of streamsets.sdk.sdc_models.Alerts.

Data Rules

class streamsets.sdk.sdc_models.DataDriftRule(stream, label, condition=None, sampling_percentage=5, sampling_records_to_retain=10, enable_meter=True, enable_alert=True, alert_text='${alert:info()}', send_email=False, active=False)[source]

Pipeline data drift rule.

Parameters
  • stream (str) – Stream to use for data rule. An entry from a Stage instance’s output_lanes list is typically used here.

  • label (str) – Rule label.

  • condition (str, optional) – Data rule condition. Default: None.

  • sampling_percentage (int, optional) – Default: 5.

  • sampling_records_to_retain (int, optional) – Default: 10.

  • enable_meter (bool, optional) – Default: True.

  • enable_alert (bool, optional) – Default: True.

  • alert_text (str, optional) – Default: '${alert:info()}'.

  • send_email (bool, optional) – Default: False.

  • active (bool, optional) – Enable the data rule. Default: False.

property active

The rule is active.

Returns

A bool.

class streamsets.sdk.sdc_models.DataRule(stream, label, condition=None, sampling_percentage=5, sampling_records_to_retain=10, enable_meter=True, enable_alert=True, alert_text=None, threshold_type='count', threshold_value=100, min_volume=1000, send_email=False, active=False)[source]

Pipeline data rule.

Parameters
  • stream (str) – Stream to use for data rule. An entry from a Stage instance’s output_lanes list is typically used here.

  • label (str) – Rule label.

  • condition (str, optional) – Data rule condition. Default: None.

  • sampling_percentage (int, optional) – Default: 5.

  • sampling_records_to_retain (int, optional) – Default: 10.

  • enable_meter (bool, optional) – Default: True.

  • enable_alert (bool, optional) – Default: True.

  • alert_text (str, optional) – Default: None.

  • threshold_type (str, optional) – One of count or percentage. Default: 'count'.

  • threshold_value (int, optional) – Default: 100.

  • min_volume (int, optional) – Only set if threshold_type is percentage. Default: 1000.

  • send_email (bool, optional) – Default: False.

  • active (bool, optional) – Enable the data rule. Default: False.

property active

Returns if the rule is active or not.

Returns

A bool.

History

class streamsets.sdk.sdc_models.History(history)[source]

Pipeline history.

Parameters

history (dict) – Python object representation of the pipeline history.

entries

A list of streamsets.sdk.sdc_models.HistoryEntry instances.

Type

list

property latest

Get pipeline history’s latest entry.

Returns

The most recent pipeline history entry as an instance of streamsets.sdk.sdc_models.HistoryEntry.

class streamsets.sdk.sdc_models.HistoryEntry(entry)[source]

Pipeline history entry.

Parameters

entry (dict) – Python object representation of the history entry.

property metrics

Get pipeline history entry’s metrics.

Returns

The pipeline history entry’s metrics as an instance of streamsets.sdk.sdc_models.Metrics.

Issues

class streamsets.sdk.sdc_models.Issue(issue)[source]

Issue encountered for a pipeline or a stage.

Parameters

issue (dict) – Python object representation of the issue.

class streamsets.sdk.sdc_models.Issues(issues)[source]

Issues encountered for pipelines as well as stages.

Parameters

issues (dict) – Python object representation of the issues.

issues_count

The number of issues.

Type

int

pipeline_issues

A list of streamsets.sdk.sdc_models.Issue instances.

Type

list

stage_issues

A dictionary mapping stage names to instances of streamsets.sdk.sdc_models.Issue.

Type

dict

Logs

class streamsets.sdk.sdc_models.Log(log)[source]

Model for SDC logs.

Parameters

log (list) – A list of dictionaries (JSON representation) of the log.

after_time(timestamp)[source]

Returns log happened after the time specified.

Parameters

timestamp (str) – Timestamp in the form ‘2017-04-10 17:53:55,244’.

Returns

The formatted log as a str.

before_time(timestamp)[source]

Returns log happened before the time specified.

Parameters

timestamp (str) – Timestamp in the form ‘2017-04-10 17:53:55,244’.

Returns

The formatted log as a str.

Metrics

class streamsets.sdk.sdc_models.MetricCounter(counter)[source]

Metric counter.

Parameters

counter (dict) – Python object representation of a metric counter.

property count

Get the metric counter’s count.

Returns

The metric counter’s count as an int.

class streamsets.sdk.sdc_models.MetricGauge(gauge)[source]

Metric gauge.

Parameters

gauge (dict) – Python object representation of a metric gauge.

property value

Get the metric gauge’s value.

Returns

The metric gauge’s value as a str.

class streamsets.sdk.sdc_models.MetricHistogram(histogram)[source]

Metric histogram.

Parameters

histogram (dict) – Python object representation of a metric histogram.

class streamsets.sdk.sdc_models.MetricTimer(timer)[source]

Metric timer.

Parameters

timer (dict) – Python object representation of a metric timer.

property count

Get the metric timer’s count.

Returns

The metric timer’s count as an int.

class streamsets.sdk.sdc_models.Metrics(metrics)[source]

Metrics.

Parameters

metrics (dict) – Python object representation of metrics.

counter(name)[source]

Get the metric counter from metrics.

Parameters

name (str) – Counter name.

Returns

The metric counter as an instance of streamsets.sdk.sdc_models.MetricCounter.

gauge(name)[source]

Get the metric gauge from metrics.

Parameters

name (str) – Gauge name.

Returns

The metric gauge as an instance of streamsets.sdk.sdc_models.MetricGauge.

histogram(name)[source]

Get the metric histogram from metrics.

Parameters

name (str) – Histogram name.

Returns

The metric histogram as an instance of streamsets.sdk.sdc_models.MetricHistogram.

meter(name)[source]

Get the metric meter from metrics.

Parameters

name (str) – Meter name.

Returns

The metric meter as an instance of streamsets.sdk.sdc_models.MetricMeter.

property pipeline

Get pipeline-level metrics.

Returns

An instance of streamsets.sdk.sdc_models.PipelineMetrics.

timer(name)[source]

Get the metric timer from metrics.

Parameters

name (str) – Timer name.

Returns

The metric timer as an instance of streamsets.sdk.sdc_models.MetricTimer.

Pipelines

class streamsets.sdk.sdc_models.PipelineBuilder(pipeline, definitions, fragment=False, data_collector=None)[source]

Class with which to build SDC pipelines.

This class allows a user to programmatically generate an SDC pipeline. Instead of instantiating this class directly, most users should use streamsets.sdk.DataCollector.get_pipeline_builder().

Parameters
  • pipeline (dict) – Python object representing an empty pipeline. If created manually, this would come from creating a new pipeline in SDC and then exporting it before doing any configuration.

  • definitions (dict) – The output of SDC’s definitions endpoint.

  • fragment (boolean, optional) – Specify if a fragment builder. Default: False.

add_data_drift_rule(*data_drift_rules)[source]

Add one or more data drift rules to the pipeline.

Parameters

*data_drift_rules – One or more instances of streamsets.sdk.sdc_models.DataDriftRule.

add_data_rule(*data_rules)[source]

Add one or more data rules to the pipeline.

Parameters

*data_rules – One or more instances of streamsets.sdk.sdc_models.DataRule.

add_error_stage(label=None, name=None, library=None)[source]

Add an error stage to the pipeline.

When specifying a stage, either label or name must be used. If library is omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_fragment(fragment, parameter_name_prefix=None)[source]

Add a fragment to the pipeline.

Parameters
  • fragment (py:obj:streamsets.sdk.sch_models.Pipeline) – Fragment to add.

  • parameter_name_prefix (str, optional) – Prefix name for the parameters of fragment. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_metric_rule(*metric_rules)[source]

Add one or more metric rules to the pipeline.

Parameters

*data_rules – One or more instances of streamsets.sdk.sdc_models.MetricRule.

add_stage(label=None, name=None, type=None, library=None)[source]

Add a stage to the pipeline.

When specifying a stage, either label or name must be used. type and library may also be used to select a particular stage if ambiguities exist. If type and/or library are omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • type (str, optional) – SDC stage type to use when selecting stage from definitions (e.g. origin, destination, processor, executor). Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_start_event_stage(label=None, name=None, library=None)[source]

Add start event stage to the pipeline.

When specifying a stage, either label or name must be used. If library is omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_stats_aggregator_stage(label=None, name=None, library=None)[source]

Add a stats aggregator stage to the pipeline.

When specifying a stage, either label or name must be used. If library is omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_stop_event_stage(label=None, name=None, library=None)[source]

Add stop event stage to the pipeline.

When specifying a stage, either label or name must be used. If library is omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

add_test_origin_stage(label=None, name=None, library=None)[source]

Add test origin stage to the pipeline.

When specifying a stage, either label or name must be used. If library is omitted, the first stage definition matching the given label or name will be used.

Parameters
  • label (str, optional) – SDC stage label to use when selecting stage from definitions. Default: None.

  • name (str, optional) – SDC stage name to use when selecting stage from definitions. Default: None.

  • library (str, optional) – SDC stage library to use when selecting stage from definitions. Default: None.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

build(title='Pipeline', **kwargs)[source]

Build the pipeline.

Parameters

title (str, optional) – Pipeline title to use. Default: 'Pipeline'.

Returns

An instance of streamsets.sdk.sdc_models.Pipeline.

import_pipeline(pipeline, **kwargs)[source]

Import a pipeline into the PipelineBuilder.

Parameters

pipeline (dict) – Exported pipeline.

Returns

An instance of streamsets.sdk.sdc_models.PipelineBuilder.

class streamsets.sdk.sdc_models.Pipeline(pipeline, all_stages=None, fragment=False)[source]

SDC pipeline.

This class provides abstractions to make it easier to interact with a pipeline before it’s imported into SDC.

Parameters
  • pipeline (dict) – A Python object representing the serialized pipeline.

  • all_stages (dict, optional) – A dictionary mapping stage names to streamsets.sdk.sdc_models.Stage instances. Default: None.

  • fragment (boolean, optional) – Specify if a fragment. Default: False.

add_parameters(**parameters)[source]

Add pipeline parameters.

Parameters

**parameters – Keyword arguments to add.

property configuration

Get pipeline’s configuration.

Returns

An instance of streamsets.sdk.models.Configuration.

property delivery_guarantee

Get the delivery guarantee.

Returns

The delivery guarantee as a str.

property id

Get the pipeline id.

Returns

The pipeline id as a str.

property metadata

Get the pipeline metadata.

Returns

Pipeline metadata as a Python object.

property origin_stage

Get the pipeline’s origin stage.

Returns

An instance of streamsets.sdk.sdc_models.Stage.

property parameters

Get the pipeline parameters.

Returns

A dict of parameter key-value pairs.

pprint()[source]

Pretty-print the pipeline’s JSON representation.

property rate_limit

Get the rate limit (records/sec).

Returns

The rate limit as a str.

property title

Get the pipeline title.

Returns

The pipeline title as a str.

class streamsets.sdk.sdc_models.Stage(stage, label=None)[source]

Pipeline stage.

Parameters
  • stage – JSON representation of the pipeline stage.

  • label (str, optional) – Human-readable stage label. Default: None.

configuration

The stage configuration.

Type

streamsets.sdk.models.Configuration

services

If supported by the stage, a dictionary mapping a service name to an instance of streamsets.sdk.models.Configuration.

Type

dict

add_output(*other_stages, event_lane=False)[source]

Connect output of this stage to another stage.

The __rshift__ operator (>>) has been overloaded to invoke this method.

Parameters

other_stage (streamsets.sdk.sdc_models.Stage) – Stage object.

Returns

This stage as an instance of streamsets.sdk.sdc_models.Stage).

property description

The stage’s description.

Type

str

property event_lanes

Get the stage’s list of event lanes.

Returns

A list of event lanes.

property label

The stage’s label.

Type

str

property library

Get the stage’s library.

Returns

The stage library as a str.

property output_lanes

Get the stage’s list of output lanes.

Returns

A list of output lanes.

set_attributes(**attributes)[source]

Set one or more stage attributes.

Parameters

**attributes – Attributes to set.

Returns

This stage as an instance of streamsets.sdk.sdc_models.Stage.

property stage_on_record_error

The stage’s on record error configuration value.

property stage_record_preconditions

The stage’s record preconditions configuration value.

property stage_required_fields

The stage’s required fields configuration value.

Pipeline ACLs

class streamsets.sdk.sdc_models.PipelineAcl(pipeline_acl)[source]

Represents a pipeline ACL.

Parameters

pipeline_acl (dict) – JSON representation of a pipeline ACL.

permissions

Pipeline Permissions object.

Type

streamsets.sdk.sdc_models.PipelinePermissions

Pipeline Permissions

class streamsets.sdk.sdc_models.PipelinePermission(pipeline_permission)[source]

A container for a pipeline permission.

Parameters

pipeline_permission (dict) – A Python object representation of a pipeline permission.

class streamsets.sdk.sdc_models.PipelinePermissions(pipeline_permissions)[source]

Container for list of permissions for a pipeline.

Parameters

pipeline_permissions (dict) – A Python object representation of pipeline permissions.

permissions

A list of streamsets.sdk.sdc_models.PipelinePermission instances.

Type

list

Previews

class streamsets.sdk.sdc_models.Preview(pipeline_id, previewer_id, preview)[source]

Preview.

Parameters
  • pipeline_id (str) – Pipeline ID.

  • previewer_id (str) – Previewer ID.

  • preview (dict) – Python object representation of the preview.

issues

An instance of streamsets.sdk.sdc_models.Issues.

Type

dict

preview_batches

A list of streamsets.sdk.sdc_models.Batch instances.

Type

list

Snapshots

class streamsets.sdk.sdc_models.Batch(batch)[source]

Snapshot batch.

Parameters

batch – Python object representation of the snapshot batch.

class streamsets.sdk.sdc_models.Record(record)[source]

Record.

Parameters

record (dict) – Python object representation of the record.

header

An instance of streamsets.sdk.sdc_models.RecordHeader.

Type

dict

value

Python object representation of the record value.

Type

dict

value2

A typed representation of the record value.

get_field_attributes(path)[source]

Given a field path string (similar to XPath), get streamsets.sdk.sdc_models.Field attributes. .. rubric:: Example

get_field_attributes(path=’[2]/east/HR/employeeName’).

Parameters

path (str) – field path string.

Returns

streamsets.sdk.sdc_models.Field attributes.

get_field_data(path)[source]

Given a field path string (similar to XPath), get streamsets.sdk.sdc_models.Field. .. rubric:: Example

get_field_data(path=’[2]/east/HR/employeeName’).

Parameters

path (str) – field path string.

Returns

An instance of streamsets.sdk.sdc_models.Field.

class streamsets.sdk.sdc_models.RecordHeader(header)[source]

Record Header.

Parameters

header (dict) – Python object representation of the record header.

class streamsets.sdk.sdc_models.Snapshot(api_client, snapshot)[source]

Snapshot.

Parameters

snapshot (dict) – Python object representation of the snapshot.

batch_number

The number of the batch that the snapshot captured.

Type

int

batches

A list of streamsets.sdk.sdc_models.Batch instances.

Type

list

id

The snapshot’s ID.

Type

str

name

The snapshot’s name.

Type

str

pipeline_id

The ID of the pipeline that the snapshot belongs to.

Type

str

time_stamp

The creation date of the snapshot as a unix timestamp.

Type

int

class streamsets.sdk.sdc_models.StageOutput(stage_output)[source]

Snapshot batch’s stage output.

Parameters

stage_output – Python object representation of the stage output.

property output

Gets the stage output’s output.

If the stage contains multiple lanes, use streamsets.sdk.sdc_models.StageOutput.output_lanes.

:raises An instance of Exception if the stage contains multiple lanes.:

Returns

An instance of streamsets.sdk.sdc_models.Record.

Users

class streamsets.sdk.sdc_models.User(user)[source]

User.

Parameters

user (dict) – Python object representation of the user.

property groups

Get user’s groups.

Returns

User groups as a str.

property name

Get user’s name.

Returns

User name as a str.

property roles

Get user’s roles.

Returns

User roles as a str.