StreamSets Data Collector
Section Contents
StreamSets Data Collector#
Main interface#
This is the main entry point used by users when interacting with SDC instances.
- class streamsets.sdk.DataCollector(server_url, username=None, password=None, authentication_method='form', accounts_authentication_token=None, accounts_server_url=None, control_hub=None, dump_log_on_error=False, **kwargs)[source]#
Class to interact with StreamSets Data Collector.
If connecting to an StreamSets Control Hub-registered instance of Data Collector, create an instance of
streamsets.sdk.ControlHub
instead of instantiating with ausername
andpassword
.- Parameters
server_url (
str
) – URL of an existing SDC deployment with which to interact.username (
str
, optional) – SDC username. Default:streamsets.sdk.sdc.DEFAULT_SDC_USERNAME
.password (
str
, optional) – SDC password. Default:streamsets.sdk.sdc.DEFAULT_SDC_PASSWORD
.authentication_method (
str
, optional) – StreamSets Data Collector authentication method. Default:streamsets.sdk.constants.ENGINE_AUTHENTICATION_METHOD_FORM
.accounts_authentication_token (
str
, optional) – StreamSets Accounts authentication token. Default:None
accounts_server_url (
str
, optional) – StreamSets Accounts server base URL. Default:None
control_hub (
streamsets.sdk.ControlHub
, optional) – A StreamSets Control Hub instance to use for SCH-registered Data Collectors. Default:None
.dump_log_on_error (
bool
) – Whether to output Data Collector logs when exceptions are raised by certain methods. Default:False
- VERIFY_SSL_CERTIFICATES = True#
- add_pipeline(*pipelines, **kwargs)[source]#
Add one or more pipelines to the DataCollector instance.
- Parameters
*pipelines – One or more instances of
streamsets.sdk.sdc_models.Pipeline
.
- capture_snapshot(pipeline, snapshot_name=None, start_pipeline=False, runtime_parameters=None, batches=1, batch_size=10, **kwargs)[source]#
Capture a snapshot for given pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.snapshot_name (
str
, optional) – Name for the generated snapshot. If set toNone
, an auto-generated UUID (which can be recovered from the returnedSnapshotCommand
object’ssnapshot_name
attribute) will be used when calling the REST API. Default:None
.start_pipeline (
bool
, optional) – If set to true, then the pipeline will be started and its first batch will be captured. Otherwise, the pipeline must be running, in which case one of the next batches will be captured. Default:False
.runtime_parameters (
dict
, optional) – Runtime parameters to override Pipeline Parameters value. Default:None
.wait (
bool
, optional) – Wait for capture snapshot to finish. Default:True
.wait_for_statuses (
list
, optional) – Pipeline statuses to wait on. Default:['RUNNING', 'FINISHED']
.timeout_sec (
int
) – Timeout to wait for snapshot, in seconds. Default:streamsets.sdk.sdc.DEFAULT_SNAPSHOT_TIMEOUT
.
- Returns
An instance of
streamsets.sdk.sdc_api.SnapshotCommand
.
- change_password(old_password, new_password)[source]#
Change password for the current user.
- Parameters
old_password (
str
) – old password.new_password (
str
) – new password.
- Returns
An instance of
streamsets.sdk.sdc_api.Command
.
- property current_user#
Get currently logged-in user and its groups and roles.
- Returns
An instance of
streamsets.sdk.sdc_models.User
.
- property definitions#
Get an SDC instance’s definitions.
Will return a cached instance of the definitions if called more than once.
- Returns
An instance of
json
.
- delete_snapshot(snapshot)[source]#
Delete a snapshot from the specified pipeline.
- Parameters
snapshot (
streamsets.sdk.sdc_models.Snapshot
) – The Snapshot object to be deleted.- Returns
An instance of
streamsets.sdk.sdc_api.Command
.
- export_pipeline(pipeline, include_library_definitions=False, include_plain_text_credentials=False)[source]#
Export single pipeline to json file.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – Pipeline instance.include_library_definitions (
boolean
) – Set to true to export for Control Hub.include_plain_text_credentials (
boolean
) – DefaultFalse
.
- Returns
A
dict
object containing the contents of pipeline.
- export_pipelines(pipelines, include_library_definitions=False, include_plain_text_credentials=False)[source]#
Export pipelines.
- Parameters
pipelines (
list
) – A list ofstreamsets.sdk.sdc_models.Pipeline
instances.include_library_definitions (
boolean
) – Set to true to export for Control Hub. DefaultFalse
.include_plain_text_credentials (
boolean
) – DefaultFalse
.
- Returns
An instance of type
bytes
indicating the content of zip file with pipeline json files.
- get_alerts()[source]#
Get pipeline alerts.
- Returns
An instance of
streamsets.sdk.sdc_models.Alerts
.
- get_bundle(generators=None)[source]#
Generate new support bundle.
- Returns
An instance of
zipfile.ZipFile
.
- get_bundle_generators()[source]#
Get available support bundle generators.
- Returns
An instance of
streamsets.sdk.sdc_models.BundleGenerators
.
- get_jmx_metrics()[source]#
Get SDC JMX metrics
- Returns
An instance of
streamsets.sdk.sdc_models.JmxMetrics
.
- get_logs(ending_offset=- 1, extra_message=None, pipeline=None, severity=None)[source]#
Get logs.
- Parameters
ending_offset (
int
) – ending_offset, Default:-1
.extra_message (
str
) – extra_message, Default:None
.pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance, Default:None
.severity (
str
) – severity, Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Log
.
- get_pipeline_acl(pipeline)[source]#
Get pipeline ACL.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.- Returns
An instance of
streamsets.sdk.sdc_models.PipelineAcl
.
- get_pipeline_builder(**kwargs)[source]#
Get a pipeline builder instance with which a pipeline can be created.
- Returns
An instance of
streamsets.sdk.sdc_models.PipelineBuilder
.
- get_pipeline_committed_offsets(pipeline)[source]#
Get a pipeline’s committed offsets.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – Pipeline object.- Returns
A
dict
containing committed offsets (or an emptydict
if no offsets exist).
- get_pipeline_history(pipeline)[source]#
Get a pipeline’s history.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.- Returns
An instance of
streamsets.sdk.sdc_models.History
.
- get_pipeline_metrics(pipeline)[source]#
Get a pipeline’s metrics.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.- Returns
An instance of
streamsets.sdk.sdc_models.Metrics
.
- get_pipeline_permissions(pipeline)[source]#
Return pipeline permissions for a given pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.- Returns
An instance of
streamsets.sdk.sdc_models.PipelinePermissions
.
- get_pipeline_status(pipeline)[source]#
Get status of a pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.
- get_snapshots(pipeline)[source]#
Get information about stored snapshots.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
, optional) – The pipeline instance.- Returns
A list of
streamsets.sdk.sdc_models.SnapshotInfo
instances.
- get_stage_errors(pipeline, stage)[source]#
Get stage errors.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – Pipeline.stage (
streamsets.sdk.sdc_models.Stage
) – Stage.
- Returns
A
streamsets.sdk.utils.SeekableList
ofstreamsets.sdk.sdc_models.StageError
instances.
- get_stage_library_version(stage)[source]#
Get the stage library version.
- Parameters
stage (
streamsets.sdk.sdc_models.Stage
) – stage object- Returns
An instance of
str
- property id#
Return id for StreamSets Data Collector.
- Returns
A
str
SDC ID.
- import_pipeline(pipeline, title=None)[source]#
Import pipeline from json file.
- Parameters
pipeline (
dict
) – JSON data loaded from file. Example usage: json.load(open(filename, ‘r’)).title (
str
, optional) – Title of the pipeline. If left out, pipeline title from JSON object will be used. DefaultNone
.
- Returns
An instance of
streamsets.sdk.sdc_models.Pipeline
.
- import_pipelines_from_archive(archive)[source]#
Import pipelines from archived zip directory.
- Parameters
archive (
file
) – file containing the pipelines.- Returns
A
streamsets.sdk.utils.SeekableList
ofstreamsets.sdk.sdc_models.Pipeline
.
- property pipelines#
Get all pipelines in the pipeline store.
- Returns
- A
streamsets.sdk.utils.SeekableList
of streamsets.sdk.sdc_models.Pipeline
instances.
- A
- remove_pipeline(*pipelines)[source]#
Remove one or more pipelines from the DataCollector instance.
- Parameters
*pipelines – One or more instances of
streamsets.sdk.sdc_models.Pipeline
.
- reset_origin(pipeline)[source]#
Reset origin offset.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – Pipeline object.- Returns
An instance of
streamsets.sdk.sdc_api.Command
.
- run_dynamic_pipeline_preview(type, parameters={}, batches=1, batch_size=1, skip_targets=True, skip_lifecycle_events=True, end_stage=None, timeout=10000, test_origin=False, stage_outputs_to_override_json_text=None, stage_outputs_to_override_json=[], **kwargs)[source]#
Run dynamic pipeline preview.
- Parameters
type (
str
) – Dynamic preview request type.CLASSIFICATION_CATALOG
orPROTECTION_POLICY
parameters (
dict
, optional) – Dynamic preview request parameters. Default:{}
batches (
int
, optional) – Number of batches. Default:1
.batch_size (
int
, optional) – Batch size. Default:1
.skip_targets (
bool
, optional) – Skip targets. Default:True
.skip_lifecycle_events (
bool
, optional) – Skip life cycle events. Default:True
.end_stage (
str
, optional) – End stage. Default:None
.timeout (
int
, optional) – Timeout. Default:10000
.test_origin (
bool
, optional) – Test origin. Default:False
.stage_outputs_to_override_json_text (
str
, optional) – Stage outputs to override text. Default:None
.stage_outputs_to_override_json (
list
, optional) – Stage outputs to override. Default:[]
.wait (
bool
, optional) – Wait for pipeline preview to finish. Default:True
.
- Returns
An instance of
streamsets.sdk.sdc_api.PreviewCommand
.
- run_pipeline_preview(pipeline, rev=0, batches=1, batch_size=10, skip_targets=True, end_stage=None, timeout=2000, test_origin=False, stage_outputs_to_override_json=None, **kwargs)[source]#
Run pipeline preview.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.rev (
int
, optional) – Pipeline revision. Default:0
.batches (
int
, optional) – Number of batches. Default:1
.batch_size (
int
, optional) – Batch size. Default:10
.skip_targets (
bool
, optional) – Skip targets. Default:True
.end_stage (
str
, optional) – End stage. Default:None
.timeout (
int
, optional) – Timeout. Default:2000
.test_origin (
bool
, optional) – Test origin. Default:False
stage_outputs_to_override_json (
str
, optional) – Stage outputs to override. Default:None
.wait (
bool
, optional) – Wait for pipeline preview to finish. Default:True
.
- Returns
An instance of
streamsets.sdk.sdc_api.PreviewCommand
.
- property sample_pipelines#
Get all sample pipelines in the pipeline store.
- Returns
- A
streamsets.sdk.utils.SeekableList
of streamsets.sdk.sdc_models.Pipeline
instances.
- A
- property sdc_configuration#
Return all configurations for StreamSets Data Collector.
- Returns
A
dict
with property names as keys and property values as values.
- set_pipeline_acl(pipeline, pipeline_acl)[source]#
Update pipeline ACL.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.pipeline_acl (
streamsets.sdk.sdc_models.PipelineAcl
) – The pipeline ACL instance.
- Returns
An instance of
streamsets.sdk.sdc_api.Command
.
- set_user(username, password=None)[source]#
Set the user with which to interact with SDC.
- Parameters
username (
str
) – Username of user.password (
str
, optional) – Password for user. Default: same asusername
.
- property stage_libraries#
Get all stage libraries.
- Returns
- A
streamsets.sdk.utils.SeekableList
of streamsets.sdk.sdc_models.StageLibrary
instances.
- A
- start_pipeline(pipeline, runtime_parameters=None, **kwargs)[source]#
Start a pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.runtime_parameters (
dict
, optional) – Collection of runtime parameters. Default:None
.wait (
bool
, optional) – Wait for pipeline to start. Default:True
.wait_for_statuses (
list
, optional) – Pipeline statuses to wait on. Default:['RUNNING', 'FINISHED']
.timeout_sec (
int
) – Timeout to wait for pipeline statuses, in seconds. Default:streamsets.sdk.sdc.DEFAULT_START_TIMEOUT
.
- Returns
An instance of
streamsets.sdk.sdc_api.PipelineCommand
.
- stop_pipeline(pipeline, **kwargs)[source]#
Stop a pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.force (
bool
, optional) – Force pipeline to stop. Default:False
.wait (
bool
, optional) – Wait for pipeline to stop. Default:True
.timeout_sec (
int
) – Timeout to wait for pipeline stop, in seconds. Default:streamsets.sdk.sdc.DEFAULT_STOP_TIMEOUT
.
- Returns
An instance of
streamsets.sdk.sdc_api.StopPipelineCommand
.
- update_pipeline(*pipelines)[source]#
Update one or more pipelines in the DataCollector instance.
- Parameters
*pipelines – One or more instances of
streamsets.sdk.sdc_models.Pipeline
.
- validate_pipeline(pipeline)[source]#
Validate a pipeline.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.
- property version#
Return the version of the Data Collector.
- Returns
The version string.
- Return type
str
- wait_for_pipeline_metric(pipeline, metric, value, timeout_sec=30)[source]#
Block until a pipeline metric reaches the desired value.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.metric (
str
) – The desired metric (e.g.'output_record_count'
or'data_batch_count'
).value – The desired value to wait for.
timeout_sec (
int
, optional) – Timeout to wait formetric
to reachvalue
, in seconds. Default:streamsets.sdk.sdc.DEFAULT_WAIT_FOR_METRIC_TIMEOUT
.
- Raises
TimeoutError – If
timeout_sec
passes withoutmetric
reachingvalue
.
- wait_for_pipeline_status(pipeline, status, timeout_sec=30)[source]#
Block until a pipeline reaches the desired status.
- Parameters
pipeline (
streamsets.sdk.sdc_models.Pipeline
) – The pipeline instance.status (
str
) – The desired status to wait for.timeout_sec (
int
, optional) – Timeout to wait forpipeline
to reachstatus
, in seconds. Default:streamsets.sdk.sdc.DEFAULT_WAIT_FOR_STATUS_TIMEOUT
.
- Raises
TimeoutError – If
timeout_sec
passes withoutpipeline
reachingstatus
.
- sdc.DEFAULT_SDC_USERNAME = 'admin'#
- sdc.DEFAULT_SDC_PASSWORD = 'admin'#
Models#
These models wrap and provide useful functionality for interacting with common SDC abstractions.
Alerts#
- class streamsets.sdk.sdc_models.Alert(alert)[source]#
Pipeline alert.
- Parameters
alert (
dict
) – Python object representation of a pipeline alert.
- property alert_texts#
Get alert’s alert texts.
- Returns
The alert’s alert texts as a
str
.
- property label#
Get alert’s label.
- Returns
The alert’s label as a
str
.
- property pipeline_id#
Get alert’s pipeline ID.
- Returns
The pipeline ID as a
str
.
- class streamsets.sdk.sdc_models.Alerts(alerts)[source]#
Container for list of alerts with filtering capabilities.
- Parameters
alerts (
dict
) – Python object representation of alerts.
- alerts#
A list of
streamsets.sdk.sdc_models.Alert
instances.- Type
list
- for_pipeline(pipeline)[source]#
Get alerts for the specified pipeline.
- Parameters
pipeline (
str
) – The pipeline for which to get alerts.- Returns
An instance of
streamsets.sdk.sdc_models.Alerts
.
Data Rules#
- class streamsets.sdk.sdc_models.DataDriftRule(stream, label, condition=None, sampling_percentage=5, sampling_records_to_retain=10, enable_meter=True, enable_alert=True, alert_text='${alert:info()}', send_email=False, active=False)[source]#
Pipeline data drift rule.
- Parameters
stream (
str
) – Stream to use for data rule. An entry from a Stage instance’s output_lanes list is typically used here.label (
str
) – Rule label.condition (
str
, optional) – Data rule condition. Default:None
.sampling_percentage (
int
, optional) – Default:5
.sampling_records_to_retain (
int
, optional) – Default:10
.enable_meter (
bool
, optional) – Default:True
.enable_alert (
bool
, optional) – Default:True
.alert_text (
str
, optional) – Default:'${alert:info()}'
.send_email (
bool
, optional) – Default:False
.active (
bool
, optional) – Enable the data rule. Default:False
.
- property active#
The rule is active.
- Returns
A
bool
.
- class streamsets.sdk.sdc_models.DataRule(stream, label, condition=None, sampling_percentage=5, sampling_records_to_retain=10, enable_meter=True, enable_alert=True, alert_text=None, threshold_type='count', threshold_value=100, min_volume=1000, send_email=False, active=False)[source]#
Pipeline data rule.
- Parameters
stream (
str
) – Stream to use for data rule. An entry from a Stage instance’s output_lanes list is typically used here.label (
str
) – Rule label.condition (
str
, optional) – Data rule condition. Default:None
.sampling_percentage (
int
, optional) – Default:5
.sampling_records_to_retain (
int
, optional) – Default:10
.enable_meter (
bool
, optional) – Default:True
.enable_alert (
bool
, optional) – Default:True
.alert_text (
str
, optional) – Default:None
.threshold_type (
str
, optional) – One ofcount
orpercentage
. Default:'count'
.threshold_value (
int
, optional) – Default:100
.min_volume (
int
, optional) – Only set ifthreshold_type
ispercentage
. Default:1000
.send_email (
bool
, optional) – Default:False
.active (
bool
, optional) – Enable the data rule. Default:False
.
- property active#
Returns if the rule is active or not.
- Returns
A
bool
.
History#
- class streamsets.sdk.sdc_models.History(history)[source]#
Pipeline history.
- Parameters
history (
dict
) – Python object representation of the pipeline history.
- entries#
A list of
streamsets.sdk.sdc_models.HistoryEntry
instances.- Type
list
- property latest#
Get pipeline history’s latest entry.
- Returns
The most recent pipeline history entry as an instance of
streamsets.sdk.sdc_models.HistoryEntry
.
- class streamsets.sdk.sdc_models.HistoryEntry(entry)[source]#
Pipeline history entry.
- Parameters
entry (
dict
) – Python object representation of the history entry.
- property metrics#
Get pipeline history entry’s metrics.
- Returns
The pipeline history entry’s metrics as an instance of
streamsets.sdk.sdc_models.Metrics
.
Issues#
- class streamsets.sdk.sdc_models.Issue(issue)[source]#
Issue encountered for a pipeline or a stage.
- Parameters
issue (
dict
) – Python object representation of the issue.
- class streamsets.sdk.sdc_models.Issues(issues)[source]#
Issues encountered for pipelines as well as stages.
- Parameters
issues (
dict
) – Python object representation of the issues.
- issues_count#
The number of issues.
- Type
int
- pipeline_issues#
A list of
streamsets.sdk.sdc_models.Issue
instances.- Type
list
- stage_issues#
A dictionary mapping stage names to instances of
streamsets.sdk.sdc_models.Issue
.- Type
dict
Logs#
- class streamsets.sdk.sdc_models.Log(log)[source]#
Model for SDC logs.
- Parameters
log (
list
) – A list of dictionaries (JSON representation) of the log.
Metrics#
- class streamsets.sdk.sdc_models.MetricCounter(counter)[source]#
Metric counter.
- Parameters
counter (
dict
) – Python object representation of a metric counter.
- property count#
Get the metric counter’s count.
- Returns
The metric counter’s count as an
int
.
- class streamsets.sdk.sdc_models.MetricGauge(gauge)[source]#
Metric gauge.
- Parameters
gauge (
dict
) – Python object representation of a metric gauge.
- property value#
Get the metric gauge’s value.
- Returns
The metric gauge’s value as a
str
.
- class streamsets.sdk.sdc_models.MetricHistogram(histogram)[source]#
Metric histogram.
- Parameters
histogram (
dict
) – Python object representation of a metric histogram.
- class streamsets.sdk.sdc_models.MetricTimer(timer)[source]#
Metric timer.
- Parameters
timer (
dict
) – Python object representation of a metric timer.
- property count#
Get the metric timer’s count.
- Returns
The metric timer’s count as an
int
.
- class streamsets.sdk.sdc_models.Metrics(metrics)[source]#
Metrics.
- Parameters
metrics (
dict
) – Python object representation of metrics.
- counter(name)[source]#
Get the metric counter from metrics.
- Parameters
name (
str
) – Counter name.- Returns
The metric counter as an instance of
streamsets.sdk.sdc_models.MetricCounter
.
- gauge(name)[source]#
Get the metric gauge from metrics.
- Parameters
name (
str
) – Gauge name.- Returns
The metric gauge as an instance of
streamsets.sdk.sdc_models.MetricGauge
.
- histogram(name)[source]#
Get the metric histogram from metrics.
- Parameters
name (
str
) – Histogram name.- Returns
The metric histogram as an instance of
streamsets.sdk.sdc_models.MetricHistogram
.
- meter(name)[source]#
Get the metric meter from metrics.
- Parameters
name (
str
) – Meter name.- Returns
The metric meter as an instance of
streamsets.sdk.sdc_models.MetricMeter
.
- property pipeline#
Get pipeline-level metrics.
- Returns
An instance of
streamsets.sdk.sdc_models.PipelineMetrics
.
- timer(name)[source]#
Get the metric timer from metrics.
- Parameters
name (
str
) – Timer name.- Returns
The metric timer as an instance of
streamsets.sdk.sdc_models.MetricTimer
.
Pipelines#
- class streamsets.sdk.sdc_models.PipelineBuilder(pipeline, definitions, fragment=False, data_collector=None)[source]#
Class with which to build SDC pipelines.
This class allows a user to programmatically generate an SDC pipeline. Instead of instantiating this class directly, most users should use
streamsets.sdk.DataCollector.get_pipeline_builder()
.- Parameters
pipeline (
dict
) – Python object representing an empty pipeline. If created manually, this would come from creating a new pipeline in SDC and then exporting it before doing any configuration.definitions (
dict
) – The output of SDC’s definitions endpoint.fragment (
boolean
, optional) – Specify if a fragment builder. Default:False
.
- add_data_drift_rule(*data_drift_rules)[source]#
Add one or more data drift rules to the pipeline.
- Parameters
*data_drift_rules – One or more instances of
streamsets.sdk.sdc_models.DataDriftRule
.
- add_data_rule(*data_rules)[source]#
Add one or more data rules to the pipeline.
- Parameters
*data_rules – One or more instances of
streamsets.sdk.sdc_models.DataRule
.
- add_error_stage(label=None, name=None, library=None)[source]#
Add an error stage to the pipeline.
When specifying a stage, either
label
orname
must be used. Iflibrary
is omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_fragment(fragment, parameter_name_prefix=None)[source]#
Add a fragment to the pipeline.
- Parameters
fragment (py:obj:streamsets.sdk.sch_models.Pipeline) – Fragment to add.
parameter_name_prefix (
str
, optional) – Prefix name for the parameters of fragment. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_metric_rule(*metric_rules)[source]#
Add one or more metric rules to the pipeline.
- Parameters
*data_rules – One or more instances of
streamsets.sdk.sdc_models.MetricRule
.
- add_stage(label=None, name=None, type=None, library=None)[source]#
Add a stage to the pipeline.
When specifying a stage, either
label
orname
must be used.type
andlibrary
may also be used to select a particular stage if ambiguities exist. Iftype
and/orlibrary
are omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.type (
str
, optional) – SDC stage type to use when selecting stage from definitions (e.g. origin, destination, processor, executor). Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_start_event_stage(label=None, name=None, library=None)[source]#
Add start event stage to the pipeline.
When specifying a stage, either
label
orname
must be used. Iflibrary
is omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_stats_aggregator_stage(label=None, name=None, library=None)[source]#
Add a stats aggregator stage to the pipeline.
When specifying a stage, either
label
orname
must be used. Iflibrary
is omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_stop_event_stage(label=None, name=None, library=None)[source]#
Add stop event stage to the pipeline.
When specifying a stage, either
label
orname
must be used. Iflibrary
is omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- add_test_origin_stage(label=None, name=None, library=None)[source]#
Add test origin stage to the pipeline.
When specifying a stage, either
label
orname
must be used. Iflibrary
is omitted, the first stage definition matching the givenlabel
orname
will be used.- Parameters
label (
str
, optional) – SDC stage label to use when selecting stage from definitions. Default:None
.name (
str
, optional) – SDC stage name to use when selecting stage from definitions. Default:None
.library (
str
, optional) – SDC stage library to use when selecting stage from definitions. Default:None
.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- build(title='Pipeline', **kwargs)[source]#
Build the pipeline.
- Parameters
title (
str
, optional) – Pipeline title to use. Default:'Pipeline'
.- Returns
An instance of
streamsets.sdk.sdc_models.Pipeline
.
- import_pipeline(pipeline, **kwargs)[source]#
Import a pipeline into the PipelineBuilder.
- Parameters
pipeline (
dict
) – Exported pipeline.- Returns
An instance of
streamsets.sdk.sdc_models.PipelineBuilder
.
- class streamsets.sdk.sdc_models.Pipeline(pipeline, all_stages=None, fragment=False)[source]#
SDC pipeline.
This class provides abstractions to make it easier to interact with a pipeline before it’s imported into SDC.
- Parameters
pipeline (
dict
) – A Python object representing the serialized pipeline.all_stages (
dict
, optional) – A dictionary mapping stage names tostreamsets.sdk.sdc_models.Stage
instances. Default:None
.fragment (
boolean
, optional) – Specify if a fragment. Default:False
.
- add_parameters(**parameters)[source]#
Add pipeline parameters.
- Parameters
**parameters – Keyword arguments to add.
- property configuration#
Get pipeline’s configuration.
- Returns
An instance of
streamsets.sdk.models.Configuration
.
- property delivery_guarantee#
Get the delivery guarantee.
- Returns
The delivery guarantee as a
str
.
- property id#
Get the pipeline id.
- Returns
The pipeline id as a
str
.
- property metadata#
Get the pipeline metadata.
- Returns
Pipeline metadata as a Python object.
- property origin_stage#
Get the pipeline’s origin stage.
- Returns
An instance of
streamsets.sdk.sdc_models.Stage
.
- property parameters#
Get the pipeline parameters.
- Returns
A
dict
of parameter key-value pairs.
- property rate_limit#
Get the rate limit (records/sec).
- Returns
The rate limit as a
str
.
- property title#
Get the pipeline title.
- Returns
The pipeline title as a
str
.
- class streamsets.sdk.sdc_models.Stage(stage, label=None)[source]#
Pipeline stage.
- Parameters
stage – JSON representation of the pipeline stage.
label (
str
, optional) – Human-readable stage label. Default:None
.
- configuration#
The stage configuration.
- services#
If supported by the stage, a dictionary mapping a service name to an instance of
streamsets.sdk.models.Configuration
.- Type
dict
- add_output(*other_stages, event_lane=False)[source]#
Connect output of this stage to another stage.
The __rshift__ operator (>>) has been overloaded to invoke this method.
- Parameters
other_stage (
streamsets.sdk.sdc_models.Stage
) – Stage object.- Returns
This stage as an instance of
streamsets.sdk.sdc_models.Stage
).
- property description#
The stage’s description.
- Type
str
- property event_lanes#
Get the stage’s list of event lanes.
- Returns
A
list
of event lanes.
- property label#
The stage’s label.
- Type
str
- property library#
Get the stage’s library.
- Returns
The stage library as a
str
.
- property output_lanes#
Get the stage’s list of output lanes.
- Returns
A
list
of output lanes.
- set_attributes(**attributes)[source]#
Set one or more stage attributes.
- Parameters
**attributes – Attributes to set.
- Returns
This stage as an instance of
streamsets.sdk.sdc_models.Stage
.
- property stage_on_record_error#
The stage’s on record error configuration value.
- property stage_record_preconditions#
The stage’s record preconditions configuration value.
- property stage_required_fields#
The stage’s required fields configuration value.
Pipeline ACLs#
Pipeline Permissions#
- class streamsets.sdk.sdc_models.PipelinePermission(pipeline_permission)[source]#
A container for a pipeline permission.
- Parameters
pipeline_permission (
dict
) – A Python object representation of a pipeline permission.
- class streamsets.sdk.sdc_models.PipelinePermissions(pipeline_permissions)[source]#
Container for list of permissions for a pipeline.
- Parameters
pipeline_permissions (
dict
) – A Python object representation of pipeline permissions.
- permissions#
A list of
streamsets.sdk.sdc_models.PipelinePermission
instances.- Type
list
Previews#
- class streamsets.sdk.sdc_models.Preview(pipeline_id, previewer_id, preview)[source]#
Preview.
- Parameters
pipeline_id (
str
) – Pipeline ID.previewer_id (
str
) – Previewer ID.preview (
dict
) – Python object representation of the preview.
- issues#
An instance of
streamsets.sdk.sdc_models.Issues
.- Type
dict
- preview_batches#
A list of
streamsets.sdk.sdc_models.Batch
instances.- Type
list
Snapshots#
- class streamsets.sdk.sdc_models.Batch(batch)[source]#
Snapshot batch.
- Parameters
batch – Python object representation of the snapshot batch.
- class streamsets.sdk.sdc_models.Record(record)[source]#
Record.
- Parameters
record (
dict
) – Python object representation of the record.
- header#
An instance of
streamsets.sdk.sdc_models.RecordHeader
.- Type
dict
- value#
Python object representation of the record value.
- Type
dict
- value2#
A typed representation of the record value.
- class streamsets.sdk.sdc_models.RecordHeader(header)[source]#
Record Header.
- Parameters
header (
dict
) – Python object representation of the record header.
- class streamsets.sdk.sdc_models.Snapshot(api_client, snapshot)[source]#
Snapshot.
- Parameters
snapshot (
dict
) – Python object representation of the snapshot.
- batch_number#
The number of the batch that the snapshot captured.
- Type
int
- batches#
A list of
streamsets.sdk.sdc_models.Batch
instances.- Type
list
- id#
The snapshot’s ID.
- Type
str
- name#
The snapshot’s name.
- Type
str
- pipeline_id#
The ID of the pipeline that the snapshot belongs to.
- Type
str
- time_stamp#
The creation date of the snapshot as a unix timestamp.
- Type
int
- class streamsets.sdk.sdc_models.StageOutput(stage_output)[source]#
Snapshot batch’s stage output.
- Parameters
stage_output – Python object representation of the stage output.
- property output#
Gets the stage output’s output.
If the stage contains multiple lanes, use
streamsets.sdk.sdc_models.StageOutput.output_lanes
.:raises An instance of
Exception
if the stage contains multiple lanes.:- Returns
An instance of
streamsets.sdk.sdc_models.Record
.
Users#
- class streamsets.sdk.sdc_models.User(user)[source]#
User.
- Parameters
user (
dict
) – Python object representation of the user.
- property groups#
Get user’s groups.
- Returns
User groups as a
str
.
- property name#
Get user’s name.
- Returns
User name as a
str
.
- property roles#
Get user’s roles.
- Returns
User roles as a
str
.