Importing and Exporting Pipelines#


The SDK also allows for importing and exporting pipelines programmatically, either between multiple SDC instances or within the same SDC instance.

Simple Data Collector to Data Collector Import-Export Operation#

In order to export a pipeline from a Data Collector instance, the streamsets.sdk.DataCollector.export_pipeline() method can be used which will return a dict containing the JSON representation of the pipeline. See the API reference for this method for details on the arguments it takes.

pipeline_json = sdc.export_pipeline(pipeline=sdc.pipelines.get(title='pipeline name'),
                                    include_plain_text_credentials=True)
with open('./from_sdc_for_sdc.json', 'w') as f:
    json.dump(pipeline_json, f)

You can import a pipeline from a JSON file into a Data Collector instance in two ways:

  1. Import the JSON file into streamsets.sdk.sdc_models.PipelineBuilder and add the pipeline:

with open('./from_sdc_for_sdc.json', 'r') as input_file:
    pipeline_json = json.load(input_file)

sdc_pipeline_builder = sdc.get_pipeline_builder()
sdc_pipeline_builder.import_pipeline(pipeline=pipeline_json)
pipeline = sdc_pipeline_builder.build(title='built from imported json file from sdc')
sdc.add_pipeline(pipeline)

2. Directly import the dict object that was generated by the streamsets.sdk.DataCollector.export_pipeline() method:

pipeline = sdc.import_pipeline(pipeline=pipeline_json)

Exporting pipelines from Data Collector for Control Hub#

To export a Data Collector pipeline to use in Control Hub, the optional argument include_library_definitions must be set to True.

pipeline_json = sdc.export_pipeline(pipeline=sdc.pipelines.get(title='pipeline name'),
                                    include_library_definitions=True,
                                    include_plain_text_credentials=True)

Exporting and Importing multiple Pipelines at once#

To export multiple pipelines from a Data Collector into a zip archive, you can use the streamsets.sdk.DataCollector.export_pipeline() and pass in a list of streamsets.sdk.sdc_models.Pipeline objects:

# Returns a list of all pipelines on the given SDC instance
pipelines = sdc.pipelines

# Show the pipelines to be exported
pipelines

pipelines_zip_data = sdc.export_pipelines(pipelines, include_library_definitions=True)
with open('./sdc_exports_for_sch.zip', 'wb') as output_file:
    output_file.write(pipelines_zip_data)

Output:

[<Pipeline (id=apiprocesdff151fe-1f1b-42e3-8920-895de370d607, title=sample_one)>,
<Pipeline (id=httpclienbea409f3-7cd6-4001-96bc-f065eb255430, title=sample_two)>,
<Pipeline (id=schapi7774665d-6d90-4e79-ad97-303fefcf1822, title=sample_three)>,
<Pipeline (id=snapshot187a8311-ee25-4543-894e-ab0f0a73b255, title=sample_four)>]

Similarly, you could import multiple pipelines into Data Collector by using streamsets.sdk.DataCollector.import_pipelines_from_archive().

with open('./sdc_exports_for_sch.zip', 'rb') as input_file:
    pipelines_zip_data = input_file.read()
pipelines = sdc.import_pipelines_from_archive(pipelines_file=pipelines_zip_data)