Importing and Exporting Pipelines


The SDK also allows for importing and exporting pipelines programmatically, either between multiple SDC instances or within the same SDC instance.

Simple Data Collector to Data Collector Import-Export Operation

In order to export a pipeline from a Data Collector instance, the streamsets.sdk.DataCollector.export_pipeline() method can be used which will return a dict containing the JSON representation of the pipeline. See the API reference for this method for details on the arguments it takes.

pipeline_json = sdc.export_pipeline(pipeline=sdc.pipelines.get(title='pipeline name'),
                                    include_plain_text_credentials=True)
with open('./from_sdc_for_sdc.json', 'w') as f:
    json.dump(pipeline_json, f)

You can import a pipeline from a JSON file into a Data Collector instance in two ways:

  1. Import the JSON file into streamsets.sdk.sdc_models.PipelineBuilder and add the pipeline:

with open('./from_sdc_for_sdc.json', 'r') as input_file:
    pipeline_json = json.load(input_file)

sdc_pipeline_builder = sdc.get_pipeline_builder()
sdc_pipeline_builder.import_pipeline(pipeline=pipeline_json)
pipeline = sdc_pipeline_builder.build(title='built from imported json file from sdc')
sdc.add_pipeline(pipeline)

2. Directly import the dict object that was generated by the streamsets.sdk.DataCollector.export_pipeline() method:

pipeline = sdc.import_pipeline(pipeline=pipeline_json)

Exporting pipelines from Data Collector for Control Hub

To export a Data Collector pipeline to use in Control Hub, the optional argument include_library_definitions must be set to True.

pipeline_json = sdc.export_pipeline(pipeline=sdc.pipelines.get(title='pipeline name'),
                                    include_library_definitions=True,
                                    include_plain_text_credentials=True)

Exporting and Importing multiple Pipelines at once

To export multiple pipelines from a Data Collector into a zip archive, you can use the streamsets.sdk.DataCollector.export_pipeline() and pass in a list of streamsets.sdk.sdc_models.Pipeline objects:

# Returns a list of all pipelines on the given SDC instance
pipelines = sdc.pipelines

# Show the pipelines to be exported
pipelines

pipelines_zip_data = sdc.export_pipelines(pipelines, include_library_definitions=True)
with open('./sdc_exports_for_sch.zip', 'wb') as output_file:
    output_file.write(pipelines_zip_data)

Output:

[<Pipeline (id=apiprocesdff151fe-1f1b-42e3-8920-895de370d607, title=sample_one)>,
<Pipeline (id=httpclienbea409f3-7cd6-4001-96bc-f065eb255430, title=sample_two)>,
<Pipeline (id=schapi7774665d-6d90-4e79-ad97-303fefcf1822, title=sample_three)>,
<Pipeline (id=snapshot187a8311-ee25-4543-894e-ab0f0a73b255, title=sample_four)>]

Similarly, you could import multiple pipelines into Data Collector by using streamsets.sdk.DataCollector.import_pipelines_from_archive().

with open('./sdc_exports_for_sch.zip', 'rb') as input_file:
    pipelines_zip_data = input_file.read()
pipelines = sdc.import_pipelines_from_archive(pipelines_file=pipelines_zip_data)