Performance Tuning Properties

By default, Transformer adds several Spark configuration properties to each pipeline with suggested values. These properties override the default Spark values in the cluster.

The defaults for these properties should work in most cases. If you are an advanced user, you can tune the performance of a specific pipeline by modifying these properties or by adding additional Spark configuration properties.

For more information about these configuration properties, see the Spark Configuration documentation.

Transformer adds the following Spark configuration properties to each pipeline:

Spark Configuration Property Description
spark.driver.memory Maximum amount of memory that the Spark driver uses to run the pipeline.
spark.driver.cores Number of cores that the Spark driver uses to run the pipeline.
spark.executor.memory Maximum amount of memory that each Spark executor uses to run the pipeline.
spark.executor.cores Number of cores that each Spark executor uses to run the pipeline.

Databricks and Dataproc do not allow overrides of this configuration property. This property is ignored when running the pipeline on a Databricks or Dataproc cluster.

spark.dynamicAllocation.enabled Enables dynamic resource allocation. Spark uses as many executors as required to run the pipeline.
Note: Local pipelines always run on one Spark executor.
spark.shuffle.service.enabled Enables the external shuffle service.

Must be true when dynamic allocation is enabled.

spark.dynamicAllocation.minExecutors Minimum number of Spark executors that the pipeline runs on when dynamic allocation is enabled.
spark.dynamicAllocation.maxExecutors Maximum number of Spark executors that the pipeline runs on when dynamic allocation is enabled.
The maximum number of Spark executors allowed for each pipeline is determined by your StreamSets account. You can decrease this number to limit executor usage in the cluster, but you cannot increase the number.
Note: When dynamic allocation is disabled, then the spark.executor.instances configuration property determines the number of Spark executors used for the pipeline. The maximum value of the spark.executor.instances property is also determined by your StreamSets account.