Extra Spark Configuration
When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.
You can add any additional Spark configuration property, as described in the Spark configuration documentation.
Configuration Property | Description |
---|---|
spark.home | Overrides the SPARK_HOME environment variable set on the
machine. For example, let's say that multiple Spark versions
are installed locally on the Transformer machine. You can add the |
Performance Tuning Properties
By default, Transformer adds several Spark configuration properties to each pipeline with suggested values. These properties override the default Spark values in the cluster.
The defaults for these properties should work in most cases. If you are an advanced user, you can tune the performance of a specific pipeline by modifying these properties or by adding additional Spark configuration properties.
For more information about these configuration properties, see the Spark Configuration documentation.
Transformer adds the following Spark configuration properties to each pipeline:
Spark Configuration Property | Description |
---|---|
spark.driver.memory | Maximum amount of memory that the Spark driver uses to run the pipeline. |
spark.driver.cores | Number of cores that the Spark driver uses to run the pipeline. |
spark.executor.memory | Maximum amount of memory that each Spark executor uses to run the pipeline. |
spark.executor.cores | Number of cores that each Spark executor uses to run the
pipeline. Databricks and Dataproc do not allow overrides of this configuration property. This property is ignored when running the pipeline on a Databricks or Dataproc cluster. |
spark.dynamicAllocation.enabled | Enables dynamic resource allocation. Spark uses as many executors
as required to run the pipeline. Note: Local pipelines always run on
one Spark executor. |
spark.shuffle.service.enabled | Enables the external shuffle service. Must be true when dynamic allocation is enabled. |
spark.dynamicAllocation.minExecutors | Minimum number of Spark executors that the pipeline runs on when dynamic allocation is enabled. |
spark.dynamicAllocation.maxExecutors | Maximum number of Spark executors that the pipeline runs on when
dynamic allocation is enabled. The maximum number of Spark executors allowed for each pipeline is
determined by your account type. You can decrease this number to
limit executor usage in the cluster, but you cannot increase the
number. Note: When dynamic allocation is disabled, then the
spark.executor.instances configuration
property determines the number of Spark executors used for
the pipeline. The maximum value of the
spark.executor.instances property is
also determined by your account type. |