Provisioned Cluster
You can configure a pipeline to run on a provisioned cluster. When provisioning a cluster, Transformer creates a new EMR Spark cluster upon the initial run of a pipeline. You can optionally have Transformer terminate the cluster after the pipeline stops.
To provision a cluster for the pipeline, select the Provision a New Cluster property on the Cluster tab of the pipeline properties. Then, define the cluster configuration properties.
When provisioning a cluster, you specify cluster details such as the EMR version, the instance types to create, and the ID of the subnet to create the cluster in. You can define bootstrap actions to execute before processing data. You also indicate whether to terminate the cluster after the pipeline stops.
You can define the number of EC2 instances that the cluster uses to process data. The minimum is 2. To improve performance, you might increase that number based on the number of partitions that the pipeline uses. You can also configure the pipeline to save log data to a different location to avoid losing that data when the cluster terminates.
For a full list of provisioning properties, see Configuring a Pipeline.
For best practices for configuring a cluster, see the Amazon EMR documentation.