Index Terms - StreamSets Platform - Transformer Engine Guide

A
- ADLS Gen1 destination
  - configuring[1]
  - data formats[1]
  - overview[1]
  - prerequisites[1]
  - retrieve authentication information[1]
  - write mode[1]
- ADLS Gen1 origin
  - configuring[1]
  - data formats[1]
  - overview[1]
  - partitions[1]
  - prerequisites[1]
  - retrieve authentication information[1]
  - schema requirement[1]
- ADLS Gen2 destination
  - configuring[1]
  - data formats[1]
  - overview[1]
  - prerequisites[1]
  - retrieve configuration details[1]
  - write mode[1]
- ADLS Gen2 origin
  - configuring[1]
  - data formats[1]
  - overview[1]
  - partitions[1]
  - prerequisites[1]
  - retrieve configuration details[1]
  - schema requirement[1]
- ADLS stages
  - local pipeline prerequisites[1]
- Aggregate processor
  - aggregate functions[1]
  - configuring[1]
  - default output fields[1]
  - example[1]
  - overview[1]
  - shuffling of data[1]
- Amazon EMR EMR[1]
- Amazon Redshift destination
  - AWS credentials and write requirements[1]
  - configuring[1]
  - installing the JDBC driver[1]
  - partitions[1]
  - server-side encryption[1]
  - write mode[1]
- Amazon S3 destination
  - authentication method[1]
  - AWS credentials[1]
  - data formats[1]
  - overview[1]
  - server-side encryption[1]
  - write mode[1]
- Amazon S3 origin
  - authentication method[1]
  - AWS credentials[1]
  - data formats[1]
  - overview[1]
  - partitions[1]
- Amazon S3 stages
  - local pipeline prerequisites[1]
- Append Data write mode
  - Delta Lake destination[1]
- authentication method
  - Amazon S3[1][2]
- AWS credentials
  - Amazon S3[1][2]
- AWS Secrets Manager
  - credential store[1]
  - properties file[1]
  - stage library[1]
- AWS Secrets Manager access
  - overview[1]
- Azure Event Hubs destination
  - configuring[1]
  - data formats[1]
  - overview[1]
  - prerequisites[1]
- Azure Event Hubs origin
  - configuring[1]
  - default and specific offsets[1]
  - overview[1]
  - prerequisites[1]
- Azure Key Vault
  - credential store[1]
  - credential store, prerequisites[1]
  - properties file[1]
  - stage library[1][2]
- Azure Key Vault access
  - overview[1]
  - prerequisites[1]
- Azure SQL destination
  - partitions[1]
- Azure SQLL destination
  - write mode[1]
B
- Base64 functions
  - description[1]
- basic syntax
  - for expressions[1]
- batch pipelines
  - case study[1]
  - description[1]
- bootstrap actions
  - EMR provisioned clusters[1]
- bulk edit mode
  - description[1]
C
- caching
  - for origins and processors[1]
  - ludicrous mode[1]
- case study
  - batch pipelines[1]
  - streaming pipelines[1]
- CDC writes
  - Delta lake destination[1]
- client deployment mode
  - Hadoop YARN cluster[1]
- cluster
  - callback URL[1]
  - Dataproc[1]
  - EMR[1]
  - Hadoop YARN[1]
  - running pipelines[1]
  - SQL Server 2019 BDC[1]
- cluster configuration
  - Databricks instance pool[1]
  - Databricks pipelines[1]
- cluster deployment mode
  - Hadoop YARN cluster[1]
- command line interface
  - jks-credentialstore command[1]
  - stagelib-cli command[1]
- conditions
  - Delta Lake destination[1]
  - Filter processor[1]
  - Join processor[1]
  - Stream Selector processor[1]
  - Window processor[1]
- configuring
  - Snowflake origin[1]
- constants
  - in the StreamSets expression language[1]
- credential stores
  - AWS Secrets Manager[1]
  - Azure Key Vault[1]
  - CyberArk[1]
  - enabling[1]
  - functions to access[1]
  - Java keystore[1]
  - overview[1]
- cross join
  - Join processor[1]
- custom schemas
  - application to JSON and delimited data[1]
  - DDL schema format[1][2]
  - error handling[1]
  - JSON schema format[1][2]
  - origins[1]
- CyberArk
  - credential store[1]
  - properties file[1]
- CyberArk access
  - overview[1]
D
- Databricks
  - init scripts for provisioned clusters[1]
  - provisioned cluster configuration[1]
  - provisioned cluster with instance pool[1]
  - uninstalling old Transformer libraries[1]
- Databricks init scripts
  - access keys for ABFSS[1]
- Databricks pipelines
  - existing cluster[1]
  - job details[1]
  - provisioned cluster[1][2]
- data formats
  - ADLS Gen1 destination[1]
  - ADLS Gen1 origin[1]
  - ADLS Gen2 destination[1]
  - ADLS Gen2 origin[1]
  - Amazon S3 destination[1]
  - Amazon S3 origin[1]
  - Azure Event Hubs destination[1]
  - File destination[1]
  - File origin[1]
  - Whole Directory origin[1]
- Dataproc
  - cluster[1]
  - credentials[1]
  - credentials in a file[1]
  - credentials in a property[1]
  - default credentials[1]
- Dataproc pipelines
  - existing cluster[1]
- data types
  - in preview[1]
  - Transformer[1]
- datetime variables
  - in the StreamSets expression language[1]
- Deduplicate processor
  - configuring[1]
  - overview[1]
- default output fields
  - Aggregate processor[1]
- default stream
  - Stream Selector[1]
- Delete from Table write mode
  - Delta Lake destination[1]
- delivery guarantee
  - pipelines[1]
- Delta Lake destination
  - ADLS Gen1 prerequisites[1]
  - ADLS Gen2 prerequisites[1]
  - Amazon S3 credential mode[1]
  - Append Data write mode[1]
  - CDC example[1]
  - configuring[1]
  - creating a managed table[1]
  - creating a table[1]
  - creating a table or managed table[1]
  - Delete from Table write mode[1]
  - overview[1]
  - overwrite condition[1]
  - Overwrite Data write mode[1]
  - partitions[1]
  - retrieve ADLS Gen1 authentication information[1]
  - retrieve ADLS Gen2 authentication information[1]
  - Update Table write mode[1]
  - Upsert Using Merge write mode[1]
  - write mode[1]
  - writing to a local file system[1]
- Delta Lake Lookup processor
  - ADLS Gen2 prerequisites[1]
  - Amazon S3 credential mode[1]
  - configuring[1]
  - overview[1]
  - retrieve ADLS Gen1 authentication information[1]
  - retrieve ADLS Gen2 authentication information[1]
  - storage systems[1][2]
  - using from a local file system[1]
- Delta Lake origin
  - ADLS Gen1 prerequisites[1][2]
  - ADLS Gen2 prerequisites[1]
  - Amazon S3 credential mode[1]
  - overview[1][2]
  - reading from a local file system[1]
  - retrieve ADLS Gen1 authentication information[1]
  - retrieve ADLS Gen2 authentication information[1]
  - storage systems[1]
- deployment mode
  - Hadoop YARN cluster[1]
- destinations
  - ADLS G1[1]
  - ADLS G2[1]
  - Amazon S3[1]
  - Azure Event Hubs[1]
  - Delta Lake[1]
  - Elasticsearch[1]
  - File[1]
  - JDBC[1]
  - Snowflake[1]
- directories
  - internal[1]
  - protected[1]
  - Transformer[1]
- directory path
  - File destination[1]
  - File origin[1]
- drivers
  - JDBC destination[1]
  - JDBC Lookup processor[1]
  - JDBC origin[1]
  - JDBC Table origin[1]
  - MySQL JDBC Table origin[1]
  - Oracle JDBC Table origin[1]
E
- Elasticsearch destination
  - configuring[1]
  - overview[1]
  - overwrite partition prerequisite[1]
  - partitions[1]
  - write mode[1]
- EMR
  - authentication method[1]
  - base URI and staging directory[1]
  - bootstrap actions for provisioned clusters[1]
  - cluster[1]
  - Kerberos stage limitation[1]
  - provisioned cluster[1]
  - server-side encryption[1]
  - SSE Key Management Service (KMS) requirement[1]
  - Transformer installation location[1]
- EMR jobs
  - force stop[1]
- encryption zones
  - using KMS to access HDFS encryption zones[1]
- execution engines
  - Transformer[1]
- execution mode
  - pipelines[1]
- executors
  - Spark[1]
- expression language
  - constants[1]
  - datetime variables[1]
  - functions[1]
  - literals[1]
  - operator precedence[1]
  - operators[1]
  - reserved words[1]
- expressions
  - in pipeline and stage properties[1]
F
- Field Flattener processor
  - configuring[1]
- Field Order processor
  - configuring[1]
  - overview[1]
- Field Remover processor
  - configuring[1]
  - overview[1]
- Field Renamer processor
  - configuring[1]
  - overview[1]
  - rename methods[1]
- fields
  - referencing[1]
- file descriptors
  - increasing[1]
- File destination
  - configuring[1]
  - data formats[1]
  - directory path[1]
  - overview[1]
  - write mode[1]
- file functions
  - description[1]
- File origin
  - configuring[1]
  - custom schema[1]
  - data formats[1]
  - directory path[1]
  - overview[1]
  - partitions[1]
  - schema requirement[1]
- Filter processor
  - configuring[1]
  - filter condition[1]
  - overview[1]
- force stop
  - EMR jobs[1]
- full outer join
  - Join processor[1]
- full read
  - Snowflake origin[1]
- functions
  - Base64 functions[1]
  - credential[1]
  - file functions[1]
  - in the StreamSets expression language[1]
  - job functions[1]
  - math functions[1]
  - miscellaneous functions[1]
  - pipeline functions[1]
  - string functions[1]
  - time functions[1]
G
- garbage collection
  - Java[1]
- Google Big Query destination
  - merge properties[1]
  - prerequisite[1]
  - write mode[1]
- Google Big Query origin
  - incremental and full query mode[1]
  - offset column and supported types[1]
  - supported data types[1]
H
- Hadoop impersonation mode
  - configuring KMS for encryption zones[1]
  - lowercasing user names[1]
  - overview[1]
- Hadoop YARN
  - cluster[1]
  - deployment mode[1]
  - directory requirements[1]
  - driver requirement[1]
  - impersonation[1]
  - Kerberos authentication[1]
- heap size
  - configuring[1]
- Hive destination
  - additional Hive configuration properties[1]
  - configuring[1]
  - data drift column order[1]
- Hive origin
  - reading Delta Lake managed tables[1]
- HTTPS protocol
  - enabling[1]
I
- impersonation mode
  - Hadoop[1]
- incremental read
  - Snowflake origin[1]
- init scripts
  - Databricks provisioned clusters[1]
- inner join
  - Join processor[1]
- inputs variable
  - PySpark processor[1]
  - Scala processor[1][2]
- installation
  - overview[1]
  - requirements[1]
  - Scala, Spark, and Java JDK requirements[1]
  - Spark shuffle service requirement[1]
- installation package
  - choosing Scala version[1]
- installation requirements
  - system[1]
J
- Java
  - garbage collection[1]
- Java configuration options
  - heap size[1]
- Java keystore
  - credential store[1]
  - properties file[1]
- JDBC destination
  - configuring[1]
  - driver installation[1]
  - overview[1]
  - partitions[1]
  - tested versions and drivers[1]
  - write mode[1]
- JDBC Lookup processor
  - configuring[1]
  - driver installation[1]
  - overview[1]
  - tested versions and drivers[1]
- JDBC Query origin
  - configuring[1]
  - driver installation[1]
  - overview[1]
  - tested versions and drivers[1]
- JDBC Table origin
  - configuring[1]
  - driver installation[1]
  - offset column[1]
  - overview[1]
  - partitions[1]
  - supported offset data types[1]
  - tested versions and drivers[1]
- job functions
  - description[1]
- Join processor
  - condition[1]
  - configuring[1]
  - criteria[1]
  - cross join[1]
  - full outer join[1]
  - inner join[1]
  - join types[1]
  - left anti join[1]
  - left outer join[1]
  - left semi join[1]
  - matching fields[1]
  - overview[1]
  - right anti join[1]
  - right outer join[1]
  - shuffling of data[1]
- join types
  - Join processor[1]
- JSON Parser processor
  - configuring[1]
  - custom schema[1]
  - error handling[1]
  - overview[1]
  - schema inference[1]
K
- Kafka destination
  - Kerberos authentication[1]
  - security[1]
  - SSL/TLS encryption[1]
- Kafka origin
  - custom schemas[1]
  - Kerberos authentication[1]
  - overview[1]
  - security[1]
  - SSL/TLS encryption[1]
- Kafka stages
  - enabling SASL[1]
  - enabling SASL on SSL/TLS[1]
  - enabling security[1]
  - enabling SSL/TLS security[1]
  - providing Kerberos credentials[1]
  - security prerequisite tasks[1]
- Kerberos
  - credentials for Kafka stages[1]
  - enabling[1]
- Kerberos authentication
  - Hadoop YARN cluster[1]
  - Kafka destination[1]
  - Kafka origin[1]
- Kerberos keytab
  - configuring in pipelines[1]
- Kudu origin
  - configuring[1]
  - overview[1]
L
- left anti join
  - Join processor[1]
- left outer join
  - Join processor[1]
- left semi join
  - Join processor[1]
- literals
  - in the StreamSets expression language[1]
- log files
  - viewing and downloading[1][2]
- logs
  - pipelines[1]
  - Spark driver[1]
  - Transformer[1]
- lookups
  - streaming example[1]
- ludicrous mode
  - caching[1]
  - optimizing pipeline performance[1]
  - pipeline statistics[1]
M
- MapR cluster
  - dynamic allocation requirement[1]
- MapR clusters
  - Hadoop impersonation prerequisite[1]
  - pipeline start prerequisite[1]
- master instance
  - retrieving details[1]
- math functions
  - description[1]
- miscellaneous functions
  - description[1]
- monitoring
  - Spark web UI[1]
- MySQL JDBC Table origin
  - custom offset queries[1]
  - default offset queries[1]
  - driver installation[1]
  - MySQL data types[1]
  - null offset value handling[1]
  - supported offset data types[1]
O
- offset column
  - Google Big Query origin[1]
  - JDBC Table[1]
- offsets
  - overview[1]
  - resetting for the pipeline[1]
  - skipping tracking[1]
- open file limit
  - configuring[1]
- operators
  - in the StreamSets expression language[1]
  - precedence[1]
- Oracle JDBC Table origin
  - custom offset queries[1]
  - default offset queries[1]
  - driver installation[1]
  - null offset value handling[1]
  - Oracle data types[1]
  - supported offset data types[1]
- origins
  - ADLS Gen1[1]
  - ADLS Gen2[1]
  - Amazon S3[1]
  - Azure Event Hubs[1]
  - caching[1]
  - Delta Lake[1]
  - Delta Lake origin[1]
  - File[1]
  - JDBC Query[1]
  - JDBC Table[1]
  - Kafka[1]
  - Kudu[1]
  - Kudu origin[1]
  - multiple[1]
  - overview[1]
  - Snowflake[1]
  - Whole Directory[1]
- output variable
  - PySpark processor[1]
  - Scala processor[1][2]
- Overwrite Data write mode
  - Delta Lake destination[1]
P
- partitioning
  - overview[1]
- partitions
  - ADLS Gen1 origin[1]
  - ADLS Gen2 origin[1]
  - Amazon Redshift destination[1]
  - Amazon S3 origin[1]
  - Azure SQL destination[1]
  - based on origins[1]
  - changing[1]
  - Delta Lake destination[1]
  - Elasticsearch destination[1]
  - File origin[1]
  - initial[1]
  - initial number[1]
  - JDBC destination[1]
  - JDBC Table origin[1]
  - Rank processor[1]
- pipeline functions
  - description[1]
- pipeline offsets offsets[1]
- pipeline properties
  - using expressions[1]
- pipelines
  - delivery guarantee[1]
  - logs[1]
  - Spark configuration[1]
  - Spark executors[1]
  - stage library match requirement[1]
- ports
  - default[1]
- PostgreSQL JDBC Table origin
  - custom offset queries[1]
  - default offset queries[1]
  - null offset value handling[1]
  - PostgreSQL JDBC driver[1]
  - supported data types[1]
  - supported offset data types[1]
- post-upgrade tasks
  - access Databricks job details[1]
  - update ADLS stages in HDInsight pipelines[1]
  - update keystore and truststore location[1]
- preprocessing script
  - pipeline[1]
  - prerequisites[1]
  - requirements[1]
  - Spark-Scala prerequisites[1]
- prerequisites
  - ADLS and Amazon S3 stages[1]
  - Azure Event Hubs destination[1]
  - Azure Event Hubs origin[1]
  - for the Scala processor and preprocessing script[1]
  - PySpark processor[1]
  - stage-related[1]
- processing mode
  - ludicrous mode versus standard[1]
- processors
  - Aggregate[1]
  - caching[1]
  - Deduplicate[1]
  - Delta Lake Lookup[1]
  - Field Order[1]
  - Field Remover[1]
  - Field Renamer[1]
  - Filter[1]
  - JDBC Lookup[1]
  - Join[1]
  - JSON Parser[1]
  - Profile[1]
  - PySpark[1]
  - Rank[1]
  - referencing fields[1]
  - Repartition[1]
  - Scala[1]
  - shuffling of data[1]
  - Snowflake Lookup[1]
  - Sort[1]
  - Spark SQL Expression[1]
  - Spark SQL Query[1]
  - Stream Selector[1]
  - Type Converter[1]
  - union[1]
  - Window[1]
- Profile processor
  - configuring[1]
  - output records[1]
  - overview[1]
  - statistics[1]
- proxy server
  - Transformer[1]
- proxy users
  - Transformer[1]
- PySpark processor
  - configuring[1]
  - custom code[1]
  - Databricks prerequisites[1]
  - EMR prerequisites[1]
  - examples[1]
  - input and output variables[1]
  - other cluster and local pipeline prerequisites[1]
  - overview[1]
  - prerequisites[1][2]
  - referencing fields[1]
- PySpark processor requirements for provisioned Databricks clusters[1]
Q
- query mode
  - Google Big Query origin[1]
R
- Rank processor
  - configuring[1]
  - example[1]
  - order by[1]
  - overview[1]
  - partition by[1]
  - rank functions[1]
  - shuffling of data[1]
- read mode
  - Snowflake origin[1]
- release notes 4.0.x[1]
- release notes 4.1.x[1]
- remote debugging
  - Transformer[1]
- repartitioning
  - methods[1]
  - overview[1]
- Repartition processor
  - coalesce by number repartition method[1]
  - configuring[1]
  - methods[1]
  - overview[1]
  - repartition by field range repartition method[1]
  - repartition by number repartition method[1]
  - shuffling of data[1]
  - use cases[1]
- reserved words
  - in the StreamSets expression language[1]
- right anti join
  - Join processor[1]
- right outer join
  - Join processor[1]
- runtime parameters
  - calling from scripting processors[1]
- runtime properties
  - calling from a pipeline[1]
  - defining[1]
  - overview[1]
- runtime resources
  - calling from a pipeline[1]
  - defining[1]
- runtime values
  - overview[1]
S
- Scala
  - choosing an Transformer engine version[1]
- Scala, Spark, and Java JDK requirements
  - installation[1]
- Scala processor
  - configuring[1]
  - custom code[1]
  - examples[1]
  - input and output variables[1]
  - inputs variable[1]
  - output variable[1]
  - overview[1]
  - prerequisites[1]
  - requirements[1]
  - Spark-Scala prerequisite[1]
  - Spark SQL queries[1]
- scripting processors
  - calling runtime values[1]
- scripts
  - preprocessing[1]
- security
  - Kafka destination[1]
  - Kafka origin[1]
- server-side encryption
  - Amazon Redshift destination[1]
  - Amazon S3 destination[1]
  - EMR clusters[1]
- shuffling
  - overview[1]
- simple edit mode
  - description[1]
- Slowly Changing Dimension processor
  - configuring[1]
  - pipeline processing[1]
- Slowly Changing Dimensions processor
  - pipeline[1]
- Snowflake destination
  - configuring[1]
  - merge properties[1]
  - overview[1]
  - required privileges[1]
  - role[1]
  - write mode[1]
- Snowflake Lookup processor
  - configuring[1]
  - overview[1]
  - pushdown optimization[1]
  - required privileges[1]
  - role[1]
- Snowflake origin
  - configuring[1]
  - full query guidelines[1]
  - incremental or full read[1]
  - incremental query guidelines[1]
  - overview[1]
  - pushdown optimization[1]
  - read mode[1]
  - required privileges[1]
  - role[1]
  - SQL query guidelines[1]
- sorting
  - multiple fields[1]
- Sort processor
  - configuring[1]
  - multiple fields[1]
  - overview[1]
- Spark cluster
  - callback URl[1]
  - Transformer URL[1]
- Spark configuration
  - pipelines[1]
- Spark executors
  - maximum[1]
- Spark processing
  - description[1]
- Spark SQL Expression processor
  - overview[1]
- Spark SQL processor
  - configuring[1]
- Spark SQL query
  - syntax[1]
- Spark SQL Query processor
  - configuring[1]
  - examples[1]
  - overview[1]
  - query syntax[1]
  - referencing fields[1]
- Spark web UI
  - monitoring[1]
- SQL query
  - guidelines for the Snowflake origin[1]
- SQL Server 2019 BDC
  - cluster[1]
  - JDBC connection information[1]
  - master instance details for JDBC[1]
  - retrieving information[1]
- SQL Server JDBC Table origin
  - configuring[1]
  - custom offset queries[1]
  - default offset queries[1]
  - null offset value handling[1]
  - SQL Server JDBC driver[1]
  - supported data types[1]
  - supported offset data types[1]
- SSL/TLS encryption
  - Kafka destination[1]
  - Kafka origin[1]
- stage libraries
  - AWS Secrets Manager Credentials Store[1]
  - Azure Key Vault Credentials Store[1][2]
- stage library match requirement
  - in a pipeline[1]
- stage properties
  - using expressions[1]
- staging directory
  - EMR pipelines[1]
- statistics
  - Profile processor[1]
- streaming pipelines
  - case study[1]
  - description[1]
- Stream Selector processor
  - conditions[1]
  - configuring[1]
  - default stream[1]
  - overview[1]
- string functions
  - description[1]
T
- time functions
  - description[1]
- Transformer
  - architecture[1]
  - description[1]
  - directories[1]
  - execution engine[1]
  - Java configuration options[1]
  - proxy server[1]
  - proxy users[1]
  - release notes[1]
  - remote debugging[1]
  - spark-submit[1]
  - starting manually[1]
  - viewing and downloading log data[1][2]
- Transformer libraries
  - removing from Databricks[1]
- troubleshooting
  - origin errors[1]
  - pipeline errors[1]
- Type Converter processor
  - configuring[1]
  - field type conversion[1]
  - overview[1]
U
- ulimit
  - configuring[1]
- union processor
  - overview[1]
- Update Table write mode
  - Delta Lake destination[1]
- Upsert Using Merge write mode
  - Delta Lake destination[1]
- URL
  - cluster callback[1]
W
- Whole Directory origin
  - data formats[1]
  - overview[1]
- Window processor
  - conditions[1]
  - configuring[1]
  - overview[1]
  - window types[1]
- window types
  - Window processor[1]
- write mode
  - Azure SQL destination[1]
  - Delta Lake destination[1]
  - Google Big Query destination[1]
  - JDBC destination[1]
  - Snowflake destination[1]