Troubleshooting

Accessing Error Messages

Informational and error messages display in different locations based on the type of information:

Pipeline configuration issues
The pipeline canvas provides guidance and error details as follows:
  • Issues found by implicit validation display in the Issues list.
  • An error icon displays at the stage where the problem occurs or on the canvas for pipeline configuration issues.
  • Issues discovered by explicit validation displays in a warning message on the canvas.
Runtime error information
You can view error information when you monitor a running pipeline. In the canvas, the pipeline displays error record counts for each stage generating error records.
On the Errors tab, you can view error record statistics and the latest set of error records with error messages. If the error was produced by an exception, you can click View Stack Trace to view the full stack trace.
Note: This information becomes unavailable when you stop the pipeline. To preserve information about error records, use the Error Records pipeline property to save error records.
Error record information
You can use the Error Records pipeline properties to write error records and related details to another system for review. The information in the following record header attributes can help you determine the problem that occurred. For more information, see Internal Attributes.
For more information about error records and error record handling, see Error Record Handling.
Data Collector errors
You can view information and errors related to the general Data Collector functionality in the Data Collector log. You can view or download the logs from the Data Collector UI. For details, see Viewing Data Collector Logs.
By default, Data Collector logs messages at the INFO severity level. You can modify the log level to display messages at another severity level. For details, see Modifying the Log Level.

Pipeline Basics

Use the following tips for help with pipeline basics:

When I go to the Data Collector UI, I get a "Webpage not available" error message.
The Data Collector is not running. Start the Data Collector.
Why isn't the Start icon enabled?
You can start a pipeline when it is valid. Use the Issues icon to review the list of issues in your pipeline. When you resolve the issues, the Start icon becomes enabled.
Why doesn't the Select Fields with Preview Data option work? No preview data displays.
Select Fields with Preview Data works when the pipeline is valid for data preview and when Data Collector is configured to run preview in the background. Make sure all stages are connected and required properties are configured. Also verify that preview is running in the background by clicking Help > Settings.
Sometimes I get a list of available fields and sometimes I don't. What's up with that?
The pipeline can display a list of available fields when the pipeline is valid for data preview and when Data Collector is configured to run preview in the background. Make sure all stages are connected and required properties are configured. Also verify that preview is running in the background by clicking Help > Settings.
The data reaching the destination is not what I expect - what do I do?
If the pipeline is still running, take a couple snapshots of the data being processed, then stop the pipeline and enter data preview and use the snapshot as the source data. In data preview, you can step through the pipeline and see how each stage alters the data.
If you already stopped the pipeline, perform data preview using the origin data. You can step through the pipeline to review how each stage processes the data and update the pipeline configuration as necessary.
You can also edit the data to test for cases that do not occur in the preview data set.

Data Preview

Use the following tips for help with data preview:
Why isn't the Preview icon enabled?
You can preview data after you connect all stages in the pipeline and configure required properties. You can use any valid value as a placeholder for required properties.
Why doesn't the data preview show any data?
If data preview doesn't show any data, one of the following issues might have occurred:
  • The origin might not be configured correctly.

    In the Preview panel, check the Configuration tab for the origin for related issues. For some origins, you can use Raw Preview to see if the configuration information is correct.

  • The origin might not have any data at the moment.

    Some origins, such as Directory and File Tail, can display processed data for data preview. However, most origins require incoming data to enable data preview.

Why am I only getting 10 records to preview when I'm asking for more?
The Data Collector maximum preview batch size overrides the data preview batch size. The Data Collector default is 10 records.
When you request data preview, you can request up to the Data Collector preview batch size, or you can increase the preview.maxBatchSize property in the Data Collector configuration file.
In data preview, I edited stage configuration and clicked Run with Changes, but I don't see any change in the data.
This might happen if the configuration change is in the origin. Run with Changes uses the existing preview data. To see how changes to origin configuration affects preview data, use Refresh Preview.

General Validation Errors

Use the following tips for help with general pipeline validation errors:
The pipeline has the following set of validation errors for a stage:
CONTAINER_0901 - Could not find stage definition for <stage library name>:<stage name>.
CREATION_006 - Stage definition not found. Library <stage library name>. Stage <stage name>. 
Version <version>
VALIDATION_0006 - Stage definition does not exist, library <stage library name>, 
name <stage name>, version <version>
The pipeline uses a stage that is not installed on the Data Collector. This might happen if you imported a pipeline from a different version of the Data Collector and the current Data Collector is not enabled to use the stage.
If the Data Collector uses a different version of the stage, you might delete the invalid version and replace it with a local valid version. For example, if the pipeline uses an older version of the Hadoop FS destination, you might replace it with a version used by this Data Collector.
If you need to use a stage that is not installed on the Data Collector, install the related stage library. For information about installing additional drivers, see Install External Libraries.

Origins

Use the following tips for help with origin stages and systems.

Directory

Why isn't the Directory origin reading all of my files?
The Directory origin reads a set of files based on the configured file name pattern, read order, and first file to process. If new files arrive after the Directory origin has passed their position in the read order, the Directory origin does not read the files unless you reset the origin.
When using the last-modified timestamp read order, arriving files should have timestamps that are later than the files in the directory.
Similarly, when using the lexicographically ascending file name read order, make sure the naming convention for the files are lexicographically ascending. For example, filename-1.log, filename-2.log, etc., works fine until filename-10.log. If filename-10.log arrives after the Directory origin completes reading filename-2.log, then the Directory origin does not read filename-10.log since it is lexicographically earlier than filename-2.log.
For more information, see Read Order.

Elasticsearch

A pipeline with an Elasticsearch origin fails to start with an SSL/TLS error, such as the following:
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
This message can display due to many different SSL/TLS issues. Review the details of related messages to determine the corrective measures to take.
Here are some examples of when a version of the message might display:
  • If you configure the stage to use SSL/TLS, but do not specify an HTTPS-enabled port.

    To resolve the issue, specify an HTTPS-enabled port.

  • If you configure the stage to use SSL/TLS and specify an HTTPS-enabled port, but configure the URL incorrectly, such as http://<host>:<port>.

    To resolve this issue, update the URL to use HTTPS, as follows: https://<host>:<port>.

  • If you configure the stage to use SSL/TLS, but do not have a certificate in the specified truststore.

    To resolve the issue, place a valid certificate in the truststore.

Hadoop FS

In the pipeline, the Hadoop FS origin has an error icon with the following message:
Validation_0071 - Stage '<stage id>' does not support 'Standalone' execution mode
You're using the Hadoop FS origin in pipeline configured for standalone execution mode. Use the Hadoop FS origin in cluster mode pipelines.
Workaround: In the pipeline properties, set the Execution Mode to Cluster. Or if you want to run the pipeline in standalone mode, use the Directory or File Tail origins to process file data.

JDBC Origins

My MySQL JDBC Driver 5.0 fails to validate the query in my JBDC Query Consumer origin.
This can occur when you use a LIMIT clause in your query.
Workaround: Upgrade to version 5.1.
I'm using a JDBC origin to read MySQL data. Why are datetime value set to zero being treated like error records?
MySQL treats invalid dates as an exception, so both the JDBC Query Consumer and the JDBC Multitable Consumer create error records for invalid dates.
You can override this behavior by setting a JDBC configuration property in the origin. Add the zeroDateTimeBehavior property and set the value to "convertToNull".
For more information about this and other MySQL-specific JDBC configuration properties, see http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html.
A pipeline using the JDBC Query Consumer origin keeps stopping with the following error:
JDBC_77 <db error message> attempting to execute query '<query>'. Giving up 
after <error count> errors as per stage configuration. First error: <first db error>.
This occurs when the origin cannot successfully execute a query. To handle transient connection or network errors, try increasing the value for the Number of Retries Upon Query Error property on the JDBC tab of the origin.
My pipeline using a JDBC origin generates an out-of-memory error when reading a large table.
When the Auto Commit property is enabled in a JDBC origin, some drivers ignore the fetch-size restriction, configured by the Max Batch Size property in the origin. This can lead to an out-of-memory error when reading a large table that cannot entirely fit in memory.
To resolve, disable the Auto Commit property on the Advanced tab of the origin.

Kafka Consumer

Why isn't my pipeline reading existing data from my Kafka topic?
The Kafka Consumer determines the first message to read based on the value of the Auto Offset Reset property. With the default value, Earliest, the origin reads messages starting with the first message in the topic.

If you already started the pipeline or ran a preview with a different setting, the offset has already been committed. To read the oldest unread data in a topic, set Auto Offset Reset to Earliest and then temporarily change the consumer group name to a different value. Run data preview. Then, change the consumer group back to the correct value and start the pipeline.

How can I reset the offset for a Kafka Consumer?
Since the offset for a Kafka Consumer is stored with the ZooKeeper for the Kafka cluster, you cannot reset the offset through the Data Collector. For information about resetting an offset through Kafka, see the Apache Kafka documentation.
The Kafka Consumer with Kerberos enabled cannot connect to an HDP 2.3 distribution of Kafka.

When enabling Kerberos, by default, HDP 2.3 sets the security.inter.broker.protocol Kafka broker configuration property to PLAINTEXTSASL, which is not supported.

To correct the issue, set security.inter.broker.protocol to PLAINTEXT.

Oracle CDC Client

Data preview continually times out for my Oracle CDC Client pipeline.
Pipelines that use the Oracle CDC Client can take longer than expected to initiate for data preview. If preview times out, try increasing the Preview Timeout property incrementally.

For more information about using preview with this origin, see Data Preview with Oracle CDC Client.

My Oracle CDC Client pipeline has paused processing during a daylight saving time change.
If the origin is configured to use a database time zone that uses daylight saving time, then the pipeline pauses processing during the time change window to ensure that all data is correctly processed. After the time change completes, the pipeline resumes processing at the last-saved offset.
For more information, see Database Time Zone.

PostgreSQL CDC Client

A PostgreSQL CDC Client pipeline generates the following error:
com.streamsets.pipeline.api.StageException: JDBC_606 - Wal Sender is not active
This can occur when the Status Interval property configured for the origin is larger than the wal_sender_timeout property in the PostgreSQL postgresql.conf configuration file.
The Status Interval property should be less than the wal_sender_timeout property. Ideally, it should be set to half of the value of the wal_sender_timeout property.
For example, you can use the default status interval of 30 seconds with the default wal_sender_timeout value of 60000 milliseconds, or 1 minute.

Salesforce

A pipeline generates a buffering capacity error
When pipelines with a Salesforce origin fail due to a buffering capacity error, such as Buffering capacity 1048576 exceeded, increase the buffer size by editing the Streaming Buffer Size property on the Subscribe tab.

Scripting Origins

A pipeline fails to stop when users click the Stop icon
Scripts must include code that stops the script when users stop the pipeline. In the script, use the sdc.isStopped method to check whether the pipeline has been stopped.
A Jython script does not proceed beyond import lock
Pipelines freeze if Jython scripts do not release the import lock upon a failure or error. When a script does not release an import lock, you must restart Data Collector to release the lock. To avoid the problem, use a try statement with a finally block in the Jython script. For more information, see Thread Safety in Jython Scripts.

SQL Server CDC Client

A pipeline with the SQL Server CDC Client origin cannot establish a connection. The pipeline fails with the following error:
java.sql.SQLTransientConnectionException: HikariPool-3 - 
   Connection is not available, request timed out after 30004ms.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:213)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:163)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.
  java:85)
at com.streamsets.pipeline.lib.jdbc.multithread.ConnectionManager.
  getNewConnection(ConnectionManager.java:45)
at com.streamsets.pipeline.lib.jdbc.multithread.ConnectionManager.
  getConnection(ConnectionManager.java:57)
at com.streamsets.pipeline.stage.origin.jdbc.cdc.sqlserver.
  SQLServerCDCSource.getCDCTables(SQLServerCDCSource.java:181)

This can occur when the origin is configured to use a certain number of threads, but the thread pool is not set high enough. On the Advanced tab, check the settings for the Maximum Pool Size and Minimum Idle Connections properties.

When using multithreaded processing, you want these properties to be set to greater than or equal to the value of the Number of Threads property on the JDBC tab.

Also, allowing late table processing requires the origin to use an additional background thread. When you enable processing late tables, you should set the Maximum Pool Size and Minimum Idle Connections properties to one thread more than the Number of Threads property.

After dropping and recreating a table, the origin won't seem to read the data in the table. What's the problem?
The SQL Server CDC Client origin stores the offset for every table that it processes to track its progress. If you drop a table and recreate it using the same name, the origin assumes it is the same table and uses the last-saved offset for the table.
If you need the origin to process data earlier than the last-saved offset, you might need to reset the origin.
Note that after you reset the origin, the origin drops all stored offsets. And when you restart the pipeline, the origin processes all available data in the specified tables. You cannot reset the origin for a particular table.
Previewing data does not show any values.
When you set the Maximum Transaction Length property, the origin fetches data in multiple time windows. The property determines the size of each time window. Previewing data only shows data from the first time window, but the origin might need to process multiple time windows before finding changed values to show in the preview.

To see values when previewing data, increase Maximum Transaction Length or set to -1 to fetch data in one time window.

A no-more-data event is generated before reading all changes
When you set the Maximum Transaction Length property, the origin fetches data in multiple time windows. The property determines the size of each time window. After processing all available rows in each time window, the origin generates a no-more-data event, even when subsequent time windows remain for processing.

Processors

Use the following tip for help with processors.

Encrypt and Decrypt Fields

The following error message displays in the log after I start the pipeline:
CONTAINER_0701 - Stage 'EncryptandDecryptFields_01' initialization error: java.lang.IllegalArgumentException: Input byte array has incorrect ending byte at 44
When the processor uses a user-supplied key, the length of the Base64 encoded key that you provide must match the length of the key expected by the selected cipher suite. For example, if the processor uses a 264-bit (32 byte) cipher suite, the Base64 encoded key must be 32 bytes in length.
You can receive this message when the length of the Base64 encoded key is not the expected length.

Destinations

Use the following tips for help with destination stages and systems.

Azure Data Lake Storage

An Azure Data Lake Storage destination seems to be causing out of memory errors, with the following object using all available memory:
com.streamsets.pipeline.stage.destination.hdfs.writer.ActiveRecordWriters
This can occur due to a Hadoop known issue, which can affect both the Azure Data Lake Storage Gen1 and Gen2 destinations.
For a description of a workaround, see the documentation for the Gen1 or Gen2 destination.

Cassandra

Why is the pipeline failing entire batches when only a few records have a problem?
Due to Cassandra requirements, when you write to a Cassandra cluster, batches are atomic. This means than an error in a one or more records causes the entire batch to fail.
Why is all of my data being sent to error? Every batch is failing.
When every batch fails, you might have a data type mismatch. Cassandra requires the data type of the data to exactly match the data type of the Cassandra column.
To determine the issue, check the error messages associated with the error records. If you see a message like the following, you have a data type mismatch. The following error message indicates that data type mismatch is for Integer data being unsuccessfully written to a Varchar column:
CASSANDRA_06 - Could not prepare record 'sdk:': 
Invalid type for value 0 of CQL type varchar, expecting class java.lang.String but class java.lang. 
Integer provided`
To correct the problem, you might use a Field Type Converter processor to convert field data types. In this case, you would convert the integer data to string.

Elasticsearch

A pipeline with an Elasticsearch destination fails to start with an SSL/TLS error, such as the following:
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
This message can display due to many different SSL/TLS issues. Review the details of related messages to determine the corrective measures to take.
Here are some examples of when a version of the message might display:
  • If you configure the stage to use SSL/TLS, but do not specify an HTTPS-enabled port.

    To resolve the issue, specify an HTTPS-enabled port.

  • If you configure the stage to use SSL/TLS and specify an HTTPS-enabled port, but configure the URL incorrectly, such as http://<host>:<port>.

    To resolve this issue, update the URL to use HTTPS, as follows: https://<host>:<port>.

  • If you configure the stage to use SSL/TLS, but do not have a certificate in the specified truststore.

    To resolve the issue, place a valid certificate in the truststore.

Hadoop FS

I'm writing text data to HDFS. Why are my files all empty?
You might not have the pipeline or Hadoop FS destination configured correctly.
The Hadoop FS destination uses a single field to write text data to HDFS.
The pipeline should collapse all data to a single field. And the Hadoop FS destination must be configured to use that field. By default, Hadoop FS uses a field named /text.

HBase

I get the following error when validating or starting a pipeline with an HBase destination:
HBASE_06 - Cannot connect to cluster: org.apache.hadoop.hbase.MasterNotRunningException: 
com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: 
Call to node00.local/<IP_address>:60000 failed on local exception: 
org.apache.hadoop.hbase.exceptions.ConnectionClosingException: 
Connection to node00.local/<IP_address>:60000 is closing. Call id=0, waitTime=58
Is your HBase master is running? If so, then you might trying to connect to a secure HBase cluster without configuring the HBase destination to use Kerberos authentication. In the HBase destination properties, select Kerberos Authentication and try again.

Kafka Producer

Can the Kafka Producer create topics?
The Kafka Producer can create a topic when all of the following are true:
  • You configure the Kafka Producer to write to a topic name that does not exist.
  • At least one of the Kafka brokers defined for the Kafka Producer has the auto.create.topics.enable property enabled.
  • The broker with the enabled property is up and available when the Kafka Producer looks for the topic.
A pipeline that writes to Kafka keeps failing and restarting in an endless cycle.
This can happen when the pipeline tries to write message to Kafka 0.8 that is longer than the Kafka maximum message size.
Workaround: Reconfigure Kafka brokers to allow larger messages or ensure that incoming records are within the configured limit.
The Kafka Producer with Kerberos enabled cannot connect to the HDP 2.3 distribution of Kafka.

When enabling Kerberos, by default, HDP 2.3 sets the security.inter.broker.protocol Kafka broker configuration property to PLAINTEXTSASL, which is not supported.

To correct the issue, set security.inter.broker.protocol to PLAINTEXT.

MemSQL Fast Loader

A pipeline stops and returns the following error:
JDBC_14 - Error processing batch. SQLState: 42000 Error Code: 1148 Message: The used command is not allowed with this MySQL version

To use MemSQL Fast Loader with a MySQL database, you must enable local data loading in MySQL. See the MySQL topic, Security Issues with LOAD DATA LOCAL.

A pipeline passes a CDC record to the MemSQL Fast Loader destination and returns the following error:
JDBC_70 - Unsupported operation in record header: 1

The MemSQL Fast Loader destination cannot process CDC records. Use the JDBC Producer destination to process these records.

SDC RPC

A pipeline fails to start with the following validation error:
IPC_DEST_15 Could not connect to any SDC RPC destination : [<host name>: 
java.net.ConnectException: Connection refused]
You configured the pipeline to write error records to a pipeline, but the configuration information for the error records pipeline is invalid.
To write error records to a pipeline, you need a valid destination pipeline that includes an RPC origin.

Executors

Use the following tips for help with executors.

Hive Query

When I enter the name of the Impala JDBC driver in the stage, I receive an error saying that the driver is not present in the class path:
HIVE_15 - Hive JDBC Driver <driver name> not present in the class path.

To use an Impala JDBC driver with the Hive Query executor, the driver must be installed as an external library. And it must be installed for the stage library that the Hive Query executor uses.

If you already installed the driver, verify that you have it installed for the correct stage library. For more information, see Installing the Impala Driver.

JDBC Connections

Use the following tips for help with stages that use JDBC connections to connect to databases. For some stages, Data Collector includes the necessary JDBC driver to connect to the database. For other stages, you must install a JDBC driver.

The following stages require you to install a JDBC driver:
  • JDBC Multitable Consumer origin
  • JDBC Query Consumer origin
  • MySQL Binary Log origin
  • Oracle Bulkload origin
  • Oracle CDC origin
  • Oracle CDC Client origin
  • SAP HANA Query Consumer origin
  • Teradata Consumer origin
  • JDBC Lookup processor
  • JDBC Tee processor
  • SQL Parser processor, when using the database to resolve the schema
  • JDBC Producer destination
  • MemSQL Fast Loader destination
  • JDBC Query executor

No Suitable Driver

When Data Collector cannot find the JDBC driver for a stage, Data Collector might generate one of the following error messages:
JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException:
JDBC_06 - Failed to initialize connection pool: java.sql.SQLException: No suitable driver

Verify that you have followed the instructions to install additional drivers, as explained in Install External Libraries.

You can also use these additional tips to help resolve the issue:

The JDBC connection string is not correct.
The JDBC Connection String property for the stage must include the jdbc: prefix. For example, a PostgreSQL connection string might be jdbc:postgresql://<database host>/<database name>.
Check your database documentation for the required connection string format. For example, if you are using a non-standard port, you must specify it in the connection string.
The JDBC driver is not stored in the correct directory.
You must store the JDBC driver in the following directory: <external directory>/streamsets-datacollector-jdbc-lib/lib/.

For example, assuming that you defined the external directory as /opt/sdc-extras, store the JDBC JAR files in /opt/sdc-extras/streamsets-datacollector-jdbc-lib/lib/.

STREAMSETS_LIBRARIES_EXTRA_DIR is not set correctly.
You must set the STREAMSETS_LIBRARIES_EXTRA_DIR environment variable to tell Data Collector where the JDBC drivers and other additional libraries are located. The location should be external to the Data Collector installation directory.
For example, to use /opt/sdc-extras as the external directory for additional libraries, then you would set STREAMSETS_LIBRARIES_EXTRA_DIR as follows:
export STREAMSETS_LIBRARIES_EXTRA_DIR="/opt/sdc-extras/"

Modify environment variables using the method required by your installation type.

The security policy is not set.
You must grant permission for code in the external directory. Ensure that the $SDC_CONF/sdc-security.policy file contains the following lines:
// user-defined external directory
grant codebase "file://<external directory>-" {
  permission java.security.AllPermission;
};
For example:
// user-defined external directory
grant codebase "file:///opt/sdc-extras/-" {
  permission java.security.AllPermission;
};
JDBC drivers do not load or register correctly.
Sometimes JDBC drivers that a pipeline requires do not load or register correctly. For example a JDBC driver might not correctly support JDBC 4.0 auto-loading, resulting in a "No suitable driver" error message.
Two approaches can resolve this issue:
  • Add the class name for the driver in the JDBC Class Driver Name property on the Legacy Drivers tab for the stage.
  • Configure Data Collector to automatically load specific drivers. In the Data Collector configuration file, uncomment the stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load property and set to a comma-separated list of the JDBC drivers required by stages in your pipelines.
The sdc user does not have correct permissions on the JDBC driver.
When you run Data Collector as a service, the default system user named sdc is used to start the service. The user must have read access to the JDBC driver and all directories in its path.
To verify the permissions, run the following command:
sudo -u sdc file <external directory>/streamsets-datacollector-jdbc-lib/lib/<driver jar file>
For example, let's assume that you are using an external directory of /opt/sdc-extras and the MySQL JDBC driver. If you receive the following output when you run the command, then the sdc user does not have read or execute access on one or more of the directories in the path:
/opt/sdc-extras/streamsets-datacollector-jdbc-lib/lib/mysql-connector-java-5.1.40-bin.jar: cannot open `/opt/sdc-extras/streamsets-datacollector-jdbc-lib/lib/mysql-connector-java-5.1.40-bin.jar' (Permission denied)
To resolve this issue, identify the relevant directories and grant the sdc user read and execute access on those directories. For example, run the following command to grant the user access on the root of the external directory:
chmod 755 /opt/sdc-extras
If you receive the following output when you run the command, then the sdc user does not have read permission on the JDBC driver:
/opt/sdc-extras/streamsets-datacollector-jdbc-lib/lib/mysql-connector-java-5.1.40-bin.jar: regular file, no read permission
To resolve this issue, run the following command to grant the user read access to the driver:
chmod 644 /opt/sdc-extras/streamsets-datacollector-jdbc-lib/lib/mysql-connector-java-5.1.40-bin.jar

Cannot Connect to Database

When Data Collector cannot connect to the database, an error message like the following displays - the exact message can vary depending on the driver:

JDBC_00 - Cannot connect to specified database: com.zaxxer.hikari.pool.PoolInitializationException:
Exception during pool initialization: The TCP/IP connection to the host 1.2.3.4, port 1234 has failed
In this case, verify that the Data Collector machine can access the database machine on the relevant port. You can use tools such as ping and netcat (nc) for this purpose. For example, to verify that the host 1.2.3.4 is accessible:
$ ping 1.2.3.4 
PING 1.2.3.4 (1.2.3.4): 56 data bytes 
64 bytes from 1.2.3.4: icmp_seq=0 ttl=57 time=12.063 ms 
64 bytes from 1.2.3.4: icmp_seq=1 ttl=57 time=11.356 ms 
64 bytes from 1.2.3.4: icmp_seq=2 ttl=57 time=11.626 ms 
^C
--- 1.2.3.4 ping statistics --- 
3 packets transmitted, 3 packets received, 0.0% packet loss 
round-trip min/avg/max/stddev = 11.356/11.682/12.063/0.291 ms
Then to verify that port 1234 can be reached:
$ nc -v -z -w2 1.2.3.4 1234 
nc: connectx to 1.2.3.4 port 1234 (tcp) failed: Connection refused

If the host or port is not accessible, check the routing and firewall configuration.

MySQL JDBC Driver and Time Values

Due to a MySQL JDBC driver issue, the driver cannot return time values to the millisecond. Instead, the driver returns the values to the second.

For example, if a column has a value of 20:12:50.581, the driver reads the value as 20:12:50.000.

Performance

Use the following tips for help with performance:
How can I decrease the delay between reads from the origin system?
A long delay can occur between reads from the origin system when a pipeline reads records faster than it can process them or write them to the destination system. Because a pipeline processes one batch at a time, the pipeline must wait until a batch is committed to the destination system before reading the next batch, preventing the pipeline from reading at a steady rate. Reading data at a steady rate provides better performance than reading sporadically.
If you cannot increase the throughput for the processors or destination, limit the rate at which the pipeline reads records from the origin system. Configure the Rate Limit property for the pipeline to define the maximum number of records that the pipeline can read in a second.
When I try to start one or more pipelines, I receive an error that not enough threads are available.
By default, Data Collector can run approximately 22 standalone pipelines at the same time. If you run a larger number of standalone pipelines at the same time, you might receive the following error:
CONTAINER_0166 - Cannot start pipeline '<pipeline name>' as there are not enough threads available
To resolve this error, increase the value of the runner.thread.pool.size property in the Data Collector configuration file.
For more information, see Running Multiple Concurrent Pipelines.
How can I tell what's slowing down my pipeline?
Review the information available in the Data Collector UI in Monitor mode. Charts provide information about the record count, record throughput, and batch throughput for the pipeline. To determine where processing slows, you can click each stage to view the count and throughput details for the stage.
If the origin is the issue, you might tune the batch size or batch wait time properties or adjust related properties in the origin system. If the destinations cause the problem, try adjusting any performance-related properties in the destination or related properties in the destination system.
If a processor causes the problem, you might take a snapshot of the pipeline to review how data passes through the pipeline and consider options for streamlining processing.
How can I improve the general pipeline performance?
You might improve performance by adjusting the batch size used by the pipeline. The batch size determines how much data passes through the pipeline at one time. By default, the batch size is 1000 records.
You might adjust the batch size based on the size of the records or the speed of their arrival. For example, if your records are extremely big, you might reduce the batch size to increase the processing speed. Or if the records are small and arrive quickly, you might increase the batch size.
Experiment with the batch size and review the results in Monitor mode.
To change the batch size, configure the production.maxBatchSize property in the Data Collector configuration file.

Cluster Execution Mode

Use the following tips for help with pipelines in cluster mode:
I got the following validation error when configuring a cluster pipeline. What does it mean?
Validation_0071 - Stage '<stage id>' does not support 'Standalone' execution mode
This message can appear when you include a non-cluster origin in a cluster pipeline. You can use the cluster version of the Kafka Consumer and the Hadoop FS origin in a cluster pipeline.
The message can also appear if you choose the Write to File option for pipeline error handling. Write to File is not supported for cluster mode.
Why isn't Data Collector reading data from my new Kafka partition?
If you create a new partition in the Kafka topic, to launch a new Data Collector worker to read from the partition, you need to restart the pipeline.
My pipeline fails to start with the following error:
Pipeline Status: START_ERROR: Unexpected error starting pipeline:java.lang.IllegalStateException: 
Timed out after waiting 121 seconds for cluster application to start. Submit command is not alive.
Check the Data Collector log for more information. It's possible that the Spark on YARN client configuration is not in place, the installation is out of date, or the node being used is not a gateway node.
My pipeline fails to start with the following error:
Pipeline Status: START_ERROR: IO Error while trying to start the pipeline: java.io.IOException: 
Kerberos Error: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
It's likely that the cluster is configured to use Kerberos, but the Data Collector is not configured to use Kerberos. For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication.
My pipeline stopped unexpectedly.
Check the Spark Application Master logs in the YARN Resource Manager UI for more information about the problem.
Why does my pipeline take so long to start?
The start time for a pipeline can vary based on how busy the YARN cluster is. Typically, a cluster pipeline should start in 30-90 seconds.