Troubleshooting
Accessing Error Messages
Informational and error messages display in different locations based on the type of information:
- Pipeline configuration issues
- The pipeline canvas provides guidance and error details as follows:
- Issues found by implicit validation display in the Issues list.
- An error icon displays at the stage where the problem occurs or on the canvas for pipeline configuration issues.
- Issues discovered by explicit validation displays in a warning message on the canvas.
- Runtime error information
- You can view error information when you monitor a running pipeline. In the canvas, the pipeline displays error record counts for each stage generating error records.
- Error record information
- You can use the Error Records pipeline properties to write error records and related details to another system for review. The information in the following record header attributes can help you determine the problem that occurred. For more information, see Internal Attributes.
- Data Collector errors
- You can view information and errors related to the general Data Collector functionality in the Data Collector log. You can view or download the logs from the Data Collector UI. For details, see Viewing Data Collector Logs.
Pipeline Basics
Use the following tips for help with pipeline basics:
- When I go to the Data Collector UI, I get a "Webpage not available" error message.
- The Data Collector is not running. Start the Data Collector.
- Why isn't the Start icon enabled?
- You can start a pipeline when it is valid. Use the Issues icon to review the list of issues in your pipeline. When you resolve the issues, the Start icon becomes enabled.
- Why doesn't the Select Fields with Preview Data option work? No preview data displays.
- Select Fields with Preview Data works when the pipeline is valid for data preview and when Data Collector is configured to run preview in the background. Make sure all stages are connected and required properties are configured. Also verify that preview is running in the background by clicking .
- Sometimes I get a list of available fields and sometimes I don't. What's up with that?
- The pipeline can display a list of available fields when the pipeline is valid for data preview and when Data Collector is configured to run preview in the background. Make sure all stages are connected and required properties are configured. Also verify that preview is running in the background by clicking .
- The data reaching the destination is not what I expect - what do I do?
- If the pipeline is still running, take a couple snapshots of the data being processed, then stop the pipeline and enter data preview and use the snapshot as the source data. In data preview, you can step through the pipeline and see how each stage alters the data.
Data Preview
- Why isn't the Preview icon enabled?
- You can preview data after you connect all stages in the pipeline and configure required properties. You can use any valid value as a placeholder for required properties.
- Why doesn't the data preview show any data?
- If data preview doesn't show any data, one of the following issues might have
occurred:
- The origin might not be configured correctly.
In the Preview panel, check the Configuration tab for the origin for related issues. For some origins, you can use Raw Preview to see if the configuration information is correct.
- The origin might not have any data at the moment.
Some origins, such as Directory and File Tail, can display processed data for data preview. However, most origins require incoming data to enable data preview.
- The origin might not be configured correctly.
- Why am I only getting 10 records to preview when I'm asking for more?
- The Data Collector maximum preview batch size overrides the data preview batch size. The Data Collector default is 10 records.
- In data preview, I edited stage configuration and clicked Run with Changes, but I don't see any change in the data.
- This might happen if the configuration change is in the origin. Run with Changes uses the existing preview data. To see how changes to origin configuration affects preview data, use Refresh Preview.
General Validation Errors
- The pipeline has the following set of validation errors for a stage:
-
CONTAINER_0901 - Could not find stage definition for <stage library name>:<stage name>. CREATION_006 - Stage definition not found. Library <stage library name>. Stage <stage name>. Version <version> VALIDATION_0006 - Stage definition does not exist, library <stage library name>, name <stage name>, version <version>
Origins
Use the following tips for help with origin stages and systems.
Directory
- Why isn't the Directory origin reading all of my files?
- The Directory origin reads a set of files based on the configured file name pattern, read order, and first file to process. If new files arrive after the Directory origin has passed their position in the read order, the Directory origin does not read the files unless you reset the origin.
Elasticsearch
- A pipeline with an Elasticsearch origin fails to start with an SSL/TLS error, such as the following:
-
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
Hadoop FS
- In the pipeline, the Hadoop FS origin has an error icon with the following message:
-
Validation_0071 - Stage '<stage id>' does not support 'Standalone' execution mode
JDBC Origins
- My MySQL JDBC Driver 5.0 fails to validate the query in my JBDC Query Consumer origin.
- This can occur when you use a LIMIT clause in your query.
- I'm using a JDBC origin to read MySQL data. Why are datetime value set to zero being treated like error records?
- MySQL treats invalid dates as an exception, so both the JDBC Query Consumer and the JDBC Multitable Consumer create error records for invalid dates.
- A pipeline using the JDBC Query Consumer origin keeps stopping with the following error:
-
JDBC_77 <db error message> attempting to execute query '<query>'. Giving up after <error count> errors as per stage configuration. First error: <first db error>.
- My pipeline using a JDBC origin generates an out-of-memory error when reading a large table.
- When the Auto Commit property is enabled in a JDBC origin, some drivers ignore the fetch-size restriction, configured by the Max Batch Size property in the origin. This can lead to an out-of-memory error when reading a large table that cannot entirely fit in memory.
Kafka Consumer
- Why isn't my pipeline reading existing data from my Kafka topic?
- The Kafka Consumer determines the first message to read based on the value of the Auto
Offset Reset property. With the default value, Earliest, the origin reads messages starting
with the first message in the topic.
If you already started the pipeline or ran a preview with a different setting, the offset has already been committed. To read the oldest unread data in a topic, set Auto Offset Reset to Earliest and then temporarily change the consumer group name to a different value. Run data preview. Then, change the consumer group back to the correct value and start the pipeline.
- How can I reset the offset for a Kafka Consumer?
- Since the offset for a Kafka Consumer is stored with the ZooKeeper for the Kafka cluster, you cannot reset the offset through the Data Collector. For information about resetting an offset through Kafka, see the Apache Kafka documentation.
- The Kafka Consumer with Kerberos enabled cannot connect to an HDP 2.3 distribution of Kafka.
-
When enabling Kerberos, by default, HDP 2.3 sets the security.inter.broker.protocol Kafka broker configuration property to
PLAINTEXTSASL
, which is not supported.To correct the issue, set security.inter.broker.protocol to PLAINTEXT.
Oracle CDC Client
- Data preview continually times out for my Oracle CDC Client pipeline.
- Pipelines that use the Oracle CDC Client can take longer than expected to
initiate for data preview. If preview times out, try increasing the Preview
Timeout property incrementally.
For more information about using preview with this origin, see Data Preview with Oracle CDC Client.
- My Oracle CDC Client pipeline has paused processing during a daylight saving time change.
- If the origin is configured to use a database time zone that uses daylight saving time, then the pipeline pauses processing during the time change window to ensure that all data is correctly processed. After the time change completes, the pipeline resumes processing at the last-saved offset.
PostgreSQL CDC Client
- A PostgreSQL CDC Client pipeline generates the following error:
-
com.streamsets.pipeline.api.StageException: JDBC_606 - Wal Sender is not active
Salesforce
- A pipeline generates a buffering capacity error
- When pipelines
with a Salesforce origin fail due to a buffering capacity error, such as
Buffering capacity 1048576 exceeded
, increase the buffer size by editing the Streaming Buffer Size property on the Subscribe tab.
Scripting Origins
- A pipeline fails to stop when users click the Stop icon
- Scripts must include
code that stops the script when users stop the pipeline. In the script, use
the
sdc.isStopped
method to check whether the pipeline has been stopped. - A Jython script does not proceed beyond import lock
- Pipelines freeze
if Jython scripts do not release the import lock upon a failure or error.
When a script does not release an import lock, you must restart Data Collector to release the lock. To avoid the problem, use a
try
statement with afinally
block in the Jython script. For more information, see Thread Safety in Jython Scripts.
SQL Server CDC Client
- A pipeline with the SQL Server CDC Client origin cannot establish a connection. The pipeline fails with the following error:
-
java.sql.SQLTransientConnectionException: HikariPool-3 - Connection is not available, request timed out after 30004ms. at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:213) at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:163) at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource. java:85) at com.streamsets.pipeline.lib.jdbc.multithread.ConnectionManager. getNewConnection(ConnectionManager.java:45) at com.streamsets.pipeline.lib.jdbc.multithread.ConnectionManager. getConnection(ConnectionManager.java:57) at com.streamsets.pipeline.stage.origin.jdbc.cdc.sqlserver. SQLServerCDCSource.getCDCTables(SQLServerCDCSource.java:181)
- After dropping and recreating a table, the origin won't seem to read the data in the table. What's the problem?
- The SQL Server CDC Client origin stores the offset for every table that it processes to track its progress. If you drop a table and recreate it using the same name, the origin assumes it is the same table and uses the last-saved offset for the table.
- Previewing data does not show any values.
- When you set the Maximum Transaction Length property, the origin fetches data in
multiple time windows. The property determines the size of each time window.
Previewing data only shows data from the first time window, but the origin might
need to process multiple time windows before finding changed values to show in
the preview.
To see values when previewing data, increase Maximum Transaction Length or set to -1 to fetch data in one time window.
- A no-more-data event is generated before reading all changes
- When you set the Maximum Transaction Length property, the origin fetches data in multiple time windows. The property determines the size of each time window. After processing all available rows in each time window, the origin generates a no-more-data event, even when subsequent time windows remain for processing.
Processors
Use the following tip for help with processors.
Encrypt and Decrypt Fields
- The following error message displays in the log after I start the pipeline:
-
CONTAINER_0701 - Stage 'EncryptandDecryptFields_01' initialization error: java.lang.IllegalArgumentException: Input byte array has incorrect ending byte at 44
Destinations
Use the following tips for help with destination stages and systems.
Azure Data Lake Storage
- An Azure Data Lake Storage destination seems to be causing out of memory errors, with the following object using all available memory:
-
com.streamsets.pipeline.stage.destination.hdfs.writer.ActiveRecordWriters
Cassandra
- Why is the pipeline failing entire batches when only a few records have a problem?
- Due to Cassandra requirements, when you write to a Cassandra cluster, batches are atomic. This means than an error in a one or more records causes the entire batch to fail.
- Why is all of my data being sent to error? Every batch is failing.
- When every batch fails, you might have a data type mismatch. Cassandra requires the data type of the data to exactly match the data type of the Cassandra column.
Elasticsearch
- A pipeline with an Elasticsearch destination fails to start with an SSL/TLS error, such as the following:
-
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
Hadoop FS
- I'm writing text data to HDFS. Why are my files all empty?
- You might not have the pipeline or Hadoop FS destination configured correctly.
HBase
- I get the following error when validating or starting a pipeline with an HBase destination:
-
HBASE_06 - Cannot connect to cluster: org.apache.hadoop.hbase.MasterNotRunningException: com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to node00.local/<IP_address>:60000 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to node00.local/<IP_address>:60000 is closing. Call id=0, waitTime=58
Kafka Producer
- Can the Kafka Producer create topics?
- The Kafka Producer can create a topic when all of the following are true:
- You configure the Kafka Producer to write to a topic name that does not exist.
- At least one of the Kafka brokers defined for the Kafka Producer has the auto.create.topics.enable property enabled.
- The broker with the enabled property is up and available when the Kafka Producer looks for the topic.
- A pipeline that writes to Kafka keeps failing and restarting in an endless cycle.
- This can happen when the pipeline tries to write message to Kafka 0.8 that is longer than the Kafka maximum message size.
- The Kafka Producer with Kerberos enabled cannot connect to the HDP 2.3 distribution of Kafka.
-
When enabling Kerberos, by default, HDP 2.3 sets the security.inter.broker.protocol Kafka broker configuration property to
PLAINTEXTSASL
, which is not supported.To correct the issue, set security.inter.broker.protocol to PLAINTEXT.
MemSQL Fast Loader
- A pipeline stops and returns the following error:
-
JDBC_14 - Error processing batch. SQLState: 42000 Error Code: 1148 Message: The used command is not allowed with this MySQL version
To use MemSQL Fast Loader with a MySQL database, you must enable local data loading in MySQL. See the MySQL topic, Security Issues with LOAD DATA LOCAL.
- A pipeline passes a CDC record to the MemSQL Fast Loader destination and returns the following error:
-
JDBC_70 - Unsupported operation in record header: 1
The MemSQL Fast Loader destination cannot process CDC records. Use the JDBC Producer destination to process these records.
SDC RPC
- A pipeline fails to start with the following validation error:
-
IPC_DEST_15 Could not connect to any SDC RPC destination : [<host name>: java.net.ConnectException: Connection refused]
Executors
Use the following tips for help with executors.
Hive Query
- When I enter the name of the Impala JDBC driver in the stage, I receive an error saying that the driver is not present in the class path:
-
HIVE_15 - Hive JDBC Driver <driver name> not present in the class path.
To use an Impala JDBC driver with the Hive Query executor, the driver must be installed as an external library. And it must be installed for the stage library that the Hive Query executor uses.
If you already installed the driver, verify that you have it installed for the correct stage library. For more information, see Installing the Impala Driver.
JDBC Connections
Use the following tips for help with stages that use JDBC connections to connect to databases. For some stages, Data Collector includes the necessary JDBC driver to connect to the database. For other stages, you must install a JDBC driver.
- JDBC Multitable Consumer origin
- JDBC Query Consumer origin
- MySQL Binary Log origin
- Oracle Bulkload origin
- Oracle CDC origin
- Oracle CDC Client origin
- SAP HANA Query Consumer origin
- Teradata Consumer origin
- JDBC Lookup processor
- JDBC Tee processor
- SQL Parser processor, when using the database to resolve the schema
- JDBC Producer destination
- MemSQL Fast Loader destination
- JDBC Query executor
No Suitable Driver
JDBC_00 - Cannot connect to specified database: com.streamsets.pipeline.api.StageException:
JDBC_06 - Failed to initialize connection pool: java.sql.SQLException: No suitable driver
Verify that you have followed the instructions to install additional drivers, as explained in Install External Libraries.
You can also use these additional tips to help resolve the issue:
- The JDBC connection string is not correct.
- The JDBC Connection String property for the stage must
include the
jdbc:
prefix. For example, a PostgreSQL connection string might bejdbc:postgresql://<database host>/<database name>
. - The JDBC driver is not stored in the correct directory.
- You must store the JDBC driver in the following directory:
<external directory>/streamsets-datacollector-jdbc-lib/lib/
. - STREAMSETS_LIBRARIES_EXTRA_DIR is not set correctly.
- You must set the
STREAMSETS_LIBRARIES_EXTRA_DIR
environment variable to tell Data Collector where the JDBC drivers and other additional libraries are located. The location should be external to the Data Collector installation directory. - The security policy is not set.
- You must grant permission for code in the external directory. Ensure that the
$SDC_CONF/sdc-security.policy
file contains the following lines:// user-defined external directory grant codebase "file://<external directory>-" { permission java.security.AllPermission; };
- JDBC drivers do not load or register correctly.
- Sometimes JDBC drivers that a pipeline requires do not load or register
correctly. For example a JDBC driver might not correctly support JDBC 4.0
auto-loading, resulting in a "No suitable driver" error message. Two approaches can resolve this issue:
- Add the class name for the driver in the JDBC Class Driver Name property on the Legacy Drivers tab for the stage.
- Configure Data Collector to automatically load specific drivers. In the Data Collector
configuration file, uncomment the
stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load
property and set to a comma-separated list of the JDBC drivers required by stages in your pipelines.
- The sdc user does not have correct permissions on the JDBC driver.
- When you run Data Collector as a service, the default system user named
sdc
is used to start the service. The user must have read access to the JDBC driver and all directories in its path.
Cannot Connect to Database
When Data Collector cannot connect to the database, an error message like the following displays - the exact message can vary depending on the driver:
JDBC_00 - Cannot connect to specified database: com.zaxxer.hikari.pool.PoolInitializationException:
Exception during pool initialization: The TCP/IP connection to the host 1.2.3.4, port 1234 has failed
$ ping 1.2.3.4
PING 1.2.3.4 (1.2.3.4): 56 data bytes
64 bytes from 1.2.3.4: icmp_seq=0 ttl=57 time=12.063 ms
64 bytes from 1.2.3.4: icmp_seq=1 ttl=57 time=11.356 ms
64 bytes from 1.2.3.4: icmp_seq=2 ttl=57 time=11.626 ms
^C
--- 1.2.3.4 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.356/11.682/12.063/0.291 ms
$ nc -v -z -w2 1.2.3.4 1234
nc: connectx to 1.2.3.4 port 1234 (tcp) failed: Connection refused
If the host or port is not accessible, check the routing and firewall configuration.
MySQL JDBC Driver and Time Values
Due to a MySQL JDBC driver issue, the driver cannot return time values to the millisecond. Instead, the driver returns the values to the second.
For example, if a column has a value of 20:12:50.581, the driver reads the value as 20:12:50.000.
Performance
- How can I decrease the delay between reads from the origin system?
- A long delay can occur between reads from the origin system when a pipeline reads records faster than it can process them or write them to the destination system. Because a pipeline processes one batch at a time, the pipeline must wait until a batch is committed to the destination system before reading the next batch, preventing the pipeline from reading at a steady rate. Reading data at a steady rate provides better performance than reading sporadically.
- When I try to start one or more pipelines, I receive an error that not enough threads are available.
- By default, Data Collector can
run approximately 22 standalone pipelines at the same time. If you run a larger number
of standalone pipelines at the same time, you might receive the following
error:
CONTAINER_0166 - Cannot start pipeline '<pipeline name>' as there are not enough threads available
- How can I tell what's slowing down my pipeline?
- Review the information available in the Data Collector UI in Monitor mode. Charts provide information about the record count, record throughput, and batch throughput for the pipeline. To determine where processing slows, you can click each stage to view the count and throughput details for the stage.
- How can I improve the general pipeline performance?
- You might improve performance by adjusting the batch size used by the pipeline. The batch size determines how much data passes through the pipeline at one time. By default, the batch size is 1000 records.
Cluster Execution Mode
- I got the following validation error when configuring a cluster pipeline. What does it mean?
-
Validation_0071 - Stage '<stage id>' does not support 'Standalone' execution mode
- Why isn't Data Collector reading data from my new Kafka partition?
- If you create a new partition in the Kafka topic, to launch a new Data Collector worker to read from the partition, you need to restart the pipeline.
- My pipeline fails to start with the following error:
Check the Data Collector log for more information. It's possible that the Spark on YARN client configuration is not in place, the installation is out of date, or the node being used is not a gateway node.Pipeline Status: START_ERROR: Unexpected error starting pipeline:java.lang.IllegalStateException: Timed out after waiting 121 seconds for cluster application to start. Submit command is not alive.
- My pipeline fails to start with the following error:
-
Pipeline Status: START_ERROR: IO Error while trying to start the pipeline: java.io.IOException: Kerberos Error: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
- My pipeline stopped unexpectedly.
- Check the Spark Application Master logs in the YARN Resource Manager UI for more information about the problem.
- Why does my pipeline take so long to start?
- The start time for a pipeline can vary based on how busy the YARN cluster is. Typically, a cluster pipeline should start in 30-90 seconds.