Administration
Providing an Activation Code
Users with an enterprise account need to provide an activation code only when using a Docker Data Collector image that is not registered with Control Hub.
Other users do not need to provide an activation code.
To request an activation code, submit a request through the StreamSets Support portal.
After you receive an email with the activation code, log in to Data Collector. On the registration page shown below, click Enter a Code, then paste the code into the Activation window and click Activate.
Users with the Admin role can view activation details by clicking
.Viewing Data Collector Configuration Properties
For details about the configuration properties or to edit the configuration file, see Configuring Data Collector.
Viewing Data Collector Directories
You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Data Collector directories are defined in environment variables. For more information, see Data Collector Environment Configuration.
To view Data Collector directories, click .
Directory | Includes | Environment Variable |
---|---|---|
Runtime | Base directory for Data Collector executables and related files. | SDC_DIST |
Configuration | The Data Collector
configuration file, sdc.properties , and related realm properties files and
keystore files. Also includes the logj4 properties file. |
SDC_CONF |
Data | Pipeline configuration and run details. | SDC_DATA |
Log | Data Collector
log file, sdc.log . |
SDC_LOG |
Resources | Directory for runtime resource files. | SDC_RESOURCES |
SDC Libraries Extra Directory | Directory to store external libraries. | STREAMSETS_LIBRARIES_EXTRA_DIR |
Viewing Data Collector Metrics
You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.
-
To
view Data Collector metrics, click .
The Data Collector Metrics page displays all metrics by default.
- To modify the metrics that display on the page, click the More icon, and then click Settings.
- Remove any metric charts that you don't want to display, and then click Save.
Viewing Data Collector Logs
You can view and download log data. When you download log data, you can select the file to download.
-
To view log data for the Data Collector, click .
The Data Collector UI displays roughly 50,000 characters of the most recent log information.
-
To stop the automatic refresh of log data, click Stop Auto
Refresh.
Or, click Start Auto Refresh to view the latest data.
- To view earlier events, click Load Previous Logs.
-
To download the latest log file, click Download. To
download a specific log file, click .
The most recent information is in the file with the highest number.
Data Collector Log Format
Data Collector uses the Apache Log4j library to write log data. Each log entry includes a timestamp and message along with additional information relevant for the message.
- Timestamp
- Pipeline
- Severity
- Message
- Category
- User
- Runner
- Thread
For example:
- Timestamp
- User
- Pipeline
- Runner
- Thread
- Stage
- Severity
- Category
- Message
2019-03-19 09:34:26,236 [user:admin] [pipeline:Test/TestPipeline65f67dde-faad-426d-ac47-8a2cd707f224] [runner:] [thread:webserver-430] [stage:] INFO StandaloneAndClusterRunnerProviderImpl - Pipeline execution mode is: STANDALONE
For this message, the stage and runner are not relevant, and therefore not included in the log entry.
The information included in the downloaded file is set by the
appender.streamsets.layout.pattern
property in the log
configuration file, $SDC_CONF/sdc-log4j2.properties. The default
configuration sets this property to:
%d{ISO8601} [user:%X{s-user}] [pipeline:%X{s-entity}] [runner:%X{s-runner}] [thread:%t] [stage:%X{s-stage}] %-5p %c{1} - %m%n
%X{s-entity}
- Pipeline name and ID%X{s-runner}
- Runner ID%X{s-stage}
- Stage name%X{s-user}
- User who initiated the operation
Modifying the Log Level
If the Data Collector logs do not provide enough troubleshooting information, you can modify the log level to display messages at another severity level.
- TRACE
- DEBUG
- INFO (Default)
- WARN
- ERROR
- FATAL
- Click .
-
Click Log Config.
Data Collector displays the contents of the log configuration file,
$SDC_CONF/sdc-log4j2.properties
. -
Change the default value of INFO for the following line in the file:
logger.l1.level=INFO
For example, to set the log level to DEBUG, modify the line as follows:
logger.l1.level=DEBUG
-
Click Save.
The changes that you make to the log level take effect immediately - you do not need to restart Data Collector. You can also change the log file by directly editing the log configuration file,
$SDC_CONF/sdc-log4j2.properties
.Note: For a Cloudera Manager installation, use Cloudera Manager to modify the log level. In Cloudera Manager, select the StreamSets service, then click Configuration. Click , and then modify the value of the Data Collector Logging Threshold property.
When you’ve finished troubleshooting, set the log level back to INFO to avoid having verbose log files.
Shutting Down Data Collector
Use one of the following methods to shut down Data Collector:
- User interface
- To use the Data Collector UI for shutdown:
- Click .
- When a confirmation dialog box appears, click Yes.
- Command line when started as a service
- To use the command line for shutdown when Data Collector is started as a service, use the required command for your operating
system:
-
For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu 14.04 LTS, use:
service sdc stop
-
For CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu 16.04 LTS, use:
systemctl stop sdc
-
- Command line when started manually
- To use the command line for shutdown when Data Collector is started manually, use the Data Collector process ID in the following
command:
kill -15 <process ID>
Restarting Data Collector
- Started manually
If you changed or added an environment variable in the
sdc-env.sh
file, then you must restart Data Collector from the command prompt. Press Ctrl+C to shut down Data Collector and then enterbin/streamsets dc
to restart Data Collector.If you did not change or add an environment variable, then you can restart Data Collector from the command prompt or from the user interface. To restart from the user interface, click , expand StreamSets Data Collector was started manually, and then click Restart Data Collector.
- Started as a serviceRun the appropriate command for your operating system:
- For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu
14.04 LTS, use:
service sdc start
- For CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu
16.04 LTS, use:
systemctl start sdc
- For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu
14.04 LTS, use:
- Started from Cloudera Manager
Use Cloudera Manager to restart Data Collector. For information about how to restart a service through Cloudera Manager, see the Cloudera documentation.
- Started from Docker
Run the following Docker command:
docker restart <containerID>
Viewing Users and Groups
If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.
To view users and groups, click Data Collector displays a read-only view of the users, groups, and roles.
.You configure users, groups, and roles for file-based authentication in the associated
realm.properties
file located in the Data Collector
configuration directory, $SDC_CONF
. For more information, see Configuring File-Based Authentication.
Managing Usage Statistics Collection
You can help to improve Data Collector by allowing StreamSets to collect usage statistics about Data Collector system performance and the features that you use. This telemetry data helps StreamSets to improve product performance and to make feature development decisions.
You can configure whether to allow usage statistics collection.
- Click .
-
Select the Share usage data with StreamSets checkbox to
enable usage statistics collection.
Clear the checkbox if you prefer not to share usage statistics.
- Click Save.
Support Bundles
You can use Data Collector to generate a support bundle. A support bundle is a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and other details to help troubleshoot issues. You upload the generated file to a StreamSets Support ticket, and the Support team can use the information to help resolve your tickets. Alternatively, you can send the file to another StreamSets community member.
Data Collector uses several generators to create a support bundle. Each generator bundles different types of information. You can choose to use all or some of the generators.
Each generator automatically redacts all passwords entered in pipelines, configuration
files, or resource files. The generators replace all passwords with the text
REDACTED
in the generated files. You can customize the generators
to redact other sensitive information, such as machine names or user names.
Before uploading a generated ZIP file to a support ticket, we recommend verifying that the file does not include any sensitive information that you do not want to share.
Generators
Data Collector can use the following generators to create a support bundle:
Generator | Description |
---|---|
SDC Info | Includes the following information:
|
Pipelines | Includes the following JSON files for each pipeline:
By default, all Data Collector pipelines are included in the bundle. |
Blob Store | Internal blob store containing information provided by Control Hub. |
Logs | Includes the most recent content of the following log files:
|
In addition, Data Collector always generates the following files when you create a support bundle:
metadata.properties
- ID and version of the Data Collector that generated the bundle.generators.properties
- List of generators used for the bundle.
Generating a Support Bundle
When you generate a support bundle, you choose the information to include in the bundle. Only users with the Admin role can generate support bundles.
You can download the bundle, and then verify its contents and upload it to a StreamSets Support ticket.
- Click the Help icon, and then click Support Bundle.
- Select the generators that you want to use.
-
Click Download.
Data Collector generates the support bundle and saves it to a ZIP file in your default downloads directory.
You can manually upload the file to a StreamSets Support ticket.
Before sharing the file, verify that the file does not include sensitive information that you do not want to share. For example, you might want to remove the pipelines not associated with your support ticket. By default, the bundle includes all Data Collector pipelines.
Customizing Generators
By default, the generators redact all passwords entered in pipelines, configuration files, or resource files. You can customize the generators to redact other sensitive information, such as machine names or user names.
To customize the generators, modify the support bundle redactor file, $SDC_CONF/support-bundle-redactor.json. The file contains rules that the generators use to redact sensitive information. Each rule contains the following information:
- description - Description of the rule.
- trigger - String constant that triggers a redaction. If a line contains this trigger string, then the redaction continues by applying the regular expression specified in the search property.
- search - Regular expression that defines the sub-string to redact.
- replace - String to replace the redacted information with.
{
"description": "Custom domain names",
"trigger": ".streamsets.com",
"search": "[a-z_-]+.streamsets.com",
"replace": "REDACTED.streamsets.com"
}
Health Inspector
The Data Collector Health Inspector provides a snapshot of how Data Collector is functioning. When you run Health Inspector, it performs checks for common misconfigurations and errors. You can use the Health Inspector to quickly check the health of your Data Collector.
Health Inspector provides only Data Collector-level details. For pipeline-level details, monitor the pipeline or review the Data Collector log.
- Data Collector configuration - Displays the settings for certain Data Collector configuration properties, such as the maximum number of pipeline errors allowed in production.
- Java Virtual Machine (JVM) process - Displays the settings for certain JVM configuration properties, such as the maximum amount of memory allotted to the JVM. Also generates related usage statistics, such as the percentage of the JVM memory currently used by Data Collector.
- Machine - Displays important details about available resources on the Data Collector machine, such as the available space in the runtime directory.
- Networking - Verifies that the internet is accessible by pinging the StreamSets website.
Viewing the Health Inspector
Data Collector generates Health Inspector details each time you open the Health Inspector page.
- To view the Data Collector Health Inspector, click the Help icon, and then click Health Inspector.
-
To view all available information, click the Expand All
link.
Green indicates that values are within expected range. Red indicates that values fall beyond the expected range.
Some details, such as JVM Child Processes, provide additional information. To view that information, click Show Output. - To refresh a category of information, click the Rerun link for the category.
- To refresh all Health Inspector details, navigate away from the page, and then return.
REST Response
You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.
You can use the REST response information to provide Data Collector details to a REST-based monitoring system. Or you might use the information in conjunction with the Data Collector REST API.
- Pipeline Configuration - Provides information about the pipeline and each stage in the pipeline.
- Pipeline Rules - Provides information about metric and data rules and alerts.
- Definitions - Provides information about all available Data Collector stages.
- Preview Data- Provides information about the preview data moving through the pipeline. Also includes monitoring information that is not used in preview.
- Pipeline Monitoring - Provides monitoring information for the pipeline.
- Pipeline Status - Provides the current status of the pipeline.
- Data Collector Metrics - Provides metrics about Data Collector.
- Thread Dump - Lists all active Java threads used by Data Collector.
Viewing REST Response Data
You can view REST response data from the location where the relevant information displays. For example, you can view Data Collector Metrics REST response data from the Data Collector Metrics page.
- Edit mode
- From the Properties panel, you can use the
More icon () to view the following REST response data:
- Pipeline Configuration
- Pipeline Rules
- Pipeline Status
- Definitions
- Preview mode
- From the Preview panel, you can use the More icon to view the Preview Data REST response data.
- Monitor mode
- From the Monitor panel, you can use the
More icon to view the following REST response
data:
- Pipeline Monitoring
- Pipeline Configuration
- Pipeline Rules
- Pipeline Status
- Definitions
- Data Collector Metrics page
- From the Data Collector Metrics page, , you can use the More icon to
view the following REST response data:
- Data Collector Metrics
- Thread Dump
Disabling the REST Response Menu
You can configure the Data Collector to disable the display of REST responses.
- To disable the REST Response menus, click the Help icon, and then click Settings.
- In the Settings window, select Hide the REST Response Menu.
Command Line Interface
Data Collector
provides a command line interface that includes a basic cli
command. Use the
command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be
running before you can use the cli
command.
cli
command:- help
- Provides information about each command or subcommand.
- manager
- Provides the following subcommands:
- start - Starts a pipeline.
- status - Returns the status of a pipeline.
- stop - Stops a pipeline.
- reset-origin - Resets the origin when possible.
- get-committed-offsets - Returns the last-saved offset for pipeline failover.
- update-committed-offsets - Updates the last-saved offset for pipeline failover.
- store
- Provides the following subcommands:
- import - Imports a pipeline.
- list - Lists information for all available pipelines.
- system
- Provides the following subcommands:
- enableDPM - Register the Data Collector with StreamSets Control Hub.
- disableDPM - Unregister the Data Collector from Control Hub.
Java Configuration Options for the Cli Command
Use the SDC_CLI_JAVA_OPTS environment variable to modify Java configuration options for
the cli
command.
For
example, to set the -Djavax.net.ssl.trustStore
option for the
cli
command when using Data Collector
with HTTPS, run the following command:
export SDC_CLI_JAVA_OPTS="-Djavax.net.ssl.trustStore=<path to truststore file> ${SDC_CLI_JAVA_OPTS}"
Using the Cli Command
Call the
cli
command from the $SDC_DIST
directory.
cli
commands:bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
<command> <subcommand> [<args>]
The usage of the basic command options depends on whether or not the Data Collector is registered with Control Hub.
Not Registered with Control Hub
Option | Description |
---|---|
-U <sdcURL> or --url <sdcURL> |
Required. URL of the Data Collector. The default URL is
|
-a <sdcAuthType> or --auth-type <sdcAuthType> |
Optional. HTTP authentication type used by the Data Collector. |
-u <sdcUser> or --user <sdcUser> |
Optional. User name to use to log in. The roles assigned to the
user account determine the tasks that you can perform. If you omit this option, the Data Collector allows admin access. |
-p <sdcPassword> or --password <sdcPassword> |
Optional. Required when you enter a user name. Password for the user account. |
-D <dpmURL> or --dpmURL <dpmURL> |
Not applicable. Do not use when the Data Collector is not registered with Control Hub. |
<command> | Required. Command to perform. |
<subcommand> | Required for all commands except help. Subcommand to perform. |
<args> | Optional. Include arguments and options as needed. |
Registered with Control Hub
Option | Description |
---|---|
-U <sdcURL> or --url <sdcURL> |
Required. URL of the Data Collector. The default URL is
|
-a <sdcAuthType> or --auth-type <sdcAuthType> |
Required. Authentication type used by the Data Collector. Set to dpm. If you omit this option, Data Collector uses the Form authentication type, which causes the command to fail. |
-u <sdcUser> or --user <sdcUser> |
Required. User account to log in. Enter your Control Hub user ID using the following format:
The roles assigned to the Control Hub user account determine the tasks that you can perform. If you omit this option, Data Collector uses the admin user account, which causes the command to fail. |
-p <sdcPassword> or --password <sdcPassword> |
Required. Enter the password for your Control Hub user account. |
-D <dpmURL> or --dpmURL <dpmURL> |
Required. Set to: https://cloud.streamsets.com. |
<command> | Required. Command to perform. |
<subcommand> | Required for all commands except help. Subcommand to perform. |
<args> | Optional. Include arguments and options as needed. |
Help Command
Use the help command to view additional information for the specified command.
bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
help <command> [<subcommand>]
bin/streamsets cli -U http://localhost:18630 help manager
Manager Command
The manager
command provides subcommands to start and stop a pipeline,
view the status of all pipelines, and reset the origin for a pipeline. It can also be used
to get the last-saved offset and to update the last-saved offset for a pipeline.
manager
command returns the pipeline status object after it
successfully completes the specified subcommand. The following is a sample of the
pipeline status object:
{
"user" : "admin",
"name" : "MyPipelinejf45e1f1-dfc1-402c-8587-918bc6e831db",
"pipelineID" : "MyPipelinejf45e1f1-dfc1-402c-8587-918bc6e831db",
"rev" : "0",
"status" : "STOPPING",
"message" : null,
"timeStamp" : 1447116703147,
"attributes" : { },
"executionMode" : "STANDALONE",
"metrics" : null,
"retryAttempt" : 0,
"nextRetryTimeStamp" : 0
}
Note that the timestamp is in the Long data format.
You can use the following manager
subcommands:
- start
- Starts a pipeline. Returns the pipeline status when successful.
- stop
- Stops a pipeline. Returns the pipeline status when successful.
- status
- Returns the status of a pipeline. Returns the pipeline status when successful.
- reset-origin
- Resets the origin of a pipeline. Use for pipeline origins that can be reset. Some pipeline origins cannot be reset. Returns the pipeline status when successful.
- get-committed-offsets
- Returns the last-saved offset for a pipeline with an origin that saves offsets. Some origins, such as the HTTP Server, have no need to save offsets.
- update-committed-offsets
- Updates the last-saved offset for a pipeline with an origin that saves offsets. Some origins, such as the HTTP Server, have no need to save offsets.
Store Command
The store
command provides subcommands to view a list of all pipelines
and to import a pipeline.
store
command:- list
- Lists all available pipelines. The
list
subcommand uses the following syntax:store list
- import
- Imports a pipeline. Use to import a pipeline JSON file, typically exported from a Data Collector. Returns a message when the import is successful.
System Command
The system
command provides subcommands to register and unregister the
Data Collector
with Control Hub.
You can use
the following subcommands with the system
command:
- enableDPM
- Registers the Data Collector with Control Hub. For a description of the syntax, see Registering from the Command Line Interface.
- disableDPM
- Unregisters the Data Collector with Control Hub. For a description of the syntax, see Unregistering from the Command Line Interface.