14.3 Processing Data

Data processing includes streaming of event data and raw data to HBase tables and indexing data into Elasticsearch.

You must start or restart data processing in the following scenarios:

When you configure scalable storage for the first time.
When you modify the IP address or port number of Elasticsearch or any of the CDH components.
When data processing fails.
When you restart the CDH or YARN cluster.
When you reboot the CDH or YARN machine.
When you update the event criteria for Correlation Engine or delete the Correlation Engine.

To start or restart data processing:

Log in to the SSDM server as the novell user and copy the files to the Spark history server where HDFS NameNode is installed:

cd /etc/opt/novell/sentinel/scalablestore

scp SparkApp-*.jar avroevent-*.avsc avrorawdata-*.avsc spark.properties log4j.properties manage_spark_jobs.sh root@<hdfs_node>:<destination_directory>

where <destination_directory> is any directory where you want to place the copied files. Also, ensure that the hdfs user has full permissions to this directory.
Log in to the <hdfs_node> server as the root user and change the ownership of the copied files to hdfs user:

cd <destination_directory>

chown hdfs SparkApp-*.jar avroevent-*.avsc avrorawdata-*.avsc spark.properties log4j.properties manage_spark_jobs.sh

Assign executable permission to the manage_spark_jobs.sh script.
(Conditional) To restart data processing, stop the currently running Spark jobs by running the following script:

./manage_spark_jobs.sh stop
Run the following script to start data processing:

./manage_spark_jobs.sh start

The above command takes a while to complete the data processing.
(Optional) Run the following command to verify the data processing status:

./manage_spark_jobs.sh status