Blue Monkey Engineer: 2014

Monday, 8 December 2014

Firing up a Spark node on my Cassandra Dev Cluster

From the previous post, Ive now got 2 data nodes on my local datacenter. I need to fire up a spark node in its on Virtual DC.

I'm following the DataStax guide from here

Configure these:

mkdirs for spark data in the install dir:
alteredcarbon:spark3 neil$ more mkDataDir.sh

#!/bin/bash

mkdir cassandra-data; cd cassandra-data

mkdir data saved_caches commit log

mkdir spark

mkdir spark/rdd spark/tmp
Edit resources/spark/conf/sparkenv.sh data dirs:
export SPARK_TMP_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/tmp"

# Directory where RDDs will be cached

export SPARK_RDD_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/rdd"

# The directory for storing master.log and worker.log files

export SPARK_LOG_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark"

export SPARK_WORKER_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/work"
Spark uses a local Cassandra node, so lets configure that
Allocate the NIC: sudo ifconfig lo0 alias 127.0.0.4 up
Configure JMX resources/cassandra/conf/cassandra env.sh
JMX_PORT="7188"
Cassandra Endpoints: resources/cassansdra/conf/cassandra.yaml
listen_address: 127.0.0.4

# that rely on node auto-discovery.

rpc_address: 127.0.0.4
Configure Logging:vi resources/cassandra/conf/log4j-server.properties
Configure /etc/hosts:
127.0.0.4 localhost4 alteredcarbon4 alteredcarbon4.local
Configure Cassandra data dirs (remember, spark runs on Cassandra nodes)
/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data

$ vi ../resources/cassandra/conf/cassandra.yaml

# the configured compaction strategy.

data_file_directories:

- /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/data

# commit log

commitlog_directory: /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/commitlog

# saved caches

saved_caches_directory: /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/saved_caches
Fire up Spark with a Cassandra node $ bin/dse cassandra -f -k
If all is good then your should see: INFO 15:00:04,865 SparkWorker: Starting remoting
INFO 15:00:05,073 SparkWorker: Remoting started; listening on addresses :[akka.tcp://sparkWorker@127.0.0.4:54942]

INFO 15:00:05,077 SparkWorker: Remoting now listens on addresses: [akka.tcp://sparkWorker@127.0.0.4:54942]

INFO 15:00:05,379 SparkWorker: Starting Spark worker 127.0.0.4:54942 with 6 cores, 9.5 GB RAM

INFO 15:00:05,380 SparkWorker: Spark home: /Volumes/BACKUP/DEV/TEMP/spark3/resources/spark

INFO 15:00:05,643 SparkWorker: Started Worker web UI at http://192.168.228.1:7081

INFO 15:00:05,645 SparkWorker: Connecting to master spark://127.0.0.4:7077...

INFO 15:00:05,950 SparkMaster: Registering worker 127.0.0.4:54942 with 6 cores, 9.5 GB RAM

INFO 15:00:05,953 SparkMaster: Adding worker 127.0.0.4

INFO 15:00:06,046 SparkMaster: New Cassandra host /127.0.0.2:9042 added

INFO 15:00:06,047 SparkMaster: New Cassandra host /127.0.0.1:9042 added

INFO 15:00:06,047 SparkMaster: Connected to Cassandra cluster: Test Cluster

INFO 15:00:06,047 SparkMaster: New Cassandra host /127.0.0.4:9042 added

INFO 15:00:06,105 SparkWorker: Successfully registered with master spark://127.0.0.4:7077

OpsCenter with 1 Analytics node: BOOM!

Wednesday, 3 December 2014

MultiNode Cassandra on Single Server (Data and Spark nodes)

Borrowed from this old post: One Man Clapping

A normal Cassandra installation requires root access as it writes to OS root perm'd directories.

I'm trying to fire up some Cassandra demos in a dev environment. Ideally this setup will give me a standalone dev environment to tinker about in.

As such we don't want to do a full install where it writes to the standard OS system locations. Instead we need to write to the nodes install directory to /install/cassandra-data

I need the ability to run the following nodes:

Data node type (storage)
Analytics node type (spark)

I'm going to create multiple directories to run each install without root permissions Node1, Node2 etc.

Do this:

Configure a set of localIp aliases:

sudo ifconfig lo0 alias 127.0.0.2 up

sudo ifconfig lo0 alias 127.0.0.3 up

Edit /etc/hosts to create entries:




127.0.0.1       localhost alteredcarbon alteredcarbon.local

127.0.0.2       localhost2 alteredcarbon2 alteredcarbon2.local

Extract dse4.5.3.tar => node1
CD into node1
Change the JMX port of the cassandra env. $ vi resources/cassandra/conf/cassandra-env.sh
JMX PORT: $ JMX_PORT="17199"
Configure local output (we don't want to write logs to /etc on the dev machine) (source)

$ mkdir cassandra-data; cd cassandra-data$ mkdir data saved_caches commit log

Then edit $ vi resources/cassandra/conf/cassandra.yaml

cluster_name: 'Dev Cluster' initial_token: 0 data_file_directories: - path_to_install/cassandra-data/data commitlog_directory: path_to_install/cassandra-data/commitlog saved_caches_directory: path_to_install/cassandra-data/saved_caches

You don't need to change the default data ports: 7000, 7001; they are running on the aliased IPs, 127.0.0.1, 127.0.0.2 etc

7. Configure the log output:

$ vi ./resources/cassandra/conf/log4j-server.properties
log4j.appender.R.File= path_to_install/cassandra-data/system.log

log4j.appender.V.File= path_to_install/cassandra- data/solrvalidation.log

8. Now fire up a realtime data node (link) 

 $ ./dse cassandra -f 
9. Or a data node: or something else (link)

Command Option Description Example

dse -c Enable the Cassandra File System (CFS) but not the integrated DSE jobtrackers and tasktrackers.  Use to start nodes for running an external Hadoop system.
dse -v Send the DSE version number to standard output. none
dse cassandra Start up a real-time Cassandra node in the background. link to example
dse cassandra -f Start up a real-time Cassandra node in the foreground. Can be used with -k, -t, or -s options. none
dse cassandra -k Start up an analytics node in Spark mode in the background. example
dse cassandra -k -t Start up an analytics node in Spark and DSE Hadoop mode. example
dse cassandra -s Start up a DSE Search/Solr node in the background. link to example
dse cassandra -s -Ddse.solr.data.dir=path Use path to store Solr data. link to example
dse cassandra -t Start up an analytics node in the background. link to example
dse cassandra -t -j Start up an analytics node as the job tracker. link to example
dse cassandra-stop -p pid Stop the DataStax Enterprise process number (pid). If -p and the pid are omitted, the command stops the local node. link to example


10. Fire up OpsCenter and we should see the node:

Command	Option	Description	Example
dse	-c	Enable the Cassandra File System (CFS) but not the integrated DSE jobtrackers and tasktrackers.	Use to start nodes for running an external Hadoop system.
dse	-v	Send the DSE version number to standard output.	none
dse cassandra		Start up a real-time Cassandra node in the background.	link to example
dse cassandra	-f	Start up a real-time Cassandra node in the foreground. Can be used with -k, -t, or -s options.	none
dse cassandra	-k	Start up an analytics node in Spark mode in the background.	example
dse cassandra	-k -t	Start up an analytics node in Spark and DSE Hadoop mode.	example
dse cassandra	-s	Start up a DSE Search/Solr node in the background.	link to example
dse cassandra	-s -Ddse.solr.data.dir=`path`	Use `path` to store Solr data.	link to example
dse cassandra	-t	Start up an analytics node in the background.	link to example
dse cassandra	-t -j	Start up an analytics node as the job tracker.	link to example
dse cassandra-stop	-p `pid`	Stop the DataStax Enterprise process number (pid). If -p and the pid are omitted, the command stops the local node.	link to example





alteredcarbon:OPSCENTER neil$ opscenter-5.0.1/bin/opscenter







OpsCenter is at : http://localhost:8080

Note: The first time I tried this it didn't work, after a reboot it was successful.

Want to do this without DSE  then try: CCM