Monday, 8 December 2014

Firing up a Spark node on my Cassandra Dev Cluster


From the previous post, Ive now got 2 data nodes on my local datacenter. I need to fire up a spark node in its on Virtual DC. 

I'm following the DataStax guide from here


Configure these:
  1. mkdirs for spark data in the install dir:
    alteredcarbon:spark3 neil$ more mkDataDir.sh
    #!/bin/bash
    mkdir cassandra-data; cd cassandra-data
    mkdir data saved_caches commit log
    mkdir spark
    mkdir spark/rdd spark/tmp

  2. Edit resources/spark/conf/sparkenv.sh data dirs:
    export SPARK_TMP_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/tmp"

    # Directory where RDDs will be cached
    export SPARK_RDD_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/rdd"

    # The directory for storing master.log and worker.log files
    export SPARK_LOG_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark"


    export SPARK_WORKER_DIR="/Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/spark/work"
  3. Spark uses a local Cassandra node, so lets configure that
  4. Allocate the NIC: sudo ifconfig lo0 alias 127.0.0.4 up
  5. Configure JMX resources/cassandra/conf/cassandra env.sh
    JMX_PORT="7188"

  6. Cassandra Endpoints: resources/cassansdra/conf/cassandra.yaml
    listen_address: 127.0.0.4
    # that rely on node auto-discovery.
    rpc_address: 127.0.0.4

  7. Configure Logging:vi resources/cassandra/conf/log4j-server.properties 
  8. Configure /etc/hosts:
    127.0.0.4       localhost4 alteredcarbon4 alteredcarbon4.local
  9. Configure Cassandra data dirs (remember, spark runs on Cassandra nodes)
    /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data
    $ vi ../resources/cassandra/conf/cassandra.yaml
    # the configured compaction strategy.
    data_file_directories:
        - /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/data
    # commit log
    commitlog_directory: /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/commitlog
    # saved caches
    saved_caches_directory: /Volumes/BACKUP/DEV/TEMP/spark3/cassandra-data/saved_caches


  10. Fire up Spark with a Cassandra node $ bin/dse cassandra -f -k
  11. If all is good then your should see: INFO 15:00:04,865 SparkWorker: Starting remoting
     INFO 15:00:05,073 SparkWorker: Remoting started; listening on addresses :[akka.tcp://sparkWorker@127.0.0.4:54942]
     INFO 15:00:05,077 SparkWorker: Remoting now listens on addresses: [akka.tcp://sparkWorker@127.0.0.4:54942]
     INFO 15:00:05,379 SparkWorker: Starting Spark worker 127.0.0.4:54942 with 6 cores, 9.5 GB RAM
     INFO 15:00:05,380 SparkWorker: Spark home: /Volumes/BACKUP/DEV/TEMP/spark3/resources/spark
     INFO 15:00:05,643 SparkWorker: Started Worker web UI at http://192.168.228.1:7081
     INFO 15:00:05,645 SparkWorker: Connecting to master spark://127.0.0.4:7077...
     INFO 15:00:05,950 SparkMaster: Registering worker 127.0.0.4:54942 with 6 cores, 9.5 GB RAM
     INFO 15:00:05,953 SparkMaster: Adding worker 127.0.0.4
     INFO 15:00:06,046 SparkMaster: New Cassandra host /127.0.0.2:9042 added
     INFO 15:00:06,047 SparkMaster: New Cassandra host /127.0.0.1:9042 added
     INFO 15:00:06,047 SparkMaster: Connected to Cassandra cluster: Test Cluster
     INFO 15:00:06,047 SparkMaster: New Cassandra host /127.0.0.4:9042 added
     INFO 15:00:06,105 SparkWorker: Successfully registered with master spark://127.0.0.4:7077


    OpsCenter with 1 Analytics node: BOOM!






Wednesday, 3 December 2014

MultiNode Cassandra on Single Server (Data and Spark nodes)


Borrowed from this old post: One Man Clapping

A normal Cassandra installation requires root access as it writes to OS root perm'd directories.

I'm trying to fire up some Cassandra demos in a dev environment. Ideally this setup will give me a standalone dev environment to tinker about in.

As such we don't want to do a full install where it writes to the standard OS system locations. Instead we need to write to the nodes install directory to /install/cassandra-data

I need the ability to run the following nodes:
  • Data node type (storage)
  • Analytics node type (spark)
I'm going to create multiple directories to run each install without root permissions Node1, Node2 etc.

Do this:
  1. Configure a set of localIp aliases: 
    sudo ifconfig lo0 alias 127.0.0.2 up
    sudo ifconfig lo0 alias 127.0.0.3 up
    
    
  2. Edit /etc/hosts to create entries: 
    127.0.0.1       localhost alteredcarbon alteredcarbon.local
    127.0.0.2       localhost2 alteredcarbon2 alteredcarbon2.local
  3. Extract dse4.5.3.tar => node1
  4. CD into node1
  5. Change the JMX port of the cassandra env. vi resources/cassandra/conf/cassandra-env.sh
  6. JMX PORT: JMX_PORT="17199"
  7. Configure local output (we don't want to write logs to /etc on the dev machine)  (source)
$ mkdir cassandra-data; cd cassandra-data$ mkdir data saved_caches commit log

Then edit $ vi resources/cassandra/conf/cassandra.yaml


cluster_name: 'Dev Cluster' initial_token: 0 data_file_directories: - path_to_install/cassandra-data/data commitlog_directory: path_to_install/cassandra-data/commitlog saved_caches_directory: path_to_install/cassandra-data/saved_caches


You don't need to change the default data ports: 7000, 7001; they are running on the aliased IPs, 127.0.0.1, 127.0.0.2 etc

7. Configure the log output: 


$ vi ./resources/cassandra/conf/log4j-server.properties
log4j.appender.R.File= path_to_install/cassandra-data/system.log
log4j.appender.V.File= path_to_install/cassandra- data/solrvalidation.log 
8. Now fire up a realtime data node (link
$ ./dse cassandra -f
9. Or a data node: or something else (link)
CommandOptionDescriptionExample
dse-cEnable the Cassandra File System (CFS) but not the integrated DSE jobtrackers and tasktrackers. Use to start nodes for running an external Hadoop system.
dse-vSend the DSE version number to standard output.none
dse cassandraStart up a real-time Cassandra node in the background.link to example
dse cassandra-fStart up a real-time Cassandra node in the foreground. Can be used with -k, -t, or -s options.none
dse cassandra-kStart up an analytics node in Spark mode in the background.example
dse cassandra-k -tStart up an analytics node in Spark and DSE Hadoop mode.example
dse cassandra-sStart up a DSE Search/Solr node in the background.link to example
dse cassandra-s -Ddse.solr.data.dir=pathUse path to store Solr data.link to example
dse cassandra-tStart up an analytics node in the background.link to example
dse cassandra-t -jStart up an analytics node as the job tracker.link to example
dse cassandra-stop-p pidStop the DataStax Enterprise process number (pid). If -p and the pid are omitted, the command stops the local node.link to example
10. Fire up OpsCenter and we should see the node:
alteredcarbon:OPSCENTER neil$ opscenter-5.0.1/bin/opscenter
OpsCenter is at : http://localhost:8080
Note: The first time I tried this it didn't work, after a reboot it was successful. 
Want to do this without DSE  then try: CCM