Cluster Performance Monitoring

Linux Tools for Performance Analysis

Command: top

  • The first tool linux sys-admins reach for

    • Gives a brief summary of the system resources
    • Also shows resource utilization and info on a per-process basis

  • Use top to make sure the Cassandra process is the main consumer of resources

Command: top Example

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 7719 root      20   0  413216   2172   1348 S  0.7  0.1   4:52.92 collectd
11800 root      20   0 3074524 1.438g  58340 S  0.7 39.3  93:52.83 java
   27 root      20   0       0      0      0 S  0.3  0.0   0:11.36 kworker/0:1
    1 root      20   0   33364   2684   1492 S  0.0  0.1   0:01.51 init
  • Illustrates the two sections:

    • Summary section at the top
    • Per-process section at the bottom

Command: top Example - Summary Section

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem
  • First line (top):

    • Current time
    • How long the system has been running
    • Average number of users
    • CPU average load for 1 minute, 5 minutes and 15 minutes

Command: top Example - Summary Section

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem
  • Second line (Tasks):

    • Total number of proceeses
    • Count of running processes
    • Count of sleeping processes
    • Count of stopped proceeses
    • Count of zombie processes

Command: top Example - Summary Section

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem
  • Third line (%Cpu(s)):

    • User space CPU utilization
    • System space CPU utilization
    • Low priority process CPU utilization
    • Idle CPU percentage
    • CPU waiting percentage
    • Hardware interrupt percentage CPU utilization
    • Software interrupt percentage CPU utilization
    • Steal percentage CPU utilization

Command: top Example - Summary Section

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem
  • Fourth line (KiB Mem):

    • Total memory (KB)
    • Amount of used memory (KB)
    • Amount of unallocated memory (KB)
    • Amount of memory the system uses for file metadata (KB)

Command: top Example - Summary Section

top - 21:50:11 up 2 days,  1:00,  2 users,  load average: 0.00, 0.01, 0.34
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st
KiB Mem:   3838800 total,  3282784 used,   556016 free,   118144 buffers
KiB Swap:        0 total,        0 used,        0 free.  1413044 cached Mem
  • Fifth line (KiB Swap):

    • Total allocated swap space (KB)
    • Used swap space (KB)
    • Unused swap space (KB)
    • System space used for caching file data (KB)

Command: top Example - Per-process Section

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 7719 root      20   0  413216   2172   1348 S  0.7  0.1   4:52.92 collectd
11800 root      20   0 3074524 1.438g  58340 S  0.7 39.3  93:52.83 java
   27 root      20   0       0      0      0 S  0.3  0.0   0:11.36 kworker/0:1
    1 root      20   0   33364   2684   1492 S  0.0  0.1   0:01.51 init
  • Each row represents a process

    • Process ID
    • Process owner
    • Priority
    • Nice level
    • Amount of virtual memory used
    • Amount of physical memory used
    • Amount of shared memory used
    • Percentage os processor utilization by this process
    • Percentage of memory used by this process
    • The length of time the process has been running
    • The process command

Command: dstat

  • Shows all system resources in a single table
  • One-stop tool includes stats like:

    • iostat
    • vmstat
    • ifstat

  • Configurable output (see man page)
  • Default shows CPU, Disk IO, Network, page and system stats

Example Usage

  • The -a option gives all the default stats
  • So -am gives all the defaults plus memory stats
dstat -am

Reading the Output

dstat -am
----total-cpu-usage---- -dsk/total- -net/total- -paging- --system-- ----memory-usage-----
usr sys idl wai hiq siq| read  writ| recv  send| in out | int  csw | used buff  cach free
 74  25   0   0   0   1|   0     0 | 456k  102k|  0   0 |1819 5742 |1712M 114M 1631M 291M
 92   7   0   0   0   1|   0     0 |  97k   27k|  0   0 | 753 2377 |1712M 114M 1632M 291M
 97   3   0   0   0   0|   0     0 |  12k   10k|  0   0 | 272  278 |1712M 114M 1633M 290M
 76  24   0   0   0   0|   0  2008k| 414k   84k|  0   0 |1931 7739 |1713M 114M 1633M 289M
  • Notice the default categories:

    • CPU
    • Disk
    • Network
    • Paging
    • System

  • Also the addition of Memory at the end

CPU Stats

----total-cpu-usage----
usr sys idl wai hiq siq
 74  25   0   0   0   1
 92   7   0   0   0   1
 97   3   0   0   0   0
 76  24   0   0   0   0
  • usr indicates percent of time spent in user mode execution
  • sys indicates percent of time spent in system (i.e., kernel) mode execution
  • idl indicates percent of time the CPU was idle
  • wai indicates percent of time waiting for IO
  • hiq indicates percent of time processing hardware interrupts
  • siq indicates percent of time processing software interrupts

Disk Stats

-dsk/total-
 read  writ
   0     0
   0     0
   0     0
   0  2008k
  • read indicates the amount of data read
  • writ indicates the amount of data written
  • To get a feel for disk utilization, know disk transfer rates

    • HDDs can transfer tens or MBS
    • SSDs can transfer hundreds of MBS

Network Stats

-net/total-
 recv  send
 456k  102k
  97k   27k
  12k   10k
 414k   84k
  • recv indicates the amount of network data received
  • send indicates the amount of network data sent
  • Network speeds vary, but Gigabit ethernet is about 100 MBS

Paging and System Stats

---paging-- ---system--
  in   out | int   csw
   0     0 |1819  5742
   0     0 | 753  2377
   0     0 | 272   278
   0     0 |1931  7739
  • Paging stats indicate the amount of memory being paged

    • If the paging stats are not zero, you have a memory problem that you need to investigate

  • System stats can be an indication of process contention

    • int indicates the number of interrupts serviced
    • csw indicates the number of process context switches

Memory Stats

------memory-usage-----
 used  buff  cach  free
1712M  114M 1631M  291M
1712M  114M 1632M  291M
1712M  114M 1633M  290M
1713M  114M 1633M  289M
  • used is the total amount of memory that is allocated
  • buff is the amount of memory used by the operating systems for file metadata
  • cach is the amount of memory used by the operating system for file data
  • free is the amount of unallocated memory

nodetool for Performance Analysis

nodetool is Cassandra’s Swiss Army Knife

  • This section focuses on using nodetool for performance analysis
  • We will look at the following sub-commands:

    • info
    • compactionhistory
    • gcstats
    • gossipinfo
    • ring
    • tablestats
    • tablehistograms
    • tpstats

nodetool info

  • Node-specific info
  • Output is self-explanatory
  • Performance points of interest include:

    • Load - amount of space used for SSTables
    • Heap Memory/Off Heap Memory - amount of RAM used by Cassandra for memtables, etc.
    • Key Cache - check out the hit rate

nodetool info

Example output:

ID                     : 57679f16-da27-4804-95bb-76c6bd36da72
Gossip active          : true
Thrift active          : true
Native Transport active: true
Load                   : 48.98 MB
Generation No          : 1482181943
Uptime (seconds)       : 245146
Heap Memory (MB)       : 594.93 / 1024.00
Off Heap Memory (MB)   : 1.15
Data Center            : Cassandra
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 148994, size 12.23 MB, capacity 51 MB, 2080730 hits, 2541295 requests, 0.819 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 25 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token                  : (invoke with -T/--tokens to see all 128 tokens)

Row cache entries are 0. This is good - the kernel caches file blocks, so the kernel is already caching rows for Cassandra. If Cassandra also caches the rows, the used space is redundant and affects performance negatively.

nodetool compactionhistory

  • Node-specific compaction history
  • Lists and describes the compaction actions that have taken place
  • The columns explain

    • What was compacted (first 3 columns)
    • When the compaction took place
    • The input and output size of the SSTables
    • The number of tables/rows compacted into a new SSTable: {# tables: # rows}

nodetool compactionhistory

Example output (first 3 columns):

Compaction History:
id                                       keyspace_name      columnfamily_name
716cecc0-c76f-11e6-8610-09423bb027f3     system             size_estimates
aa88e810-c693-11e6-8610-09423bb027f3     system             size_estimates
d8ca3a70-c63f-11e6-8610-09423bb027f3     system             local
...

Example output (last 4 columns):

Compaction History:
compacted_at              bytes_in       bytes_out      rows_merged
1482319213964             3530           553            {4:8}
1482224820497             3546           533            {4:8}
1482188820375             3192           2898           {4:1}
...

nodetool gcstats

  • Java’s garbage collection can impact performance - gcstats tells about the impact
  • Node-specific info on the JVM garbage collection
  • Measurements accumulate since previous invocation of gcstats
  • Reports:

    • Length of time for measurements
    • Amount of time spent garbage collecting
    • Amount of memory collected
    • Number of collection sessions

nodetool gcstats

Example output (first 4 columns)

Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms)
       207123                 289                1059                  14

Example output (last 3 columns)

GC Reclaimed (MB)         Collections      Direct Memory Bytes
     1887436296                   4                       -1

GC Reclaimed is mislabeled - it is actaully bytes.

nodetool gossipinfo

  • Cluster-level info
  • Shows the current state of gossip entries according to the target node
  • Entries have:

    • Label for the entry
    • Version number (except generation and hearbeat)
    • The value for the entry

nodetool gossipinfo

Example output:

/172.31.11.213
  generation:1482181943
  heartbeat:780755
  STATUS:27:NORMAL,-1085930816098221712
  LOAD:780740:5.3671796E7
  SCHEMA:518274:848dcb3f-63a5-3b43-8a3e-906016012448
  DC:63:Cassandra
  RACK:12:rack1
  RELEASE_VERSION:4:3.0.9.1346
  RPC_ADDRESS:3:172.31.11.213
  X_11_PADDING:264722:{"dse_version":"5.0.3","workload":"Cassandra","server_id":"0A-DF-99-13-F0-65","graph":false,"active":"true","health":0.9}
  SEVERITY:780757:0.0
  NET_VERSION:1:10
  HOST_ID:2:57679f16-da27-4804-95bb-76c6bd36da72
  RPC_READY:152:true
  TOKENS:26:<hidden>

nodetool ring

  • Shows info about tokens in the ring
  • Cluster-level, table-level from the perspective of the queried node
  • Use to determine the balance of tokens in the cluster
  • Compare the load column to determine the balance

nodetool ring

Example Execution:

root@node0:~# node/bin/nodetool ring -- killr_video

Datacenter: Cassandra
==========
Address        Rack        Status State   Load            Owns                Token
                                                                              9154262618015766378
172.31.11.213  rack1       Up     Normal  51.21 MB        100.00%             -8848252007366263074
172.31.11.213  rack1       Up     Normal  51.21 MB        100.00%             -8838326952028015898
172.31.11.213  rack1       Up     Normal  51.21 MB        100.00%             -8636663704484743815
...

nodetool tablestats

  • Shows info about keyspaces/tables flushed to disk
  • Node/Keyspace-level info
  • Use the -H option to get human-readable output
  • Use this command to get a sense of resources used by tables

nodetool tablestats

Example Execution:

root@node0:~# node/bin/nodetool tablestats -H -- killr_video
Keyspace: killr_video
	Read Count: 1519021
	Read Latency: 1.7196091792016042 ms.
	Write Count: 540301
	Write Latency: 0.35517111017747516 ms.
	Pending Flushes: 0
		Table: user_by_email
		SSTable count: 2
		Space used (live): 29.38 MB
		Space used (total): 29.38 MB
		Space used by snapshots (total): 0 bytes
		Off heap memory used (total): 476.69 KB
		SSTable Compression Ratio: 0.8082841406710903
		Number of keys (estimate): 229417
    ...
    (Much more output - not shown on this slide)

nodetool tablehistograms

  • Provides statistics about a specific table
  • Table-focused command (must supply keyspace and table)
  • Use this command to understand table latencies

Example execution:

root@node0:~# /root/node/bin/nodetool tablehistograms killr_video user_by_email
killr_video/user_by_email histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size     Cell Count
                              (micros)          (micros)           (bytes)
50%             0.00             17.08            219.34                86              2
75%             0.00             17.08            219.34               103              2
95%             0.00             20.50          20924.30               103              2
98%             0.00             29.52         223875.79               103              2
99%             0.00             88.15         464228.84               103              2
Min             0.00              8.24             17.09                51              2
Max             0.00        1155149.91        3449259.15               124              2

nodetool tpstats

  • Gives info about thread pools
  • Thread pools handle requests for various activities

    • Each row represents a class of activities

  • Accepted requests can be active, pending or completed
  • If the queue is full, the request is blocked

    • blocked requests indicate a resource problem

  • Requests can time out - these messages are dropped

nodetool tpstats

  • An explanation of pool request states:

    • Active - the pool is working on the request
    • Pending - the request is queued for the pool
    • Completed - the request is finished
    • Blocked - the queue was full so the request was rejected

nodetool tpstats

Example output:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     9         0       13352418         0                 0
ViewMutationStage                 0         0              0         0                 0
ReadStage                         0         0        8754502         0                 0
RequestResponseStage              0         0              0         0                 0
ReadRepairStage                   0         0              0         0                 0
CounterMutationStage              0         0              0         0                 0
MiscStage                         0         0              0         0                 0
CompactionExecutor                0         0         187110         0                 0
MemtableReclaimMemory             0         0            461         0                 0
PendingRangeCalculator            0         0              1         0                 0
GossipStage                       0         0              0         0                 0
... (more stats not shown on this slide)
Native-Transport-Requests         5         0       25854489         0            101210

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                   199
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

System & Output Logs

Sometimes nodetool is not enough

  • For investigating in-depth problems, use logging

    • Cassandra’s system.log for monitoring and diagnosing node behavior
    • Java’s GC logging for tracking down garbage collection related issues

Cassandra’s Logging

  • By default, the log file is in /var/log/cassandra/system.log
  • Also check debug.log in the same directory

    • system.log logs INFO messages and above
    • debug.log logs all messages

  • Change the location by adding the following line to cassandra-env.sh:

JVM_OPTS="$JVM_OPTS -Dcassandra.logdir=<LOG_DIRECTORY_GOES_HERE>"

Cassandra’s Logging

  • There are 7 logging levels:

    • OFF
    • ERROR
    • WARN
    • INFO (Default)
    • DEBUG
    • TRACE
    • ALL

Each level includes its messages and all those at higher levels

Cassandra’s Logging

  • More is not always better - too much logging may NOT be helpful

    • Increased logging affects system performance
    • Lots of log entries may be overwhelming to understand

  • To keep logging manageable, set different levels for different packages or classes

Cassandra’s Logging

  • Two ways to configure logging:

    • Using the configuration files logback-test.xml or logback.xml
    • Using nodetool setlogginglevel (only sets logging levels)

Cassandra’s Logging

  • Logging configuration files

    • In the same directory as cassandra.yaml
    • Looks first for logback-test.xml otherwise logback.xml
    • Restart node for changes to the file to take affect

Cassandra’s Logging

Configure log management (rolling) and output pattern

Example logback.xml:

<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>${cassandra.logdir}/system.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
      <fileNamePattern>${cassandra.logdir}/system.log.%i.zip</fileNamePattern>
      <minIndex>1</minIndex>
      <maxIndex>20</maxIndex>
    </rollingPolicy>

    <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
      <maxFileSize>20MB</maxFileSize>
    </triggeringPolicy>
    <encoder>
      <pattern>%-5level [%thread] %date{ISO8601} %F:%L - %msg%n</pattern>
    </encoder>
  </appender>
  ...
  • RollingFileAppender rolls files over based on:

Cassandra’s Logging

Enable specific logging levels for certain packages or classes

Example logback.xml:

...
<!--Enable for debugging of Lease problems -->
<logger name="com.datastax.bdp.leasemanager" level="TRACE"/>

<!--Enable for debugging of StorageProxy problems -->
<logger name="org.apache.cassandra.service.StorageProxy" level="DEBUG"/>
...

Cassandra’s Logging

  • nodetool setlogginglevel talks directly to the process

    • No need to restart the node for changes to take affect
    • Changes will not persist if node is restarted

Example:

nodetool setlogginglevel org.apache.cassandra.service.StorageProxy DEBUG

Cassandra’s Logging

  • Also get the current logging levels

Example:

root@node0:~# node/bin/nodetool getlogginglevels

Logger Name                                        Log Level
ROOT                                                    INFO
DroppedAuditEventLogger                                 INFO
SLF4JAuditWriter                                        INFO
SolrValidationErrorLogger                              ERROR
com.cryptsoft                                            OFF
org.apache.cassandra                                   DEBUG
...

JVM Garbage Collection Logging

  • GC logging helps you know exactly what the JVM is doing
  • Find out:

    • When GC occurred
    • How much memory GC reclaimed
    • How much memory is available in the heap

JVM Garbage Collection Logging

  • Turn on GC logging

    • Statically by editing cassandra-env.sh
    • Dynamically by using jinfo

  • In either case, edit cassandra-env.sh to specify the GC log file
JVM_OPTS="$JVM_OPTS -Xloggc:<LOG_FILE_NAME_GOES_HERE>"

JVM Garbage Collection Logging

  • For static configuration:
  • Edit cassandra-env.sh
  • You can use these options:

    • -XX:+PrintGC - Simple, prints a line for every GC and every full GC
    • -XX:+PrintGCDetails - Detailed, young generation as well as old and perm gen
    • -XX:+PrintGCTimeStamps - Adds time to a simple or detailed GC log
    • -XX:+PrintGCDateStamps - Adds date information to a simple or detailed GC log

JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCTimeStamps"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"
JVM_OPTS="$JVM_OPTS -Xloggc:/root/gc.log"

JVM Garbage Collection Logging

For dynamic configuration, use jinfo

jinfo -flag +PrintGC <NODE_PROCESS_ID_GOES_HERE>
jinfo -flag +PrintGCTimeStamps <NODE_PROCESS_ID_GOES_HERE>
jinfo -flag +PrintGCDateStamps <NODE_PROCESS_ID_GOES_HERE>

JVM Garbage Collection Logging

Example output:

8109.128: [GC [PSYoungGen: 109884K->14201K(139904K)] 691015K->595332K(1119040K), 0.0454530 secs]
  • 8109.128 - the number of seconds since the JVM started
  • PSYoungGen (young GC):

    • 109884K→14201K - heap before and after sizes
    • (139904K) - Amount of young GC space available

  • 0.0454530 secs - time required for GC

Exercise 3—​Monitor, Measure and Diagnose Cluster Performance