VMware vROPS 6.x Cluster Having Poor Performance

Overview:
In VMware vROPS 6.x, sometimes the Casandra database load on each clustered node goes very high as listed in below screenshot. However VMware claims - this issue got fixed in 6.6 version. It also cause “Failed to Disable” HA error on the admin UI page.
Prerequisites:
- Make sure we have snapshots for all the nodes of a cluster.
- Make sure there are recent successful image level backup for all the nodes of a cluster.
Procedure:
- In Admin UI, ensure that all nodes are taken offline by clicking “Take Offline” under “Cluster Status”
- If this button is greyed out or in case it’s not available, select each node and click Take Node Offline.
- If you are unable to do the above step then follow the below listed step as alternate option to do it.
- Log in to the master node as the root user and repeat this process for all other nodes in the Cluster.
- cloudpandavrops1:'#service vmware-casa stop
- cloudpandavrops1:'#service vmware-vcops stop
- The nodes should be taken offline in this order - data nodes, master replica and master node.
- Force the Cassandra DB online so that we can work with it without reads/writes taking place.
- cloudpandavrops1:'# service vmware-vcops start cassandra force
- Once cassandra DB is online, run the commands against the DB to truncate three different tables.
- globalpersistence.activity_2_tbl
- globalpersistence.activityresults_tbl
- globalpersistence.queueid_tbl
- Before execute the DB commands, check the load of each vROPS Cassandra DB node.
- cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cass*/bin/nodetool -p 9008 status
- cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
- cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activity_2_tbl" &
- cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.activityresults_tbl" &
- cloudpandavrops1:'# nohup $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/cqlsh --ssl --cqlshrc $VCOPS_BASE/user/conf/cassandra/cqlshrc -e "consistency quorum; truncate globalpersistence.queueid_tbl" &
- Once these tables are truncated, run a repair operation against the DB to ensure all nodes were in sync.
- cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password repair -par
- Once it’s all in sync, confirm the load on the Cassandra DB is reduced from 18GB to 1GB
- cloudpandavrops1:'# $VCOPS_BASE/cassandra/apache-cassandra-2.1.8/bin/nodetool -p 9008 --ssl -u maintenanceAdmin --password-file /usr/lib/vmware-vcops/user/conf/jmxremote.password status
- Bring the cluster back online and after some time check if those all objects were back in a “Collecting” or “Data Receiving” state.
- If the cluster won’t come online then try to Force the Cassandra DB offline (it’s an optional step)
- cloudpandavrops1:'#service vmware-vcops stop cassandra force
- The nodes should be brought online in reverse order, once the activity gets complete.
- cloudpandavrops1:'#service vmware-casa start
- cloudpandavrops1:'#service vmware-vcops start
- Then click around in the environment and the UI is as responsive as we would expect.
- Once we confirm the environment is back online and behaving as expected, we can check if there is any HA error in admin UI like “Failed to disable HA”
- If we notice the above error, we have to follow the below listed steps to rectify it. However this error will not cause any impact to the cluster functionality.
- This required bringing the 'casa' and vROPS service offline so that we can make edits to a file read on casa's startup to correct the error on this page.
- cloudpandavrops1:'#service vmware-casa stop
- cloudpandavrops1:'#service vmware-vcops stop
- cloudpanda01:'# vi /storage/db/casa/webapp/hsqldb/casa.db.script
- Change “is_ha_enabled":failed to disable to “is_ha_enabled":true
- Change "initialization_state":"failed to disable" to "initialization_state":"NONE"
- After modifying the line it should look something like this. Here is a sample line.
INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINE","cluster_name":"vROPS-Prod","is_ha_enabled":true,"ha_transition_state":"NONE","initialization_state":"NONE","remove_node_state":"NONE","document_version":84,"document_time":1515169871248,"online_state":"ONLINE","online_state_time":1515169871242,"online_state_reason":"","cluster_members":[],"admin_slices":[],"installation_state":"DONE","slices":{"a436f79c-dc0c-40ec-a915-b7e256ba6ef6":{"slice_uuid":"a436f79c-dc0c-40ec-a915-b7e256ba6ef6","is_admin_node":true,"ip_address":"cloudpandavrops1.ce.corp.com","preferred_addresses":{},"slice_name":"cloudpandavrops1","membership_state":null},"0cdd8bc1-1610-411e-9c8b-fae36b46857a":{"slice_uuid":"0cdd8bc1-1610-411e-9c8b-fae36b46857a","is_admin_node":false,"ip_address":"cloudpandavrops2.ce.corp.com","preferred_addresses":{},"slice_name":"cloudpandavrops2","membership_state":null}}}')
- Once we bring casa and vROPS back online we can verify HA reported as “Enabled” as expected.
- cloudpandavrops1:'#service vmware-casa start
- cloudpandavrops1:'#service vmware-vcops start
- At this point we can let the environment run as is for some time to monitor further.
- #cd /storage/log/vcops/log/casa
- #tail pakManager.actions.log
- #tail casa-gc.log
- #tail casa-performance.log
- #tail casa-rest-calls.log
- #tail casa.log
- #tail casa_cassandra.log
- #tail catalina.out
- #tail pakManager.query.log
- #cd /var/log/vcops_logs/ or #cd /var/log/vmware/vcops
- #tail vcops-services-startup.log
- #tail vcops-firstboot.log
- #tail vcops-upgrade.log
Write Review