Hbase Debugging · Banno Docs

Hbase Resources

Production hbase

Hbase grafana dashboard

Hbase master status page

Master status page

Note: You must be on vpn and you must update yours hosts file(see below) for these to work

master

To load each regionserver status page scroll down and click each link.

Production hbase hosts

You will need to insert these values into your /etc/hosts file

10.211.12.12    LKSBNMASTER02
10.211.12.22    LKSBNDATANODE02
10.211.12.23    LKSBNDATANODE03
10.211.12.24    LKSBNDATANODE04
10.211.12.25    LKSBNDATANODE05
10.211.12.26    LKSBNDATANODE06
10.211.12.27    LKSBNDATANODE07
10.211.12.28    LKSBNDATANODE08

Staging and UAT Hbase

Both staging and thunderdome are currently deployed as a stand alone hbase, which means that their regionserver and master server reside on the same node.

Staging Hbase - hbase0-lks.staging-2.banno-internal.com Staging master status Staging region server status:

UAT Hbase - http://hbase0-lks.uat-2.banno-internal.com UAT master status UAT region server status:

Restarting hbase

$ ssh hbase0-lks.staging-2.banno-internal.com
adam@staginghbase1:~$ sudo -s
[sudo] password for adam:
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-master stop
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-regionserver stop
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-master start
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-regionserver start

If hbase services do not stop normally above you can kill them by user

root@staginghbase1:/home/adam# sudo pkill -u hbase

Restarting an HBase Regionserver gracefully

On the node:

/usr/lib/hbase/bin/graceful_stop.sh `hostname`
if running into a region that gets stuck during it, manually assign in a hbase shell (in a differnt tab, leaving the graceful_stop still running): assign 'banno_transaction| .....|01234558.0123abcdf8'
Ctrl-C when it tries to ssh after assigning regions and stopping.
Start regionserver if necessary

General Hbase links

Troubleshooting Checklist

Check the scala errors that we are seeing, hbase related errors are most prevalent in siphon and api but can occur in Che as well. They will usually show up as either NotServingRegionExceptions or timeouts when trying to make hbase calls.
Check the grafana Graphs to ensure that the hbase cluster is receiving requests and that the compaction times are ok, you can also see if the regions are in need of a balancing from these graphs. If you are looking at the data-services graphs Spikes in the Account Context Timing most often are a result of issues with hbase calls. You will also see a corresponding increase in siphon errors.
Check the master Status page for dead regions and/or regions stuck in transition. Each will have their own section on the status page. Note: Regions in transition will on occasion have regions there and be working properly but they should not stay as a region in transition for long.
Check Region Server Status pages to ensure that the region-servers are alive and do not have any issues
Check for hbase inconsistencies on any node in the cluster with hbase hbck

Common Issues

IN PRODUCTION ENVIRONMENT THESE FIXES MUST BE DONE BY THE OPS/Infra FIREFIGHTER

Regions Stuck in Transition

Occasionally a region will gets stuck when transitioning between regionservers. You can see this by looking at the master status page and see at the bottom that there are Regions in Transition that do not go away within a couple of minutes. You will also see an error like this in the logs for apps which use hbase (api, siphon, history):

org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: pending_transaction,c6c46790-7338-11e3-8ae1-005056a30036
|9223370645101975807|InstitutionPending                  |4f6293e0-900a-11e3-831b-005056a30032,1431623112470.ba71c4fb13dfbd5d687edf20cbf92431.

To fix stuck regions:

SSH onto an hbase node (see above)
Enter the shell with hbase shell
Manually assign that region to a regionserver by running assign '<full-region-name>' in the shell. An example of the full region name is shown in the log line above, it is everything after the NotServingRegionException.
- It should look something like: id_to_hbasekey,080184b3afa71516c5e62d246cd23981e79bb6eb,1443190671719.a82dc39a1e28084653f4617e0c9c147c.
- You can also get this from Right Click -> “View Source” on the hbase master status page
Repeat the process till all the stuck regions have been assigned to a region.

Regionserver not responding to queries

On occasion a region server will enter a dead locked state where it appears to be alive to the master node but is not responding to any of the messages that are sent to it. In this case the regionserver is queing up all the db transactions that are sent to it in the write ahead log so that it does not lose anything but queries to look at data on these nodes do not return.

This problem will show itself by causing the region server status pages to not load properly.

To do any of the following commands on the HBase node, you need sudo. If you see something like the following error when using sudo <command>, first sudo, then run the command separately:

+======================================================================+
|      Error: JAVA_HOME is not set and Java could not be found         |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site        |
|       > http://java.sun.com/javase/downloads/ <                      |
|                                                                      |
| HBase requires Java 1.6 or later.                                    |
| NOTE: This script will find Sun Java whether you install using the   |
|       binary or the RPM based installer.                             |
+======================================================================+

To fix the dead locked regionserver

Restart the regionserver process on the offending node: /etc/init.d/hadoop-hbase-regionserver restart Note: you may need to force the process to stop and start it with /etc/init.d/hadoop-hbase-regionserver start if it does not respond to the restart command
Rebalance the cluster after the server is up and has completed its start up process.

Dead Region Server

You will be able to see that a region server is dead by checking in the status page of the master. Toward the bottom of the page there is a section labled Dead Region Servers, and that will display any regionservers that are dead.

This scenario is the most likely issue if you are looking into staging hbase issues.

To fix a dead the dead regionserver you will have to:

Restart the process by running /etc/init.d/hadoop-hbase-regionserver start on the server that has died.
Rebalance the cluster after the server is no longer in the dead servers list and has completed its start up process.

Rebalancing the cluster

A “cluster rebalance” is needed when the number of regions on each regionserver are not roughly even. A rebalence will usually have to be done after a service has been restarted in order to bring that node fully back into the cluster.

To Rebalance The Cluster:

Enter the hbase shell on any server in the cluster with hbase shell
Run balancer.
If the regions do not start balancing, or the balancer command returns false, you will need to turn on balancing. To do this run balance_switch true in the hbase shell, and then repeat the balancer command.

Truncating a hbase table

Often we truncate a few mobile data-services tables because they’re ephemerial data. To do this follow these steps:

On 10.211.12.12

$ sudo -i
$ su - hbase
$ hbase shell

> disable 'pending_id_to_hbasekey'
> drop 'pending_id_to_hbasekey'

> disable 'pending_transaction'
> drop 'pending_transaction'

Siphon will migrate/create the tables on boot, so on siphon0-lks.production-2.banno-internal.com:

$ sudo -i
$ sv restart banno-siphon-beta