Posts

Showing posts from December, 2012

Impala and Kerberos

First, Impala is beta software and has some limitations. Stay tuned and test this, you'll see it can be change your BI world dramatically. What is Impala?   Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. ( https://ccp.cloudera.com/display/IMPALA10BETADOC/Introducing+Cloudera+Impala ) You can build Impala by source ( https://github.com/cloudera/impala ) or you can grab them by using yum on a RHEL / CentOS 6x server. Imapla doesn't support RHEL / CentOS prior 6, since the most part of Impala is written in C++. I choose the rpm-version for this article, but the compiled version will work in the same manner. To grab impala directly per yum setup a new rep

Using Hive's HBase handler

Hive supports per HIVE-705 HBase integration for SELECT and write INSERT both and is well described Hive's wiki . Note, as of Hive 0.9x the integration requires HBase 0.92x. In this article I'll show how to use existing HBase tables with Hive. To use Hive in conjunction with HBase, a storage-handler is needed. Per default, the storage handler comes along with your Hive installation and should be available in Hive's lib directory ( $HIVE_HOME/lib/hive-hbase-handler* ). The handler requires hadoop-0.20x and later as well as zookeeper 3.3.4 and up. To get Hive and HBase working together, add HBase's config directory into hive-site.xml : <property> <name>hive.aux.jars.path</name> <value>file:///etc/hbase/conf</value> </property> and sync the configs ( hbase-site.xml as well as hive-site.xml ) to your clients. Add a table in Hive using the HBase handler: CREATE TABLE hbase_test ( key1 string, col1 string ) STORED BY &

Hive "drop table" hangs (Postgres Metastore)

By using postgres as a metastore database it could be happen that " drop table xyz " hangs and Postgres is showing LOCKS with UPDATE. This happen since some tables are missing and can be fixed by using: create index "IDXS_FK1" on "IDXS" using btree ("SD_ID"); create index "IDXS_FK2" on "IDXS" using btree ("INDEX_TBL_ID"); create index "IDXS_FK3" on "IDXS" using btree ("ORIG_TBL_ID"); CREATE TABLE "ROLES" ( "ROLE_ID" bigint NOT NULL, "CREATE_TIME" int NOT NULL, "OWNER_NAME" varchar(128) DEFAULT NULL, "ROLE_NAME" varchar(128) DEFAULT NULL, PRIMARY KEY ("ROLE_ID"), CONSTRAINT "ROLEENTITYINDEX" UNIQUE ("ROLE_NAME") ) ; CREATE TABLE "ROLE_MAP" ( "ROLE_GRANT_ID" bigint NOT NULL, "ADD_TIME" int NOT NULL, "GRANT_OPTION" smallint NOT NULL, "GRANTOR"

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s