Mittwoch, 16. Juli 2014

Switch to HiveServer2 and Beeline

In Hive 0.11 HiveServer2 [2] was introduced, its time to switch from the old Hive CLI to the modern version. Why?
First, security [1]. Hive CLI bypasses the Apache HiveServer2 and calls a MR job directly. This behavior compromises any security projects like Apache Sentry [3]. With HiveServer2 the Kerberos impersonation brings fine granulated security down to HiveSQL. Its possible to enable a strong security layer with Kerberos, Apache Sentry [3] and Apache HDFS ACL [4], like other DWHs have.
Second, HiveServer2 brings connection concurrency to Hive. This allows multiple connections from different users and clients per JDBC (remote and per Beeline) over Thrift.
Third, the Hive CLI command could be deprecated in the future, this is discussed within the Hive Developer Community.

For the first steps a beeline connection can be established per

beeline -u jdbc:hive2://<SERVER>:<PORT>/<DB> -n USERNAME -p PASSWORD

The URI describes the JDBC connection string, followed by the database the user want to query. The same string can be used for remote JDBC connections, too. Additional, the connection parameters are easy to default in a Kerberos enabled environment per .bashrc like

alias hive2='beeline -u jdbc:hive2://HOST:PORT/DB -n $USER'

(The use of hive should be prohibited (per chmod 700, as example) to avoid bypassing HiveServer2.)

All leading distributions have HiveServer2 included, and the use of Beeline is well documented and pretty easy. Cloudera wrote a great Blogpost [5] about a migration from Hive CLI to Beeline, additional client information are available in the Beeline-Wiki [7]. Beeline and HS2 works in a multi-tenant Tez environment [8].

Snippets

 

Use Beeline in background [6]:
export HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal"
nohup beeline -u jdbc:hive2://<HOST>:<PORT>/DB -n <USER> -p <PASS> -d org.apache.hive.jdbc.HiveDriver -f hql_script &”


Query a table per CLI:
beeline -u jdbc:hive2://<HOST>:<PORT>/DB -n <USER> -p <PASS> -e "select count(*) from (select a.sender, a.recipient, b.recipient as c from transactions a join transactions b on a.recipient = b.sender where a.time < b.time and b.time - a.time < 5) i;"

Dienstag, 8. Juli 2014

XAttr are coming to HDFS

HDFS 2006 [1] describes the use of Extended Attributes. XAttr, known from *NIX Operating Systems, connects physically stored data with describing metadata above the strictly defined attributes by the filesystem. Mostly used to provide additional information, like hash, checksum, encoding or security relevant information like signature or author / creator.
According to the source code [2] the use of xattr can be configured by dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size in hdfs-default.xml. The default for dfs.namenode.fs-limits.max-xattrs-per-inode is 32, for dfs.namenode.fs-limits.max-xattr-size the default is 16384.

Within HDFS, the extended user attributes will be stored in the user namespace as an identifier.The identifier has four namespaces, like the Linux FS kernel implementation has: security, system, trusted and user. Only the superuser can access the trusted namespaces (system and security).
The xattr definitions are free and can be interpreted by additional tools like security frameworks, backup systems, per API or similar. Additionally, the attributes are case-sensitive and the namespace interpretes the definition as it is (case-insensitive).

An attribute can be set per dfs command like this:

hadoop dfs -setfattr -n 'alo.enc_default' -v UTF8 /user/alo/definition_table.txt

and can be read per:

hadoop dfs -getfattr -d /user/alo/definition_table.txt

# file: /user/alo/definition_table.txt
user.enc_default='UTF8'


HDFS 2006 is already committed [3] and will be available in HDFS 2.5.x, is enabled per default and will have no impact on performance if you don't use them.

[1] https://issues.apache.org/jira/browse/HDFS-2006

Freitag, 4. Juli 2014

Cloudera + Intel + Dell = ?

Wie Cloudera in einer Pressemitteilung [1] veröffentlichte, kommt nach dem Intel-Investment [2] nun der Schulterschluss mit Dell. Hier meine Meinung dazu.

Seit Jahren versprechen Analysten Wachstumsraten im hohen zweistelligen Prozentbereich bis 2020 [3], schlussendlich ist es nur logisch das Intel über den augenblicklichen Platzhirsch Cloudera in das "BigData Business" investiert, nachdem augenscheinlich die eigene Distribution nicht so erfolgreich war als gehofft. Zudem erkauft sich Intel hier einen bedeutenden Einfluss auf das Hadoop Projekt. Neben Hortonworks ist Cloudera einer der bedeutendsten Committer des gesamten Ecosystems.
Der Einfluss Intels beginnt bei Kryptographie (Rhino) [4], weitere Möglichkeiten wären optimierter Bytecode für Intel CPU's in Impala / Spark, Advanced Networking Features im Hadoop Core (IPv6) oder die Unterstützung proprietärer Lösungen Intels, die nur in CDH verfügbar sein werden. Da Cloudera in nahezu allen relevanten Projekten des Apache Hadoop Ecosystems vertreten ist kann diese Votingmacht durchaus genutzt werden um Apache Hadoop in eine Richtung zu beeinflussen, welche von beiden Unternehmen gewünscht ist.

Langsamer Abschied von Open Source?
Bei den beiden Distributionen CDH (Cloudera) und HDP (Hortonworks) ist eine zunehmende Fragmentierung zu sehen. Sehr deutlich bei den neuesten Erwerbungen - Cloudera kauft Gazzang, Hortonworks XA Secure. Damit sind alle Distributionen ab Einsatz der jeweiligen per Distribution proprietären Verschlüsselung nicht mehr kompatibel. Deutlich wird hier sicher die Diskrepanz wenn Intels Cryptocards zum Einsatz kommen und Gazzang dahingehend optimiert wird [5].
Auch im Hadoop Core wird die Strategie sichtbar - Cloudera setzt auf das Parquet Fileformat, Hortonworks auf ORCFile. Doch die Unterschiede gehen weiter: Hortonworks setzt auf OpenSource Tools wie Ambari, Storm, Shark und Falcon, Cloudera dagegen im Umsatzträchtigen Enterpriseumfeld auf Closed Open Code (Sourcecode öffentlich, aber keine Community basierte Entwicklung) wie Impala und Closed Source bei Management (Cloudera Manager und Enterprise AddOns), Verschlüsselung (Gazzang) und Data Lineage (Navigator).
Da sowohl Hortonworks als auch Cloudera (100% of the top 5 US intelligence agencies run Cloudera) den Public und Intelligence Sector in den USA / UK bedienen darf gefragt werden ob Closed Source im umsatzstarken deutschen Umfeld (NSA Untersuchungsausschuss) eine clevere Strategie ist. Zumindest bei HDP besteht die Möglichkeit eines kompletten Audits.

Kooperation mit DELL
Dell schwächelte, bedingt durch den Einbruch im PC Markt und die bisherige Konzentration auf Bürohardware, bereits seit 2012. Michael Dell gelang es 2013 Dell wieder in eine private Gesellschaft zu überführen - gemeinsam mit den Finanzinvestor Silver Lake. Die Dell Aktie verschwand von der Börse, und der Weg war frei das Unternehmen drastisch umzubauen und auf Vertrieb zu trimmen.
Dell's Vertriebsmodell ist Stückzahlen getrieben, es zählt nur das verkaufte Blech. Da Dell keinerlei nennenswerten Umsatz mit Dienstleistungen macht und diese an Partner outsourced, ist Dell natürlich der ideale Partner für Intel und Cloudera. Es findet keine Kannibalisierung des bisherigen Geschäfts statt, im Gegenteil. Da das Geschäftsmodell aller Hadoop Distributoren auf wiederkehrenden Subscriptions beruht, ist dieser Deal nahezu unschlagbar. Dell bekommt ein Alleinstellungsmerkmal, verkauft mehr Server - Intel verdient an CPU, Memory, SSDs, Netzwerk, und Cloudera an Subscriptions [1]:
"Driven by collaboration with the open source community and efforts across Dell, Intel and Cloudera, the Dell appliance is a best of breed big data solution stack. Designed from the silicon up, this appliance can enable certain applications to run up to 100x faster, is easy to use and deploy, and is compatible with existing solutions. [...] The Dell In-Memory Appliances for Cloudera Enterprise will be available in pre-sized, pre-configured options so that enterprise customers can choose and quickly deploy the version that is right for their applications. Value-added consulting services for custom configurations are also available through Dell."
Mit anderen (saloppen) Worten - wenn der Kunde einen modernden Hadoop CDH Analytics Cluster betreiben will, ist die einzige sinnvolle und zukunftsträchtige Lösung eine Lösung von Dell / Intel / Cloudera. Denn dieses Angebot ist das beste was auf dem Markt ist und weiter sein wird. Dafür wird Intel / Cloudera / Dell sorgen. Und wenn der Kunde nun eine Hardware Lösung von $commodity_hardware_vendor will, kann er das gern machen. Nur hat er dann eben nicht die Performance, die möglich wäre - wenn man denn auf CDH setzt:
"Together, the partners said they are attempting to build a “big data ecosystem” that combines data analytics hardware and software to move advanced data analytics to mainstream applications." [6]
Was könnte diese Strategie für die Zukunft bringen?
Die Aussage, einige Anwendungen werden mit dieser Lösung bis zu 100x schneller (Anmerkung: Bis zu 100x schneller scheint ein beliebter Term im amerikanischen Marketing zu sein) werden, legt nahe das mit sehr hoher Wahrscheinlichkeit Server und Software aufeinander abgestimmt zum Einsatz kommen. Damit wird die Flexibilität in der Wahl des Herstellers seitens des Kunden erheblich eingeschränkt. Und das sieht etwas nach einem Vendor Lock durch die Hintertür aus, was aus Sicht aller (außer des Kunden) der beste Weg ist die bestehende Abhängigkeit von wiederkehrenden Subscriptions und Services zu negieren. Cloudera's CEO, Tom Reilly, zeigt deutlich in einem Interview wohin die Reise gehen soll [7]:
Intel is working on a chip that’s going to ship in five years from now. They’re sharing those designs with us and we’re collaborating with them on how we can write and take advantage of instructions in the chip to actually make them perform better for analytic workloads. So if a customer is going to build a scale-out grid, and they were planning to have a thousand nodes driving it, the work with Intel might say they can do it with 600, which is significant cost savings in the long run. That’s huge. That’s a five-year roadmap.
Ob das wirklich der Weg zum "Big Data King" ist wird die Zukunft zeigen.

[1] http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2014/06/24/cloudera-dell-and-intel-advance-enterprise-deployments-of-hadoop.html
[2] http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2014/06/02/cloudera-names-intel-cio-to-board-of-directors-and-announces-the.html
[3] http://www.datanami.com/2014/05/29/hadoop-market-grow-58-2020-report-says/
[4] http://blog.cloudera.com/blog/2014/06/project-rhino-goal-at-rest-encryption/
[5] http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/2014/06/03/cloudera-strengthens-hadoop-security-with-acquisition-of-gazzang.html
[6] http://www.datanami.com/2014/06/24/cloudera-dell-intel-target-big-data-ecosystem/
[7] http://www.information-age.com/it-management/strategy-and-innovation/123458167/one-year-ceo-clouderas-tom-reilly-just-cant-wait-be-big-data-king

Donnerstag, 12. Juni 2014

Remove HDP and Ambari completely

Its a bit hard to remove HDP and Ambari completely - so I share my removal script here. Works for me perfect, just adjust the HDFS directory. In my case it was /hadoop
#!/bin/bash
echo "==> Stop Ambari and Hue"
ambari-server stop && ambari-agent stop
/etc/init.d/hue stop
sleep 10
echo "==> Erase HDP and Ambari completely"
yum -y erase ambari-agent ambari-server ambari-log4j hadoop libconfuse nagios ganglia sqoop hcatalog\* hive\* hbase\* zookeeper\* oozie\* pig\* snappy\* hadoop-lzo\* knox\* hadoop\* storm\* hue\*
# remove configs
rm -rf /var/lib/ambari-*/keys /etc/hadoop/ /etc/hive /etc/hbase/ /etc/oozie/ /etc/zookeeper/ /etc/falcon/ /etc/ambari-* /etc/hue/
# remove ambaris default hdfs dir
rm -rf /hadoop
# remove the repos
echo "==> Remove HDP and Ambari Repo"
rm -rf /etc/yum.repos.d/HDP.repo /etc/yum.repos.d/ambari.repo
# delete all HDP related users
echo "==> Delete the user accounts"
userdel -f hdfs && userdel -f sqoop && userdel -f hue && userdel -f yarn && userdel -f hbase && userdel -f && hive userdel -f oozie && userdel -f hcat && userdel -f puppet && userdel -f storm && userdel -f ambari-qa && userdel -f ambari_qa && userdel -f tez && userdel -f flume && userdel -f hadoop_deploy && userdel -f hcatalog && userdel -f zookeeper && userdel -f falcon && userdel -f rrdcached
# remove the unwanted sockets
echo "==> remove the HDFS socket and logs"
rm -rf /var/run/hdfs-sockets
rm -rf /var/log/sqoop2 /var/log/hdfs* /var/log/hadoop-* /var/log/hbase* /var/log/hue* /var/log/nagios /var/log/oozie /var/log/storm /var/log/zookeeper /var/log/falcon /var/log/flume* /var/run/flume-ng/ /var/run/hadoop* /var/run/hbase/ /var/run/hue/ /var/run/nagios/ /var/run/oozie/ /var/run/solr/ /var/run/spark/ /var/run/sqoop2/ /var/run/storm/ /var/run/zookeeper/ 
/var/lib/oozie/
For CDH just follow the guidance here:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_uninstall_cm.html

And MapR here:


Dienstag, 20. Mai 2014

Facebook's Presto

In November 2013 Facebook published their Presto engine as Open Source, available at GitHub. Presto is a distributed interactive SQL query engine, able to run over dozens of modern BigData stores, based on Apache Hive or Cassandra. Presto comes with a limited JDBC Connector, supports Hive 0.13 with Parquet and Views.

Installation

Just a few specialties. Presto runs only with Java7, does not support Kerberos and does not have built-in user authentication, neither. To protect data a user should not be able to read, the use of HDFS Acl's / POSIX permissions should be considered. The setup of Presto is pretty easy and well documented. Just follow the documentation, use "uuidgen" to generate a unique ID for your Presto Node (node.id in node.properties) and add "hive" as datasource (config.properties: datasources=jmx,hive). I used user "hive" to start the server with:
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-0.68/bin/launcher start

After the successful start you should be able to connect to Presto's Webinterface (discovery.uri in config.properties). The UI is pretty simple, but a good point to see what happens with your queries, how many splits are created and what time each step takes.

The CLI is a stand-alone self-executing jar file and can be placed on any computer which has installed Java7 and can connect to the Presto Instance. To be sure that the client is using the correct Java version a PATH inclusion may make sense:
export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && /software/presto --server [your-presto-server]:[port] --catalog hive --schema default

presto:default> show tables;
    Table
--------------
 building
 hvac
 sample_07
 sample_08
 transactions

Now let's test if Presto is really fast and can compare with Impala. To make the tests more simple I wrote a small script which uses MR to generate sample data. Its available in my git-repo. Just run it as the user you want to be, maybe make it executable or use "sh". With the script I mentioned before I created a table called transactions, and this table we want to query. I post only 2 exemplary queries, but the script has a few more.

1. Finding highest gainers

select id, sum(amount) as amount from (select sender as id, amount * -1 as amount from transactions union all select recipient as id, amount from transactions) unionResult group by id order by amount desc limit 10;

Results
Hive: 39.078 seconds, Fetched: 10 row(s)
Tez: 18.227 seconds, Fetched: 10 row(s)
Presto: 0:02 [1.2M rows, 38.2MB] [720K rows/s, 22.9MB/s]


2. Finding fraudsters

select count(*) from (select a.sender, a.recipient, b.recipient as c from transactions a join transactions b on a.recipient = b.sender where a.time < b.time and b.time - a.time < 5) i;

Results
Hive: 208.065 seconds, Fetched: 1 row(s)
Tez: 101.758 seconds, Fetched: 1 row(s)
Presto: 1:02 [600K rows, 19.1MB] [9.7K rows/s, 317KB/s]

Conclusion

Since Tez brings a significant better performance, Presto brings light speed into Hadoop based SQL and can be measured with Impala. The advantage of Presto is the flexibility of connectors - the Presto Team will add more connectors for Oracle, MySQL, PostgresSQL and HBase very soon. Also Authentication (Kerberos), Authorization and SQL Grants will be supported within the next month [1].