Thursday, December 8, 2016

Hue 3.11 with HDP 2.5

Works fine with CentOS / RHEL, I used 6.8 in that case. Epel has to be available, if not, install the repo.
And I ask me why Hortonworks didn't integrated Hue v3 in their HDP release - I mean, Hue v2 is older as old and lacks dramatically on functionality.
Anyhow, lets get to work.

sudo wget -O /etc/yum.repos.d/epel-apache-maven.repo

sudo yum install ant gcc krb5-devel mysql mysql-devel openssl-devel cyrus-sasl-devel cyrus-sasl-gssapi sqlite-devel libtidy libxml2-devel libxslt-devel openldap-devel python-devel python-simplejson python-setuptools rsync gcc-c++ saslwrapper-devel libffi-devel gmp-devel apache-maven

sudo mkdir /software; sudo chown hue: /software && cd /software
wget -O && unzip; cd hue-master; sudo mkdir -p /usr/local/hue && chown -R hue: /usr/local/hue && make install

HDP config changes:

Oozie => Custom oozie-site
oozie.service.ProxyUserService.proxyuser.hue.groups *
oozie.service.ProxyUserService.proxyuser.hue.hosts *

Hive => Custom webhcat-site *
webhcat.proxyuser.hue.groups *

HDFS => Custom core-site
hadoop.proxyuser.hue.hosts *
hadoop.proxyuser.hue.groups *

At the end, hue.ini needs to be configured to fit the installation, here's an example - I use 8899 as HueUI port:


At least a new security rule for port 8899 has to be created, as well as the hbase thrift service has to be started per:
nohup hbase thrift start &

Configure Hue:
/usr/local/hue/build/env/bin/hue syncdb
/usr/local/hue/build/env/bin/hue migrate

Start Hue:
/usr/local/hue/build/env/bin/supervisor -d

Login per http://your_hue3_host:8899

I strongly recommend to use MySQL as an backend DB, but for first test the integrated SQLite instance is fine, too.

HUE-4701 - recreate the saved queries from sample notebook

Tuesday, November 29, 2016

Erase HDP 2.x and Ambari

Since I hack now often with Hortonworks HDP, I also often need to completely clean out my lab environments to get fresh boxes. I figured to write a ugly shell script is more comfortable as bothering my infra guys to reset the VM's in Azure - which also reset all my modifications. Bad!
Anyhow, here's the script in the case anyone has some use, too.

As usual, first stop all Ambari managed services. I remove Postgres too, since the setup of a new db done by the installer of Ambari is much more faster than dealing with inconsistencies later.
Side Note: The script is made for RHEL based distributions ;)

Monday, October 24, 2016

FreeIPA and Hadoop Distributions (HDP / CDH)

FreeIPA is the tool of choice when it comes to implement a security architecture from the scratch today. I don't need to praise the advantages of FreeIPA, it speaks for himself. It's the Swiss knife of user authentication, authorization and compliance.

To implement FreeIPA into Hadoop distributions like Hortonwork's HDP and Cloudera's CDH some tweaks are necessary, but the outcome is it worth. I assume that the FreeIPA server setup is done and the client tools are distributed. If not, the guide from Hortonworks has those steps included, too.

For Hortonworks, nothing more as the link to the documentation is necessary:

Ambari 2.4x has FreeIPA (Ambari-6432) support (experimental, but it works as promised) included. The setup and rollout is pretty simple and runs smoothly per Wizard.

For Cloudera it takes a bit more handwork, but it works at the end also perfect and well integrated, but not at the same UI level as Ambari. These steps are necessary to get Cloudera Manager working with FreeIPA:

1. create the CM principal in FreeIPA (example: cdh@ALO.ALT)
2. retrieve the keytab:
 ipa-getkeytab -r -s freeipa.alo.alt -p cdh -k cdh.keytab
3. install ipa-admintools on the Cloudera Manager server 
 yum install ipa-admintools -y
4. place the retrieval-script (from my GitHub) in /opt/cloudera/security/ (or another path accessible by cloudera manager), make it executable and owned by cloudera-scm
 chmod 775 /opt/cloudera/security/ && chown cloudera-scm: /opt/cloudera/security/
5. Start the Kerberos wizard, but stop after verifying the cdh user
6. Set the configuration [1] for "Custom Kerberos Keytab Retrieval Script" to "/opt/cloudera/security/"
7. resume the Kerberos wizard and follow the steps until its finished and restart the cluster.

The FreeIPA client from RHEL7 / CentOS 7 uses now memory based keytabs, but Java doesn't support them (yet). To switch back to the file based ticket cache, the config file (/etc/krb5.conf) needs to be altered by commenting default_ccache_name out, which let the client use the default file based ticket cache;

cat /etc/krb5.conf
# default_ccache_name = KEYRING:persistent:%{uid}

Wednesday, October 12, 2016

Shifting paradigms in the world of BigData

In building the next generation of applications, companies and stakeholders need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of that changes.

Event Driven vs. CRUD
Software development traditionally is driven by entity-relation modeling and CRUD operations on that data. The modern world isn’t about data at rest, it’s about being responsive to events in flight. This doesn’t mean that you don’t have data at rest, but that this data shouldn’t be organized in silos.
The traditional CRUD model is neither expressive nor responsive, given by the amount of uncountable available data sources. Since all data is structured somehow, an RDBMS isn't able to store and work with data when the schema isn't known (schema on write). That makes the use of additional free available data more like an adventure than a valid business model, given that the schema isn't known and can change rapidly. Event driven approaches are much more dynamical, open and make the data valuable for other processes and applications. The view to the data is defined by the use of the data (schema on read). This views can be created manually (Data Scientist), automatically (Hive and Avro for example) or explorative (R, AI, NNW).

Centralized vs Siloed Data Stores
BigData projects often fail by not using a centralized data store, often refereed as Data Lake or Data Hub. It’s essential to understand the idea of a Data Lake and the need for it. Siloed solutions (aka data warehouse solutions) have only data which match the schema and nothing else. Every schema is different, and often it’s impossible to use them in new analytic applications. In a Data Lake the data is stored as it is - originally, untouched, uncleaned, disaggregated. That makes the entry (or low hanging fruit) mostly easy - just start to catch all data you can get. Offload RDBMS and DWs to your Hadoop cluster and start the journey by playing with that data, even by using 3rd party tools instead to develop own tailored apps. Even when this data comes from different DWH's, mining and correlating them often brings treasures to light.

Scaled vs. Monolith Development
Custom processing at scale involves tailored algorithms, be they custom Hadoop jobs, in-memory approaches for matching and augmentation or 3rd party applications. Hadoop is nothing more (or less) than a framework which allows the user to work within a distributed system, splitting workloads into smaller tasks and let those tasks run on different nodes. The interface to that system are reusable API's and Libraries. That makes the use of Hadoop so convenient - the user doesn't need to take care about the distribution of tasks nor to know exactly how the framework works. Additionally, every piece of written code can be reused by others without having large code depts.
On the other hand Hadoop gives the user an interface to configure the framework to match the application needs dynamically on runtime, instead of having static configurations like traditional processing systems.

Having this principles in mind by planning and architecting new applications, based on Hadoop or similar technologies doesn’t guarantee success, but it lowers the risk to get lost. Worth to note that every success has had many failures before. Not trying to create something new is the biggest mistake we can made, and will result sooner or later in a total loss.

Thursday, September 15, 2016

Cloudera Manager and Slack

The most of us are getting bored by receiving hundreds of monitoring emails every day. To master the flood, rules are getting in play - and with that rules the interest into email communication are reduced.
To master the internal information flood, business messaging networks like Slack are taking more and more place.

To make CM work with Slack a custom alert script from my Github will do the trick:

The use is pretty straight forward - create a channel in Slack, enable Webhooks, place the token into the script, store the script on your Cloudera Manager host, make it executable for cloudera-scm : and enable outgoing firewall / proxy rules to let the script chat with Slack's API. The script can handle proxy connections, too.

In Cloudera Manager, the script path needs to be added into Cloudera-Management-Service => Configuration => Alert Publisher => Custom Script.