Posts

When to Choose ETL vs. ELT for Maximum Efficiency

ETL has been the traditional approach, where data is extracted, transformed, and then loaded into the target database. ELT flips this process - extracting data and loading it directly into the system, before transforming it. While ETL has been the go-to for many years, ELT is emerging as the preferred choice for modern data pipelines. This is largely due to ELT's speed, scalability, and suitability for large, diverse datasets generated by multiple different tools and systems, think about CRM, ERP datasets, log files, edge computing or IoT. List goes on, of course.. Data Engineering Landscape Data engineering is the new kind of DevOps. With the exponential growth in data volume and sources, the need for efficient and scalable data pipelines and therefore data engineers has become the new standard . In the past, limitations in compute power, storage capacity, and network bandwidth made the famous 3-word "let's move data round" phrase Extract, Transform, Load (ETL) the

Life hacks for your startup with OpenAI and Bard prompts

OpenAI and Bard   are the most used GenAI tools today; the first one has a massive Microsoft investment, and the other one is an experiment from Google. But did you know that you can also use them to optimize and hack your startup? Even creating pitch scripts, sales emails, and elevator pitches with one (or both) of them helps you not only save time but also validate your marketing and wording. Curios? Here a few prompt hacks for startups to create / improve / validate buyer personas, your startups mission / vision statements, and USP definitions. Introduce yourself and your startup Introduce yourself, your startup, your website, your idea, your position, and in a few words what you are doing to the chatbot: Prompt : I'm NAME and our startup NAME, with website URL, is doing WHATEVER. With PRODUCT NAME, we aim to change or disrupt INDUSTRY. Bard is able to pull information from your website. I'm not sure if ChatGPT can do that, though. But nevertheless, now you have laid a grea

Indexing PostgreSQL with Apache Solr

Searching and filtering large IP address datasets within PostgreSQL can be challenging. Why? Databases excel at data storage and structured queries, but often struggle with full-text search and complex analysis. Apache Solr, a high-performance search engine built on top of Lucene, is designed to handle these tasks with remarkable speed and flexibility. What do we need? A running PostgreSQL database with a table containing IP address information (named "ip_loc" in our example). A basic installation of Apache Solr. Setting up Apache Solr Create a Solr Core: Bash solr create -c ip_data -d /path/to/solr/configsets/ Define the Schema ( schema.xml ) XML < field name = "start_ip" type = "ip" indexed = "true" stored = "true" /> < field name = "end_ip" type = "ip" indexed = "true" stored = "true" /> < field name = "iso2" type = "string" indexed = "true&q

Some fun with Apache Wayang and Spark / Tensorflow

Apache Wayang is an open-source Federated Learning (FL) framework developed by the Apache Software Foundation. It provides a platform for distributed machine learning, with a focus on ease of use and flexibility. It supports multiple FL scenarios and provides a variety of tools and components for building FL systems. It also includes support for various communication protocols and data formats, as well as integration with other Apache projects such as Apache Kafka and Apache Pulsar for data streaming. The project aims to make it easier to develop and deploy machine learning models in decentralized environments. It's important to note that this are just examples and they may not be the way for your project to interact with Apache Wayang, you may need to check the documentation of the Apache Wayang project ( https://wayang.apache.org ) to see how to interact with it. I just point out how easy it is to use different languages to interact between Wayang and Spark. Also, you need to mak

Get Apache Wayang ready to test within 5 minutes

Image
Hey followers, I often get ask how to get Apache Wayang ( https://wayang.apache.org ) up and running without having a full big data processing system behind. We heard you, we built a full fledged docker container, called BDE (Blossom Development Environment), which is basically Wayang. Here's the repo:  https://github.com/databloom-ai/BDE I made a short screencast how to get it running with Docker on OSX, and we also have made two hands-on videos to explain the first steps. Let's start with the basics - Docker. Get the whole platform with: docker pull ghcr.io/databloom-ai/bde:main At the end the Jupyter notebook address is shown, control-click on it (OS X); the browser should open and login you automatically: Voila - done. You have now a full working Wayang environment, we prepared three notebooks to make it more easy to dive into. Watch our development tutorial video (part 1) to get a better understanding what Wayang can do, and what not. Click the video below: 

Combined Federated Data Services with Blossom and Flower

Image
When it comes to Federated Learning frameworks we typically find two leading open source projects - Apache Wayang [2] (maintained by  databloom ) and Flower [3] (maintained by  Adap ). And at the first view both frameworks seem to do the same. But, as usual, the 2nd view tells another story. How does Flower differ from Wayang? Flower is a federated learning system, written in Python and supports a large number of training and AI frameworks. The beauty of Flower is the strategy concept [4]; the data scientist can define which and how a dedicated framework is used. Flower delivers the model to the desired framework and watches the execution, gets the calculations back and starts the next cycle. That makes Federated Learning in Python easy, but also limits the use at the same time to platforms supported by Python.  Flower has, as far as I could see, no data query optimizer; an optimizer understands the code and splits the model into smaller pieces to use multiple frameworks at the same ti

Compile Apache Wayang on Mac M1

We release Apache Wayang  v0.6.0 in the next days, and during the release testing I was wondering if we get wayang on M1 (ARM) running. And yes, a few small changes - voila! Install maven, scala, sqlite and groovy: brew install maven scala groovy sqlite Download openJDK 8 for M1: https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk  and install the pkg.  Get Apache Wayang either from  https://dist.apache.org/repos/dist/dev/wayang/ , or git-clone directly: git clone https://github.com/apache/incubator-wayang.git Start the build process: cd incubator-wayang export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home mvn clean install Ready to go: [INFO] Reactor Summary for Apache Wayang 0.6.0-SNAPSHOT: ... [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time:  06:24 min After the build is done the binaries are located in mavens home: ~/.m2/repository/o