Brian Flüg, Qubole: On the benefits of data lakes for machine learning

Ryan Daws is a senior editor at TechForge Media with over a decade of experience in crafting compelling narratives and making complex topics accessible. His articles and interviews with industry leaders have earned him recognition as a key influencer by organisations like Onalytica. Under his leadership, publications have been praised by analyst firms such as Forrester for their excellence and performance. Connect with him on X (@gadget_ry) or Mastodon (

Data lakes offer a number of advantages for machine learning, but it takes an experienced partner to unlock their full benefit.

AI News caught up with Brian Flüg, Solutions Architect at Qubole, to find out how the company is helping data scientists with their workloads.

What are the advantages of using a data lake for machine learning?

The advantages of using a secure and open data lake for machine learning are numerous. It is simple to deploy and companies can reduce risk while decreasing costs.

Data Scientists can build, deploy and iterate on their models faster with experiment tracking and out-of-box Integrations for front-end tools: RStudio,, Datarobot. They can also benefit from end-to-end workflows anchored by schedulers and Airflow. Managed Notebooks also offer serverless (Offline) editing.

In addition to this, developers can achieve higher developer productivity by skipping steps and building applications with code auto-complete, code compare, code-free visualisations (QVIZ), version control, hands-free dependency management, and easy access to cloud storage and data catalog.

You can also automate infrastructure provisioning for machine learning by minimising costs automatically while supporting concurrent user growth without a performance impact. You can benefit from having near-zero management overhead regardless of the number of users or model versions and scale up or down automatically to support all workloads at any point in time.

How can businesses use automation to limit the impact of disruptive and rapidly-evolving situations like we saw over the pandemic?

As enterprises look to navigate fast-changing conditions brought about by the pandemic, data leaders are being tasked with harnessing massive volumes of data across the organisation and leveraging streaming analytics, machine learning, and artificial intelligence to help organisations make smarter decisions and adapt to the new surroundings. It has been more crucial than ever to unlock the potential of data lakes and automation for unmatched success.

Through conversations with our customers and partners, it was clear that data lakes support the analytics capabilities that businesses needed to see them through this crisis, including real-time data pipelines, machine learning, and artificial intelligence. Data lakes are at the cutting edge of analytics and data science today, and optimising them is critical to business success.

What new capabilities does Qubole provide for data scientists?

Qubole caters to data scientists wherever their skills and experience lie. Regardless of whether you are a rookie or a machine learning wizard, Qubole has the capabilities to support your activities. These capabilities include machine learning, artificial intelligence, analytical automation, streaming, and ad-hoc analytics.

Qubole provides your data science teams with the best tool for every task in the data science life cycle — in a single, cloud-native platform. Data scientists can prepare data with end-to-end visibility of the entire pipeline. They can explore, query, and visualise data through Qubole’s SQL Workbench. Integrations with JDBC and ODBC connectors to the BI tool of their choice allow data scientists to explore and visualise data.

Another capability is building and training models with rapid prototyping, flawless execution, and broad support for machine learning ecosystems such as Spark, MLib, MXNet, Tensorflow, Keras, SciKit Learn, Python, or R, with integrated Notebook service for ease of use and collaboration.

Finally, data scientists can collaborate to deploy trained models, schedule production jobs (monitoring end-to-end data science workflows with complete visibility into the data pipeline), and take take advantage of Qubole’s hosted Airflow service to create production workflows.

How can a streaming data pipeline unlock the benefits of real-time data for machine learning?

You can build streaming data pipelines to capture the benefits of real-time data for machine learning and ad-hoc analytics, which has huge benefits for machine learning. Qubole Pipelines Service is a Stream Processing Service that addresses real-time ingestion, decision, machine learning, and reporting use-cases.

A streaming data pipeline will enable an accelerated development cycle. You can develop a pipeline within minutes without writing even a single line of code and deploy it instantly. Test run and debug new pipelines to check the connectivity and business logic with a built-in test framework and experience near-zero downtime and no data loss.

In addition to this, a simplified data lake operation provides data management and better data consistency. It allows for the detection of invalid records and schema mismatches by setting alerts and preventing data loss by cleansing and reprocessing these records (by storing them in a configurable cloud storage location). Importantly, comprehensive operational management and deep insights help companies to keep costs in check.

(Photo by Tim Foster on Unsplash)

Qubole will be sharing their invaluable insights during this year’s AI & Big Data Expo Europe, which runs from 23-24 November 2021. Qubole’s booth number is 309. Brian Flüg will also be speaking at the virtual edition of this year’s event on 1 December 2021. You can find out more about his sessions and register to attend here.

Tags: , , , , , , , ,

View Comments
Leave a comment

Leave a Reply