Subsurface LIVE 2023, The Data Lakehouse Conference, Present by Dremio

Featured

Dive deep of Apache Iceberg on AWS

Veena Vasudevan, Sr. Partner Solutions Architect, Amazon Web Services

Oct 18 2023, 33 mins

Play

All episodes

The Year of the Data Lakehouse

Tomer Shiran, Anushka Anand, Deepika Duggirala, Tamas Kerekjarto

Data and Analytics organizations have worked to balance improving access and self-service for the business with achieving security and governance at scale. Data lakehouses are ideally suited to help organizations provide both agility and governance. Deepika Duggirala, SVP Global Technology Platforms, at TransUnion and Tamas Kerekjarto, Head of Engineering, Renewables and Energy Solutions at Shell, will share their journeys to deliver governed self-service.

Apache Iceberg development and adoption accelerated significantly this year, enabling modern data lakes to deliver data warehouse functionality and become data lakehouses. With Apache Iceberg, the industry has consolidated around a vendor-agnostic table format, and innovation from tech companies (Apple, Netflix, etc.) and service providers (AWS, Dremio, GCP, Snowflake, etc.) is creating a world in which data and compute are independent. In this new world, companies can enjoy advancements in data processing, thanks to engine freedom, and data management, thanks to new paradigms such as Data-as-Code. Tomer Shiran, CPO and Founder of Dremio, will deliver Subsurface’s keynote address.

Aug 21 2023, 72 mins

Play

Aug 21 2023, 72 mins

On-demand

Aug 21 2023, 72 mins
Data Governance at Scale with Microsoft

Mike Flaskox, Read Maloney

Scaling data governance practices while continuing to provide efficient access to the data lake is a challenge all businesses face. No one knows this better than Microsoft, whose exabyte-scale data lake platform supports analytics that empower multiple billion-dollar lines of business. Join this session to hear Mike Flasko, VP & GM, Data Governance & Privacy at Microsoft, chat with Read Maloney, CMO of Dremio, about their journey and lessons learned, as they built one of the largest governed data lake platforms in the world.

Aug 21 2023, 29 mins

Play

Aug 21 2023, 29 mins

On-demand

Aug 21 2023, 29 mins
Managing Data Files In Apache Iceberg

Russell Spitzer

Everything was going great: your data was in your data lake, queries were fast, and the SREs were happy. But then things started to slow down. Queries took longer, even specific queries which used to be fast now take a long time. The culprit? Small and unorganized files. The solution? Apache Iceberg’s RewriteDatafile action. This talk will dive into how RewriteDataFiles can 1) right-size your files, merging small files and splitting large ones, ensuring that no time is wasted in query planning or in opening files; and 2) reorganize the data within your files, supporting hierarchal sort and multidimensional z-ordering algorithms, enabling you to make sure your data is optimally set out for your queries. With these two capabilities, any table can be kept at peak performance regardless of ingestion patterns and table size.

Aug 23 2023, 27 mins

Play

Aug 23 2023, 27 mins

On-demand

Aug 23 2023, 27 mins
Data Mesh in Practice Panel

Zhamak Dehghani, Raja Perumalsamy, Ugo Ciraci, Tomer Shiran, Ben Hudson

This panel will discuss the practical considerations of data mesh, tips for addressing the people and cultural aspects of data mesh, and real-world lessons learned when implementing a data mesh or data mesh-like approach. We will hear from experts and people with real-world experience implementing data mesh, including the creator of Data Mesh herself, Zhamak Dehghani. We will discuss topics such as the challenges faced during the implementation of data mesh, balancing data mesh with existing enterprise architectures and solutions, success metrics when implementing data mesh, and strategies for overcoming the organizational resistance to change. Join us to learn more about data mesh and how to implement it successfully in your organization.

Aug 23 2023, 44 mins

Play

Aug 23 2023, 44 mins

On-demand

Aug 23 2023, 44 mins
Modern Data Lakehouse at Shell

Natarajan Kalidoss (Nata), Raja Perumalsamy

In this video, we will present how Dremio Data Lakehouse came to the rescue when a use case from the electricity retail domain with a large scale machine learning problem around power consumption forecasting imposed some significant challenges on the data engineering and data science teams in Shell.

We will describe how we:
1.) addressed the large variety of data sources that needed to be ingested, processed and served to the Data Science team
2.) streamlined some of the Data Science workflows around data exploration, feature engineering and model testing
3.) operationalized and scaled ML training and inferencing

Aug 25 2023, 30 mins

Play

Aug 25 2023, 30 mins

On-demand

Aug 25 2023, 30 mins
Lakehouse: Smart Iceberg Table Optimizer

Rajasekhar Konda, Steve Zhang

The new lakehouse architectural design pattern provides many technical benefits like ACID support, time travel for machine learning, and better query performance. Apache Iceberg implements this pattern and provides the flexibility to further enhance based on real-world needs. Using Apache Iceberg table format requires special vacuuming like snapshot expire and orphan removal for data governance and metadata and data compaction for efficient and fast access of data.

This video will discuss how to handle these table operations at very large scale by keeping cost in mind without compromising on data engineering, ML, analytical, and BI use cases. Automating these operations makes life easier for engineers leveraging the platforms without having to worry about how Iceberg internals work. We will share lessons learned for optimizing the streaming and batch data sets in a cost-effective and efficient way.

Aug 25 2023, 29 mins

Play

Aug 25 2023, 29 mins

On-demand

Aug 25 2023, 29 mins
Lakes and Lakehouses: The Evolution of Analytics in the Cloud with AWS

Rajesh Sampath Rahim Bhojani

More organizations than ever before are adopting data lakes to drive business outcomes and innovate faster. As data lakes grow in size and scope, data lake architectures have evolved to balance agility, innovation and governance. Amazon Web Services (AWS) is a pioneer in cloud data lakes, to the point where Amazon S3 is now the de facto storage for data lakes. In this session, Rajesh Sampath, General Manager for Amazon S3 API Experience, and Rahim Bhojani, Dremio’s SVP and Head of Engineering, discuss the evolution of data lakes, capabilities required to build a modern data lake architecture, emerging trends, and how organizations turn data into strategic assets.

Aug 28 2023, 27 mins

Play

Aug 28 2023, 27 mins

On-demand

Aug 28 2023, 27 mins
Scaling Row Level Deletions at Pinterest

Ashish Singh

With close to exabyte-scale data at Pinterest and evolving business needs, the ability to perform row-level data deletions efficiently on petabytes of data is important. This talk shares how we utilize Apache Iceberg to achieve this goal at Pinterest. We will discuss challenges specific to row-level deletion, solutions we considered, and their trade-offs. Furthermore, we will share some bottlenecks that row-level data deletions run into and the optimizations we added to resolve them. Given how important data deletion requirements are in the current world, we hope that the learnings and solutions shared in this session will help you save money for your respective businesses while improving reliability.

Aug 30 2023, 29 mins

Play

Aug 30 2023, 29 mins

On-demand

Aug 30 2023, 29 mins
Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg

Steven Wu, Apple & Gang Ye, Apple

In modern data architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to Iceberg tables can suffer from two problems: the small files problem that can hurt read performance, and poor data clustering that can make file pruning less effective.

In this video, we will discuss how data teams can address those problems by adding a shuffling stage to the Flink Iceberg streaming writer to intelligently group data via bin packaging or range partition, reduce the number of concurrent files that every task writes, and improve data clustering. We will explain the motivations in detail and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.

Sep 1 2023, 30 mins

Play

Sep 1 2023, 30 mins

On-demand

Sep 1 2023, 30 mins
Apache Iceberg's Best Secret: A Guide to Metadata Tables

Szehon Ho, Software Engineer, Apple

Apache Iceberg’s rich metadata is its secret sauce, powering core features like time travel, query optimizations, and optimistic concurrency handling. But did you know that this metadata is accessible to all, via easy-to-use system tables?

This talk will walk through real-life examples of using metadata tables to get even more out of Iceberg and address questions such as:
- What is the last partition updated and when?
- Why are there too many small files?
- What Iceberg maintenance procedures can give us better query performance?
- Can we start building more advanced systems like data audit and data quality?
- How many null values are being added per hour?
- What is the latency of data ingest over time?
- We will also cover metadata table performance tips and tricks, and ongoing improvements in the community.

Whether you are already using Iceberg metadata tables or interested in getting started, watch this talk to learn how this under-utilized feature can help manage data tables more effectively than ever before.

Sep 29 2023, 23 mins

Play

Sep 29 2023, 23 mins

On-demand

Sep 29 2023, 23 mins
Fast Data Processing with Apache Arrow

Andrei Ionescu, Senior Software Engineer, Adobe

Using Rust, Apache Arrow, and table formats, data can be efficiently processed closer to the hardware and without any pauses. In this video, it will explain the pros and cons of Apache Arrow for data processing and compare the performance with Apache Spark — the "standard" in terms of distributed processing of big data.

We will discuss the advantages of the Rust language, including Rust Arrow and the tools available, the missing pieces, and performance comparisons.

Oct 4 2023, 24 mins

Play

Oct 4 2023, 24 mins

On-demand

Oct 4 2023, 24 mins
Arrow Flight SQL High Performance, Simplicity, Interoperability Data Transfers

Dipankar Mazumdar, Developer Advocate, Dremio

Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.

Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.

In this video, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Oct 6 2023, 20 mins

Play

Oct 6 2023, 20 mins

On-demand

Oct 6 2023, 20 mins
DataOps in action with Nessie, Iceberg and Great Expectations

Antonio Murgia, Data Architect, Agile Lab

This talk will present how Nessie, Iceberg, and Great Expectations is used to build a DataOps pipeline that ensures data quality and avoids “datastrophes.”

Oct 11 2023, 22 mins

Play

Oct 11 2023, 22 mins

On-demand

Oct 11 2023, 22 mins
Dive deep of Apache Iceberg on AWS

Veena Vasudevan, Sr. Partner Solutions Architect, Amazon Web Services

Watch this video to learn about the common challenges of using traditional file formats on premises and how leveraging Apache Iceberg on AWS helps you overcome these challenges. You will also learn about the comprehensive and advanced features of Apache Iceberg with elaborate demos that showcase the unique capabilities of Apache Iceberg on AWS.

Oct 18 2023, 33 mins

Play

Oct 18 2023, 33 mins

On-demand

Oct 18 2023, 33 mins

Subsurface LIVE 2023, The Data Lakehouse Conference, Present by Dremio

About this series

Presented by

Featured

Dive deep of Apache Iceberg on AWS

All episodes

The Year of the Data Lakehouse

Data Governance at Scale with Microsoft

Managing Data Files In Apache Iceberg

Data Mesh in Practice Panel

Modern Data Lakehouse at Shell

Lakehouse: Smart Iceberg Table Optimizer

Lakes and Lakehouses: The Evolution of Analytics in the Cloud with AWS

Scaling Row Level Deletions at Pinterest

Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg

Apache Iceberg's Best Secret: A Guide to Metadata Tables

Fast Data Processing with Apache Arrow

Arrow Flight SQL High Performance, Simplicity, Interoperability Data Transfers

DataOps in action with Nessie, Iceberg and Great Expectations

Dive deep of Apache Iceberg on AWS