Featured
Dive deep of Apache Iceberg on AWS
Veena Vasudevan, Sr. Partner Solutions Architect, Amazon Web Services
Watch this video to learn about the common challenges of using traditional file formats on premises and how leveraging Apache Iceberg on AWS helps you overcome these challenges. You will also learn about the comprehensive and advanced features of Apache Iceberg with elaborate demos that showcase the unique capabilities of Apache Iceberg on AWS.
All episodes
-
The Year of the Data Lakehouse
Tomer Shiran, Anushka Anand, Deepika Duggirala, Tamas Kerekjarto
Data and Analytics organizations have worked to balance improving access and self-service for the business with achieving security and governance at scale. Data lakehouses are ideally suited to help organizations provide both agility and governance. Deepika Duggirala, SVP Global Technology Platforms, at TransUnion and Tamas Kerekjarto, Head of Engineering, Renewables and Energy Solutions at Shell, will share their journeys to deliver governed self-service.
Apache Iceberg development and adoption accelerated significantly this year, enabling modern data lakes to deliver data warehouse functionality and become data lakehouses. With Apache Iceberg, the industry has consolidated around a vendor-agnostic table format, and innovation from tech companies (Apple, Netflix, etc.) and service providers (AWS, Dremio, GCP, Snowflake, etc.) is creating a world in which data and compute are independent. In this new world, companies can enjoy advancements in data processing, thanks to engine freedom, and data management, thanks to new paradigms such as Data-as-Code. Tomer Shiran, CPO and Founder of Dremio, will deliver Subsurface’s keynote address.
-
Data Governance at Scale with Microsoft
Mike Flaskox, Read Maloney
Scaling data governance practices while continuing to provide efficient access to the data lake is a challenge all businesses face. No one knows this better than Microsoft, whose exabyte-scale data lake platform supports analytics that empower multiple billion-dollar lines of business. Join this session to hear Mike Flasko, VP & GM, Data Governance & Privacy at Microsoft, chat with Read Maloney, CMO of Dremio, about their journey and lessons learned, as they built one of the largest governed data lake platforms in the world.
-
Managing Data Files In Apache Iceberg
Russell Spitzer
Everything was going great: your data was in your data lake, queries were fast, and the SREs were happy. But then things started to slow down. Queries took longer, even specific queries which used to be fast now take a long time. The culprit? Small and unorganized files. The solution? Apache Iceberg’s RewriteDatafile action. This talk will dive into how RewriteDataFiles can 1) right-size your files, merging small files and splitting large ones, ensuring that no time is wasted in query planning or in opening files; and 2) reorganize the data within your files, supporting hierarchal sort and multidimensional z-ordering algorithms, enabling you to make sure your data is optimally set out for your queries. With these two capabilities, any table can be kept at peak performance regardless of ingestion patterns and table size.
-
Data Mesh in Practice Panel
Zhamak Dehghani, Raja Perumalsamy, Ugo Ciraci, Tomer Shiran, Ben Hudson
This panel will discuss the practical considerations of data mesh, tips for addressing the people and cultural aspects of data mesh, and real-world lessons learned when implementing a data mesh or data mesh-like approach. We will hear from experts and people with real-world experience implementing data mesh, including the creator of Data Mesh herself, Zhamak Dehghani. We will discuss topics such as the challenges faced during the implementation of data mesh, balancing data mesh with existing enterprise architectures and solutions, success metrics when implementing data mesh, and strategies for overcoming the organizational resistance to change. Join us to learn more about data mesh and how to implement it successfully in your organization.
-
Modern Data Lakehouse at Shell
Natarajan Kalidoss (Nata), Raja Perumalsamy
In this video, we will present how Dremio Data Lakehouse came to the rescue when a use case from the electricity retail domain with a large scale machine learning problem around power consumption forecasting imposed some significant challenges on the data engineering and data science teams in Shell.
We will describe how we:
1.) addressed the large variety of data sources that needed to be ingested, processed and served to the Data Science team
2.) streamlined some of the Data Science workflows around data exploration, feature engineering and model testing
3.) operationalized and scaled ML training and inferencing -
Lakehouse: Smart Iceberg Table Optimizer
Rajasekhar Konda, Steve Zhang
The new lakehouse architectural design pattern provides many technical benefits like ACID support, time travel for machine learning, and better query performance. Apache Iceberg implements this pattern and provides the flexibility to further enhance based on real-world needs. Using Apache Iceberg table format requires special vacuuming like snapshot expire and orphan removal for data governance and metadata and data compaction for efficient and fast access of data.
This video will discuss how to handle these table operations at very large scale by keeping cost in mind without compromising on data engineering, ML, analytical, and BI use cases. Automating these operations makes life easier for engineers leveraging the platforms without having to worry about how Iceberg internals work. We will share lessons learned for optimizing the streaming and batch data sets in a cost-effective and efficient way.
-
Lakes and Lakehouses: The Evolution of Analytics in the Cloud with AWS
Rajesh Sampath Rahim Bhojani
More organizations than ever before are adopting data lakes to drive business outcomes and innovate faster. As data lakes grow in size and scope, data lake architectures have evolved to balance agility, innovation and governance. Amazon Web Services (AWS) is a pioneer in cloud data lakes, to the point where Amazon S3 is now the de facto storage for data lakes. In this session, Rajesh Sampath, General Manager for Amazon S3 API Experience, and Rahim Bhojani, Dremio’s SVP and Head of Engineering, discuss the evolution of data lakes, capabilities required to build a modern data lake architecture, emerging trends, and how organizations turn data into strategic assets.
-
Scaling Row Level Deletions at Pinterest
Ashish Singh
With close to exabyte-scale data at Pinterest and evolving business needs, the ability to perform row-level data deletions efficiently on petabytes of data is important. This talk shares how we utilize Apache Iceberg to achieve this goal at Pinterest. We will discuss challenges specific to row-level deletion, solutions we considered, and their trade-offs. Furthermore, we will share some bottlenecks that row-level data deletions run into and the optimizations we added to resolve them. Given how important data deletion requirements are in the current world, we hope that the learnings and solutions shared in this session will help you save money for your respective businesses while improving reliability.
-
Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Apple & Gang Ye, Apple
In modern data architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to Iceberg tables can suffer from two problems: the small files problem that can hurt read performance, and poor data clustering that can make file pruning less effective.
In this video, we will discuss how data teams can address those problems by adding a shuffling stage to the Flink Iceberg streaming writer to intelligently group data via bin packaging or range partition, reduce the number of concurrent files that every task writes, and improve data clustering. We will explain the motivations in detail and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
-
Apache Iceberg's Best Secret: A Guide to Metadata Tables
Szehon Ho, Software Engineer, Apple
Apache Iceberg’s rich metadata is its secret sauce, powering core features like time travel, query optimizations, and optimistic concurrency handling. But did you know that this metadata is accessible to all, via easy-to-use system tables?
This talk will walk through real-life examples of using metadata tables to get even more out of Iceberg and address questions such as:
- What is the last partition updated and when?
- Why are there too many small files?
- What Iceberg maintenance procedures can give us better query performance?
- Can we start building more advanced systems like data audit and data quality?
- How many null values are being added per hour?
- What is the latency of data ingest over time?
- We will also cover metadata table performance tips and tricks, and ongoing improvements in the community.Whether you are already using Iceberg metadata tables or interested in getting started, watch this talk to learn how this under-utilized feature can help manage data tables more effectively than ever before.
-
Fast Data Processing with Apache Arrow
Andrei Ionescu, Senior Software Engineer, Adobe
Using Rust, Apache Arrow, and table formats, data can be efficiently processed closer to the hardware and without any pauses. In this video, it will explain the pros and cons of Apache Arrow for data processing and compare the performance with Apache Spark — the "standard" in terms of distributed processing of big data.
We will discuss the advantages of the Rust language, including Rust Arrow and the tools available, the missing pieces, and performance comparisons.
-
Arrow Flight SQL High Performance, Simplicity, Interoperability Data Transfers
Dipankar Mazumdar, Developer Advocate, Dremio
Network protocols for transferring data generally have one of two problems: they’re slow for large data transfers but have simple APIs (e.g. JDBC) or they’re fast for large data transfers but have complex APIs specific to the system. Apache Arrow Flight addresses the former by providing high performance data transfers and half of the latter by having a standard API independent of systems. However, while the Arrow Flight API is performant and an open standard, it can be more complex to use than simpler APIs like JDBC.
Arrow Flight SQL rounds out the solution, providing both great performance and a simple universal API.
In this video, we’ll show the performance benefits of Arrow Flight, the client difference between interacting with Arrow Flight and Arrow Flight SQL, and an overview of a JDBC driver built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.
-
DataOps in action with Nessie, Iceberg and Great Expectations
Antonio Murgia, Data Architect, Agile Lab
This talk will present how Nessie, Iceberg, and Great Expectations is used to build a DataOps pipeline that ensures data quality and avoids “datastrophes.”
-
Dive deep of Apache Iceberg on AWS
Veena Vasudevan, Sr. Partner Solutions Architect, Amazon Web Services
Watch this video to learn about the common challenges of using traditional file formats on premises and how leveraging Apache Iceberg on AWS helps you overcome these challenges. You will also learn about the comprehensive and advanced features of Apache Iceberg with elaborate demos that showcase the unique capabilities of Apache Iceberg on AWS.