Name: Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg
Start: 2023-09-01T15:00:00Z
End: 2023-09-01T15:00:29.000Z
Location: BrightTALK
Rating: 5

Dremio is the easy and open data lakehouse, providing self-service analytics with data warehouse functionality and data lake flexibility across all of your data. Dremio increases agility with a revolutionary data-as-code approach that enables Git-like data experimentation, version control, and governance.

Ready to revolutionize your data management approach and learn how to maximize your environment with Dremio?   

Watch Alex Merced in this workshop where he’ll  guide you step-by-step through building a lakehouse on your laptop with Dremio, Nessie and Minio. This is a great opportunity  to try out many of the best features Dremio offers.

You'll learn how to:
- Read and write Apache Iceberg tables on your object storage, cataloged by Nessie,
- Create views in the semantic layer,
- And much more

GDW Community Edition Workshop Description:
In this hands-on workshop, participants will embark on a journey to construct their very own data lakehouse platform using their laptops. The workshop is designed to introduce and guide participants through the setup and utilization of three pivotal tools in the data lakehouse architecture: Dremio, Nessie, and Apache Iceberg. Each of these tools plays a crucial role in enabling the flexibility of data lakes with the efficiency and ease of use of data warehouses aiming to simplify and economize data management. 

Participants will start by setting up a Docker environment to run all necessary services, including a notebook server, Nessie for catalog tracking with Git-like versioning, Minio as an S3-compatible storage layer, and Dremio as the core lakehouse platform. The workshop will provide a practical, step-by-step guide to federating data sources, organizing and documenting data, and performing queries with Dremio; tracking table changes and branching with Nessie; and creating, querying, and managing Apache Iceberg tables for an ACID-compliant data lakehouse. 

Prerequisites for the workshop include having Docker installed on your laptop. Attendees will be taken through the process of creating a docker-compose file to spin up the required services, configuring Dremio to connect with Nessie and Minio, and finally, executing SQL queries to manipulate and query data within their lakehouse.

CE WS1 | Getting Started with Dremio: Build a Data Lakehouse on your Laptop

Watch Alex Merced, Developer Advocate at Dremio to explore the future of data management and discover how Dremio can revolutionize your analytics TCO, enabling you to do more with less.

Data leaders are navigating the challenging landscape of enabling data-driven customer experiences and enhancing operational efficiency through analytics insights, all while meticulously managing budgets. Organizations leveraging cloud data warehouses, like Snowflake, often grapple with the complexities of unifying data analytics across diverse cloud and on-premise applications. The process involves significant costs, resources, and time to extract, rebuild, and integrate data for consumability.

Enter the data lakehouse – offering the potential to drastically reduce the total cost of ownership (TCO) associated with analytics.

In this video, you will gain insights into:
- Key distinctions between traditional data warehouses and the innovative data lakehouse model.
- How Dremio empowers organizations to slash analytics TCO by over 50%.
- Uncovering hidden costs associated with data ingestion, storage, compute, business intelligence, and labor.
- Simplifying self-service analytics through Dremio’s unified lakehouse platform.

Learn how to reduce your Snowflake cost by 50%+ with a lakehouse

Organizations aim to increase data access and lower the time it takes to gain insights, all while managing governance and controlling rising data costs.

Dremio’s unified lakehouse platform for self-service analytics enables data consumers to move fast while also reducing manual repetitive tasks and ticket overload for data engineers.

In this Gnarly Data Waves episode, you will learn: 
- Overview of Dremio, what is it and why is it growing rapidly
- Proven use cases by some of the most demanding customers in the world
- Demonstration for how to rapidly get started and try it out

#datalakehouse #analytics #datawarehouse #datalake #dataengineers #dataarchitects #governance #dremiocloud #opendatalakehouse #apacheiceberg #selfservice #enterprisedata #multitables #tableformat #automateddata #query #etl #pipelines #genai #generativeai #parquet #json #tableau #bi #shiftleft #usecases #tco #datamangement #views

Getting Started with Dremio

Traditional ETL processes are notorious for their complexity and cost inefficiencies. Watch this video as we introduce a game-changing virtual data pipeline approach with Dremio's next-gen DataOps, aimed at streamlining, simplifying, and fortifying your data pipelines to save time and reduce cost.

You'll gain insights in this video:
- Simplified Data Pipeline Management: How to use Dremio for data source branching, merging, and pipeline automation.
- Mastering Data Ingestion and Access: Learn how to curate data using virtual data marts accessed through a universal Semantic layer.
- Better Orchestration with dbt: Discover the benefits of orchestrating DML and view logic, optimizing data workflows.
- Elevating Data Quality: Learn techniques to automate lakehouse maintenance and improve  data integrity.

Next-Gen Data Pipelines are Virtual

S&P Global is a leading global financial services company headquartered in New York. It provides credit ratings, benchmarks, analytics, and workflow solutions in the global capital, commodity, and automotive markets. As a company, data is an essential asset across all of S&P Global’s solutions offerings. 

Watch Tian de Klerk, Director of Business Intelligence, as he shares how they built a data lakehouse for FinOps analysis with Dremio Cloud on Microsoft Azure.

In this video, you will learn about:
- The hidden costs of extracting operational data into BI cubes
- Simplifying traditional data engineering processes with Dremio’s zero-ETL lakehouse
- How Dremio’s semantic layer and query acceleration make self-service analytics easy for end users

How S&P Global is Building an Azure Data Lakehouse with Dremio

Companies are struggling with the complex, brittle, and expensive nature of the data lifecycle in existing analytical environments. Dremio is announcing the availability of Dremio Cloud on Microsoft Azure, providing companies the ability to simplify and optimize their analytical environment. 

Watch and learn how Jonny Dixon, Sr. Product Manager at Dremio and Hanno Borns, Principal Product Manager at Microsoft Azure will look into:
- Problems companies face with existing analytical architectures
- How  Dremio and Microsoft Azure work together
- What  Dremio Cloud on Azure is, and the value it provides
- How the Dremo Cloud on Azure solution works, with a demo

Empowering Analytics: Unleashing the Power of Dremio Cloud on Microsoft Azure

Data integration is the foundation of modern business. When organizations consolidate the ingestion, cleansing, and transformation of disparate data sources into high-performance pipelines, they can drive analytics insights into every decision.

In order to keep pace with fast-changing business requirements, enterprises must keep pace with best practices in data integration. Chief among these is migrating data integration pipelines to scalable, resilient, and agile cloud platforms.

In this panel discussion video, TDWI senior research director James Kobielus will engage data industry experts in an in-depth discussion of data integration trends and best practices. 

The discussion will focus on several key issues:
- What are the hallmarks of modern data integration?
- What trends are spurring enterprises to modernize their data integration capabilities?
- Why should enterprises modernize their data integration platforms, processes, and organizations?
- What core capabilities are essential for deploying a full-featured enterprise data integration stack?
- What emerging data integration best practices are needed to support sophisticated new use cases in artificial intelligence, distributed analytics, and low-latency streaming?
- What new techniques should enterprises consider for reducing the cost and improving the efficiency of their data integration processes?
- How feasible is it for enterprises to entirely automate their data integration processes?
- To what extent can and should self-service tools be used to help business analysts and other nontraditional roles build and deploy sophisticated data integration pipeline logic?
- What the essential first step for enterprises on their journeys to modern data integration?

Watch this video and learn the data integration trends and best practices for your organization.

Expert Panel Discussion – Data Integration Trends and Best Practices Webinar

Watch this live fireside chat with David Stodder, Senior Director of Research Business Intelligence at TWDI, and Nik Acheson, Senior Product and Strategy Leader at Dremio, as they talk about using Data Mesh to Advance Distributed Data Access, Agility and Governance. During this informative session, you will learn:

- Best practices for success in the data mesh journey so you can make it easier to discover, understand, and trust data
- The importance of metadata catalogs, business glossaries, and data intelligence for integrating discovery, access, and governance
- How data mesh, data fabrics, and data virtualization differ and are related
- The role of an open data lakehouse in a distributed data architecture
- Balancing self-service data domains with requirements for enterprise data governance
- Sorting out data virtualization, data mesh and data fabrics
- Role of metadata catalogs, business glossaries and semantic layer
- Data mesh and the open data lakehouse: How they fit together
- The data mesh journey: Lessons learned and best practices
- Improving the user experience and increasing business value

Using Data Mesh to Advance Distributed Data Access, Agility and Governance

Supply Chain of the Future: Deep dive with Mark Sear from Maersk and hear how he is delivering the supply chain of the future for one of the largest supply chains in the world

Data as a Force Multiplier

Watch Mike Fergusion, CEO at Intelligent Business Strategies discuss top data trends and outlook onward.

Top Data Trends with Mike Ferguson, CEO at Intelligent Business Strategies

Watch Sendur Sellakumar, CEO of Dremio showcase The State of the Lakehouse: Benchmark your organization by learning about the recent data and AI trends from a new survey of over 500 organizations.

The State of the Lakehouse with Sendur Sellakumar, CEO of Dremio

Watch Sendur Sellakumar, CEO at Dremio, Mike Ferguson, CEO at Intelligent Business Strategies and Mark Sear, Director of Data Analytics and AI/ML at Maersk discuss 2024 trends and predictions: Industry experts provide valuable insights, shaping the year ahead with their knowledge and experience.

2024 Predictions Panel discussion with Industry experts

Dremio delivers no compromise lakehouse analytics for all of your data - and recent launches are making Dremio faster, more reliable, and more flexible than ever. 

Watch Mark Shainman, Product Marketing at Dremio and Colleen Quinn, Product Marketing at Dremio provides what’s new in Dremio:
- New Gen-AI capabilities for automated data descriptions and labeling
- Dremio Cloud SaaS service now available on Microsoft Azure
- Advances to ensure 100% query reliability with no memory failures
- Expanded Apache Iceberg capabilities to streamline Iceberg adoption and improve performance

What’s new in Dremio : New GenAI capabilities, 100% query success + on Azure

Embark on a transformative journey with our insightful presentation, "ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture." In this engaging session, we'll delve into the intricacies of modern data engineering and how it has evolved to address key pain points in the realm of data processing.

Alex Merced, Developer Advoate will illuminate the challenges data engineers face, from the complexities of backfilling and brittle pipelines to the frustration of sluggish data delivery. We'll introduce you to the high-impact concepts of ZeroETL and Virtual Data Marts, demonstrating how these innovative patterns can dramatically alleviate these common pains. By reducing the need for manual data movement and preparation pipelines, you'll discover a more efficient, agile, and responsive data ecosystem.

Watch this video for a practical guide to implementing these transformative patterns. We'll walk you through the steps to bring the power of ZeroETL and Virtual Data Marts into your own data landscape. Leveraging cutting-edge tools like Dremio, DBT, and more, you'll gain hands-on experience in designing and deploying these patterns to streamline your data workflows and supercharge your analytics capabilities.

Don't miss this opportunity to stay at the forefront of data architecture, enabling your organization to harness data's full potential while reducing complexity and overhead. The exploration of the future of data engineering – a future where ZeroETL and Virtual Data Marts pave the way for data agility, speed, and innovation.

ZeroETL & Virtual Data Marts: The Cutting Edge of Lakehouse Architecture

In this engaging talk with Alex Merced, Developer Advocate, we'll explore how Dremio revolutionizes data access, delivering speed, simplicity, and substantial cost savings. 

Discover the power of Dremio as we dive deep into:
- Data Access at Lightning Speed: Learn how Dremio accelerates data access, making insights available in real-time.
- Simplicity in Data Preparation: Streamline your data pipeline with Dremio's intuitive interface for data transformation.
- Cost Efficiency: Uncover how Dremio’s optimizations save you money while improving performance
- Use Cases: Explore real-world success stories and applications of Dremio's data access solutions.
- Future-Proofing Your Data Infrastructure: Understand how Dremio ensures scalability and adaptability.

Watch this video to uncover the secrets of fast, easy data access without breaking the bank!

How Dremio provides you fast and easy data access while saving you money

Organizations are struggling with the proliferation of toolings in their data infrastructure and the exponential growth of ETL pipelines are slowing down data engineers to deliver value to the business. They want to spend more time making impactful decisions and working on high value projects. Fivetran significantly reduces the amount of time spent in building ETL pipelines with their no-code approach. Dremio is the easy and open data lakehouse, providing self-service analytics with data warehouse functionality and data lake flexibility across all your data. Together, Dremio and Fivetran bring the best solution for enabling organizations to GTM faster.

In this video, you will learn:
- What Iceberg table format is and why it matters in data lakehouses
- How to load source files into Iceberg tables using Fivetran 
- How to create a unified access layer for your data with Dremio Cloud

How To Build an Iceberg Data Lakehouse with Fivetran and Dremio

Watch this insightful webinar featuring Jacopo Tagliabue of Bauplan Labs as he dives into the world of data science and machine learning pipelines. In this video, you'll discover the rationale behind Bauplan Labs' choice of open-source technologies, such as Apache Iceberg table format and Project Nessie transactional data catalog, for their cutting-edge platform. Gain valuable insights into why modern data platforms are increasingly adopting these technologies and how Nessie's git-like features can revolutionize your data management. Don't miss out on this opportunity to stay ahead in the world of data science and technology!

About Project Nessie - https://www.dremio.com/blog/introducing-nessie-as-a-dremio-source/

What you will learn:
- Why Modern Data Platforms are being built on Apache Iceberg
- Why Modern Data Platforms are being built on Nessie

Building a Data Science Platform on Apache Iceberg and Nessie

Dive deep of Apache Iceberg on AWS

DataOps in action with Nessie, Iceberg and Great Expectations

Arrow Flight SQL High Performance, Simplicity, Interoperability Data Transfers

Fast Data Processing with Apache Arrow

Apache Iceberg's Best Secret: A Guide to Metadata Tables

Scaling Row Level Deletions at Pinterest

Lakes and Lakehouses: The Evolution of Analytics in the Cloud with AWS

Lakehouse: Smart Iceberg Table Optimizer

Modern Data Lakehouse at Shell

Data Mesh in Practice Panel

Managing Data Files In Apache Iceberg

Data Governance at Scale with Microsoft

The Year of the Data Lakehouse

Subsurface LIVE 2023, The Data Lakehouse Conference, Present by Dremio

In modern data architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to Iceberg tables can suffer from two problems: the small files problem that can hurt read performance, and poor data clustering that can make file pruning less effective.

In this video, we will discuss how data teams can address those problems by adding a shuffling stage to the Flink Iceberg streaming writer to intelligently group data via bin packaging or range partition, reduce the number of concurrent files that every task writes, and improve data clustering. We will explain the motivations in detail and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.

Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg

Optimize Data

Ingestion

Apache Iceberg

Streaming

shuffling

partition

Data Architecture

Big Data

Big Data Analytics

data optimization

Welcome to the virtualization community on BrightTALK! Whether it affects servers, storage, networks, desktops or other parts of the data center, virtualization provides real benefits by reducing the resources needed for your
infrastructure and creating software-defined data center components. However, it can also complicate your infrastructure. Join this active community to learn best
practices for avoiding virtual machine sprawl and other common virtualization pitfalls as well as how you can make the most of your virtualization environment.

Virtualization

Are you an IT service management professional interested in developing your knowledge and improving your job performance? Join the IT service management community to access the latest updates from industry experts. Learn and share insights related to IT service management (ITSM) including topics such as the service desk, service catalog, problem and incident management, ITIL v4 and more. Engage with industry experts on current best practices and participate in active discussions that address the needs and challenges of the ITSM community.

IT Service Management

Cloud computing has exploded over the past few years, delivering a previously unimagined level of workplace mobility and flexibility. The cloud computing community on BrightTALK is made up of thousands of engaged professionals learning from the latest cloud computing research and resources. Join the community to expand your cloud computing knowledge and have your questions answered in live sessions with industry experts and vendor representatives.

Cloud Computing

Increasing expectations for good governance, effective risk management and complex demands for corporate compliance are presenting a growing challenge for organizations of all sizes. Join industry thought leaders as they provide you with practical advice on how to implement successful risk and compliance management strategies across your organization. Browse risk management resources in the form of interactive webinars and videos and ask questions of expert GRC professionals.

IT Governance, Risk and Compliance

The data center management community focuses on the holistic management and optimization of the data center. From technologies such as virtualization and cloud computing to data center design, colocation, energy efficiency and monitoring, the BrightTALK data center management community provides the most up-to-date and engaging content from industry experts to better your infrastructure and operations. Engage with a community of your peers and industry experts by asking questions, rating presentations and participating in polls during webinars, all while you gain insight that will help you transform your infrastructure into a next generation data center.

Data Center Management

The application development community features top thought leadership focusing on optimal practices in software development, SDLC methodology, mobile app development and application development platforms and tools. Join top software engineers and coders as they cover emerging trends in everything from enterprise app development to developing for mobile platforms such as Android and iOS.

Application Development

Practicing business intelligence allows your company to transform raw data into sets of insights for targeted business growth. The business intelligence and analytics community on BrightTALK is made up of thousands of data scientists, database administrators, business analysts and other data professionals. Find relevant webinars and videos on business analytics, business intelligence, data analysis and more presented by recognized thought leaders. Join the conversation by participating in live webinars and round table discussions.

Business Intelligence and Analytics

The IT project management community on BrightTALK includes thousands of IT project and portfolio management professionals. Find relevant webinars and videos on agile methodologies, scrum strategy, project management processes and more. Attend live webinars or view on demand content presented by recognized thought leaders in the IT project management industry.

IT Project Management

Network infrastructure professionals understand that a reliable and secure infrastructure is crucial to enabling business execution. Join the network infrastructure community to interact with thousands of IT professionals. Browse hundreds of on-demand and live webinars and videos to learn about the latest trends in network computing, SDN, WAN optimization and more.

Network Infrastructure

Welcome to the big data and data management community on BrightTALK. Join thousands of data quality engineers, data scientists, database administrators and other professionals to find more information about the hottest topics affecting your data. Subscribe now to learn about efficiently storing, optimizing a complex infrastructure, developing governing policies, ensuring data quality and analyzing data to make better informed decisions. Join the conversation by watching live and on-demand webinars and take the opportunity to interact with top experts and thought leaders in the field.

Big Data and Data Management

As an IT professional, many of the problems you face are multifaceted, complex and don’t lend themselves to simple solutions. The information technology community features useful and free information technology resources. Join to browse thousands of videos and webinars on ITIL best practices, IT security strategy and more presented by leading CTOs, CIOs and other technology experts.

Tame Small Files and Optimize Data Layout for Streaming Ingestion to Iceberg

Presented by

About this talk

More from this channel