Review Guidelines on How to Conquer the Most Common Spark Problems You are Likely to Encounter
Apache Spark is a full-fledged, data engineering toolkit that enables you to operate on large data sets without worrying about the underlying infrastructure. Spark is known for its speed, which is a result of improved implementation of MapReduce that focuses on keeping data in memory instead of persisting data on disk. However, in addition to its great benefits, Spark has its issues including complex deployment and scaling. How best to deal with these and other challenges and maximize the value you are getting from Spark?
Drawing on experiences across dozens of production deployments, Pepperdata Field Engineer Alexander Pierce explores issues observed in a cluster environment with Apache Spark and offers guidelines on how to overcome the most common Spark problems you are likely to encounter. Alex will also accompany his presentation with demonstrations and examples. Attendees can use this information to improve the usability and supportability of Spark in their projects and successfully overcome common challenges. During this webinar, attendees will learn about:
– Serialization and its role in Spark performance
– Partition recommendations and sizing
– Executor resource sizing and heap utilization
– Driver-side vs. executor-side processing: reducing idle executor time
– Using shading to manage library conflicts