CI Pathways: HPC Data Science with Apache Spark I

04/09/25 - 02:00 PM - 04:00 PM EDT

This session will introduce essential tools and techniques for manipulating very large datasets. It will explore common challenges and pitfalls encountered when migrating from more traditional databases and how to mitigate them. The session will feature Apache Spark, an open-source unified analytics engine designed for large-scale data processing, and introduce the basics of how to use PySpark, the Python API for Apache Spark. By the end of the session, attendees will have a solid foundation in managing large datasets and be prepared to tackle complex data processing tasks.

Location

Virtual

Speakers

Bryon Gill, PSC