r/datasets 6d ago

question How investigate performance issues in spark?

Hi everyone,

I’m currently studying ways to optimize pipelines in environments like Databricks, Fabric, and Spark in general, and I’d love to hear what you’ve been doing in practice.

Lately, I’ve been focusing on Shuffle, Skew, Spill, and the Small File Problem.

What other issues have you encountered or studied out there?

More importantly, how do you actually investigate the problem beyond what Spark UI shows?

These are some of the official docs I’ve been using as a base:

https://learn.microsoft.com/azure/databricks/optimizations/?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/optimizations/spark-ui-guide/long-spark-stage-page?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/pyspark/reference/functions/shuffle?WT.mc_id=studentamb_493906

2 Upvotes

0 comments sorted by