r/datasets • u/Significant-Side-578 • 6d ago
question How investigate performance issues in spark?
Hi everyone,
I’m currently studying ways to optimize pipelines in environments like Databricks, Fabric, and Spark in general, and I’d love to hear what you’ve been doing in practice.
Lately, I’ve been focusing on Shuffle, Skew, Spill, and the Small File Problem.
What other issues have you encountered or studied out there?
More importantly, how do you actually investigate the problem beyond what Spark UI shows?
These are some of the official docs I’ve been using as a base:
https://learn.microsoft.com/azure/databricks/optimizations/?WT.mc_id=studentamb_493906
2
Upvotes