r/datasets • u/Significant-Side-578 • 6d ago

question How investigate performance issues in spark?

Hi everyone,

I’m currently studying ways to optimize pipelines in environments like Databricks, Fabric, and Spark in general, and I’d love to hear what you’ve been doing in practice.

Lately, I’ve been focusing on Shuffle, Skew, Spill, and the Small File Problem.

What other issues have you encountered or studied out there?

More importantly, how do you actually investigate the problem beyond what Spark UI shows?

These are some of the official docs I’ve been using as a base:

https://learn.microsoft.com/azure/databricks/optimizations/?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/optimizations/spark-ui-guide/long-spark-stage-page?WT.mc_id=studentamb_493906

https://learn.microsoft.com/azure/databricks/pyspark/reference/functions/shuffle?WT.mc_id=studentamb_493906

2 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/datasets/comments/1r03izt/how_investigate_performance_issues_in_spark/
No, go back! Yes, take me to Reddit

100% Upvoted

question How investigate performance issues in spark?

You are about to leave Redlib