Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Control Number of Partitions of a DataFrame in Spark

Tips and Traps

  1. DataFrame.repartition repartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the method DataFrame.repartition, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed