Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Control Number of Partitions of a DataFrame in Spark

Tips and Traps

  1. DataFrame.repartition repartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the method DataFrame.repartition, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed

Resizing Hard Disk of Guest Machine in Virtualbox

Suppose you have virtual hard disk in VirtualBox called xp.vdi, you can resize it (megabytes) using the following command.

VBoxManage modifyhd xp.vdi --resize 40960

The command currently doesn't support vmdk virtual disk. So if you have a virtual disk called xp.vmdk, you have to first convert it …