Manipulate Videos Using MoviePy in Python
Installation¶
Process Big Data Using PySpark
-
PySpark 2.4 and older does not support Python 3.8. You have to use Python 3.7 with PySpark 2.4 or older.
-
It can be extremely helpful to run a PySpark application locally to detect possible issues before submitting it to the Spark cluster.
#!/usr/bin/env bash …
Get CentOS Version
You can get the version of CentOS using the following command.
rpm -q centos-release
This trick can be used to get the version of the CentOS distribution on a Spark cluster. Basically, you run this command in the driver or workers to print the versions and then parse the log …
Control Number of Partitions of a DataFrame in Spark
Tips and Traps¶
DataFrame.repartitionrepartitions the DataFrame by hash code of each row. If you specify a (multiple) column(s) (instead of number of partitions) to the methodDataFrame.repartition, then hash code of the column(s) are calculated for repartition. In some situations, there are lots of hash conflictions even if the total number of rows is small (e.g., a few thousand), which means that partitions generated might be skewed
Spark Issue: Shell Related
Symptom 1
/bin/sh: hdfs: command not found
Possible Causes of Symptom 1
The command hdfs is not on the search path.
Possible Solutions to Symptom 1
- Use the full path to the command.
- Configure the environment variable
PATHbefore you use the command. - Find other alternatives to the command …