Tips and Traps¶
TABLESAMPLEmust be immedidately after a table name.The
WHEREclause in the following SQL query runs afterTABLESAMPLE.SELECT * FROM table_name TABLESAMPLE (10 PERCENT) WHERE id = 1If you want to run a
WHEREclause first and then doTABLESAMPLE, you have to a subquery instead.SELECT * FROM ( SELECT * FROM table_name WHERE id = 1 ) A TABLESAMPLE (10 PERCENT)Avoid using
TABLESAMPLE (k rows)as it is not simple random sample but instead implemented usingLIMIT. Always useTABLESAMPLE (p PERCENT)if randomness is important.
Be Careful with Subsampling¶
If you don’t persist the data frame, it’s recalculated every time! This is really dangerous for any random associated data processing, e.g., subsampling.
Do NOT Sampling a Specific Number of Rows¶
Sampling a specific number of rows in Spark does not performance a simple random sampling,
it is implemented as LIMIT
It is suggested that you always sample a fraction instead of sampling a specific number of rows in Spark
if randomness is important.
# avoid
select * from table_name TABLESAMPLE (100 ROWS)
# use the following instead
select * from table_name TABLESAMPLE (1 PCT) SQL API¶
SELECT * FROM some_table
TABLESAMPLE (100 ROWS)SELECT * FROM some_table
TABLESAMPLE (50 PERCENT)import pandas as pd
import findspark
findspark.init("/opt/spark-3.1.2-bin-hadoop3.2/")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType
spark = SparkSession.builder.appName("PySpark Sample").enableHiveSupport().getOrCreate()21/10/04 10:22:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
!curl -sSL https://www.legendu.net/media/data/iris.csv -o iris.csvdf = spark.read.option("header", "true").csv("iris.csv")
df.show()+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| species|
+---+---------------+--------------+---------------+--------------+-----------+
| 1| 5.1| 3.5| 1.4| 0.2|Iris-setosa|
| 2| 4.9| 3.0| 1.4| 0.2|Iris-setosa|
| 3| 4.7| 3.2| 1.3| 0.2|Iris-setosa|
| 4| 4.6| 3.1| 1.5| 0.2|Iris-setosa|
| 5| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
| 6| 5.4| 3.9| 1.7| 0.4|Iris-setosa|
| 7| 4.6| 3.4| 1.4| 0.3|Iris-setosa|
| 8| 5.0| 3.4| 1.5| 0.2|Iris-setosa|
| 9| 4.4| 2.9| 1.4| 0.2|Iris-setosa|
| 10| 4.9| 3.1| 1.5| 0.1|Iris-setosa|
| 11| 5.4| 3.7| 1.5| 0.2|Iris-setosa|
| 12| 4.8| 3.4| 1.6| 0.2|Iris-setosa|
| 13| 4.8| 3.0| 1.4| 0.1|Iris-setosa|
| 14| 4.3| 3.0| 1.1| 0.1|Iris-setosa|
| 15| 5.8| 4.0| 1.2| 0.2|Iris-setosa|
| 16| 5.7| 4.4| 1.5| 0.4|Iris-setosa|
| 17| 5.4| 3.9| 1.3| 0.4|Iris-setosa|
| 18| 5.1| 3.5| 1.4| 0.3|Iris-setosa|
| 19| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 20| 5.1| 3.8| 1.5| 0.3|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows
Sample with Replacement¶
df.sample(True, 0.1).show()+---+---------------+--------------+---------------+--------------+---------------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| species|
+---+---------------+--------------+---------------+--------------+---------------+
| 3| 4.7| 3.2| 1.3| 0.2| Iris-setosa|
| 7| 4.6| 3.4| 1.4| 0.3| Iris-setosa|
| 11| 5.4| 3.7| 1.5| 0.2| Iris-setosa|
| 16| 5.7| 4.4| 1.5| 0.4| Iris-setosa|
| 24| 5.1| 3.3| 1.7| 0.5| Iris-setosa|
| 40| 5.1| 3.4| 1.5| 0.2| Iris-setosa|
| 47| 5.1| 3.8| 1.6| 0.2| Iris-setosa|
| 51| 7.0| 3.2| 4.7| 1.4|Iris-versicolor|
| 92| 6.1| 3.0| 4.6| 1.4|Iris-versicolor|
| 99| 5.1| 2.5| 3.0| 1.1|Iris-versicolor|
|113| 6.8| 3.0| 5.5| 2.1| Iris-virginica|
|119| 7.7| 2.6| 6.9| 2.3| Iris-virginica|
|130| 7.2| 3.0| 5.8| 1.6| Iris-virginica|
|149| 6.2| 3.4| 5.4| 2.3| Iris-virginica|
+---+---------------+--------------+---------------+--------------+---------------+
Sample without Replacement¶
df.sample(False, 0.9).show()+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| species|
+---+---------------+--------------+---------------+--------------+-----------+
| 2| 4.9| 3.0| 1.4| 0.2|Iris-setosa|
| 3| 4.7| 3.2| 1.3| 0.2|Iris-setosa|
| 4| 4.6| 3.1| 1.5| 0.2|Iris-setosa|
| 5| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
| 6| 5.4| 3.9| 1.7| 0.4|Iris-setosa|
| 8| 5.0| 3.4| 1.5| 0.2|Iris-setosa|
| 9| 4.4| 2.9| 1.4| 0.2|Iris-setosa|
| 10| 4.9| 3.1| 1.5| 0.1|Iris-setosa|
| 11| 5.4| 3.7| 1.5| 0.2|Iris-setosa|
| 12| 4.8| 3.4| 1.6| 0.2|Iris-setosa|
| 13| 4.8| 3.0| 1.4| 0.1|Iris-setosa|
| 14| 4.3| 3.0| 1.1| 0.1|Iris-setosa|
| 16| 5.7| 4.4| 1.5| 0.4|Iris-setosa|
| 17| 5.4| 3.9| 1.3| 0.4|Iris-setosa|
| 18| 5.1| 3.5| 1.4| 0.3|Iris-setosa|
| 19| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 20| 5.1| 3.8| 1.5| 0.3|Iris-setosa|
| 22| 5.1| 3.7| 1.5| 0.4|Iris-setosa|
| 23| 4.6| 3.6| 1.0| 0.2|Iris-setosa|
| 24| 5.1| 3.3| 1.7| 0.5|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows
df.sample(False, 0.5).show()+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| species|
+---+---------------+--------------+---------------+--------------+-----------+
| 5| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
| 7| 4.6| 3.4| 1.4| 0.3|Iris-setosa|
| 8| 5.0| 3.4| 1.5| 0.2|Iris-setosa|
| 9| 4.4| 2.9| 1.4| 0.2|Iris-setosa|
| 10| 4.9| 3.1| 1.5| 0.1|Iris-setosa|
| 13| 4.8| 3.0| 1.4| 0.1|Iris-setosa|
| 14| 4.3| 3.0| 1.1| 0.1|Iris-setosa|
| 15| 5.8| 4.0| 1.2| 0.2|Iris-setosa|
| 16| 5.7| 4.4| 1.5| 0.4|Iris-setosa|
| 18| 5.1| 3.5| 1.4| 0.3|Iris-setosa|
| 19| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 23| 4.6| 3.6| 1.0| 0.2|Iris-setosa|
| 24| 5.1| 3.3| 1.7| 0.5|Iris-setosa|
| 28| 5.2| 3.5| 1.5| 0.2|Iris-setosa|
| 30| 4.7| 3.2| 1.6| 0.2|Iris-setosa|
| 34| 5.5| 4.2| 1.4| 0.2|Iris-setosa|
| 35| 4.9| 3.1| 1.5| 0.1|Iris-setosa|
| 37| 5.5| 3.5| 1.3| 0.2|Iris-setosa|
| 46| 4.8| 3.0| 1.4| 0.3|Iris-setosa|
| 50| 5.0| 3.3| 1.4| 0.2|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows