Tips and Traps¶

TABLESAMPLE must be immedidately after a table name.

The WHERE clause in the following SQL query runs after TABLESAMPLE.

 SELECT 
     *
 FROM 
     table_name 
 TABLESAMPLE (10 PERCENT) 
 WHERE 
     id = 1

If you want to run a WHERE clause first and then do TABLESAMPLE, you have to a subquery instead.

 SELECT 
     *
 FROM (
     SELECT * FROM table_name
     WHERE id = 1
 ) A
 TABLESAMPLE (10 PERCENT)

Avoid using TABLESAMPLE (k rows) as it is not simple random sample but instead implemented using LIMIT. Always use TABLESAMPLE (p PERCENT) if randomness is important.

Be Careful with Subsampling¶

If you don't persist the data frame, it's recalculated every time! This is really dangerous for any random associated data processing, e.g., subsampling.

Do NOT Sampling a Specific Number of Rows¶

Sampling a specific number of rows in Spark does not performance a simple random sampling, it is implemented as LIMIT It is suggested that you always sample a fraction instead of sampling a specific number of rows in Spark if randomness is important.

# avoid 
select * from table_name TABLESAMPLE (100 ROWS) 
# use the following instead
select * from table_name TABLESAMPLE (1 PCT)

SQL API¶

SELECT * FROM some_table
TABLESAMPLE (100 ROWS)

SELECT * FROM some_table
TABLESAMPLE (50 PERCENT)

In [1]:

import pandas as pd
import findspark

findspark.init("/opt/spark-3.1.2-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType

spark = SparkSession.builder.appName("PySpark Sample").enableHiveSupport().getOrCreate()

21/10/04 10:22:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

In [9]:

!curl -sSL https://www.legendu.net/media/data/iris.csv -o iris.csv

In [12]:

df = spark.read.option("header", "true").csv("iris.csv")
df.show()

+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  1|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|  2|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|  3|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|  4|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  6|            5.4|           3.9|            1.7|           0.4|Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|Iris-setosa|
| 12|            4.8|           3.4|            1.6|           0.2|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 15|            5.8|           4.0|            1.2|           0.2|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 17|            5.4|           3.9|            1.3|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 20|            5.1|           3.8|            1.5|           0.3|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

Sample with Replacement¶

In [16]:

df.sample(True, 0.1).show()

+---+---------------+--------------+---------------+--------------+---------------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|        species|
+---+---------------+--------------+---------------+--------------+---------------+
|  3|            4.7|           3.2|            1.3|           0.2|    Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|    Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|    Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|    Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|    Iris-setosa|
| 40|            5.1|           3.4|            1.5|           0.2|    Iris-setosa|
| 47|            5.1|           3.8|            1.6|           0.2|    Iris-setosa|
| 51|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|
| 92|            6.1|           3.0|            4.6|           1.4|Iris-versicolor|
| 99|            5.1|           2.5|            3.0|           1.1|Iris-versicolor|
|113|            6.8|           3.0|            5.5|           2.1| Iris-virginica|
|119|            7.7|           2.6|            6.9|           2.3| Iris-virginica|
|130|            7.2|           3.0|            5.8|           1.6| Iris-virginica|
|149|            6.2|           3.4|            5.4|           2.3| Iris-virginica|
+---+---------------+--------------+---------------+--------------+---------------+

Sample without Replacement¶

In [20]:

df.sample(False, 0.9).show()

+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  2|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|  3|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|  4|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  6|            5.4|           3.9|            1.7|           0.4|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|Iris-setosa|
| 12|            4.8|           3.4|            1.6|           0.2|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 17|            5.4|           3.9|            1.3|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 20|            5.1|           3.8|            1.5|           0.3|Iris-setosa|
| 22|            5.1|           3.7|            1.5|           0.4|Iris-setosa|
| 23|            4.6|           3.6|            1.0|           0.2|Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

In [22]:

df.sample(False, 0.5).show()

+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 15|            5.8|           4.0|            1.2|           0.2|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 23|            4.6|           3.6|            1.0|           0.2|Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|Iris-setosa|
| 28|            5.2|           3.5|            1.5|           0.2|Iris-setosa|
| 30|            4.7|           3.2|            1.6|           0.2|Iris-setosa|
| 34|            5.5|           4.2|            1.4|           0.2|Iris-setosa|
| 35|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 37|            5.5|           3.5|            1.3|           0.2|Iris-setosa|
| 46|            4.8|           3.0|            1.4|           0.3|Iris-setosa|
| 50|            5.0|           3.3|            1.4|           0.2|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

References¶

https://stackoverflow.com/questions/51502443/is-sample-n-really-a-random-sample-when-used-with-sparklyr

Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Sample Rows from a Spark DataFrame