Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Sample Rows from a Spark DataFrame

Tips and Traps

  1. TABLESAMPLE must be immedidately after a table name.

  2. The WHERE clause in the following SQL query runs after TABLESAMPLE.

     SELECT 
         *
     FROM 
         table_name 
     TABLESAMPLE (10 PERCENT) 
     WHERE 
         id = 1
    
    

    If you want to run a WHERE clause first and then do TABLESAMPLE, you have to a subquery instead.

     SELECT 
         *
     FROM (
         SELECT * FROM table_name
         WHERE id = 1
     ) A
     TABLESAMPLE (10 PERCENT)
  3. Avoid using TABLESAMPLE (k rows) as it is not simple random sample but instead implemented using LIMIT. Always use TABLESAMPLE (p PERCENT) if randomness is important.

Be Careful with Subsampling

If you don't persist the data frame, it's recalculated every time! This is really dangerous for any random associated data processing, e.g., subsampling.

Do NOT Sampling a Specific Number of Rows

Sampling a specific number of rows in Spark does not performance a simple random sampling, it is implemented as LIMIT It is suggested that you always sample a fraction instead of sampling a specific number of rows in Spark if randomness is important.

# avoid 
select * from table_name TABLESAMPLE (100 ROWS) 
# use the following instead
select * from table_name TABLESAMPLE (1 PCT) 

SQL API

SELECT * FROM some_table
TABLESAMPLE (100 ROWS)
SELECT * FROM some_table
TABLESAMPLE (50 PERCENT)
In [1]:
import pandas as pd
import findspark

findspark.init("/opt/spark-3.1.2-bin-hadoop3.2/")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType

spark = SparkSession.builder.appName("PySpark Sample").enableHiveSupport().getOrCreate()
21/10/04 10:22:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
In [9]:
!curl -sSL https://www.legendu.net/media/data/iris.csv -o iris.csv
In [12]:
df = spark.read.option("header", "true").csv("iris.csv")
df.show()
+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  1|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|  2|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|  3|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|  4|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  6|            5.4|           3.9|            1.7|           0.4|Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|Iris-setosa|
| 12|            4.8|           3.4|            1.6|           0.2|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 15|            5.8|           4.0|            1.2|           0.2|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 17|            5.4|           3.9|            1.3|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 20|            5.1|           3.8|            1.5|           0.3|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

Sample with Replacement

In [16]:
df.sample(True, 0.1).show()
+---+---------------+--------------+---------------+--------------+---------------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|        species|
+---+---------------+--------------+---------------+--------------+---------------+
|  3|            4.7|           3.2|            1.3|           0.2|    Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|    Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|    Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|    Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|    Iris-setosa|
| 40|            5.1|           3.4|            1.5|           0.2|    Iris-setosa|
| 47|            5.1|           3.8|            1.6|           0.2|    Iris-setosa|
| 51|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|
| 92|            6.1|           3.0|            4.6|           1.4|Iris-versicolor|
| 99|            5.1|           2.5|            3.0|           1.1|Iris-versicolor|
|113|            6.8|           3.0|            5.5|           2.1| Iris-virginica|
|119|            7.7|           2.6|            6.9|           2.3| Iris-virginica|
|130|            7.2|           3.0|            5.8|           1.6| Iris-virginica|
|149|            6.2|           3.4|            5.4|           2.3| Iris-virginica|
+---+---------------+--------------+---------------+--------------+---------------+

Sample without Replacement

In [20]:
df.sample(False, 0.9).show()
+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  2|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|  3|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|  4|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  6|            5.4|           3.9|            1.7|           0.4|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 11|            5.4|           3.7|            1.5|           0.2|Iris-setosa|
| 12|            4.8|           3.4|            1.6|           0.2|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 17|            5.4|           3.9|            1.3|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 20|            5.1|           3.8|            1.5|           0.3|Iris-setosa|
| 22|            5.1|           3.7|            1.5|           0.4|Iris-setosa|
| 23|            4.6|           3.6|            1.0|           0.2|Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

In [22]:
df.sample(False, 0.5).show()
+---+---------------+--------------+---------------+--------------+-----------+
| id|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|    species|
+---+---------------+--------------+---------------+--------------+-----------+
|  5|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|  7|            4.6|           3.4|            1.4|           0.3|Iris-setosa|
|  8|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|  9|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
| 10|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 13|            4.8|           3.0|            1.4|           0.1|Iris-setosa|
| 14|            4.3|           3.0|            1.1|           0.1|Iris-setosa|
| 15|            5.8|           4.0|            1.2|           0.2|Iris-setosa|
| 16|            5.7|           4.4|            1.5|           0.4|Iris-setosa|
| 18|            5.1|           3.5|            1.4|           0.3|Iris-setosa|
| 19|            5.7|           3.8|            1.7|           0.3|Iris-setosa|
| 23|            4.6|           3.6|            1.0|           0.2|Iris-setosa|
| 24|            5.1|           3.3|            1.7|           0.5|Iris-setosa|
| 28|            5.2|           3.5|            1.5|           0.2|Iris-setosa|
| 30|            4.7|           3.2|            1.6|           0.2|Iris-setosa|
| 34|            5.5|           4.2|            1.4|           0.2|Iris-setosa|
| 35|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
| 37|            5.5|           3.5|            1.3|           0.2|Iris-setosa|
| 46|            4.8|           3.0|            1.4|           0.3|Iris-setosa|
| 50|            5.0|           3.3|            1.4|           0.2|Iris-setosa|
+---+---------------+--------------+---------------+--------------+-----------+
only showing top 20 rows

Comments