Tips and Traps¶
TABLESAMPLE
must be immedidately after a table name.The
WHERE
clause in the following SQL query runs afterTABLESAMPLE
.SELECT * FROM table_name TABLESAMPLE (10 PERCENT) WHERE id = 1
If you want to run a
WHERE
clause first and then doTABLESAMPLE
, you have to a subquery instead.SELECT * FROM ( SELECT * FROM table_name WHERE id = 1 ) A TABLESAMPLE (10 PERCENT)
Avoid using
TABLESAMPLE (k rows)
as it is not simple random sample but instead implemented usingLIMIT
. Always useTABLESAMPLE (p PERCENT)
if randomness is important.
Be Careful with Subsampling¶
If you don't persist the data frame, it's recalculated every time! This is really dangerous for any random associated data processing, e.g., subsampling.
Do NOT Sampling a Specific Number of Rows¶
Sampling a specific number of rows in Spark does not performance a simple random sampling,
it is implemented as LIMIT
It is suggested that you always sample a fraction instead of sampling a specific number of rows in Spark
if randomness is important.
# avoid
select * from table_name TABLESAMPLE (100 ROWS)
# use the following instead
select * from table_name TABLESAMPLE (1 PCT)
SQL API¶
SELECT * FROM some_table
TABLESAMPLE (100 ROWS)
SELECT * FROM some_table
TABLESAMPLE (50 PERCENT)
import pandas as pd
import findspark
findspark.init("/opt/spark-3.1.2-bin-hadoop3.2/")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType, StructType
spark = SparkSession.builder.appName("PySpark Sample").enableHiveSupport().getOrCreate()
!curl -sSL https://www.legendu.net/media/data/iris.csv -o iris.csv
df = spark.read.option("header", "true").csv("iris.csv")
df.show()
Sample with Replacement¶
df.sample(True, 0.1).show()
Sample without Replacement¶
df.sample(False, 0.9).show()
df.sample(False, 0.5).show()