Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Sample Rows from a Spark DataFrame

Tips and Traps

  1. TABLESAMPLE must be immedidately after a table name.

  2. The WHERE clause in the following SQL query runs after TABLESAMPLE.

     SELECT 
         *
     FROM 
         table_name 
     TABLESAMPLE (10 PERCENT) 
     WHERE 
         id = 1
    
    

    If you want to run a WHERE

Sample Lines from a File Using Command Line

NOTE: the article talks about sampling "lines" rather than "records". If a records can occupy multiple lines, e.g., if any field contains a new line (\n), the following tutorial does not work and you have to fall back to more powerful tools such as Python or R.

Let's say …