Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Sample Rows from a Spark DataFrame

Tips and Traps

  1. TABLESAMPLE must be immedidately after a table name.

  2. The WHERE clause in the following SQL query runs after TABLESAMPLE.

     SELECT 
         *
     FROM 
         table_name 
     TABLESAMPLE (10 PERCENT) 
     WHERE 
         id = 1
    
    

    If you want to run a WHERE

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd …

Union DataFrames in Spark

Comment

  1. union relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.

Static Type Checking of Python Scripts Using pytype

Configuration

There are 3 ways to control the behavior of `pytype.

  1. Pass command-line options to pytype.

  2. Specify a configuration file using pytype --config /path/to/config/file .... You can generate an example configuration file using the command pytype --generate-config pytype.cfg.

  3. If no configuration file is found, pytype uses the …

Call Java Using PyJNIus from Python

PyJNIus is a simple-to-use Java interface for Python. However, JPype is a better alternative.

Installation

pip install Cython
pip install pyjnius

Example with Imported Jar

import os
os.environ["CLASSPATH"] = "/path/to/your.jar"
from jnius import autoclass
YourClass = autoclass(path.to.YourClass)
yourObj = YourClass()

Note: Avoid using the same …