Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Sample Rows from a Spark DataFrame

Tips and Traps

  1. TABLESAMPLE must be immedidately after a table name.

  2. The WHERE clause in the following SQL query runs after TABLESAMPLE.

     SELECT 
         *
     FROM 
         table_name 
     TABLESAMPLE (10 PERCENT) 
     WHERE 
         id = 1
    
    

    If you want to run a WHERE

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd …

Union DataFrames in Spark

Comment

  1. union relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.

Java Interfaces for Python

JPype, py4j and PyJNIus are all good options for Java interface for Python. Jpype is easy to use and widely adopted. PyJNIus is an even easier solution compred to JPype. py4j is more complicated to use compared to JPype and PyJNIus, however, it has a better performance, generally speaking.

JPype …

Work with Long Strings in Python

This article discusses different ways to write long strings in Python.

Long String in One Line

A long string can be put on the the same line, which is ugly of course.

The eval Function in Python

The function eval takes a single line of code as string, evaluates it, and returns the value. Notice that objects in the evaluated expression must be present in the current scope, otherwise, exceptions will be thrown.

Even though eval (together with exec) might be useful in some situations, e.g., when implementing a REPL. It is strongly suggested that you avoid using eval