Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd …

Static Type Checking of Python Scripts Using pytype

Configuration

There are 3 ways to control the behavior of `pytype.

  1. Pass command-line options to pytype.

  2. Specify a configuration file using pytype --config /path/to/config/file .... You can generate an example configuration file using the command pytype --generate-config pytype.cfg.

  3. If no configuration file is found, pytype uses the …

Call Java Using PyJNIus from Python

PyJNIus is a simple-to-use Java interface for Python. However, JPype is a better alternative.

Installation

pip install Cython
pip install pyjnius

Example with Imported Jar

import os
os.environ["CLASSPATH"] = "/path/to/your.jar"
from jnius import autoclass
YourClass = autoclass(path.to.YourClass)
yourObj = YourClass()

Note: Avoid using the same …

Java Interfaces for Python

JPype, py4j and PyJNIus are all good options for Java interface for Python. Jpype is easy to use and widely adopted. PyJNIus is an even easier solution compred to JPype. py4j is more complicated to use compared to JPype and PyJNIus, however, it has a better performance, generally speaking.

JPype …

Avoid Database Lock in SQLite3

  1. According to https://www.sqlite.org/lockingv3.html, POSIX advisory locking is known to be buggy or even unimplemented on many NFS implementations (including recent versions of Mac OS X) and that there are reports of locking problems for network filesystems under Windows. So, the rule of thumb is to …