Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Collections and Iterators in C++

Collections

  1. Prefer std::deque to std::vector when the size of the collection is unknow.

  2. Suppose set A and B are two set with the same type and set C is another set with the same value type but a different comparison function, then it is still valid to insert …

Python Logging

General Tips

  1. logging is a Python module for logging coming with the standard library while loguru is a popular 3rd-party logging library. Unless you do not want your Python package/script to depend on 3rd-party libraries, loguru is preferred to logging for multiple reasons.

    • loguru is easy and fun to …

Special Characters to Avoid in Strings

This articles talks about special characters to avoid in various places. You might not encounter issues most of the time when violating rules stated below, however, you never know when things will break. It is more for a good-practice suggestion.

String for Shell

  1. When you pass parameters from one program …

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd …

Java Interfaces for Python

JPype, py4j and PyJNIus are all good options for Java interface for Python. Jpype is easy to use and widely adopted. PyJNIus is an even easier solution compred to JPype. py4j is more complicated to use compared to JPype and PyJNIus, however, it has a better performance, generally speaking.

JPype …

Concurrency and Parallel Computing in Python

The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend …