Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Date Functions in Spark

Tips and Traps

  1. HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.
  1. datetime.datetime or datetime.date

Count Number of Fields in Each Line

Sometimes, a structured text file might be malformatted. A simple way to verify it is to count the number of fields in each line.

Using awk

You can count the number of fields in each line using the following awk command. Unfortunately, awk does not take escaped characters into consideration …

Quickly Create a Scala Project Using Gradle in Intellij IDEA

Easy Way

  1. Create a directory (e.g., demo_proj) for your project.

  2. Run gradle init --type scala-library in terminal in the above directory.

  3. Import the directory as a Gradle project in IntelliJ IDEA. Alternatively, you can add apply plugin: 'idea' into build.gradle and then run the command ./gradlew openIdea to …

Install Python Packages Behind Firewall

It is recommended that you use pip to install Python packages.

  1. If you don't already know the proxy in use (in your company), read the post Find out Proxy in Use to figure it out.

  2. Set proxy environment variables.

    set http_proxy=http://user:password@proxy_ip:port
    set https_proxy=https://user …

Visualize Nvidia GPU Usage

You can use the tool nvtop (Linux only) to visualize the usage of Nvidia GPUs. However, it is only available on Linux and is not suitable for tracking and visualize the GPU usage in a long time period. Another simple approach to track and visualize the GPU usage is to dump GPU usage statistics into a CSV file using the following command