Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Date Functions in Spark

Tips and Traps

  1. HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.
  1. datetime.datetime or datetime.date

Understand Execuation of SQL Statements

Execuation Order

A SQL statement selects rows and columns from a big (rectangular) table. You put columns that you want to select after SELECT and rows you want to select after FROM. A SQL statement is executed as follows. First, the (INNER|LEFT|RIGHT|FULL) JOIN (ON) is executed if any (see more explanation later). Second, the WHERE

Select Columns from Structured Text Files

Python pandas

My first choice is pandas in Python. However, below are some tools for quick and dirty solutions.

q

q -t -H 'select c1, c3 from file.txt'

cut

cut -d\t -f1,3 file.txt

awk

awk -F'\t' '{print $1 "\t" $3}' file.tsv 

Note: neither cut …

Sample Lines from a File Using Command Line

NOTE: the article talks about sampling "lines" rather than "records". If a records can occupy multiple lines, e.g., if any field contains a new line (\n), the following tutorial does not work and you have to fall back to more powerful tools such as Python or R.

Let's say …