Tips and Traps¶
- HDFS table might contain invalid data (I'm not clear about the reasons at this time) with respct to the column types (e.g., Date and Timestamp). This will cause issues when Spark tries to load the data. For more discussions, please refer to Unrecognized column type:TIMESTAMP_TYP.
datetime.datetime
ordatetime.date
Understand Execuation of SQL Statements
Execuation Order¶
A SQL statement selects rows and columns from a big (rectangular) table.
You put columns that you want to select after SELECT
and rows you want to select after FROM
.
A SQL statement is executed as follows.
First,
the (INNER|LEFT|RIGHT|FULL) JOIN (ON)
is executed if any (see more explanation later).
Second,
the WHERE
Select Columns from Structured Text Files
Python pandas
My first choice is pandas in Python. However, below are some tools for quick and dirty solutions.
q
q -t -H 'select c1, c3 from file.txt'
cut
cut -d\t -f1,3 file.txt
awk
awk -F'\t' '{print $1 "\t" $3}' file.tsv
Note: neither cut …
Sample Lines from a File Using Command Line
NOTE: the article talks about sampling "lines" rather than "records".
If a records can occupy multiple lines,
e.g., if any field contains a new line (\n
),
the following tutorial does not work
and you have to fall back to more powerful tools such as Python or R.
Let's say …