Tips and Traps¶
TABLESAMPLE
must be immedidately after a table name.The
WHERE
clause in the following SQL query runs afterTABLESAMPLE
.SELECT * FROM table_name TABLESAMPLE (10 PERCENT) WHERE id = 1
If you want to run a
WHERE
Union RDDs in Spark
No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.
-
Union 2 RDDs.
df1.union(df2) // or for old-fashioned RDD rdd1.union(rdd_2)
-
Union multiple RDDs.
df = spark.union([df1, df2, df3]) // spark is a SparkSession object // or for old-fashioned RDD rdd …
Union DataFrames in Spark
Comment¶
union
relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.
Comparing Similarity of Two Different Clusterings
The paper Comparing Clusterings - An Overview has a good view of different metrics for comparing the similarity of 2 clusterings. Overall, Normalized Mutual Information sounds like a good one. It is implemented in sklearn as sklearn.metrics.normalized_mutual_info_score . Of course, there are many more metrics for measuring similarity of 2 …
Static Type Checking of Python Scripts Using pytype
Configuration
There are 3 ways to control the behavior of `pytype.
-
Pass command-line options to
pytype
. -
Specify a configuration file using
pytype --config /path/to/config/file ...
. You can generate an example configuration file using the commandpytype --generate-config pytype.cfg
. -
If no configuration file is found, pytype uses the …
Call Java Using PyJNIus from Python
PyJNIus is a simple-to-use Java interface for Python. However, JPype is a better alternative.
Installation
pip install Cython
pip install pyjnius
Example with Imported Jar
import os
os.environ["CLASSPATH"] = "/path/to/your.jar"
from jnius import autoclass
YourClass = autoclass(path.to.YourClass)
yourObj = YourClass()
Note: Avoid using the same …