Using Optimus for Data Profiling in PySpark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips & Traps¶

Optimus requires Python 3.6+.

import pandas as pd
import findspark

# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)

from optimus import Optimus

ops = Optimus(master="local")

df = ops.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),
    ],
)
df.table()

ops.profiler.run(df)

<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>

References¶

https://github.com/ironmussa/Optimus

https://github.com/ironmussa/Optimus/tree/master/examples

https://htmlpreview.github.io/?https://github.com/ironmussa/Optimus/blob/master/docs/cheatsheet/optimus_cheat_sheet.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions