Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Using Optimus for Data Profiling in PySpark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips & Traps

  1. Optimus requires Python 3.6+.

import pandas as pd
import findspark

# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)
from optimus import Optimus

ops = Optimus(master="local")
df = ops.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),
    ],
)
df.table()
Loading...
ops.profiler.run(df)
Loading...
<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>