Ben Chuanlong Du's Blog

It is never too late to learn.

Using Optimus for Data Profiling in PySpark

Tips & Traps

  1. Optimus requires Python 3.6+.
In [1]:
import pandas as pd
import findspark

# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)
In [5]:
from optimus import Optimus

ops = Optimus(master="local")
In [6]:
df = ops.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),
    ],
)
df.table()
Viewing 5 of 5 rows / 4 columns
15 partition(s)
names
1 (string)
nullable
height
2 (float)
nullable
function
3 (string)
nullable
rank
4 (int)
nullable
bumbl#ebéé⋅⋅
17.5
Espionage
7
Optim'us
28.0
Leader
10
ironhide&
26.0
Security
7
Jazz
13.0
First⋅Lieutenant
8
Megatron
None
None
None
Viewing 5 of 5 rows / 4 columns
15 partition(s)
In [7]:
ops.profiler.run(df)

Overview

Dataset info

Number of columns 4
Number of rows 5
Total Missing (%) 2
Total size in memory -1 Bytes

Column types

Categorical 0
Numeric 0
Date 0
Array 0
Not available 0

names

categorical
Unique 4
Unique (%)
Missing 0
Missing (%)

Datatypes

String 5
Integer
Decimal
Bool
Date
Missing 0
Null 0

Frequency

Value Count Frequency (%)
Jazz 1 20.0%
Megatron 1 20.0%
bumbl#ebéé 1 20.0%
Optim'us 1 20.0%
ironhide& 1 20.0%
"Missing" 0 %

height

numeric
Unique 4
Unique (%)
Missing 1
Missing (%)

Datatypes

String
Integer
Decimal 4
Bool
Date
Missing 0
Null 1

Basic Stats

Mean 21.125
Minimum 13.0
Maximum 28.0
Zeros(%) 0

Quantile statistics

Minimum 13.0
5-th percentile 13.0
Q1 13.0
Median 17.5
Q3 26.0
95-th percentile 28.0
Maximum 28.0
Range
Interquartile range

Descriptive statistics

Standard deviation 7.07549
Coef of variation
Kurtosis -1.70021
Mean 21.125
MAD
Skewness -0.15561
Sum 84.5
Variance 50.0625

function

categorical
Unique 5
Unique (%)
Missing 0
Missing (%)

Datatypes

String 5
Integer
Decimal
Bool
Date
Missing 0
Null 0

Frequency

Value Count Frequency (%)
First Lieutenant 1 20.0%
Leader 1 20.0%
Security 1 20.0%
Espionage 1 20.0%
None 1 20.0%
"Missing" 0 %

rank

numeric
Unique 3
Unique (%)
Missing 1
Missing (%)

Datatypes

String
Integer 4
Decimal
Bool
Date
Missing 0
Null 1

Basic Stats

Mean 8.0
Minimum 7
Maximum 10
Zeros(%) 0

Quantile statistics

Minimum 7
5-th percentile 7
Q1 7
Median 7
Q3 8
95-th percentile 10
Maximum 10
Range
Interquartile range

Descriptive statistics

Standard deviation 1.41421
Coef of variation
Kurtosis -1.0
Mean 8.0
MAD
Skewness 0.8165
Sum 32
Variance 2.0
Out[7]:
<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>
In [ ]:
 

Comments