Compare Data Frames Using DataCompy in Python

Comments¶

data-diff is similar tool which efficiently diff rows across two different databases.

Installation¶

!pip3 install -U datacompy

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: datacompy in /home/dclong/.local/lib/python3.8/site-packages (0.8.1)
Requirement already satisfied: numpy<=1.22.3,>=1.11.3 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.21.6)
Requirement already satisfied: ordered-set<=4.1.0,>=4.0.2 in /home/dclong/.local/lib/python3.8/site-packages (from datacompy) (4.1.0)
Requirement already satisfied: pandas<=1.4.2,>=0.25.0 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.4.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas<=1.4.2,>=0.25.0->datacompy) (1.14.0)

from pathlib import Path
import findspark

findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructType,
    StructField,
    ArrayType,
)
import datacompy

spark = SparkSession.builder.appName("datacompy").enableHiveSupport().getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/01 13:50:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Comparing Two pandas DataFrames¶

from io import StringIO
import pandas as pd

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

df1

df2

compare = datacompy.Compare(
    df1,
    df2,
    join_columns="acct_id",  # You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name="original",
    df2_name="new",
)
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
2   float_fld        float64   float64          4    0.0005            3
1        name         object    object          3    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id name (original)            name (new)
4  10000001238   Lucille Bluth      Loose Seal Bluth
0  10000001234  George Maharis  George Michael Bluth
3  10000001237      Bob Loblaw         Robert Loblaw

       acct_id  float_fld (original)  float_fld (new)
1  10000001235                1.0000              NaN
4  10000001238                   NaN          111.000
0  10000001234            14530.1555        14530.155
2  10000001236                   NaN            1.000

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt              name  float_fld    date_fld
5  10000001238        1.05  Loose Seal Bluth        NaN  2017-01-01

Comparing Two Spark DataFrames¶

There is no advantage of running datacompy in a local version of Spark! This approach consumes more memory than running datacompy on pandas DataFrames and costs more time.
If you use datacompy with a local version of Spark, make sure to import datacompy after findspark.init(...).

sdf1 = spark.createDataFrame(df1)
sdf1.show()

+-----------+----------+----------------+----------+----------+
|    acct_id|dollar_amt|            name| float_fld|  date_fld|
+-----------+----------+----------------+----------+----------+
|10000001234|    123.45|  George Maharis|14530.1555|2017-01-01|
|10000001235|      0.45|   Michael Bluth|       1.0|2017-01-01|
|10000001236|    1345.0|    George Bluth|       NaN|2017-01-01|
|10000001237|  123456.0|      Bob Loblaw|    345.12|2017-01-01|
|10000001238|      1.05|   Lucille Bluth|       NaN|2017-01-01|
|10000001238|      1.05|Loose Seal Bluth|       NaN|2017-01-01|
+-----------+----------+----------------+----------+----------+

sdf2 = spark.createDataFrame(df2)
sdf2.show()

+-----------+----------+--------------------+---------+
|    acct_id|dollar_amt|                name|float_fld|
+-----------+----------+--------------------+---------+
|10000001234|     123.4|George Michael Bluth|14530.155|
|10000001235|      0.45|       Michael Bluth|      NaN|
|10000001236|    1345.0|        George Bluth|      1.0|
|10000001237|  123456.0|       Robert Loblaw|   345.12|
|10000001238|      1.05|    Loose Seal Bluth|    111.0|
+-----------+----------+--------------------+---------+

comparison = datacompy.SparkCompare(
    spark,
    sdf1,
    sdf2,
    join_columns=["acct_id"],  # must use a list of column(s)
    cache_intermediates=True,
    abs_tol=0.0001,
    rel_tol=0,
    match_rates=True,
)

type(comparison)

datacompy.sparkcompare.SparkCompare

comparison.report()


****** Column Summary ******
Number of columns in common with matching schemas: 4
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 1
Number of columns in compare but not base: 0

****** Columns In Base Only ******
Column Name  Dtype
-----------  -------------
date_fld     string       

****** Row Summary ******
Number of rows in common: 5
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 1
Number of duplicate rows found in compare: 0

****** Row Comparison ******
Number of rows with some columns unequal: 4
Number of rows with all columns equal: 1

****** Column Comparison ******
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 0

****** Columns with Unequal Values ******
Base Column Name  Compare Column Name  Base Dtype     Compare Dtype  # Matches  # Mismatches  Match Rate %
----------------  -------------------  -------------  -------------  ---------  ------------  ------------
dollar_amt        dollar_amt           double         double                 4             1      80.00000
float_fld         float_fld            double         double                 3             2      60.00000
name              name                 string         string                 2             3      40.00000

SparkCompare.report takes a file handle to write the report to. You can pass a StringIO object to SparkCompare.report to write the report to a string buffer.

with StringIO() as sio:
    comparison.report(sio)
    report = sio.getvalue()

report

'\n****** Column Summary ******\nNumber of columns in common with matching schemas: 4\nNumber of columns in common with schema differences: 0\nNumber of columns in base but not compare: 1\nNumber of columns in compare but not base: 0\n\n****** Columns In Base Only ******\nColumn Name  Dtype\n-----------  -------------\ndate_fld     string       \n\n****** Row Summary ******\nNumber of rows in common: 5\nNumber of rows in base but not compare: 0\nNumber of rows in compare but not base: 0\nNumber of duplicate rows found in base: 1\nNumber of duplicate rows found in compare: 0\n\n****** Row Comparison ******\nNumber of rows with some columns unequal: 4\nNumber of rows with all columns equal: 1\n\n****** Column Comparison ******\nNumber of columns compared with some values unequal: 3\nNumber of columns compared with all values equal: 0\n\n****** Columns with Unequal Values ******\nBase Column Name  Compare Column Name  Base Dtype     Compare Dtype  # Matches  # Mismatches  Match Rate %\n----------------  -------------------  -------------  -------------  ---------  ------------  ------------\ndollar_amt        dollar_amt           double         double                 4             1      80.00000\nfloat_fld         float_fld            double         double                 3             2      60.00000\nname              name                 string         string                 2             3      40.00000\n'

References¶

https://github.com/capitalone/datacompy

https://capitalone.github.io/datacompy/pandas_usage.html