Comments¶

data-diff is similar tool which efficiently diff rows across two different databases.

Installation¶

In [2]:

!pip3 install -U datacompy

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: datacompy in /home/dclong/.local/lib/python3.8/site-packages (0.8.1)
Requirement already satisfied: numpy<=1.22.3,>=1.11.3 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.21.6)
Requirement already satisfied: ordered-set<=4.1.0,>=4.0.2 in /home/dclong/.local/lib/python3.8/site-packages (from datacompy) (4.1.0)
Requirement already satisfied: pandas<=1.4.2,>=0.25.0 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.4.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas<=1.4.2,>=0.25.0->datacompy) (1.14.0)

In [3]:

from pathlib import Path
import findspark

findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import (
    IntegerType,
    StringType,
    StructType,
    StructField,
    ArrayType,
)
import datacompy

spark = SparkSession.builder.appName("datacompy").enableHiveSupport().getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/01 13:50:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Comparing Two pandas DataFrames¶

In [5]:

from io import StringIO
import pandas as pd

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

In [6]:

df1

Out[6]:

	acct_id	dollar_amt	name	float_fld	date_fld
0	10000001234	123.45	George Maharis	14530.1555	2017-01-01
1	10000001235	0.45	Michael Bluth	1.0000	2017-01-01
2	10000001236	1345.00	George Bluth	NaN	2017-01-01
3	10000001237	123456.00	Bob Loblaw	345.1200	2017-01-01
4	10000001238	1.05	Lucille Bluth	NaN	2017-01-01
5	10000001238	1.05	Loose Seal Bluth	NaN	2017-01-01

In [7]:

df2

Out[7]:

	acct_id	dollar_amt	name	float_fld
0	10000001234	123.40	George Michael Bluth	14530.155
1	10000001235	0.45	Michael Bluth	NaN
2	10000001236	1345.00	George Bluth	1.000
3	10000001237	123456.00	Robert Loblaw	345.120
4	10000001238	1.05	Loose Seal Bluth	111.000

In [8]:

compare = datacompy.Compare(
    df1,
    df2,
    join_columns="acct_id",  # You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name="original",
    df2_name="new",
)
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
2   float_fld        float64   float64          4    0.0005            3
1        name         object    object          3    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id name (original)            name (new)
4  10000001238   Lucille Bluth      Loose Seal Bluth
0  10000001234  George Maharis  George Michael Bluth
3  10000001237      Bob Loblaw         Robert Loblaw

       acct_id  float_fld (original)  float_fld (new)
1  10000001235                1.0000              NaN
4  10000001238                   NaN          111.000
0  10000001234            14530.1555        14530.155
2  10000001236                   NaN            1.000

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt              name  float_fld    date_fld
5  10000001238        1.05  Loose Seal Bluth        NaN  2017-01-01

Comparing Two Spark DataFrames¶

There is no advantage of running datacompy in a local version of Spark ! This approach consumes more memory than running datacompy on pandas DataFrames and costs more time.
If you use datacompy with a local version of Spark, make sure to import datacompy after `findspark.init(...)` .

In [9]:

sdf1 = spark.createDataFrame(df1)
sdf1.show()

+-----------+----------+----------------+----------+----------+
|    acct_id|dollar_amt|            name| float_fld|  date_fld|
+-----------+----------+----------------+----------+----------+
|10000001234|    123.45|  George Maharis|14530.1555|2017-01-01|
|10000001235|      0.45|   Michael Bluth|       1.0|2017-01-01|
|10000001236|    1345.0|    George Bluth|       NaN|2017-01-01|
|10000001237|  123456.0|      Bob Loblaw|    345.12|2017-01-01|
|10000001238|      1.05|   Lucille Bluth|       NaN|2017-01-01|
|10000001238|      1.05|Loose Seal Bluth|       NaN|2017-01-01|
+-----------+----------+----------------+----------+----------+

In [10]:

sdf2 = spark.createDataFrame(df2)
sdf2.show()

+-----------+----------+--------------------+---------+
|    acct_id|dollar_amt|                name|float_fld|
+-----------+----------+--------------------+---------+
|10000001234|     123.4|George Michael Bluth|14530.155|
|10000001235|      0.45|       Michael Bluth|      NaN|
|10000001236|    1345.0|        George Bluth|      1.0|
|10000001237|  123456.0|       Robert Loblaw|   345.12|
|10000001238|      1.05|    Loose Seal Bluth|    111.0|
+-----------+----------+--------------------+---------+

In [11]:

comparison = datacompy.SparkCompare(
    spark,
    sdf1,
    sdf2,
    join_columns=["acct_id"],  # must use a list of column(s)
    cache_intermediates=True,
    abs_tol=0.0001,
    rel_tol=0,
    match_rates=True,
)

In [18]:

type(comparison)

Out[18]:

datacompy.sparkcompare.SparkCompare

In [12]:

comparison.report()

****** Column Summary ******
Number of columns in common with matching schemas: 4
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 1
Number of columns in compare but not base: 0

****** Columns In Base Only ******
Column Name  Dtype
-----------  -------------
date_fld     string       

****** Row Summary ******
Number of rows in common: 5
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 1
Number of duplicate rows found in compare: 0

****** Row Comparison ******
Number of rows with some columns unequal: 4
Number of rows with all columns equal: 1

****** Column Comparison ******
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 0

****** Columns with Unequal Values ******
Base Column Name  Compare Column Name  Base Dtype     Compare Dtype  # Matches  # Mismatches  Match Rate %
----------------  -------------------  -------------  -------------  ---------  ------------  ------------
dollar_amt        dollar_amt           double         double                 4             1      80.00000
float_fld         float_fld            double         double                 3             2      60.00000
name              name                 string         string                 2             3      40.00000

SparkCompare.report takes a file handle to write the report to. You can pass a StringIO object to SparkCompare.report to write the report to a string buffer.

In [16]:

with StringIO() as sio:
    comparison.report(sio)
    report = sio.getvalue()

In [17]:

report

Out[17]:

'\n****** Column Summary ******\nNumber of columns in common with matching schemas: 4\nNumber of columns in common with schema differences: 0\nNumber of columns in base but not compare: 1\nNumber of columns in compare but not base: 0\n\n****** Columns In Base Only ******\nColumn Name  Dtype\n-----------  -------------\ndate_fld     string       \n\n****** Row Summary ******\nNumber of rows in common: 5\nNumber of rows in base but not compare: 0\nNumber of rows in compare but not base: 0\nNumber of duplicate rows found in base: 1\nNumber of duplicate rows found in compare: 0\n\n****** Row Comparison ******\nNumber of rows with some columns unequal: 4\nNumber of rows with all columns equal: 1\n\n****** Column Comparison ******\nNumber of columns compared with some values unequal: 3\nNumber of columns compared with all values equal: 0\n\n****** Columns with Unequal Values ******\nBase Column Name  Compare Column Name  Base Dtype     Compare Dtype  # Matches  # Mismatches  Match Rate %\n----------------  -------------------  -------------  -------------  ---------  ------------  ------------\ndollar_amt        dollar_amt           double         double                 4             1      80.00000\nfloat_fld         float_fld            double         double                 3             2      60.00000\nname              name                 string         string                 2             3      40.00000\n'

References¶

https://github.com/capitalone/datacompy

https://capitalone.github.io/datacompy/pandas_usage.html

In [ ]:

Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Compare Data Frames Using DataCompy in Python

Comments¶

Installation¶

Comparing Two pandas DataFrames¶

Comparing Two Spark DataFrames¶

References¶

Comments