Installation¶
!pip3 install -U datacompyDefaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: datacompy in /home/dclong/.local/lib/python3.8/site-packages (0.8.1)
Requirement already satisfied: numpy<=1.22.3,>=1.11.3 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.21.6)
Requirement already satisfied: ordered-set<=4.1.0,>=4.0.2 in /home/dclong/.local/lib/python3.8/site-packages (from datacompy) (4.1.0)
Requirement already satisfied: pandas<=1.4.2,>=0.25.0 in /usr/local/lib/python3.8/dist-packages (from datacompy) (1.4.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas<=1.4.2,>=0.25.0->datacompy) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas<=1.4.2,>=0.25.0->datacompy) (1.14.0)
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import (
IntegerType,
StringType,
StructType,
StructField,
ArrayType,
)
import datacompy
spark = SparkSession.builder.appName("datacompy").enableHiveSupport().getOrCreate()Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/01 13:50:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Comparing Two pandas DataFrames¶
from io import StringIO
import pandas as pd
data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""
data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))df1Loading...
df2Loading...
compare = datacompy.Compare(
df1,
df2,
join_columns="acct_id", # You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name="original",
df2_name="new",
)
print(compare.report())DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 5 6
1 new 4 5
Column Summary
--------------
Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.0500 0
2 float_fld float64 float64 4 0.0005 3
1 name object object 3 0.0000 0
Sample Rows with Unequal Values
-------------------------------
acct_id dollar_amt (original) dollar_amt (new)
0 10000001234 123.45 123.4
acct_id name (original) name (new)
4 10000001238 Lucille Bluth Loose Seal Bluth
0 10000001234 George Maharis George Michael Bluth
3 10000001237 Bob Loblaw Robert Loblaw
acct_id float_fld (original) float_fld (new)
1 10000001235 1.0000 NaN
4 10000001238 NaN 111.000
0 10000001234 14530.1555 14530.155
2 10000001236 NaN 1.000
Sample Rows Only in original (First 10 Columns)
-----------------------------------------------
acct_id dollar_amt name float_fld date_fld
5 10000001238 1.05 Loose Seal Bluth NaN 2017-01-01
Comparing Two Spark DataFrames¶
There is no advantage of running datacompy in a local version of Spark! This approach consumes more memory than running datacompy on pandas DataFrames and costs more time.
If you use datacompy with a local version of Spark, make sure to import datacompy after
findspark.init(...).
sdf1 = spark.createDataFrame(df1)
sdf1.show()+-----------+----------+----------------+----------+----------+
| acct_id|dollar_amt| name| float_fld| date_fld|
+-----------+----------+----------------+----------+----------+
|10000001234| 123.45| George Maharis|14530.1555|2017-01-01|
|10000001235| 0.45| Michael Bluth| 1.0|2017-01-01|
|10000001236| 1345.0| George Bluth| NaN|2017-01-01|
|10000001237| 123456.0| Bob Loblaw| 345.12|2017-01-01|
|10000001238| 1.05| Lucille Bluth| NaN|2017-01-01|
|10000001238| 1.05|Loose Seal Bluth| NaN|2017-01-01|
+-----------+----------+----------------+----------+----------+
sdf2 = spark.createDataFrame(df2)
sdf2.show()+-----------+----------+--------------------+---------+
| acct_id|dollar_amt| name|float_fld|
+-----------+----------+--------------------+---------+
|10000001234| 123.4|George Michael Bluth|14530.155|
|10000001235| 0.45| Michael Bluth| NaN|
|10000001236| 1345.0| George Bluth| 1.0|
|10000001237| 123456.0| Robert Loblaw| 345.12|
|10000001238| 1.05| Loose Seal Bluth| 111.0|
+-----------+----------+--------------------+---------+
comparison = datacompy.SparkCompare(
spark,
sdf1,
sdf2,
join_columns=["acct_id"], # must use a list of column(s)
cache_intermediates=True,
abs_tol=0.0001,
rel_tol=0,
match_rates=True,
)type(comparison)datacompy.sparkcompare.SparkComparecomparison.report()
****** Column Summary ******
Number of columns in common with matching schemas: 4
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 1
Number of columns in compare but not base: 0
****** Columns In Base Only ******
Column Name Dtype
----------- -------------
date_fld string
****** Row Summary ******
Number of rows in common: 5
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 1
Number of duplicate rows found in compare: 0
****** Row Comparison ******
Number of rows with some columns unequal: 4
Number of rows with all columns equal: 1
****** Column Comparison ******
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 0
****** Columns with Unequal Values ******
Base Column Name Compare Column Name Base Dtype Compare Dtype # Matches # Mismatches Match Rate %
---------------- ------------------- ------------- ------------- --------- ------------ ------------
dollar_amt dollar_amt double double 4 1 80.00000
float_fld float_fld double double 3 2 60.00000
name name string string 2 3 40.00000
SparkCompare.report takes a file handle to write the report to.
You can pass a StringIO object to SparkCompare.report
to write the report to a string buffer.
with StringIO() as sio:
comparison.report(sio)
report = sio.getvalue()report'\n****** Column Summary ******\nNumber of columns in common with matching schemas: 4\nNumber of columns in common with schema differences: 0\nNumber of columns in base but not compare: 1\nNumber of columns in compare but not base: 0\n\n****** Columns In Base Only ******\nColumn Name Dtype\n----------- -------------\ndate_fld string \n\n****** Row Summary ******\nNumber of rows in common: 5\nNumber of rows in base but not compare: 0\nNumber of rows in compare but not base: 0\nNumber of duplicate rows found in base: 1\nNumber of duplicate rows found in compare: 0\n\n****** Row Comparison ******\nNumber of rows with some columns unequal: 4\nNumber of rows with all columns equal: 1\n\n****** Column Comparison ******\nNumber of columns compared with some values unequal: 3\nNumber of columns compared with all values equal: 0\n\n****** Columns with Unequal Values ******\nBase Column Name Compare Column Name Base Dtype Compare Dtype # Matches # Mismatches Match Rate %\n---------------- ------------------- ------------- ------------- --------- ------------ ------------\ndollar_amt dollar_amt double double 4 1 80.00000\nfloat_fld float_fld double double 3 2 60.00000\nname name string string 2 3 40.00000\n'