Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
It is suggested that you use multiprocessing (e.g.,
pool_size=8) to speed up data profiling. Note: It seems to me that currently multiprocessing only works whenminimal=True.minimal=Truehelps reuce consumed memory.profile = ProfileReport( df, title="Data Profiling Report", explorative=True, minimal=True, pool_size=8 )ProfileReport.dumpdumps the report to a pickle file (for caching purpose) whileProfileReport.to_filedumps the report to a HTML file or a JSON file.
Installation¶
pip3 install --user -U ydata-profiling[notebook]!wget https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv--2023-06-11 15:15:56-- https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4610348 (4.4M) [text/plain]
Saving to: ‘bank-full.csv’
bank-full.csv 100%[===================>] 4.40M --.-KB/s in 0.1s
2023-06-11 15:15:56 (36.4 MB/s) - ‘bank-full.csv’ saved [4610348/4610348]
!head bank-full.csv"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
58;"retired";"married";"primary";"no";121;"yes";"no";"unknown";5;"may";50;1;-1;0;"unknown";"no"
!pip3 install --user ydata-profilingRequirement already satisfied: ydata-profiling in /home/dclong/.local/lib/python3.10/site-packages (4.2.0)
Requirement already satisfied: scipy<1.11,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.10.1)
Requirement already satisfied: pandas!=1.4.0,<2,>1.1 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.5.3)
Requirement already satisfied: matplotlib<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.7.1)
Requirement already satisfied: pydantic<2,>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.10.7)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (6.0)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.1.2)
Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (0.7.5)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.23.5)
Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.12.3)
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.30.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.65.0)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.12.2)
Requirement already satisfied: multimethod<2,>=1.4 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.9.1)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (0.14.0)
Requirement already satisfied: typeguard<3,>=2.13.2 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (2.13.3)
Requirement already satisfied: imagehash==4.3.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.9.2)
Requirement already satisfied: dacite>=1.8 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.8.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (1.4.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (9.5.0)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (23.1.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (3.1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (0.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (4.39.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (23.1)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/lib/python3/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<2,>1.1->ydata-profiling) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.1.1)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<2,>=1.8.1->ydata-profiling) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2023.5.7)
Requirement already satisfied: patsy>=0.5.2 in /home/dclong/.local/lib/python3.10/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.3)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from patsy>=0.5.2->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0)
import jupyter
jupyter.textOutputLimit = 0from pathlib import Path
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement.
warnings.warn(msg, NumbaDeprecationWarning)
/home/dclong/.local/lib/python3.10/site-packages/visions/backends/shared/nan_handling.py:51: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def hasna(x: np.ndarray) -> bool:
df = pd.read_csv("bank-full.csv", sep=";")
dfprofile = ProfileReport(
df,
title="Profile Report of the UCI Bank Marketing Dataset",
explorative=True,
minimal=True,
pool_size=8,
)profileSave the Report¶
ProfileReport.to_jsonrepresents the report as a JSON string andProfileReport.to_htmlrepresents the report as a HTML string.You can save a report to a file using the method
ProfileReport.to_file. The file content will be different based on the specified file extension.
from pathlib import Path
from loguru import logger
import pandas as pd
from ydata_profiling import ProfileReport
def dump_profile(df: pd.DataFrame | str | Path, title: str, output_dir: str | Path):
"""Run ydata-profiling on a DataFrame and dump the report into files.
:param df: A pandas DataFrame.
:param title: The title of the generated report.
:param output_dir: The output directory for reports.
:raises ValueError: If an input file other than Parquet/Pickle/CSV is provided.
"""
if isinstance(df, str):
df = Path(df)
if isinstance(df, Path):
logger.info("Reading the DataFrame from {}...", df)
ext = df.suffix.lower()
if ext == ".parquet":
df = pd.read_parquet(df)
elif ext == ".pickle":
df = pd.read_pickle(df)
elif ext == ".csv":
df = pd.read_csv(df)
else:
raise ValueError("Only Parquet, Pickle and CSV files are support!")
logger.info("Shape of the DataFrame: {}", df.shape)
logger.info("Profiling the DataFrame...")
report = ProfileReport(df, title=title, minimal=True, explorative=True)
if isinstance(output_dir, str):
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
# dump report
logger.info("Dumping the report to HTML...")
report.to_file(output_dir / "report.html")
logger.info("Dumping the report to JSON...")
report.to_file(output_dir / "report.json")
logger.info("Dumping the report to Pickle...")
report.dump(output_dir / "report.pickle")Write to the report a HTML file.
profile.to_file(
"../../../../../home/media/pandas-profiling/uci_bank_marketing_report.html"
)Write to the report a JSON file.
profile.to_file(
"../../../../../home/media/pandas-profiling/uci_bank_marketing_report.json"
)Get the HTML string representation of the report.
profile.to_html()Configuration¶
src
# Sort the variables. Possible values: ascending, descending or None (leaves original sorting)
sort: None