Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

  1. It is suggested that you use multiprocessing (e.g., pool_size=8) to speed up data profiling. Note: It seems to me that currently multiprocessing only works when minimal=True.

  2. minimal=True helps reuce consumed memory.

    profile = ProfileReport(
        df, title="Data Profiling Report", 
        explorative=True, minimal=True, pool_size=8
    )
  3. ProfileReport.dump dumps the report to a pickle file (for caching purpose) while ProfileReport.to_file dumps the report to a HTML file or a JSON file.

Installation

pip3 install --user -U ydata-profiling[notebook]
!wget https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv
--2023-06-11 15:15:56--  https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4610348 (4.4M) [text/plain]
Saving to: ‘bank-full.csv’

bank-full.csv       100%[===================>]   4.40M  --.-KB/s    in 0.1s    

2023-06-11 15:15:56 (36.4 MB/s) - ‘bank-full.csv’ saved [4610348/4610348]

!head bank-full.csv
"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
58;"retired";"married";"primary";"no";121;"yes";"no";"unknown";5;"may";50;1;-1;0;"unknown";"no"
!pip3 install --user ydata-profiling
Requirement already satisfied: ydata-profiling in /home/dclong/.local/lib/python3.10/site-packages (4.2.0)
Requirement already satisfied: scipy<1.11,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.10.1)
Requirement already satisfied: pandas!=1.4.0,<2,>1.1 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.5.3)
Requirement already satisfied: matplotlib<4,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.7.1)
Requirement already satisfied: pydantic<2,>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.10.7)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (6.0)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (3.1.2)
Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (0.7.5)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.23.5)
Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.12.3)
Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (2.30.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.65.0)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (0.12.2)
Requirement already satisfied: multimethod<2,>=1.4 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (1.9.1)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (0.14.0)
Requirement already satisfied: typeguard<3,>=2.13.2 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (2.13.3)
Requirement already satisfied: imagehash==4.3.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.9.2)
Requirement already satisfied: dacite>=1.8 in /home/dclong/.local/lib/python3.10/site-packages (from ydata-profiling) (1.8.1)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (1.4.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling) (9.5.0)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (23.1.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (3.1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]==0.7.5->ydata-profiling) (0.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (4.39.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (23.1)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/lib/python3/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<4,>=3.2->ydata-profiling) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<2,>1.1->ydata-profiling) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.1.1)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<2,>=1.8.1->ydata-profiling) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling) (2023.5.7)
Requirement already satisfied: patsy>=0.5.2 in /home/dclong/.local/lib/python3.10/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.3)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from patsy>=0.5.2->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0)
import jupyter

jupyter.textOutputLimit = 0
from pathlib import Path
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file
/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement.
  warnings.warn(msg, NumbaDeprecationWarning)
/home/dclong/.local/lib/python3.10/site-packages/visions/backends/shared/nan_handling.py:51: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def hasna(x: np.ndarray) -> bool:
df = pd.read_csv("bank-full.csv", sep=";")
df
Loading...
profile = ProfileReport(
    df,
    title="Profile Report of the UCI Bank Marketing Dataset",
    explorative=True,
    minimal=True,
    pool_size=8,
)
profile
Loading...
Loading...
Loading...
Loading...

Save the Report

  1. ProfileReport.to_json represents the report as a JSON string and ProfileReport.to_html represents the report as a HTML string.

  2. You can save a report to a file using the method ProfileReport.to_file. The file content will be different based on the specified file extension.

from pathlib import Path
from loguru import logger
import pandas as pd
from ydata_profiling import ProfileReport


def dump_profile(df: pd.DataFrame | str | Path, title: str, output_dir: str | Path):
    """Run ydata-profiling on a DataFrame and dump the report into files.

    :param df: A pandas DataFrame.
    :param title: The title of the generated report.
    :param output_dir: The output directory for reports.
    :raises ValueError: If an input file other than Parquet/Pickle/CSV is provided.
    """
    if isinstance(df, str):
        df = Path(df)
    if isinstance(df, Path):
        logger.info("Reading the DataFrame from {}...", df)
        ext = df.suffix.lower()
        if ext == ".parquet":
            df = pd.read_parquet(df)
        elif ext == ".pickle":
            df = pd.read_pickle(df)
        elif ext == ".csv":
            df = pd.read_csv(df)
        else:
            raise ValueError("Only Parquet, Pickle and CSV files are support!")
    logger.info("Shape of the DataFrame: {}", df.shape)
    logger.info("Profiling the DataFrame...")
    report = ProfileReport(df, title=title, minimal=True, explorative=True)
    if isinstance(output_dir, str):
        output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    # dump report
    logger.info("Dumping the report to HTML...")
    report.to_file(output_dir / "report.html")
    logger.info("Dumping the report to JSON...")
    report.to_file(output_dir / "report.json")
    logger.info("Dumping the report to Pickle...")
    report.dump(output_dir / "report.pickle")

Write to the report a HTML file.

profile.to_file(
    "../../../../../home/media/pandas-profiling/uci_bank_marketing_report.html"
)
Loading...

Write to the report a JSON file.

profile.to_file(
    "../../../../../home/media/pandas-profiling/uci_bank_marketing_report.json"
)
Loading...

Get the HTML string representation of the report.

profile.to_html()

Configuration

src/pandas_profiling/config_default.yaml

# Sort the variables. Possible values: ascending, descending or None (leaves original sorting)
sort: None