Ben Chuanlong Du's Blog

It is never too late to learn.

Extract Identifier Names from a SQL Query Using Sqlfluff

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

In [2]:
!pip3 install sqlfluff
Collecting sqlfluff
  Downloading sqlfluff-0.5.2-py3-none-any.whl (278 kB)
     |████████████████████████████████| 278 kB 6.4 MB/s 
Requirement already satisfied: pytest in /usr/local/lib/python3.8/dist-packages (from sqlfluff) (6.2.3)
Collecting diff-cover>=2.5.0
  Downloading diff_cover-5.0.1-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 5.0 MB/s 
Requirement already satisfied: pathspec in /usr/local/lib/python3.8/dist-packages (from sqlfluff) (0.8.1)
Collecting oyaml
  Downloading oyaml-1.0-py2.py3-none-any.whl (3.0 kB)
Collecting appdirs
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: Jinja2 in /usr/local/lib/python3.8/dist-packages (from sqlfluff) (2.11.3)
Collecting cached-property
  Downloading cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)
Collecting bench-it
  Downloading bench_it-1.0.1-py2.py3-none-any.whl (19 kB)
Collecting colorama>=0.3
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: click>=7.1 in /usr/local/lib/python3.8/dist-packages (from sqlfluff) (7.1.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from sqlfluff) (3.7.4.3)
Collecting configparser
  Downloading configparser-5.0.2-py3-none-any.whl (19 kB)
Requirement already satisfied: toml in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (0.10.2)
Requirement already satisfied: pluggy<1.0.0a1,>=0.12 in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (0.13.1)
Requirement already satisfied: iniconfig in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (1.1.1)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (20.3.0)
Requirement already satisfied: py>=1.8.2 in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (1.10.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from pytest->sqlfluff) (20.9)
Collecting jinja2-pluralize
  Downloading jinja2_pluralize-0.3.0-py2.py3-none-any.whl (4.8 kB)
Requirement already satisfied: pygments in /usr/local/lib/python3.8/dist-packages (from diff-cover>=2.5.0->sqlfluff) (2.8.1)
Requirement already satisfied: chardet>=3.0.0 in /usr/lib/python3/dist-packages (from diff-cover>=2.5.0->sqlfluff) (3.0.4)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.8/dist-packages (from oyaml->sqlfluff) (5.4.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from Jinja2->sqlfluff) (1.1.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->pytest->sqlfluff) (2.4.7)
Collecting inflect>=0.2.4
  Downloading inflect-5.3.0-py3-none-any.whl (32 kB)
Installing collected packages: inflect, jinja2-pluralize, diff-cover, oyaml, appdirs, cached-property, bench-it, colorama, configparser, sqlfluff
Successfully installed appdirs-1.4.4 bench-it-1.0.1 cached-property-1.5.2 colorama-0.4.4 configparser-5.0.2 diff-cover-5.0.1 inflect-5.3.0 jinja2-pluralize-0.3.0 oyaml-1.0 sqlfluff-0.5.2

Parse the SQL file Using sqlfluff parse

You can also use the Python API sqlfluff.parse to parse a SQL query/file, however, you have to manually handle exceptions at this time.

In [1]:
!sqlfluff parse example.sql > example_parse.log
/bin/bash: sqlfluff: command not found

Parse the Log File to Extract Identifier Names

In [ ]:
import re
from pathlib import Path
from typing import Set
In [ ]:
def extract_identifier(line):
    identifier = line.strip().rsplit(maxsplit=1)[-1]
    return identifier[1:-1]


def extract_reference(lines, idx):
    if "dot:" in lines[idx + 2]:
        return extract_identifier(lines[idx + 1]), extract_identifier(lines[idx + 3])
    return "", extract_identifier(lines[idx + 1])


def gen_mapping(names: Set[str], prefix):
    mapping = {}
    idx = 1
    for name in names:
        if name in mapping:
            continue
        name_lower = name.lower()
        if name_lower in mapping:
            mapping[name] = mapping[name_lower]
            continue
        mapping[name] = prefix + str(idx)
        mapping[name_lower] = mapping[name]
        idx += 1
    return mapping

    def anonymize(sql, *mappings):
        for mapping in mappings:
            for key, val in mapping.items():
                sql = sql.replace(key, val)
        return sql
In [ ]:
with Path("parse_example.log").open() as fin:
    lines = fin.readlines()
col_refs = [
    extract_reference(lines, idx)
    for idx, line in enumerate(lines)
    if re.search("\|\s*column_reference:", line)
]
cols = set(col for _, col in col_refs)
cols
In [ ]:
tab_refs = [
    extract_table_reference(lines, idx)
    for idx, line in enumerate(lines)
    if re.search("\|\s*table_reference:", line)
]
dbs = set(db for db, _ in tab_refs if db)
dbs
In [ ]:
tables = set(table for _, table in tab_refs if table)
tables

Anonymize a SQL File

Once you get all database, table and column names from a SQL query, you can anonymize the SQL query by replacing database, table and column names with non-sensitive names.

In [ ]:
sql = Path("example.sql").read_text()
sql_anon = anonymize(
    sql,
    gen_mapping(dbs, "db_"),
    gen_mapping(tables, "table_"),
    gen_mapping(cols, "col_"),
)

Comments