Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Construct pandas DataFrames in Python

In [1]:
import pandas as pd

DataFrame from Dictionary

  1. By default each key-value is a column in the resulting data frame. You can specify the option orient = 'index' to make each key-value a row in the resulting data frame when using the method pandas.DataFrame.from_dict.

  2. Starting from Python 3.7, a dict preserves insertion orders. This effectively makes a pandas DataFrame keep the insertion order of columns.

In [2]:
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": [1, 1, 1, 1, 1]})

df.head()
Out[2]:
x y z
0 1 5 1
1 2 4 1
2 3 3 1
3 4 2 1
4 5 1 1
In [3]:
df = pd.DataFrame.from_dict({"x": [1, 2, 3, 4, 5], "a": [5, 4, 3, 2, 1]})

df.head()
Out[3]:
x a
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
In [4]:
df = pd.DataFrame.from_dict(
    {"x": [1, 2, 3, 4, 5], "a": [5, 4, 3, 2, 1]}, orient="index"
)

df.head()
Out[4]:
0 1 2 3 4
x 1 2 3 4 5
a 5 4 3 2 1
In [8]:
df = pd.DataFrame.from_dict(
    {"how": 9, "are": 3, "you": 7, "doing": 5, "today": 6},
    orient="index",
    columns=["freq"],
)

df
Out[8]:
freq
how 9
are 3
you 7
doing 5
today 6

DataFrame from List of Dictionaries (as Rows)

Each dictionary is a row in the resulting data frame.

In [2]:
d = [
    {"points": 50, "time": "5:00", "year": 2010},
    {"points": 25, "time": "6:00", "month": "february"},
    {"points": 90, "time": "9:00", "month": "january"},
    {"points_h1": 20, "month": "june"},
]
pd.DataFrame(d)
Out[2]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN

DataFrame from List of Lists/Tuples (as Rows)

Each list/tuple is a row in the resulting data frame.

In [8]:
df = pd.DataFrame(
    data=[
        ["foo", "one", "small", 1],
        ["foo", "one", "large", 2],
        ["foo", "one", "large", 2],
        ["foo", "two", "small", 3],
        ["foo", "two", "small", 3],
        ["bar", "one", "large", 4],
        ["bar", "one", "small", 5],
        ["bar", "two", "small", 6],
        ["bar", "two", "large", 7],
    ],
    columns=["a", "b", "c", "d"],
)

df.head()
Out[8]:
a b c d
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
In [28]:
df = pd.DataFrame.from_records(
    data=[
        ["foo", "one", "small", 1],
        ["foo", "one", "large", 2],
        ["foo", "one", "large", 2],
        ["foo", "two", "small", 3],
        ["foo", "two", "small", 3],
        ["bar", "one", "large", 4],
        ["bar", "one", "small", 5],
        ["bar", "two", "small", 6],
        ["bar", "two", "large", 7],
    ],
    columns=["a", "b", "c", "d"],
)

df.head()
Out[28]:
a b c d
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3

DataFrame from List of Lists/Tuples (as Columns)

Each list/tuple is a row in the resulting data frame. Note that pd.concat on a list of Lists/Tuples won't here. You have to first create a DataFrame with the list of Lists/Tuples as rows and then transpose it.

In [15]:
df = pd.DataFrame(
    data=[
        ["foo", "one", "small", 1],
        ["foo", "one", "large", 2],
        ["foo", "one", "large", 2],
        ["foo", "two", "small", 3],
        ["foo", "two", "small", 3],
        ["bar", "one", "large", 4],
        ["bar", "one", "small", 5],
        ["bar", "two", "small", 6],
        ["bar", "two", "large", 7],
    ],
    columns=["a", "b", "c", "d"],
).transpose()

df.head()
Out[15]:
0 1 2 3 4 5 6 7 8
a foo foo foo foo foo bar bar bar bar
b one one one two two one one two two
c small large large small small large small small large
d 1 2 2 3 3 4 5 6 7

DataFrame from One Series (as a Row)

The sereis is a row in the resulting data frame.

In [11]:
id = pd.Series([1, 2, 3, 4, 5], name="id")

pd.DataFrame(data=[id])
Out[11]:
0 1 2 3 4
id 1 2 3 4 5
In [12]:
id = pd.Series([1, 2, 3, 4, 5], name="id")

pd.DataFrame([id])
Out[12]:
0 1 2 3 4
id 1 2 3 4 5

DataFrame from Multiple Serieses (as Rows)

The sereises are rows in the resulting data frame.

In [13]:
id = pd.Series([1, 2, 3, 4, 5], name="id")
x = pd.Series(["a", "b", "c", "d", "e"], name="x")
pd.DataFrame([id, x])
Out[13]:
0 1 2 3 4
id 1 2 3 4 5
x a b c d e

DataFrame from One Series (as a Column)

The sereis is a column in the resulting data frame.

In [6]:
id = pd.Series([1, 2, 3, 4, 5], name="id")
id.to_frame()
Out[6]:
id
0 1
1 2
2 3
3 4
4 5
In [7]:
id = pd.Series([1, 2, 3, 4, 5], name="id")
pd.DataFrame(id)
Out[7]:
id
0 1
1 2
2 3
3 4
4 5

DataFrame from Multiple Series (as Columns)

The serieses are columns in the resulting data frame.

In [10]:
id = pd.Series([1, 2, 3, 4, 5], name="id")
x = pd.Series(["a", "b", "c", "d", "e"], name="x")
pd.concat([id, x], axis=1)
Out[10]:
id x
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e

Series to Underlying Data

In [19]:
id = pd.Series([1, 2, 3, 4, 5], name="id")
id.tolist()
Out[19]:
[1, 2, 3, 4, 5]

DataFrame to Underlying Data

In [21]:
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "a": [5, 4, 3, 2, 1]})
print(df.head())
df.values.tolist()
   a  x
0  5  1
1  4  2
2  3  3
3  2  4
4  1  5
Out[21]:
[[5, 1], [4, 2], [3, 3], [2, 4], [1, 5]]

Index

An index will always be created. By default, a sequence of integers (starting from 0) is used as the index.

In [3]:
import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "a": [5, 4, 3, 2, 1]}, index=None)

df
Out[3]:
a x
0 5 1
1 4 2
2 3 3
3 2 4
4 1 5

Column Names

Similar to the index, a sequence of integers (starting from 9) is used as the column names by default.

In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame([(1, "a"), (2, "b")], columns=None)
df
Out[2]:
0 1
0 1 a
1 2 b

Empty DataFrame

Create an empty DataFrame without any column or row.

In [2]:
pd.DataFrame({})
Out[2]:

Create an empty (no rows) DataFrame with 1 column named x.

In [4]:
df = pd.DataFrame({"x": []})
df
Out[4]:
x

Create an empty (no rows) DataFrame with 2 column x and y.

In [6]:
df = pd.DataFrame([], columns=["x", "y"])
df
Out[6]:
x y

You can use the variable DataFrame.empty to check whether a DataFrame is empty or not.

In [3]:
df.empty
Out[3]:
True

You can operate on columns of an empty (no rows) DataFrame as usual.

In [66]:
df = pd.DataFrame({"cal_dt": []})
df.cal_dt = pd.to_datetime(df.cal_dt)
df
Out[66]:
cal_dt
In [72]:
len(df.cal_dt.unique())
Out[72]:
0
In [67]:
d = df.cal_dt.max() - df.cal_dt.min()
d
Out[67]:
NaT
In [14]:
pd.isnull(d)
Out[14]:
True
In [15]:
pd.isnull(None)
Out[15]:
True
In [10]:
len(df.cal_dt.unique())
Out[10]:
0
In [ ]:
 

Comments