Ben Chuanlong Du's Blog

It is never too late to learn.

Parsing YAML in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. PyYAML (YAML 1.1 currently) and ruamel.yaml (YAML 1.2) are 2 Python libraries for parsing YAML. PyYAML is more widely used.

  2. PyYAML is preferred over json for serialization and deserialization for multiple reasons.

    • PyYAML is a superset of json.
    • PyYAML supports serializing and deserializing set while json does not.
    • YAML is more readable than JSON.
In [1]:
!pip3 install pyyaml
Requirement already satisfied: pyyaml in /usr/local/lib/python3.8/site-packages (5.3.1)
In [1]:
import yaml
In [2]:
doc = """
  a: 1
  b:
    c: 3
    d: 4
"""
In [4]:
dic = yaml.load(doc, Loader=yaml.FullLoader)
dic
Out[4]:
{'a': 1, 'b': {'c': 3, 'd': 4}}
In [5]:
yaml.dump(dic, open("test.yml", "w"))
In [7]:
yaml.load(open("test.yml"), Loader=yaml.FullLoader)
Out[7]:
{'a': 1, 'b': {'c': 3, 'd': 4}}

Read YAML from a String

In [8]:
doc = """
- 
    cal_dt: 2019-01-01
- 
    cal_dt: 2019-01-02
    
    
"""
yaml.load(doc, Loader=yaml.FullLoader)
Out[8]:
[{'cal_dt': datetime.date(2019, 1, 1)}, {'cal_dt': datetime.date(2019, 1, 2)}]

Read YAML Form File (Single Doc)

In [10]:
with open("items.yaml") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
print(data)
{'raincoat': 1, 'coins': 5, 'books': 23, 'spectacles': 2, 'chairs': 12, 'pens': 6}
In [11]:
!cat set.yaml
!!set
1: null
2: null
3: null
In [12]:
with open("set.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
data
Out[12]:
{1, 2, 3}
In [13]:
type(data)
Out[13]:
set

Read YAML (Multiple Docs)

Notice that the method yaml.load_all returns a generator!

In [15]:
with open("data.yaml") as f:
    docs = yaml.load_all(f, Loader=yaml.FullLoader)
    for doc in docs:
        for k, v in doc.items():
            print(k, "->", v)
cities -> ['Bratislava', 'Kosice', 'Trnava', 'Moldava', 'Trencin']
companies -> ['Eset', 'Slovnaft', 'Duslo Sala', 'Matador Puchov']

Convert generator to a list so that you use it out of the with block.

In [16]:
with open("data.yaml") as f:
    docs = list(yaml.load_all(f, Loader=yaml.FullLoader))
In [17]:
docs
Out[17]:
[{'cities': ['Bratislava', 'Kosice', 'Trnava', 'Moldava', 'Trencin']},
 {'companies': ['Eset', 'Slovnaft', 'Duslo Sala', 'Matador Puchov']}]
In [18]:
for doc in docs:
    for k, v in doc.items():
        print(k, "->", v)
cities -> ['Bratislava', 'Kosice', 'Trnava', 'Moldava', 'Trencin']
companies -> ['Eset', 'Slovnaft', 'Duslo Sala', 'Matador Puchov']

YAML Dump to String

In [19]:
users = [
    {"name": "John Doe", "occupation": "gardener"},
    {"name": "Lucy Black", "occupation": "teacher"},
]

print(yaml.dump(users))
- name: John Doe
  occupation: gardener
- name: Lucy Black
  occupation: teacher

In [20]:
print(yaml.dump(set([1, 2, 3])))
!!set
1: null
2: null
3: null

YAML Dump to File

In [21]:
with open("users.yaml", "w") as fout:
    yaml.dump(users, fout)
In [22]:
with open("set.yaml", "w") as fout:
    yaml.dump(set([1, 2, 3]), fout)
In [23]:
!cat set.yaml
!!set
1: null
2: null
3: null

Tokens

PyYAML can work with a lower-level API when parsing YAML files. The mehtod scan scans a YAML stream and produces scanning tokens.

The following example scans and prints tokens.

In [24]:
with open("items.yaml") as f:
    data = yaml.scan(f, Loader=yaml.FullLoader)
    for token in data:
        print(token)
StreamStartToken(encoding=None)
BlockMappingStartToken()
KeyToken()
ScalarToken(plain=True, style=None, value='raincoat')
ValueToken()
ScalarToken(plain=True, style=None, value='1')
KeyToken()
ScalarToken(plain=True, style=None, value='coins')
ValueToken()
ScalarToken(plain=True, style=None, value='5')
KeyToken()
ScalarToken(plain=True, style=None, value='books')
ValueToken()
ScalarToken(plain=True, style=None, value='23')
KeyToken()
ScalarToken(plain=True, style=None, value='spectacles')
ValueToken()
ScalarToken(plain=True, style=None, value='2')
KeyToken()
ScalarToken(plain=True, style=None, value='chairs')
ValueToken()
ScalarToken(plain=True, style=None, value='12')
KeyToken()
ScalarToken(plain=True, style=None, value='pens')
ValueToken()
ScalarToken(plain=True, style=None, value='6')
BlockEndToken()
StreamEndToken()

Fix Indention Issue

PyYAML has an issue of indention currently. For details, please refer to Incorrect indentation with lists #234 .

In [ ]:
class Dumper(yaml.Dumper):
    def increase_indent(self, flow=False, *args, **kwargs):
        return super().increase_indent(flow=flow, indentless=False)


yaml.dump(data, Dumper=Dumper)

Examples

In [20]:
with open("ex1.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
print(data)
{'args': [{'cal_dt': '2019-01-01', 'path': '/path/1'}, {'cal_dt': '2019-01-02', 'path': '/path/2'}]}
In [21]:
with open("ex2.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
print(data)
[{'cal_dt': datetime.date(2019, 1, 1), 'path': '/path/1'}, {'cal_dt': datetime.date(2019, 1, 2), 'path': '/path/2'}]
In [22]:
type(data[0]["cal_dt"])
Out[22]:
datetime.date
In [23]:
with open("ex3.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
print(data)
{'args': {'x': [1, 2, 3], 'y': ['a', 'b', 'c']}}
In [24]:
with open("ex4.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
print(data)
{'x': [1, 2, 3], 'y': ['a', 'b', 'c']}
In [26]:
with open("ex5.yaml", "r") as fin:
    data = yaml.load(fin, Loader=yaml.FullLoader)
data
Out[26]:
{'x': [1, 2, 3],
 'y': "import dsutil\ndsutil.datetime.range('2019-01-01', '2019-01-05')"}
In [27]:
data["y"]
Out[27]:
"import dsutil\ndsutil.datetime.range('2019-01-01', '2019-01-05')"
In [28]:
eval(compile(data["y"], "some_file", "exec"))
In [31]:
x = eval("range(10)")
In [32]:
x
Out[32]:
range(0, 10)
In [33]:
import json

json.dumps(list(x))
Out[33]:
'[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]'
In [34]:
list(exec(data["y"]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-42675847b9bb> in <module>
----> 1 list(exec(data['y']))

TypeError: 'NoneType' object is not iterable

eval, exec, single, compile

  1. simple 1 line python code which requires you to have every library ready ...

  2. multiple: need a way to reliably run the code and return the result ...

In [ ]:
yaml.load("""!!python/list(range(10))""", Loader=yaml.FullLoader)

Comments