Working with XML using Python

In this article we are going to learn how to parse, explore, modify and populate an XML file using Python ElementTree. We will understand what is XML file and its data format, why it is used and how to explore its tree structure.

What is XML?

XML stands for “Extensible Markup Language”. It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework.

XML creates a tree-like structure that is easy to interpret and supports a hierarchy. Whenever a page follows XML, it can be called an XML document.

XML document have different sections which are called elements. Element consists of a starting and ending tag and can have more elements in it. These inner elements are called child elements. There’s always a one top level element which is known as root of the document.

Elements have attributes which have some name and its value. To understand it in a better way, let’s look into following xml file:

<countries>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
</country>
<country name="Canada">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="United States" direction="W"/>
</country>
</countries>

In this document:
<countries> is the root tag
Root tag have further child elements as <country>
<country> have an attribute named as name
<country> element also have further child elements as <rank>, <year>, <gdppc> and <neighbor>.

What is Python ElementTree?

In above document we can see what is a XML document and how it stores data. In different programming languages it is handled in different ways. Python have a build in library ElementTree, that has functions to read and manipulate XML files.

First of all we have to import ElementTree. It is a common practice to use ET alias.

import xml.etree.ElementTree as ET

Parsing XML Data

In our document we have countries information in multiple country tags and within country tag we also have some details like rank, year, neighbor and GDP/Ps. First of all we have to read our xml file using ElementTree aka ET. And then we will be reading its root element with getroot() function.

tree = ET.parse(‘your path to file/countries.xml’)
root = tree.getroot()

We can check the root tag name as:

root.tagoutput: 'countries'root.attrib
{}

Tag of root element is countries and it do not have any attributes. Now let’s find what is in the root tag.

for child in root:
print(child.tag, child.attrib)
output:country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
country {'name': 'Canada'}

To check all elements within our root element, we will check it as:

[elem.tag for elem in root.iter()]output:['countries',
'country',
'rank',
'year',
'gdppc',
'neighbor',
'country',
'rank',
'year',
'gdppc',
'neighbor',
'country',
'rank',
'year',
'gdppc',
'neighbor',
'country',
'rank',
'year',
'gdppc',
'neighbor']

In order to store every country information in a Python list, we will do it as:

countries_list=[]
for child in root:
countryName,neighborName, rank, gdppc, year=’’,’’,’’,’’,’’
countryName=child.attrib[‘name’]
for eachChild in child.getchildren():
if(eachChild.tag==’neighbor’):
#neighbor do not have text, instead it have attribute
neighborName=eachChild.attrib[‘name’]
if(eachChild.tag==’rank’):
rank=eachChild.text
if(eachChild.tag==’gdppc’):
gdppc=eachChild.text
else:
year=eachChild.text
obj={
'CountryName’:countryName,
’Rank’:rank,
’Neighbor’:neighborName,
’GDPPC’:gdppc,
’Year’:year
}
countries_list.append(obj)

In this code example, we have extracted every country info from country element, its child tags and attributes.

countries_listoutput[{'CountryName': 'Liechtenstein',
'Rank': '1',
'Neighbor': 'Austria',
'GDPPC': '141100',
'Year': None},
{'CountryName': 'Singapore',
'Rank': '4',
'Neighbor': 'Malaysia',
'GDPPC': '59900',
'Year': None},
{'CountryName': 'Panama',
'Rank': '68',
'Neighbor': 'Costa Rica',
'GDPPC': '13600',
'Year': None},
{'CountryName': 'Canada',
'Rank': '68',
'Neighbor': 'United States',
'GDPPC': '13600',
'Year': None}]

This is our final output in Python list that we extracted from xml document.

ElementTree is an important Python library to handle and manipulate XML files in Python. ElementTree breaks down XML document in a tree structure that is easy to work with programmatically.

Now you are good to understand and work with basic XML parsing!

Software Engineer, Data Engineer, Freelancer