XML file is widely used format to store and transport data over internet. Parsing XML file is a very basic programming requirement. Here we’ll see how to parse and print the content of an XML file in C programming language.
XML File Format
Before jumping into the code, we should understand basic format of an XML file. We’ll this XML file as an example.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
</catalog>
XML is a markup language like HTML. Unlike HTML, the tags are not predefined. Any string can be used as XML tag. That’s why it is called eXtensible Markup Language. Here are a few XML construct you should be aware of.
Tag: tag is most important XML markup construct that starts with ‘ < ‘ and ends with ‘ > ‘. <catalog> and <book> are the examples of tag in our example XML file. It could be of three types, 1) Start Tag such as <catalog>, 2) End Tag such as </catalog> and 3) Empty Tag such as <catalog />. Empty tag is not present in our example.
Element: Element is logical document component of an XML file. It generally starts with a start tag and ends with an end tag. It could be an empty element tag also. The characters between the start tag and end tag, if any, are call the content of the element. Element can contain markup including other elements which are called children. Our example file contains one big element catalog which has few book elements. We can imagine an XML file as a hierarchical tree structure of elements.
Attribute: Attribute is also a markup construct which is basically a name-value pair. It exists inside a start or empty element tag. In our XML file ‘ id ‘ is an example of an attribute in <book id=”bk101″> tag.
C Program to Parse and Print XML File
Standard C library does not provide XML parser. I used libxml2. parser. So you have to explicitly install the libxml2 development library. If you don’t have it installed already, run the following command to install it.
For Redhat based Linux:
yum install libxml2-devel
And for Debian based Linux.
apt-get install libxml2-dev
The Program
#include <stdio.h>
#include <libxml/parser.h>
/*gcc `xml2-config --cflags --libs` test.c*/
int is_leaf(xmlNode * node)
{
xmlNode * child = node->children;
while(child)
{
if(child->type == XML_ELEMENT_NODE) return 0;
child = child->next;
}
return 1;
}
void print_xml(xmlNode * node, int indent_len)
{
while(node)
{
if(node->type == XML_ELEMENT_NODE)
{
printf("%*c%s:%s\n", indent_len*2, '-', node->name, is_leaf(node)?xmlNodeGetContent(node):xmlGetProp(node, "id"));
}
print_xml(node->children, indent_len + 1);
node = node->next;
}
}
int main(){
xmlDoc *doc = NULL;
xmlNode *root_element = NULL;
doc = xmlReadFile("dummy.xml", NULL, 0);
if (doc == NULL) {
printf("Could not parse the XML file");
}
root_element = xmlDocGetRootElement(doc);
print_xml(root_element, 1);
xmlFreeDoc(doc);
xmlCleanupParser();
}
This program first reads the XML file using the xmlReadFile() function. The file name is hard-code as ‘dummy.xml’. This file needs to be present before running the program. The xmlReadFile() function returns an XML document tree. We get the root element of the XML from the document tree using the xmlDocGetRootElement() function.
The root node (element) of the XML tree is passed to the print_xml() function to print the whole XML content in hierarchical form. This function traverses all siblings of the input node (including the passed node). If a node is of type ELEMENT then it prints some information about the node. libxml2 keeps few other type of nodes also as sibling of the ELEMENT type node. That’s why we are skipping all node except ELEMENT type node. Tag name is printed. And if the node is a leaf node, then we print content of the node, otherwise, we print the value of “id” attribute.
We are not printing the content of non-leaf nodes because libxml2 returns content of all nested children as the content of the node. The the content will be lengthy and repeated. Apart from printing the information of the node, we are also calling the same function print_xml() recursively for the children of the current node. This way all nodes will get printed.
To compile this program, run this command.
gcc `xml2-config --cflags --libs` test.c
Here is the output of the program.
-catalog:(null)
-book:bk101
-author:Gambardella, Matthew
-title:XML Developer's Guide
-genre:Computer
-price:44.95
-publish_date:2000-10-01
-description:An in-depth look at creating applications
with XML.
-book:bk102
-author:Ralls, Kim
-title:Midnight Rain
-genre:Fantasy
-price:5.95
-publish_date:2000-12-16
-description:A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
-book:bk103
-author:Corets, Eva
-title:Maeve Ascendant
-genre:Fantasy
-price:5.95
-publish_date:2000-11-17
-description:After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
I am using CodeBlocks IDE on Debian Buster and after downloading
libxml2-devel the program would not build.
There are couple tricks :
1) Go to /usr/include/libxml2
Copy folder libxml to /usr/include so the path to the contents will
be /usr/include/libxml
2) Open the Project->Build Options -> Linker Settings tab -> Link Libraries
Add /usr/lib/x86_64-linux-gnu/libxml2.so to the list.
THEN it will find everything and build properly.
If you are using some other IDE then step 2 will be different of course.
If you are building the hard way from the command line then you will
have to inform the linker as to where libxml2.so is located.
Thank you very much for sharing this information.
how to read this xml file.
i am not able to read this file by using above code.
thank you
You have to keep the “dummy.xml” file in the same directory you are running the program from.
This is one of the best articles for XML parsing. Thanks for sharing.
How to convert an xml file in to a tabular form csv file using c
Try this function:
void save_xml_to_csv(xmlNode * node, FILE *fp, int indent_len)
{
int i = 0;
while(node)
{
if(node->type == XML_ELEMENT_NODE)
{
for(i = 0; i name, is_leaf(node)?xmlNodeGetContent(node):xmlGetProp(node, “id”));
}
save_xml_to_csv(node->children, fp, indent_len + 1);
node = node->next;
}
}
Call this function like:
doc = xmlReadFile(“dummy.xml”, NULL, 0);
root_element = xmlDocGetRootElement(doc);
FILE *fp = fopen(“output.csv”, “w”);
save_xml_to_csv(root_element, fp, 0);
fclose(fp);
xmlFreeDoc(doc);
xmlCleanupParser();
Another important thing: You should not have any comma (,) or new line in the xml file. That will distort the output csv file. The example xml file has both comma and new line. You have to remove them and try.
Very clear and crisp explanation. Good one. Thanks a lot
Thanks a lot by sharing the documents.
I need some more information regarding the xml parsing using c program in eclipse and visual studio 2013.
If any document is there please share.
thanks in advance