Introduction
As a
GUI
programmer,
I prefer
using the
SAX2 API for
my
XML
parsing
needs
because of
its
events-based
usage model.
However, the
API set is
large,
COM-based
(MS
version),
and requires
you to work
with wide
character
strings. I
find it
simpler to
wrap the API
and expose
just the
functionality
that I need.
In this
article,
I present a
C++
wrapper
framework
that allows
you to do
basic
XML
parsing
and
validation
using SAX2.
The wrapper
interfaces
use pure
C++/STL and
the
events-based
model is
retained. In
addition,
beyond just
returning
strings as
they are
parsed, I've
included
some basic
XML data
types in the
wrapper
layer to
represent
XML
elements and
attributes.
The
attached
demo is
actually a
VC++
6 workspace
(TestXml.dsw)
that is
comprised of
two separate
projects.
The first
project is a
Win32 static
library (XmlSupport)
that
contains the
wrapper
framework
classes.
This project
does not use
MFC.
The second
project does
use
MFC
and is a
dialog-based
test
application
(TestXmlSupport)
that
demonstrates
how to use
the wrapper
classes.
I've divided
the code in
this way to
make it
clearer
which
classes are
intended for
reuse and
which are
merely for
demonstration.
If you find
the code
useful, you
can
repackage
the classes
as you like,
such as in a
DLL. Also,
note that
the wrapper
exposes only
a small part
of the SAX2
API set. But
it's enough
for what I
want to
demonstrate
in this
article.
Prerequisites
Support
for SAX2
(the latest
version of
the Simple
API for
XML)
is provided
as part of
the
Microsoft
XML
Core
Services (MSXML)
SDK. I am
using
MSXML 4.0
SP2, which
is a
prerequisite
for
compiling
the demo
projects as
they rely on
certain
functionality
(such as
schema
validation
using SAX2)
which is
only
supported as
of version
4.0. On my
XP Pro
machine, I
initially
had problems
installing
SP2, as I
had a
partial
install of
MSXML 4.0
already.
However,
after I
uninstalled
that first
(as
recommended
by
Microsoft),
I was able
to
re-install
SP2 just
fine. As I
understand,
MSXML 4.0
can coexist
with a
MSXML
3.0 SDK
installation.
Background
The SAX2
API is
useful when
you need to
parse large
XML
documents
because it
does not
require you
to read in
the entire
file before
returning
parse
results.
When you
initiate a
parse with
the SAX2
parser,
XML
element and
attribute
entities are
returned as
soon as they
are
encountered,
while the
document is
still being
processed
serially. In
addition,
with SAX2,
you can
abort a
parse
at any time.
For example,
this allows
you to stop
parsing as
soon as
you've found
a particular
record
that's
stored in
your
XML file.
The test
application
demonstrates
this ability
to abort.
To use
the SAX2 API
directly
yourself,
you
typically
begin by
specifying
an
import
directive to
incorporate
information
from the
MSXML
COM type
library. For
example, add
the
following to
an include
file before
using any
SAX2
interfaces
or types
(note: these
examples do
not use
smart
pointers):
Collapse
Copy
Code
#import <msxml4.dll> raw_interfaces_only
using namespace MSXML2;
The SAX2
parser is
encapsulated
by the COM
class,
SAXXMLReader40
.
The parser
interface it
supports is
ISAXXMLReader
,
which offers
methods such
as
parseURL()
.
Below is the
code I use
to create an
instance of
the COM
class, and
at the same
time, get a
pointer to
the parser
interface (m_reader
):
Collapse
Copy
Code
void CXmlParserImpl::CreateSaxReader()
{
HRESULT hr = CoCreateInstance(
__uuidof(SAXXMLReader40),
NULL,
CLSCTX_ALL,
__uuidof(ISAXXMLReader),
(void **)&m_reader);
if ( SUCCEEDED(hr) )
{
...
}
}
To
initiate a
parse using
the SAX
reader, I
simply call
ISAXXMLReader::parseURL()
and pass in
a wide
character
string that
contains a
HTTP URL or
the file
path of an
XML file.
An HTTP URL
can be used,
for example,
to specify
an
XML-based
RSS feed
from a
website such
as CNN.com.
Collapse
Copy
Code
HRESULT hr = m_reader->parseURL(wszURL);
Since
SAX2 is
events-based,
you need to
implement
one or more
event
handler
interfaces
in order to
receive
parsing
results or
error
notifications.
For example,
there is the
ISAXContentHandler
interface,
which has
virtual
methods such
as
startElement()
and
endElement()
.
These
methods are
invoked by
the SAX
reader when
it
encounters
the start or
end of an
XML
element. In
your
application,
you need to
write a
class that
implements
ISAXContentHandler
and
overrides
methods such
as
startElement()
.
Then create
an instance
of your
class and
register the
instance
with the SAX
reader. This
registration
is typically
done
immediately
after
creating the
SAX reader:
Collapse
Copy
Code
void CXmlParserImpl::CreateSaxReader()
{
...
if ( SUCCEEDED(hr) )
{
m_contentHandler = new CSaxContentHandler;
hr = m_reader->putContentHandler(m_contentHandler);
if ( FAILED(hr) )
{
delete m_contentHandler;
m_contentHandler = NULL;
}
}
}
Similarly,
to receive
parsing
errors, you
can
write
a class that
implements
the
ISAXErrorHandler
interface,
and register
that handler
instance
with the SAX
reader as
well. The
ISAXErrorHandler
interface
has virtual
methods such
as
error()
and
fatalError()
that are
called when
errors in
the input
XML
source are
detected.
Below is
an example
XML file,
test.xml,
which is a
slightly
modified
version of a
sample file
from the
MSDN
Technical
article,
"JumpStart
for Creating
a SAX2
Application
with C++"
(see
References
section for
link). I've
added
comments
beside each
line in the
code block
below to
indicate
which of the
relevant
ISAXContentHandler
methods are
invoked as
each line is
processed by
the SAX
reader.
Collapse
Copy
Code
<?xml version="1.0" encoding="ISO-8859-1"?> ........................ startDocument()
<root> ............................................................. startElement()
<PARTS> ........................................................ startElement()
<PART ID="ABC" Tag="cab"> .................................. startElement()
<PARTNO>12345</PARTNO> ................................. startElement(),
characters(), endElement()
<DESCRIPTION>VIP - Very Important Part</DESCRIPTION> ... startElement(),
characters(), endElement()
</PART> .................................................... endElement()
<PART ID="XYZ" Tag="zxy"> .................................. startElement()
<PARTNO>5678</PARTNO> .................................. startElement(),
characters(), endElement()
<DESCRIPTION>LIP - Less Important Part</DESCRIPTION> ... startElement(),
characters(), endElement()
</PART> .................................................... endElement()
</PARTS> ....................................................... endElement()
</root> ............................................................ endElement(),
endDocument()
The C++
wrapper
discussed in
this article
hides much
of the above
coding
details,
such as
creation of
the COM
class,
optional
remapping of
wide
character
strings,
etc.
However, the
wrapper
layer still
uses an
events-based
approach, so
understanding
the above
will help in
learning how
to use and
even extend
the wrapper.
The C++
wrapper
layer
The
classes that
are exported
by the
C++
wrapper
layer are
described
below. Note
that these
classes use
char
-based
STL strings.
For each of
these
classes,
there is an
equivalent
class in the
wrapper
layer that
uses
wide-character
(wchar_t
)
strings
instead.
This is
discussed
further in a
later
section.
-
CXmlAttribute
:
This is
a data
class
that
represents
a single
XML
attribute.
An
XML
attribute
is
basically
a
name-value
pair of
strings.
For
example,
in the
test.xml
file
listed
earlier,
ID="ABC"
is an
attribute
where
the
attribute
name is
"ID
"
and the
attribute
value is
"ABC
".
-
CXmlElement
:
This is
a data
class
that
represents
either a
start
element
or an
end
element.
It
contains
information
such as
the
element
name and
a set of
zero or
more
CXmlAttribute
objects.
For
example,
in the
test.xml
file
listed
earlier,
<PART
ID="ABC"
Tag="cab">
is an
XML
element
with
element
name "PART
"
and two
attributes.
-
IXmlElementHandler
:
This is
a pure
abstract
class
that
defines
an
interface
for
receiving/handling
XML
event
notifications
during
parsing.
One of
your
application
classes
should
implement
this
interface
and
register
itself
with a
CXmlParser
instance.
Collapse
Copy
Code
class IXmlElementHandler
{
public:
virtual void OnXmlStartElement(const CXmlElement& xmlElement) = 0;
virtual void OnXmlElementData(const std::string& elementData,
int depth) = 0;
virtual void OnXmlEndElement(const CXmlElement& xmlElement) = 0;
virtual void OnXmlError(int line, int column,
const std::string& errorText, unsigned long errorCode) = 0;
virtual bool OnXmlAbortParse(const CXmlElement& xmlElement) = 0;
};
-
CXmlParser:
This is
the
primary
class in
the
wrapper
layer.
It's a
concrete
class
that
wraps
the
functionality
of the
SAX2
reader
(parser).
Below is
the
class
definition
for
reference:
Collapse
Copy
Code
class CXmlParser
{
public:
CXmlParser();
~CXmlParser();
bool IsReady() const;
void AttachElementHandler(IXmlElementHandler* pElementHandler);
void DetachElementHandler();
bool SetParserFeature(const std::string& featureName, bool value);
bool GetParserFeature(const std::string& featureName,
bool& value) const;
bool AddValidationSchema(const std::string& namespaceURI,
const std::string& xsdPath);
bool RemoveValidationSchema(const std::string& namespaceURI);
bool Parse(const std::string& xmlPath);
private:
CXmlParserImpl* m_impl;
};
You can
find each of
the above
classes
defined in
the
XmlSupport
project. As
mentioned
earlier,
this is a
Win32 static
library. To
use the
library in
your own
application,
just include
the
XmlParser.h
file, and
link your
project
against
XmlSupport.lib.
Note that
the run-time
library I am
using in my
projects is
Debug
Multithreaded
DLL (for
Win32
Debug
configuration)
and
Multithreaded
DLL (for
Win32
Release
configuration).
Check the
project
settings of
your
application
(C/C++ tab,
Code
Generation
category) to
make sure
your
settings are
compatible
or else you
will get
linker
errors.
Validation
Before
discussing
the test
application,
I will
backtrack a
bit and give
some
background
information
on
validating
XML files
using SAX2.
As of
MSXML 4.0,
SAX2
supports
validation
using
XML
schemas as
defined in
XSD files.
An XML
schema is
like a
grammar for
describing
XML instance
documents.
As an
analogy, I
like to
think of it
as a
blueprint
that tells
you how to
build a
house and
also allows
you to check
whether
you've built
it
correctly.
For example,
for the
test.xml
file listed
earlier, you
can create a
corresponding
test.xsd
file that
defines a
schema which
can be used
to detect
errors such
as someone
using
<PARTNUM>
instead of
<PARTNO>
in the
XML file,
or
<DESC>
instead of
<DESCRIPTION>
.
The SAX2
reader
reports
validation
errors
during
parsing
through the
normal
mechanism.
You just
need to
attach an
error
handler that
implements
ISAXErrorHandler
as discussed
earlier.
By
default, the
SAX2 reader
does not
perform
validation
during
parsing. To
enable
validation,
you must
turn on a
"feature",
which is
like a
boolean
property
that
controls a
particular
parsing
option.
Using
ISAXXMLReader
directly:
Collapse
Copy
Code
HRESULT hr = m_reader->putFeature(L"schema-validation", VARIANT_TRUE);
Correspondingly,
in the
CXmlParser
wrapper
class, you
can set
features
using the
SetParserFeature()
method.
Collapse
Copy
Code
bool result = m_xmlParser->SetParserFeature("schema-validation", true);
Once
validation
is enabled,
the SAX
reader must
be able to
find the XSD
file
corresponding
to the XML
file being
parsed.
There are
two ways to
do this. The
first, and
easiest way,
is to have
the XML file
reference
the location
of the
schema file.
For example,
the
books.xml
file from
MSDN
examples on
validation
contains an
attribute
specification
near the top
of the file
that
specifies
books.xsd
as the
schema file
to use.
Collapse
Copy
Code
xsi:schemaLocation="urn:books books.xsd"
Finally,
there is
also another
SAX reader
feature
which needs
to be
enabled in
order to
validate
using the
schema
location. By
default,
this feature
is enabled
though.
Collapse
Copy
Code
bool result = m_xmlParser->SetParserFeature("use-schema-location", true);
The
second way
to associate
an
XML file
with an XSD
file for
validation
is to use a
schema
cache. A
schema cache
is basically
a container
of XSD file
paths, each
indexed by a
key - the
XML
namespace.
Take a look
again at the
schemaLocation
attribute I
showed
earlier. The
string
"urn:books"
is actually
the
namespace
associated
with the
books.xsd
schema file.
When you add
a new XSD
path to the
schema
cache, you
need to
specify the
namespace to
use. The
code below
shows how
the schema
cache is
created and
associated
with the SAX
reader:
Collapse
Copy
Code
void CXmlParserImpl::CreateSchemaCache()
{
if ( m_reader == NULL )
return;
HRESULT hr = CoCreateInstance(
__uuidof(XMLSchemaCache40),
NULL,
CLSCTX_ALL,
__uuidof(IXMLDOMSchemaCollection2),
(void **)&m_schemaCache);
if ( SUCCEEDED(hr) )
{
hr = m_reader->putProperty(L"schemas",
_variant_t(m_schemaCache));
if ( FAILED(hr) )
{
OutputDebugString("CXmlParserImpl::Create"
"SchemaCache(): putProperty(\"schemas\",...) failed\n");
}
}
}
Once the
schema cache
is created
and
registered
with the SAX
reader, you
can add a
schema:
Collapse
Copy
Code
hr = m_schemaCache->add(wszNamespaceURI, _variant_t(xsdPath.c_str()));
Or,
remove a
schema:
Collapse
Copy
Code
hr = m_schemaCache->remove(wszNamespaceURI);
Note that
if you add
the wrong
schema
(e.g., you
specify a
namespace
that is not
used by your
XML file),
validation
won't work
properly,
even if you
have the
"schema-validation"
feature
enabled in
the SAX
reader. In
this case,
the SAX
reader will
report an
error when
it finishes
parsing the
root
element:
"Validate
failed
because the
root element
had no
associated
DTD/Schema".
If an
XML file
does not use
namespaces,
you can use
an empty
string ("")
for the
namespace
when
adding/removing
schemas. A
proper
validation
error
reported by
the SAX
reader looks
something
like:
"Element
content is
invalid
according to
the
DTD/Schema.
Expecting:
...".
The
TestXmlSupport
application
The
TestXmlSupport
application
demonstrates
the use of
the
CXmlParser
wrapper
class. It's
a
dialog-based
MFC
application
that allows
you to
choose an
XML file,
initiate a
parse, and
see the
parsing
results
appear in a
list box.
The parsing
results
consist of
"Log"
messages and
a printout
of each
XML
element as
it is
received by
the class
that
implements
IXmlElementHandler
.
Below is a
snapshot of
the
application
after it has
parsed a RSS
feed from
the CNN
website.
The
dialog has a
validation
section that
allows you
to
experiment
with various
parser
options. The
Enable
exhaustive
errors
checkbox
corresponds
to the
"exhaustive-errors"
feature.
When
enabled, the
SAX reader
will
continue
parsing even
if it has
found an
error. This
allows you
to receive
all of the
errors in
the file
instead of
just the
first one
encountered.
The other
two
checkboxes
correspond
to the
"schema-validation"
and
"use-schema-location"
features
which were
discussed
earlier.
Below the
row of
checkboxes
is a set of
controls for
adding and
removing
schemas. To
use the
controls,
you type in
a namespace
string,
select your
XSD file,
and then
press the
Add button
to add the
schema to
the parser.
A log
message in
the results
list box
will tell
you whether
the
operation
succeeded or
not. Note
that it is
possible to
leave the
namespace
field empty
if your
XML file
does not use
namespaces.
However, I
will mention
again that
even if the
add
operation
succeeds, if
the
namespace
you entered
(blank or
otherwise)
does not
actually
match what
your
XML file
is using,
then
validation
will not
work
properly.
The
bottom half
of the
dialog
contains a
list box
that
displays
parsing
results,
along with
options for
controlling
the parse.
There is an
edit box
that allows
you to
specify a
parsing
delay in
milliseconds.
This tells
the
application
to pause for
a short
period of
time after
each start
element is
encountered.
Using a
value of 500
milliseconds,
for example,
will slow
down the
parsing
enough that
the user has
time to
abort the
parse.
Otherwise,
the parse
will likely
complete
before you
can press
the Abort
button. The
Clear
button
clears the
contents of
the results
list box.
Instead
of embedding
a lot of
logic into
the dialog
class, I've
moved most
of its
functionality
into a
helper
class,
CXmlTester
.
This is
actually the
class that
uses the
CXmlParser
wrapper and
also handles
the
XML
parser
events by
implementing
the
IXmlElementHandler
interface.
For example,
here is the
CXmlTester
implementation
of the
virtual
OnXmlStartElement()
method:
Collapse
Copy
Code
void CXmlTester::OnXmlStartElement(const CXmlElement& xmlElement)
{
if ( m_resultsWnd != NULL )
{
m_resultsWnd->InsertString(-1, xmlElement.ToString().c_str());
}
if ( m_parsingDelay > 0 )
{
PumpWindowsMessages();
::Sleep(m_parsingDelay);
}
}
In the
TestXmlSupport
project, if
you go to
the FileView
in the
workspace
window
within VC6,
you can see
I've added a
folder to
the project
called "XML
Files".
Here I've
inserted
some sample
XML/XSD
files that
you can try
out using
the test
application.
These files
are slightly
modified
versions
taken from
the MSDN
documentation:
-
test.xml:
Models a
parts
catalog.
-
books.xml/books.xsd:
Models a
catalog
of
books.
Uses the
default
namespace
(empty
string).
I've
modified
this
from the
MSDN
original
by
changing
the
element
name of
the last
book
record
to "
mybook
".
This is
to make
the XML
file
non-valid
according
to
books.xsd
(so I
can test
that
validation
is
working
and that
it is
able to
detect
the
error
that I
introduced).
-
books2.xml/books2.xsd:
Models a
set of
books.
Uses the
namespace
"
urn:books
".
The
XML
file
also is
non-valid
according
to the
XSD file
(so it
can also
be used
to test
that
validation
is
working).
The
XML
file
uses the
schemaLocation
attribute.
Building
an
application
model
Although
CXmlTester
exercises
all of the
methods in
the
CXmlParser
wrapper
class, it's
not a
practical
example of
why you want
to parse an
XML file in
the first
place.
Typically,
you want to
do more than
just
re-display
the XML file
in a list
box, or log
parser
events.
Thus, I've
provided a
second
class,
CBookCatalog
,
in the test
application
which models
a catalog of
books. This
class also
uses
CXmlParser
in order to
build up a
collection
of books and
is designed
to work with
the
books.xml
file. In the
dialog, when
the user
presses the
Parse
button, the
CXmlTester
instance is
used to
perform an
initial
parse. If
that parse
completes
successfully,
a
CBookCatalog
instance
will attempt
to build its
catalog from
the same XML
file (thus
parsing a
second
time). If
the proper
books.xml
file was
selected,
the catalog
should have
11 books in
total. If
not, the
catalog will
remain
empty. If
the catalog
is built
successfully,
you can
search for a
particular
book in the
catalog by
entering a
book ID in
the dialog
and then
pressing the
Find Book
button. The
search
result will
be displayed
in a simple
message box.
Wide-character
string
support
In my
initial
posting of
the demo
source code,
the classes
exported by
the wrapper
layer used
char
-based
strings
(e.g.,
std::string
).
In other
words, there
is a
conversion
that takes
place in the
wrapper
layer from
SAX strings
(which are
wide-character)
to
std::string
.
While this
can be
convenient
for
applications
which are
not using
Unicode
strings,
feedback
from
ucancode
members
correctly
point out
that this is
non-standard
and to be
generic,
wide-character
strings
should be
supported.
Thus, I've
added an
alternate
set of
classes to
the wrapper
layer based
on
std::wstring
.
To avoid
duplication
of code, I
used
templates to
parameterize
the two XML
data types:
Collapse
Copy
Code
typedef CBasicXmlAttribute<char> CXmlAttribute;
typedef CBasicXmlAttribute<wchar_t> CWXmlAttribute;
typedef CBasicXmlElement<char> CXmlElement;
typedef CBasicXmlElement<wchar_t> CWXmlElement;
I also
added a new
element
handler
interface,
IWXmlElementHandler
,
which is
used in
conjunction
with the
wide-character
version of
the parser
wrapper
class,
CWXmlParser
.
These
changes do
not affect
the test
application
that I
provided
originally
as none of
the original
class names
were
changed.
However, in
order to
test the new
parser
wrapper,
CWXmlParser
,
I decided to
write a
simple
console
application,
TestConsole,
that parses
the
books2.xml
file a
couple of
times and
prints out
the elements
as they are
received.
Summary
The
presented
wrapper
framework
can be a
starting
point for
adding
XML parsing
support to
your own
applications.
I want to
make it
clear though
that this is
not a
generic
solution for
all cases as
I have only
wrapped a
small subset
of the SAX2
API. If you
have more
generic
requirements,
perhaps this
article can
help you to
come up with
a better
wrapper
design. Or,
if you are
relatively
new to the
MSXML
SDK, you can
use this
work to help
you ramp up
quicker than
relying
solely on
the MSDN
documentation.
Another goal
of my
article was
to provide
an example
of how to
structure/organize
classes for
reusability.
For example,
the wrapper
approach
should help
in
applications
where you
need to
parse more
than one
type of
XML file.
Future areas
to look at
include the
MXXMLWriter
COM class in
SAX2, which
provides for
XML
writing/generation.
References