Reading EDXML Data

The EDXML SDK features several classes and subpackages for parsing EDXML data streams, all based on the excellent lxml library. All EDXML parsers are incremental, which allows for developing efficient system components that process a never ending stream of input events.

EDXML Parsers

All EDXML parsers are based on the EDXMLParserBase class, which has several subclasses for specific purposes. During parsing, the parser generates calls to a set of callback methods which can be overridden to process input data. There are callbacks for processing events and tracking ontology updates.

The EDXMLParserBase class has two types of subclasses, push parsers and pull parsers. Pull parsers read from a provided file-like object in a blocking fashion. Push parsers need to be actively fed with string data. Push parsers provide control over the input process, which allows implementing efficient low latency event processing components.

The two most used EDXML parsers are EDXMLPullParser and EDXMLPushParser. For the specific purpose of extracting ontology data from EDXML data, there are EDXMLOntologyPullParser and EDXMLOntologyPushParser. The latter pair of classes skip event data and only invoke callbacks when ontology information is received.

An example of using the pull parser is shown below:

import os

from edxml import EDXMLPullParser


class MyParser(EDXMLPullParser):
    def _parsed_event(self, event):
        # Do whatever you want here
        ...

    def _parsed_ontology(self, ontology):
        # Do whatever you want here
        ...


with MyParser() as parser:
    parser.parse(os.path.dirname(__file__) + '/input.edxml')

Besides extending a parser class and overriding callbacks, there is secondary mechanism specifically for processing events. EDXML parsers allow callbacks to be registered for specific event types or events from specific sources. These callbacks can be any Python callable. This allows EDXML data streams to be processed using a set of classes, each of which registered with the parser to process specific event data. The parser takes care of routing the events to the appropriate class.

EventCollection

In stead of reading EDXML data in a streaming fashion it can also be useful to read and access an EDXML document as a whole. This can be done using the EventCollection class. The following example illustrates this:

import os

from edxml import EventCollection

data = open(os.path.dirname(__file__) + '/input.edxml', 'rb').read()

collection = EventCollection.from_edxml(data)

for event in collection:
    print(event)

Class Documentation

The class documentation of the various parsers can be found below.

EDXMLParserBase

class edxml.EDXMLParserBase(validate=True)

Bases: object

This is the base class for all EDXML parsers.

Create a new EDXML parser. By default, the parser validates the input. Validation can be disabled by setting validate = False

Parameters:validate (bool, optional) – Validate input or not
close()

Close the parser after parsing has finished. After closing, the parser instance can be reused for parsing another EDXML data file.

Returns:The EDXML parser
Return type:EDXMLParserBase
get_event_counter()

Returns the number of parsed events. This counter is incremented after the _parsedEvent callback returned.

Returns:The number of parsed events
Return type:int
get_event_type_counter(event_type_name)

Returns the number of parsed events of the specified event type. These counters are incremented after the _parsedEvent callback returned.

Parameters:event_type_name (str) – The type of parsed events
Returns:The number of parsed events
Return type:int
get_ontology()

Returns the ontology that was read by the parser. The ontology is updated whenever new ontology information is parsed from the input data.

Returns:The parsed ontology
Return type:edxml.ontology.Ontology
set_custom_event_class(event_class)

By default, EDXML parsers will generate ParsedEvent instances for representing event elements. When this method is used to set a custom element class, this class will be instantiated in stead of ParsedEvent. This can be used to implement custom APIs on top of the EDXML events that are generated by the parser.

Note

In is strongly recommended to extend the ParsedEvent class and implement additional class methods on top of it.

Note

Implementing a custom element class that can replace the standard etree.Element class is tricky, be sure to read the lxml documentation about custom Element classes.

Parameters:event_class (etree.ElementBase) – The custom element class
Returns:The EDXML parser
Return type:EDXMLParserBase
set_event_source_handler(source_patterns, handler)

Register a handler for specified event sources. Whenever an event is parsed that has an event source URI matching any of the specified regular expressions, the supplied handler will be called with the event (which will be a ParsedEvent instance) as its only argument.

Multiple handlers can be installed for a given event source, they will be invoked in the order of registration. Event source handlers are invoked after event type handlers.

Parameters:
  • source_patterns (List[str]) – List of regular expressions
  • handler (callable) – Handler
Returns:

The EDXML parser

Return type:

EDXMLParserBase

set_event_type_handler(event_types, handler)

Register a handler for specified event types. Whenever an event is parsed of any of the specified types, the supplied handler will be called with the event (which will be a ParsedEvent instance) as its only argument.

Multiple handlers can be installed for a given type of event, they will be invoked in the order of registration. Event type handlers are invoked before event source handlers.

Parameters:
  • event_types (List[str]) – List of event type names
  • handler (callable) – Handler
Returns:

The EDXML parser

Return type:

EDXMLParserBase

EDXMLPushParser

class edxml.EDXMLPushParser(validate=True, foreign_element_tags=None)

Bases: edxml.parser.EDXMLParserBase

An incremental push parser for EDXML data. Unlike the pull parser, this parser does not read data by itself and does not block when the data stream dries up. It needs to be actively fed with stings, allowing full control of the input process.

Optionally, a list of tags of foreign elements can be supplied. The tags must prepend the namespace in James Clark notation. Example:

[‘{http://some/foreign/namespace}attribute’]

These elements will be passed to the _parse_foreign_element() when encountered.

Note

This class extends EDXMLParserBase, refer to that class for more details about the EDXML parsing interface.

feed(data)

Feeds the specified string to the parser. A call to the feed() method may or may not trigger calls to callback methods, depending on the size and content of the passed string buffer.

Parameters:data (bytes) – String data

EDXMLPullParser

class edxml.EDXMLPullParser(validate=True)

Bases: edxml.parser.EDXMLParserBase

An blocking, incremental pull parser for EDXML data, for parsing EDXML data from file-like objects.

Note

This class extends EDXMLParserBase, refer to that class for more details about the EDXML parsing interface.

Create a new EDXML parser. By default, the parser validates the input. Validation can be disabled by setting validate = False

Parameters:validate (bool, optional) – Validate input or not
parse(input_file, foreign_element_tags=())

Parses the specified file. The file can be any file-like object, or the name of a file that should be opened and parsed. The parser will generate calls to the various callback methods in the base class, allowing the parsed data to be processed.

Optionally, a list of tags of foreign elements can be supplied. The tags must prepend the namespace in James Clark notation. Example:

[‘{http://some/foreign/namespace}tag’]

These elements will be passed to the _parse_foreign_element() when encountered.

Notes

Passing a file name rather than a file-like object is preferred and may result in a small performance gain.

Parameters:
  • input_file (Union[io.TextIOBase, file, str]) –
  • foreign_element_tags (List[str]) –
Returns:

edxml.EDXMLPullParser

EDXMLOntologyPushParser

class edxml.EDXMLOntologyPushParser(validate=True, foreign_element_tags=None)

Bases: edxml.parser.EDXMLPushParser

A variant of the incremental push parser which ignores the events, parsing only the ontology information.

EDXMLOntologyPullParser

class edxml.EDXMLOntologyPullParser(validate=True)

Bases: edxml.parser.EDXMLPullParser

A variant of the incremental pull parser which ignores the events, parsing only the ontology information.

Create a new EDXML parser. By default, the parser validates the input. Validation can be disabled by setting validate = False

Parameters:validate (bool, optional) – Validate input or not

EventCollection

class edxml.EventCollection(events=(), ontology=None)

Bases: list, typing.Generic

Class representing a collection of EDXML events. It is an extension of the list type and can be used like any other list.

Creates a new event collection, optionally initializing it with events and an ontology.

Parameters:
  • events (Iterable[edxml.event.EDXMLEvent]) – Initial event collection
  • ontology (edxml.ontology.Ontology) – Corresponding ontology
extend(iterable)

Extend list by appending elements from the iterable.

create_dict_by_hash()

Creates a dictionary mapping sticky hashes to event collections containing the events that have that hash. The hashes are represented as hexadecimal strings.

Returns:Dict[str, EventCollection]
is_equivalent_of(other)

Compares the collection with another specified collection. It returns True in case the two collections are equivalent, i.e. there are no semantic differences. For example, when one collection contains two instances of the same logical event while the other collection contains the result of merging the two events then there is no difference. Ordering of events or properties within an event are also irrelevant and do not result in any differences either.

Parameters:other (EventCollection) – Another event collection
Returns:bool
set_ontology(ontology)

Associates the evens in the collection to the specified EDXML ontology.

Parameters:ontology (edxml.ontology.Ontology) –
Returns:edxml.EventCollection
update_ontology(ontology)

Updates the ontology that is associated with the evens in the collection using the given ontology.

Parameters:ontology (edxml.ontology.Ontology) –
resolve_collisions()

Returns a new EventCollection that contains only a single instance of each logical event in this collection. All input event instances that share a sticky hash are merged into a single output event.

Returns:edxml.EventCollection
classmethod from_edxml(edxml_data, foreign_element_tags=())

Parses EDXML data and returns a new EventSet containing the events and ontology information from the EDXML data.

Foreign elements are ignored by default. Optionally, tags of foreign elements can be specified allowing the parser to process them. The tags must prepend the namespace in James Clark notation. Example:

[‘{http://some/foreign/namespace}tag’]

Parameters:
  • edxml_data (bytes) – The EDXML data
  • foreign_element_tags (Tuple[str]) – Foreign element tags
Returns:

Return type:

EventCollection

to_edxml(pretty_print=True)

Returns a string containing the EDXML representation of the events in the collection.

Parameters:pretty_print (bool) – Pretty print output yes or no
Returns:
Return type:bytes
filter_type(event_type_name)

Returns a new event set containing the subset of events of specified event type.

Parameters:event_type_name (str) –
Returns:
Return type:EventCollection