Data Transcoding

EDXML data is most commonly generated from some type of input data. The input data is used to generate output events. The EDXML SDK features the concept of a record transcoder, which is a class that contains all required information and logic for transcoding a chunk of input data into an output event. The SDK can use record transcoders to generate events and event type definitions for you. It also facilitates unit testing of record transcoders.

In case the input data transforms into events of more than one event type, the transcoding process can be done my multiple record transcoders. This allows splitting the problem of transcoding the input in multiple parts. A transcoder mediator can be used to automatically route chunks of input data to the correct transcoder.

A record transcoder is an extension of the EventTypeFactory class. Because of this, record transcoders use class constants to describe event types. These class constants will be used to populate the output ontology while transcoding.

All record transcoders feature a TYPE_MAP constant which maps record selectors to event types. Record selectors identify chunks of input data that should transcode into a specific type of output event. What these selectors look like depends on the type of input data.

Object Transcoding

The most broadly usable record transcoder is the ObjectTranscoder class. Have a look at a quick example:

from edxml.transcode.object import ObjectTranscoder
from edxml_bricks.computing.generic import ComputingBrick


class UserTranscoder(ObjectTranscoder):
    TYPES = ['com.acme.staff.account']
    TYPE_MAP = {'user': 'com.acme.staff.account'}
    PROPERTY_MAP = {
        'com.acme.staff.account': {
            'name': 'user.name'
        }
    }
    TYPE_PROPERTIES = {
        'com.acme.staff.account': {
            'user.name': ComputingBrick.OBJECT_USER_NAME
        }
    }

This minimal example shows a basic transcoder for input data records representing a user account. It shows the general structure of a transcoder, how to map input record types to output event types and map input record fields to output event properties.

The record selector shown in the TYPE_MAP constant is simply the name of type of input record: user. We will show how to label input records and how input record types are used to route input records to record transcoders later, when we discuss transcoder mediators.

The PROPERTY_MAP constant maps input record fields to event type properties. Just one field-to-property mapping is shown here, extending the example is straight forward.

The TYPE_PROPERTIES constant specifies the properties for each event type as well as the object type of each property. The object type that we refer to here using the ComputingBrick.OBJECT_USER_NAME constant is defined using an ontology brick from the public collection of ontology bricks.

Transcoding Steps

What happens when the generate() method of the transcoder is called is the following:

  1. It will first generate an ontology using the TYPES constant to determine which event types to define and using the PROPERTY_MAP constant to look up which properties each event type should have. From the TYPE_PROPERTIES constant it determines the object types for each event property. In this example we have just one output event type which has a single property.
  2. The transcoder will check the TYPE_MAP constant and see that an input record of type user should yield an output event of type com.acme.staff.account.
  3. It checks the PROPERTY_MAP constant to see which record fields it should read and which property its values should be stored in. In this example, the name field goes into the user.name property.
  4. The transcoder reads the name field from the input record and uses it to populate the user.name property.

Input Record Fields

We referred to name as a “field” because we did not specify what name refers to. Each input record might be a Python object and a field might actually be an attribute of the input record. Or the record might be a dictionary and the field a key in that dictionary. Both scenarios are supported. The transcoder will first try to treat the record as a dictionary and use its get() method to read the name item. In our example, the record is not a dictionary and the read will fail. Then, the transcoder will try to see if the record has an attribute named name by attempting to read it using the getattr() method. This will succeed and the output event property is populated.

Object Types and Concepts

Record transcoders usually define only event types, not the object types and concepts that these event types refer to. The reason is that these ontology elements are rarely specific for one particular transcoder. Object types and concepts are typically used my multiple transcoders and multiple data sources. In fact, object types and concepts are the very thing that enable machines to correlate information from multiple EDXML documents and forge it into a single consistent data set.

For that reason object types and concepts are usually defined by means of ontology bricks rather than a transcoder.

Selector Syntax

As we mentioned before, the name value in the PROPERTY_MAP constant is not just a name. It is a selector. As such, it can point to more than just dictionary entries or attributes in input records. Selectors support a dotted syntax to address values within values. For example, foo.bar can be used to access an item named bar inside a dictionary named foo. And if bar happens to be a list, you can address the first entry in that list by using foo.bar.0 as selector.

Using a Mediator

Now we will extend the example to include a transcoder mediator:

import sys

from edxml.transcode.object import ObjectTranscoder, ObjectTranscoderMediator
from edxml_bricks.computing.generic import ComputingBrick


class UserTranscoder(ObjectTranscoder):
    TYPES = ['com.acme.staff.account']
    TYPE_MAP = {'user': 'com.acme.staff.account'}
    PROPERTY_MAP = {
        'com.acme.staff.account': {
            'name': 'user.name'
        }
    }
    TYPE_PROPERTIES = {
        'com.acme.staff.account': {
            'user.name': ComputingBrick.OBJECT_USER_NAME
        }
    }


class MyMediator(ObjectTranscoderMediator):
    TYPE_FIELD = 'type'


class Record:
    type = 'user'
    name = 'Alice'


with MyMediator(output=sys.stdout.buffer) as mediator:
    # Register the transcoder
    mediator.register('user', UserTranscoder())
    # Define an EDXML event source
    mediator.add_event_source('/acme/offices/amsterdam/')
    # Set the source as current source for all output events
    mediator.set_event_source('/acme/offices/amsterdam/')
    # Process the input record
    mediator.process(Record)

Now we see that the TYPE_FIELD constant of the mediator is used to set the name of the field in the input records that contains the record type. The example uses a single type of input record named user. The record transcoder is registered with the mediator using the same record type name. When the record is fed to the mediator using the process() method the mediator will read the record type and use the associated transcoder to generate an output event. In this case the output EDXML stream will be written to standard output. Note that the transcoder produces binary data, so we write the output to sys.stdout.buffer rather than sys.stdout.

XML Transcoding

When you need to transcode XML input you can use the XmlTranscoder class. It is highly similar to the ObjectTranscoder class. The main difference is in how input records and fields are identified. This is done using XPath expressions. Below example illustrates this:

import sys
from io import BytesIO

from edxml.transcode.xml import XmlTranscoder, XmlTranscoderMediator
from edxml_bricks.computing.generic import ComputingBrick

# Define an input document
xml = bytes(
    '<records>'
    '  <users>'
    '    <user>'
    '      <name>Alice</name>'
    '    </user>'
    '  </users>'
    '</records>', encoding='utf-8'
)


# Define a transcoder for user records
class UserTranscoder(XmlTranscoder):
    TYPES = ['com.acme.staff.account']
    TYPE_MAP = {'.': 'com.acme.staff.account'}
    PROPERTY_MAP = {
        'com.acme.staff.account': {
            'name': 'user.name'
        }
    }
    TYPE_PROPERTIES = {
        'com.acme.staff.account': {
            'user.name': ComputingBrick.OBJECT_USER_NAME
        }
    }


# Transcode the input document
with XmlTranscoderMediator(output=sys.stdout.buffer) as mediator:
    # Register transcoder
    mediator.register('/records/users/user', UserTranscoder())
    # Define an EDXML event source
    mediator.add_event_source('/acme/offices/amsterdam/')
    # Set the source as current source for all output events
    mediator.set_event_source('/acme/offices/amsterdam/')
    # Parse the XML data
    mediator.parse(BytesIO(xml))

XPath expressions are used in various places in the above example. Firstly, the transcoder is registered to transcode all XML elements that are found using XPath expression /records/users/user. These will be treated as input records for the transcoding process. Note that transcoders should be registered using absolute XPath expressions. Second, the TYPE_MAP constant indicates that all of the matching elements should be used for outputting an event of type com.acme.staff.account. This is achieved by using an XPath expression relative to the root of the input element. In our example the entire input element becomes the output event, so we use . as the XPath expression. In case the transcoder would produce various types of output events from sub-elements, then the XPath expressions in TYPE_MAP would need to select various sub-elements. Finally, the PROPERTY_MAP constant indicates that the event property named user.name should be populated by applying XPath expression name to the input record. This ultimately results in a single output event containing object value Alice.

Efficiency Considerations

The XML input is parsed in an incremental fashion. An in memory tree is built while parsing. For reasons of efficiency, the mediator will delete XML elements after transcoding them, thus preventing the in memory XML tree from growing while parsing. However, XML elements that have no matching transcoder are not deleted. The reason is that XML transcoders may need to aggregate information from multiple XML elements, while being registered on just one of those elements. So, the in memory XML tree may still grow when parts of the document have no transcoder associated with them.

The solution is to use the NullTranscoder. Registering this transcoder with an absolute XPath selector will allow the mediator to delete the matching XML elements and keep the in memory tree nice and clean.

Advanced Subjects

Defining Event Sources

The transcoder mediator examples showed how to add and select an event source for the output events. This will suffice for cases where the transcoder presents its output as a single EDXML event source. Transcoders may also define and use multiple EDXML event sources. This can be done by defining multiple sources and switch sources while feeding input records. For advanced use cases there are other methods that may suit you better.

Defining event sources can also be done by overriding the generate_ontology() method.

Assigning an event source to individual output events can be done by overriding the post_process() method. This is described in more detail here.

Outputting Multiple Events

By default, the transcoding process produces a single EDXML output event for each input data record. When input records contain a lot of information it may make sense to transcode a single input record into multiple output events. This can be achieved by overriding the post_process() function. This is explained in more detail in the next subsection.

Customizing Output Events

The events that are generated by the transcoder may be all you need for simple use cases. Sometimes the events may require additional processing. For example, you might need to adjust the values for some property. As an example, let us assume that the object values for some property need to be converted to lowercase. This can be done by defining the following generator:

def lowercase_user_name(name):
    yield name.lower()

Then, this generator can be used in the TYPE_PROPERTY_POST_PROCESSORS constant:

TYPE_PROPERTY_POST_PROCESSORS = {
    'com.acme.staff.account': {
        'user': lowercase_user_name,
    }
}

Note that the postprocessor can also be used to generate multiple object values from a single input record value.

If you need full access to the generated events and adjust them to your liking, then you can override the post_process() function. This function is actually a generator taking a single EDXML event as input and generating zero or more output events. The input record from which the event was constructed is provided as a function argument as well.

As an example, you might want to add the original JSON input record as an event attachment. An implementation of this could look like the following:

def post_process(self, event, input_record):
    event.attachments['input-record']['input-record'] = json.dumps(input_record)
    yield event

Customizing the Ontology

Not every aspect of the output ontology can be specified by means of the record transcoder class constants. Defining property relations is an example of this. Property relations are much better expressed in a procedural fashion. This can be done by overriding the create_event_type() class method. This is demonstrated in the following example:

@classmethod
def create_event_type(cls, event_type_name, ontology):

    user = super().create_event_type(event_type_name, ontology)

    user['name'].relate_intra('can be reached at', 'phone').because(
        "the LDAP directory entry of [[name]] mentions [[phone]]"
    )

    return user

Unit Testing

Record transcoders can be tested using a transcoder test harness. This is a special kind of transcoder mediator. There is the TranscoderTestHarness base class and the ObjectTranscoderTestHarness and XmlTranscoderTestHarness extensions. Feeding input records into these mediators will have the test harness use your transcoder to generate an EDXML document containing the output ontology and events, validating the output in the process. The EDXML document will then be parsed back into Python objects. The data is validated again in the process. Finally, any colliding events will be merged and the final event collection will be validated a third time. This assures that the output of the transcoder can be serialized into EDXML, deserialized and merged correctly.

After feeding the input records the parsed ontology and events are exposed by means of the events attribute of the test harness. This attribute is an EventCollection which you can use to write assertions about the resulting ontology and events. So, provided you feed the test harness with a good set of test records, this results in unit tests that cover everything. The transcoding process itself, ontology generation, validity of the output events and event merging logic.

A full example of the use of a test harness is shown below:

import pytest

from edxml.transcode.object import ObjectTranscoderTestHarness, ObjectTranscoder
from edxml_bricks.computing.generic import ComputingBrick


class TestObjectTranscoder(ObjectTranscoder):
    __test__ = False

    TYPES = ['com.acme.staff.account']
    TYPE_MAP = {'user': 'com.acme.staff.account'}
    PROPERTY_MAP = {
        'com.acme.staff.account': {
            'name': 'user.name'
        }
    }
    TYPE_PROPERTIES = {
        'com.acme.staff.account': {
            'user.name': ComputingBrick.OBJECT_USER_NAME
        }
    }


@pytest.fixture()
def fixture_object():
    return {'type': 'user', 'name': 'Alice'}


def test(fixture_object):
    with ObjectTranscoderTestHarness(TestObjectTranscoder(), record_selector='type') as harness:
        harness.process_object(fixture_object)

    assert harness.events[0]['user.name'] == {'Alice'}

Automatic Data Normalization and Cleaning

By default a transcoder will just copy values from the input records into the properties of the output events. Often times this will not yield the desired result. A common case is date / time values. There are many different formats for representing time, and EDXML accepts just one specific format. Even greater challenges arise when the types of values contained in a single input record field may vary from one input record to another. Or when an input record field may occasionally contain downright nonsensical gibberish.

Transcoders feature various means of dealing with these challenges. Input data can be automatically normalized and cleaned. The default transcoder behavior is to error when it produces an event containing an invalid object value. In stead of adding code to properly normalize event object values in your transcoders, you can also have the SDK do the normalization for you. In order to do so you can use the TYPE_AUTO_REPAIR_NORMALIZE constant to opt into automatic normalization for a particular event property.

Automatic normalization also supports events containing non-string values. For example, placing a float in a property that uses an EDXML data type from the decimal family can automatically normalize that float into the proper decimal string representation that fits the data type. Some of the supported Python types are float, bool, datetime, Decimal and IP (from IPy).

In some cases, input data may contain values that cannot be normalized automatically. Using the TYPE_AUTO_REPAIR_DROP constant it is possible to opt into dropping these values from the output event. A common case is an input record that contains a field that may hold both IPv4 and IPv6 internet addresses. In EDXML these must be represented using different data types and separate event properties. This can be done by having the transcoder store the value in both properties. This means that one of the two properties will always contain an invalid value. By allowing these invalid values to be dropped, the output events will always have the values in the correct event property.

Note that there can be quite a performance penalty for enabling automatic normalization and cleaning. In many cases, this will not matter much. The transcoder is optimistic. As long as the output events are valid no normalization or cleaning done and there is no performance hit. Only when an output event fails to validate the expensive event repair code is run.

In case performance turns out to be an issue, you can always optimize your transcoder by normalizing event objects yourself. You might find the TYPE_PROPERTY_POST_PROCESSORS constant helpful. Alternatively, you can override the post_process() function to modify the autogenerated events as needed.

In case you want to retain the original record values as they were before normalization and cleaning there are two options for doing so. Firstly, the original value could be stored in another property and an ‘original’ relations could be used to relate the original value to the normalized one. Second, (part of) the original input record could be stored as an event attachment.

Description & Visualization

The Transcoder Mediator class contains two methods that allows EDXML transcoders to generate descriptions and visualizations of their output ontology. Both can be great aids to review the ontology information in your record transcoders. The first of these methods is describe_transcoder(). Refer to that method for details. The other method is generate_graphviz_concept_relations(). Again, refer to that method for details.

The image displayed below shows an example of the output of generate_graphviz_concept_relations() for an advanced transcoder that employs multiple record transcoders:

_images/transcoder-concepts-graph.png

The image shows the various reasoning pathways for concept mining provided by the full ontology of all record transcoders combined. It tells the transcoder developer how machines can correlate information to mine knowledge from the data.

API Documentation

Below, the documentation of the various transcoding classes is given.

Base Classes

Transcoders & Mediators

Test Harnesses

RecordTranscoder

class edxml.transcode.RecordTranscoder

Bases: edxml.ontology.event_type_factory.EventTypeFactory

This is a base class that can be extended to implement record transcoders for the various input data record types that are processed by a particular TranscoderMediator implementation. The class extends the EventTypeFactory class, which is used to generate the event types for the events that will be produced by the record transcoder.

TYPE_MAP = {}

The TYPE_MAP attribute is a dictionary mapping input record type selectors to the corresponding EDXML event type names. This mapping is used by the transcoding mediator to find the correct record transcoder for each input data record.

Note

When no EDXML event type name is specified for a particular input record type selector, it is up to the record transcoder to set the event type on the events that it generates.

Note

The fallback record transcoder must set the None key to the name of the EDXML fallback event type.

PROPERTY_MAP = {}

The PROPERTY_MAP attribute is a dictionary mapping event type names to the property value selectors for finding property objects in input records. Each value in the dictionary is another dictionary that maps value selectors to property names. The exact nature of the value selectors differs between record transcoder implementations.

TYPE_PROPERTY_POST_PROCESSORS = {}

The TYPE_PROPERTY_POST_PROCESSORS attribute is a dictionary mapping EDXML event type names to property processors. The property processors are a dictionary mapping property names to processors. A processor is a function that accepts a value from the input field that corresponds with the property and returns an iterable yielding zero or more values which will be stored in the output event. The processors will be applied to input record values before using them to create output events.

Example:

{
  'event-type-name': {
    'property-a': lambda x: yield x.lower()
  }
}
TYPE_AUTO_REPAIR_NORMALIZE = {}

The TYPE_AUTO_REPAIR_NORMALIZE attribute is a dictionary mapping EDXML event type names to properties which should be repaired automatically by normalizing their object values. This means that the transcoder is not required to store valid EDXML string representations in its output events. Rather, it may store any type of value which can be normalized into valid string representations automatically. Please refer to the normalize_objects() method for a list of supported value types. The names of properties for which values may be normalized are specified as a list. Example:

{'event-type-name': ['some-property']}
TYPE_AUTO_REPAIR_DROP = {}

The TYPE_AUTO_REPAIR_DROP attribute is a dictionary mapping EDXML event type names to properties which should be repaired automatically by dropping invalid object values. This means that the transcoder is permitted to store object values which cause the output event to be invalid. The EDXML writer will attempt to repair invalid output events. First, it will try to normalize object values when configured to do so. As a last resort, it can try to drop any offending object values. The names of properties for which values may be dropped are specified as a list. Example:

{'event-type-name': ['some-property']}
generate(record, record_selector, **kwargs)

Generates one or more EDXML events from the given input record

Parameters:
  • record – Input data record
  • record_selector (str) – The selector matching the input record
  • **kwargs – Arbitrary keyword arguments
Yields:

edxml.EDXMLEvent

post_process(event, input_record)

Generates zero or more EDXML output events from the given EDXML input event. If this method is overridden by an extension of the RecordTranscoder class, all events generated by the generate() method are passed through this method for post processing. This allows the generated events to be modified or omitted. Or, multiple derivative events can be created from a single input event.

The input record that was used to generate the input event is also passed as a parameter. Post processors can use this to extract additional information and add it to the input event.

Parameters:
Yields:

edxml.EDXMLEvent

TranscoderMediator

class edxml.transcode.TranscoderMediator(output=None)

Bases: object

Base class for implementing mediators between a non-EDXML input data source and a set of RecordTranscoder implementations that can transcode the input data records into EDXML events.

Sources can instantiate the mediator and feed it input data records, while record transcoders can register themselves with the mediator in order to transcode the types of input record that they support.

The class is a Python context manager which will automatically flush the output buffer when the mediator goes out of scope.

Create a new transcoder mediator which will output streaming EDXML data using specified output. The output parameter is a file-like object that will be used to send the XML data to. Note that the XML data is binary data, not string data. When the output parameter is omitted, the generated XML data will be returned by the methods that generate output.

Parameters:output (file, optional) – a file-like object
register(record_selector, record_transcoder)

Register a record transcoder for processing records identified by the specified record selector. The exact nature of the record selector depends on the mediator implementation.

The same record transcoder can be registered for multiple record selectors.

Note

Any record transcoder that registers itself as a transcoder using None as selector is used as the fallback record transcoder. The fallback record transcoder is used to transcode any record for which no transcoder has been registered.

Parameters:
debug(warn_no_transcoder=True, warn_fallback=True, log_repaired_events=True)

Enable debugging mode, which prints informative messages about transcoding issues, disables event buffering and stops on errors.

Using the keyword arguments, specific debug features can be disabled. When warn_no_transcoder is set to False, no warnings will be generated when no matching record transcoder can be found. When warn_fallback is set to False, no warnings will be generated when an input record is routed to the fallback transcoder. When log_repaired_events is set to False, no message will be generated when an invalid event was repaired.

Parameters:
  • warn_no_transcoder (bool) – Warn when no record transcoder found
  • warn_fallback (bool) – Warn when using fallback transcoder
  • log_repaired_events (bool) – Log events that were repaired
Returns:

Return type:

TranscoderMediator

disable_event_validation()

Instructs the EDXML writer not to validate its output. This may be used to boost performance in case you know that the data will be validated at the receiving end, or in case you know that your generator is perfect. :)

Returns:
Return type:TranscoderMediator
enable_auto_repair_drop(event_type_name, property_names)

Allows dropping invalid object values from the specified event properties while repairing invalid events. This will only be done as a last resort when normalizing object values failed or is disabled.

Note

Dropping object values may still lead to invalid events.

Parameters:
  • event_type_name (str) –
  • property_names (List[str]) –
Returns:

Return type:

TranscoderMediator

ignore_invalid_events(warn=False)

Instructs the EDXML writer to ignore invalid events. After calling this method, any event that fails to validate will be dropped. If warn is set to True, a detailed warning will be printed, allowing the source and cause of the problem to be determined.

Note

If automatic event repair is enabled the writer will attempt to repair any invalid events before dropping them.

Note

This has no effect when event validation is disabled.

Parameters:warn (bool) – Log warnings or not
Returns:
Return type:TranscoderMediator
ignore_post_processing_exceptions(warn=True)

Instructs the mediator to ignore exceptions raised by the _post_process() methods of record transcoders. After calling this method, any input record that that fails transcode due to post processing errors will be ignored and a warning is logged. If warn is set to False these warnings are suppressed.

Parameters:warn (bool) – Log warnings or not
Returns:
Return type:TranscoderMediator
enable_auto_repair_normalize(event_type_name, property_names)

Enables automatic repair of the property values of events of specified type. Whenever an invalid event is generated by the mediator it will try to repair the event by normalizing object values of specified properties.

Parameters:
  • event_type_name (str) –
  • property_names (List[str]) –
Returns:

TranscoderMediator

add_event_source(source_uri)

Adds an EDXML event source definition. If no event sources are added, a bogus source will be generated.

Warning

The source URI is used to compute sticky hashes. Therefore, adjusting the source URIs of events after generating them changes their hashes.

The mediator will not output the EDXML ontology until it receives its first event through the process() method. This means that the caller can generate an event source ‘just in time’ by inspecting the input record and use this method to create the appropriate source definition.

Returns the created EventSource instance, to allow it to be customized.

Parameters:source_uri (str) – An Event Source URI
Returns:
Return type:EventSource
set_event_source(source_uri)

Set a fixed event source for the output events. This source will automatically be set on every output event.

Parameters:source_uri (str) – The event source URI
Returns:
Return type:TranscoderMediator
generate_graphviz_concept_relations()

Returns a graph that shows possible concept mining reasoning paths.

Returns:graphviz.Digraph
describe_transcoder(input_description)

Returns a reStructuredText description for a transcoder that uses this mediator. This is done by combining the ontologies of all registered record transcoders and describing what the resulting data would entail.

Parameters:input_description (str) – Short description of the input data
Returns:str
process(record)

Processes a single input record, invoking the correct transcoder to generate an EDXML event and writing the event into the output.

If no output was specified while instantiating this class, any generated XML data will be returned as bytes.

Parameters:record – Input data record
Returns:Generated output XML data
Return type:bytes
close(write_ontology_update=True)

Finalizes the transcoding process by flushing the output buffer. When the mediator is not used as a context manager, this method must be called explicitly to properly close the mediator.

By default the current ontology will be written to the output if needed. This can be prevented by using the method parameter.

If no output was specified while instantiating this class, any generated XML data will be returned as bytes.

Parameters:write_ontology_update (bool) – Output ontology yes/no
Returns:Generated output XML data
Return type:bytes

TranscoderTestHarness

class edxml.transcode.TranscoderTestHarness(transcoder, record_selector, base_ontology=None, register=True)

Bases: edxml.transcode.mediator.TranscoderMediator

This class is a substitute for the transcoding mediators which can be used to test record transcoders. It provides the means to feed input records to record transcoders and make assertions about the output events.

After processing is completed, either by closing the context or by explicitly calling close(), any colliding events are merged. This means that unit tests will also test the merging logic of the events.

Creates a new test harness for testing specified record transcoder. Optionally a base ontology can be provided. When provided, the harness will try to upgrade the base ontology to the ontology generated by the record transcoder, raising an exception in case of backward incompatibilities.

By default the record transcoder will be automatically registered at the specified selector. In case you wish to do the record registration on your own you must set the register parameter to False.

Parameters:
events = None

The resulting event collection

process(record, selector=None)

Processes a single record, invoking the correct record transcoder to generate an EDXML event and adding the event to the event set.

The event is also written into an EDXML writer and parsed back to an event object. This means that all validation that would be applied to the event when using the real transcoder mediator has been applied.

Parameters:
  • record – The input record
  • selector – The selector that matched the record
close(write_ontology_update=True)

Finalizes the transcoding process by flushing the output buffer. When the mediator is not used as a context manager, this method must be called explicitly to properly close the mediator.

By default the current ontology will be written to the output if needed. This can be prevented by using the method parameter.

If no output was specified while instantiating this class, any generated XML data will be returned as bytes.

Parameters:write_ontology_update (bool) – Output ontology yes/no
Returns:Generated output XML data
Return type:bytes

ObjectTranscoder

class edxml.transcode.object.ObjectTranscoder

Bases: edxml.transcode.transcoder.RecordTranscoder

PROPERTY_MAP = {}

The PROPERTY_MAP attribute is a dictionary mapping event type names to their associated property mappings. Each property mapping is itself a dictionary mapping input record attribute names to EDXML event properties. The map is used to automatically populate the properties of the output events produced by the generate() method of the ObjectTranscoder class. The attribute names may contain dots, indicating a subfield or positions within a list, like so:

{'event-type-name': {'fieldname.0.subfieldname': 'property-name'}}

Mapping field values to multiple event properties is also possible:

{'event-type-name': {'fieldname.0.subfieldname': ['property', 'another-property']}}

Note that the event structure will not be validated until the event is yielded by the generate() method. This creates the possibility to add nonexistent properties to the attribute map and remove them in the generate() method, which may be convenient for composing properties from multiple input record attributes, or for splitting the auto-generated event into multiple output events.

EMPTY_VALUES = {}

The EMPTY_VALUES attribute is a dictionary mapping input record fields to values of the associated property that should be considered empty. As an example, the data source might use a specific string to indicate a value that is absent or irrelevant, like ‘-’, ‘n/a’ or ‘none’. By listing these values with the field associated with an output event property, the property will be automatically omitted from the generated EDXML events. Example:

{'fieldname.0.subfieldname': ('none', '-')}

Note that empty values are always omitted, because empty values are not permitted in EDXML event objects.

generate(input_object, record_selector, **kwargs)

Generates one or more EDXML events from the given input record, populating it with properties using the PROPERTY_MAP class property.

When the record transcoder is the fallback transcoder, record_selector will be None.

The input record can be a dictionary or act like one, it can be an object, a dictionary containing objects or an object containing dictionaries. Object attributes or dictionary items are allowed to contain lists or other objects. The keys in the PROPERTY_MAP will be used to access its items or attributes. Using dotted notation in PROPERTY_MAP, you can extract pretty much everything from anything.

This method can be overridden to create a generic event generator, populating the output events with generic properties that may or may not be useful to the specific record transcoders. The specific record transcoders can refine the events that are generated upstream by adding, changing or removing properties, editing the event attachments, and so on.

Parameters:
  • input_object (dict, object) – Input object
  • record_selector (Optional[str]) – The name of the input record type
  • **kwargs – Arbitrary keyword arguments
Yields:

EDXMLEvent

ObjectTranscoderMediator

class edxml.transcode.object.ObjectTranscoderMediator(output=None)

Bases: edxml.transcode.mediator.TranscoderMediator

This class is a mediator between a source of Python objects, also called input records, and a set of ObjectTranscoder implementations that can transcode the objects into EDXML events.

Sources can instantiate the mediator and feed it records, while record transcoders can register themselves with the mediator in order to transcode the record types that they support. Note that we talk about “record types” rather than “object types” because mediators distinguish between types of input record by inspecting the attributes of the object rather than inspecting the Python object as obtained by calling type() on the object.

Create a new transcoder mediator which will output streaming EDXML data using specified output. The output parameter is a file-like object that will be used to send the XML data to. Note that the XML data is binary data, not string data. When the output parameter is omitted, the generated XML data will be returned by the methods that generate output.

Parameters:output (file, optional) – a file-like object
TYPE_FIELD = None

This constant must be set to the name of the item or attribute in the object that contains the input record type, allowing the TranscoderMediator to route objects to the correct record transcoder.

If the constant is set to None, all objects will be routed to the fallback transcoder. If there is no fallback transcoder available, the record will not be processed.

Note

The fallback transcoder is a record transcoder that registered itself using None as record type.

register(record_type_identifier, transcoder)

Register a record transcoder for processing objects of specified record type. The same record transcoder can be registered for multiple record types.

Note

Any record transcoder that registers itself using None as record_type_identifier is used as the fallback transcoder. The fallback transcoder is used to transcode any record for which no record transcoder has been registered.

Parameters:
  • record_type_identifier (Optional[str]) – Name of the record type
  • transcoder (ObjectTranscoder) – ObjectTranscoder class
process(input_record)

Processes a single input object, invoking the correct object transcoder to generate an EDXML event and writing the event into the output.

If no output was specified while instantiating this class, any generated XML data will be returned as bytes.

The object may optionally be a dictionary or act like one. Object transcoders can extract EDXML event object values from both dictionary items and object attributes as listed in the PROPERTY_MAP of the matching record transcoder. Using dotted notation the keys in PROPERTY_MAP can refer to dictionary items or object attributes that are themselves dictionaries of lists.

Parameters:input_record (dict,object) – Input object
Returns:Generated output XML data
Return type:bytes

XmlTranscoder

class edxml.transcode.xml.XmlTranscoder

Bases: edxml.transcode.transcoder.RecordTranscoder

TYPE_MAP = {}

The TYPE_MAP attribute is a dictionary mapping XPath expressions to EDXML event type names. The XPath expressions are relative to the XPath of the elements that that record transcoder is registered to at the transcoding mediator. The expressions in TYPE_MAP are evaluated on each XML input element to obtain sub-elements. For each sub-element an EDXML event of the corresponding type is generated. In case the events are supposed to be generated from the input element as a whole, you can use ‘.’ for the XPath expression. However, you can also use the expressions to produce multiple types of output events from different parts of the input element.

Note

When no EDXML event type name is specified for a particular XPath expression, it is up to the record transcoder to set the event type on the events that it generates.

Note

The fallback transcoder must set the None key to the name of the EDXML fallback event type.

Example

{‘.’: ‘some-event-type’}

PROPERTY_MAP = {}

The PROPERTY_MAP attribute is a dictionary mapping event type names to the XPath expressions for finding property objects. Each value in the dictionary is another dictionary that maps property names to the XPath expression. The XPath expressions are relative to the source XML element of the event. Example:

{'event-type-name': {'some/subtag[@attribute]': 'property-name'}}

The use of EXSLT regular expressions is supported and may be used in Xpath keys like this:

{'event-type-name': {'*[re:test(., "^abc$", "i")]': 'property-name'}}

Mapping XPath expressions to multiple event properties is also possible:

{'event-type-name': {'some/subtag[@attribute]': ['property', 'another-property']}}

Extending XPath by injecting custom Python functions is supported due to the lxml implementation of XPath that is being used in the record transcoder implementation. Please refer to the lxml documentation about this subject. This record transcoder implementation provides a small set of custom XPath functions already, which shows how it is done.

Note that the event structure will not be validated until the event is yielded by the generate() method. This creates the possibility to add nonexistent properties to the XPath map and remove them in the Generate method, which may be convenient for composing properties from multiple XML input tags or attributes, or for splitting the auto-generated event into multiple output events.

EMPTY_VALUES = {}

The EMPTY_VALUES attribute is a dictionary mapping XPath expressions to values of the associated property that should be considered empty. As an example, the data source might use a specific string to indicate a value that is absent or irrelevant, like ‘-’, ‘n/a’ or ‘none’. By listing these values with the XPath expression associated with an output event property, the property will be automatically omitted from the generated EDXML events. Example:

{'./some/subtag[@attribute]': ('none', '-')}

Note that empty values are always omitted, because empty values are not permitted in EDXML event objects.

generate(element, record_selector, **kwargs)

Generates one or more EDXML events from the given XML element, populating it with properties using the PROPERTY_MAP class property.

When the record transcoder is the fallback transcoder, record_selector will be None.

This method can be overridden to create a generic event generator, populating the output events with generic properties that may or may not be useful to the specific record transcoders. The specific record transcoders can refine the events that are generated upstream by adding, changing or removing properties, editing the event content, and so on.

Parameters:
  • element (etree.Element) – XML element
  • record_selector (Optional[str]) – The matching XPath selector
  • **kwargs – Arbitrary keyword arguments
Yields:

EDXMLEvent

XmlTranscoderMediator

class edxml.transcode.xml.XmlTranscoderMediator(output=None)

Bases: edxml.transcode.mediator.TranscoderMediator

This class is a mediator between a source of XML elements and a set of XmlTranscoder implementations that can transcode the XML elements into EDXML events.

Sources can instantiate the mediator and feed it XML elements, while record transcoders can register themselves with the mediator in order to transcode the types of XML element that they support.

register(xpath_expression, transcoder, tag=None)

Register a record transcoder for processing XML elements matching specified XPath expression. The same record transcoder can be registered for multiple XPath expressions. The transcoder argument must be a XmlTranscoder class or an extension of it.

The optional tag argument can be used to pass a list of tag names. Only the tags in the input XML data that are included in this list will be visited while parsing and matched against the XPath expressions associated with registered record transcoders. When the argument is not used, the tag names will be guessed from the xpath expressions that the record transcoders have been registered with. Namespaced tags can be specified using James Clark notation:

{http://www.w3.org/1999/xhtml}html

The use of EXSLT regular expressions in XPath expressions is supported and can be specified like in this example:

*[re:test(., "^abc$", "i")]

Note

Any record transcoder that registers itself using None as the XPath expression is used as the fallback transcoder. The fallback transcoder is used to transcode any record that does not match any XPath expression of any registered transcoder.

Parameters:
  • xpath_expression (Optional[str]) – XPath of matching XML records
  • transcoder (XmlTranscoder) – XmlTranscoder
  • tag (Optional[str]) – XML tag name
parse(input_file, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=True, remove_blank_text=False, remove_comments=False, remove_pis=False, encoding=None, html=False, recover=None, huge_tree=False, schema=None, resolve_entities=False)

Parses the specified file, writing the resulting EDXML data into the output. The file can be any file-like object, or the name of a file that should be opened and parsed.

The other keyword arguments are passed directly to lxml.etree.iterparse, please refer to the lxml documentation for details.

If no output was specified while instantiating this class, any generated XML data will be collected in a memory buffer and returned when the transcoder is closed.

Notes

Passing a file name rather than a file-like object is preferred and may result in a small performance gain.

Parameters:
  • schema – an XMLSchema to validate against
  • huge_tree (bool) – disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)
  • recover (bool) – try hard to parse through broken input (default: True for HTML, False otherwise)
  • html (bool) – parse input as HTML (default: XML)
  • encoding – override the document encoding
  • remove_pis (bool) – discard processing instructions
  • remove_comments (bool) – discard comments
  • remove_blank_text (bool) – discard blank text nodes
  • no_network (bool) – prevent network access for related files
  • load_dtd (bool) – use DTD for parsing
  • dtd_validation (bool) – validate (if DTD is available)
  • attribute_defaults (bool) – read default attributes from DTD
  • resolve_entities (bool) – replace entities by their text value (default: True)
  • input_file (Union[io.TextIOBase, file, str]) –
generate(input_file, attribute_defaults=False, dtd_validation=False, load_dtd=False, no_network=True, remove_blank_text=False, remove_comments=False, remove_pis=False, encoding=None, html=False, recover=None, huge_tree=False, schema=None, resolve_entities=False)

Parses the specified file, yielding bytes containing the resulting EDXML data while parsing. The file can be any file-like object, or the name of a file that should be opened and parsed.

If an output was specified when instantiating this class, the EDXML data will be written into the output and this generator will yield empty strings.

The other keyword arguments are passed directly to lxml.etree.iterparse, please refer to the lxml documentation for details.

Notes

Passing a file name rather than a file-like object is preferred and may result in a small performance gain.

Parameters:
  • schema – an XMLSchema to validate against
  • huge_tree (bool) – disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)
  • recover (bool) – try hard to parse through broken input (default: True for HTML, False otherwise)
  • html (bool) – parse input as HTML (default: XML)
  • encoding – override the document encoding
  • remove_pis (bool) – discard processing instructions
  • remove_comments (bool) – discard comments
  • remove_blank_text (bool) – discard blank text nodes
  • no_network (bool) – prevent network access for related files
  • load_dtd (bool) – use DTD for parsing
  • dtd_validation (bool) – validate (if DTD is available)
  • attribute_defaults (bool) – read default attributes from DTD
  • resolve_entities (bool) – replace entities by their text value (default: True)
  • input_file (Union[io.TextIOBase, file, str]) –
Yields:

bytes – Generated output XML data

process(element, tree=None)

Processes a single XML element, invoking the correct record transcoder to generate an EDXML event and writing the event into the output.

If no output was specified while instantiating this class, any generated XML data will be returned as bytes.

Parameters:
  • element (etree.Element) – XML element
  • tree (etree.ElementTree) – Root of XML document being parsed
Returns:

Generated output XML data

Return type:

bytes

static get_visited_tag_name(xpath)

Tries to determine the name of the tag of elements that match the specified XPath expression. Raises ValueError in case the xpath expression is too complex to determine the tag name.

Returns:Optional[List[str]]

ObjectTranscoderTestHarness

class edxml.transcode.object.ObjectTranscoderTestHarness(transcoder, record_selector, base_ontology=None, register=True)

Bases: edxml.transcode.test_harness.TranscoderTestHarness

Creates a new test harness for testing specified record transcoder. When provided, the harness will try to upgrade the base ontology to the ontology generated by the record transcoder, raising an exception in case of backward incompatibilities.

The record_selector is the selector that the record transcoder will be registered at.

By default the record transcoder will be automatically registered at the specified xpath. In case you wish to do the record registration on your own you must set the register parameter to False.

Parameters:
  • transcoder (edxml.transcode.RecordTranscoder) – The record transcoder under test
  • base_ontology (edxml.Ontology) – Base ontology
  • register (bool) – Register the transcoder yes or no
process_object(fixture_object, selector=None, close=True)

Parses specified object and transcodes it to produce output events. The output events are added to the event set.

The selector is only relevant when the record transcoder can output multiple types of events. It must be set to the selector of the sub-object inside the object being transcoded that corresponds with one specific output event type. When unspecified and the transcoder produces a single type of output events it will be fetched from the TYPE_MAP constant of the record transcoder.

By default, the test harness is automatically closed after processing the record. When processing multiple input records this can be prevented. Note that the harness must be closed before using it in assertions.

Parameters:
  • fixture_object – The object to use as input record fixture
  • selector
  • close (bool) – Close test harness after processing yes / no

XmlTranscoderTestHarness

class edxml.transcode.xml.XmlTranscoderTestHarness(fixtures_path, transcoder, transcoder_root, base_ontology=None, register=True)

Bases: edxml.transcode.test_harness.TranscoderTestHarness

Creates a new test harness for testing specified record transcoder using XML fixtures stored at the indicated path. When provided, the harness will try to upgrade the base ontology to the ontology generated by the record transcoder, raising an exception in case of backward incompatibilities.

The transcoder_root is the XPath expression that the record transcoder will be registered at. It will be used to extract the input elements for the record transcoder from XML fixtures, exactly as XmlTranscoderMediator would do on real data.

By default the record transcoder will be automatically registered at the specified xpath. In case you wish to do the record registration on your own you must set the register parameter to False.

Parameters:
  • fixtures_path (str) – Path to the fixtures set
  • transcoder (edxml.transcode.RecordTranscoder) – The record transcoder under test
  • transcoder_root (str) – XPath for record transcoder registration
  • base_ontology (edxml.Ontology) – Base ontology
  • register (bool) – Register the transcoder yes or no
process_xml(filename, element_root=None)

Parses specified XML file and transcodes it to produce output events. The output events are added to the event set. The filename argument must be a path relative to the fixtures path.

The XML file is expected to be structured like real input data stripped down to contain the XML elements that are relevant to the record transcoder under test.

The element_root is only relevant when the record transcoder can output multiple types of events. It must be set to the XPath expression of the sub-element inside the element being transcoded that corresponds with one specific output event type. When unspecified and the transcoder produces a single type of output events it will be fetched from the TYPE_MAP constant of the record transcoder.

Parameters:
  • filename (str) – The XML file to use as input record fixture
  • element_root

NullTranscoder

class edxml.transcode.NullTranscoder

Bases: edxml.transcode.transcoder.RecordTranscoder

This is a pseudo-transcoder that is used to indicate input records that should be discarded rather than transcoder into output events. By registering this transcoder for transcoding particular types of input records, those records will be ignored.