Digital Commons to dspace issues using xml_util Python module

During the Digital Commons to Dspace data-migration (documented as Harvesting OAI data to Dspace) there was a namespace issue that stopped the harvest_digital_commons.py script from running.

The command “dcxml = xml_util.xml(dcString)” is designed so that the python code code can access and process the document as xml. The information extracted from the document includes page url’s as well as links to pdf and zip files needed for the data migration to take place.

Problem: The dcString does not contain any namespace declaration at the head of the document because it is a temporary metadata file split from a full digital commons document. And therefore the python was not able to access and process the xml.

Issue: When the above command was issued python was unable to match the dc and xsi elements, attributes etc due to the fact that “dcxml” had no valid DOM.

Solution: The code needed to be upgraded to include both dc and xsi namespaces in the argument list as follows

dcxml = xml_util.xml(dcString,[(“dc”,http://purl.org/dc/elements/1.1/”),(“xsi”,http://www.w3.org/2001/XMLSchema-instance”)])

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s