Handling HTML With Drupal’s Migrate API

Drupal 8, we use the core Migrate API for

  • Upgrading Drupal 6 and Drupal 7 sites
  • Migrating sites from other systems to Drupal
  • Recurring imports from external systems (feeds)

It is a robust, flexible tool.

Drupal works best with structured data, and the Migrate API supports this: file attachments, related taxonomy terms, references to authors or other nodes, and so on. Along with the structured data, we also have to deal with blocks of text, and these blocks often contain HTML markup.

Until now, the Migrate API has supported basic processing of text fields using regular expressions. Marco Villegas and I contributed some plugins to the Migrate Plus module to support proper HTML parsing. This is easier to use and more reliable than using regular expressions.

We originally wrote these plugins while working for Isovera on a project for Pega Systems. Both Isovera and Pega have supported sharing these plugins with the Drupal community. I hope other developers will use them and give back some of their own plugins that use the same approach.

Open Parsing HTML versus processing with regular expressions configuration options

Parsing HTML versus processing with regular expressions

<a href="https://www.drupal.org">Drupal home page</a> 

Using regular expressions

Unfortunately, it is more complicated than that. For example:

  • The HTML tags are case-insensitive: you have to match a or A.
  • There might be other attributes, such as class, id, or name, before or after the href attribute.
  • The URL (value of the href attribute) might be enclosed in single quotes instead of double quotes.
  • There might be newlines within the HTML element.
  • Are you sure that an escaped quote (like \") is not allowed in a URL?

Before you start researching that last question, the point is that you should not spend your time reinventing the wheel.

There is an amusing answer on StackOverflow describing the dangers of trying to process HTML with regular expressions, and this practice has come to be known as Parsing Html The Cthulhu Way. The StackOverflow answer ends with the suggestion,

Have you tried using an XML parser instead?

Using the DOMDocument class

$document = new \DOMDocument();
$xpath = new \DOMXPath($document);

After this bit of boilerplate code, we can search the $xpath object with any XPath query and extract whatever attributes we need. For example, to find the href attribute of each <a> element in the source,

foreach($xpath->query('//a') as $html_node) {
$href = $html_node->getAttribute('href');
// Your processing goes here.

Using XPath queries gives us a lot of flexibility: we can find <a> elements having a specific class, or we can select those that are nested inside some other HTML element. We did not even think about these possibilities when discussing regular expressions above.

When you are finished processing your DOMDocument element, you can convert it back to a string:

$processed_html = $document->saveHTML();

Open Migrate API and the ETL paradigm configuration options

Migrate API and the ETL paradigm

  • Extract (source plugin): read data from the source
  • Transform (process plugins): change data to match the site’s structure
  • Load (destination plugin): save the data

Each migration has a single source plugin and a single destination plugin, but each field uses at least one process plugin and may use several. I think this is the fun part: creating new, easy-to-configure process plugins is the best way to add reusable code to the framework.

The Transform/process phase is also the right place to handle HTML processing.

Open New process plugins for managing HTML configuration options

New process plugins for managing HTML

The dom plugin

plugin: dom
method: import
source: 'body/0/value'
# Other plugins do their work here.
plugin: dom
method: export

The dom_str_replace plugin

plugin: dom_str_replace
mode: attribute
xpath: '//a'
name: href
search: 'documentation.example.com'
replace: 'help.example.com'

Warning: The xpath key was called expression in version 8.x-4.2 of the Migrate Plus module. Use xpath starting with the recently released version 8.x-5.0-rc1.

Like the str_replace plugin that is already part of the Migrate Plus module, this plugin supports either basic string replacement, using the PHP str_replace() or str_ireplace() function, or regular expressions, using preg_replace().

The dom_apply_styles plugin

This plugin lets you search for an XPath expression and replace the corresponding HTML elements with whatever is configured in the Editor module. For example,

plugin: dom_apply_styles
format: full_html
xpath: '//b'
style: Bold

This will replace <b>...</b> with whatever style is labeled “Bold” in the Full HTML text format, perhaps <strong class="normal-size">...</strong>.

The dom_migration_lookup plugin

If those references are in links in a text field, then you can now use the dom_migration_lookupplugin:

plugin: dom_migration_lookup
mode: attribute
xpath: '//a'
name: href
search: '@/node/(\d+)@'
replace: '/node/[mapped-id]'
- article
- page

If either the article or page migration has mapped 123 to 456, then this will replace /node/123 in any href attributes with /node/456.

Like the core migration_lookup plugin, this one violates the strict ETL paradigm, since a process plugin (i.e., code in the Transform stage) has to “peek” at the destination database. Ditto for the dom_apply_styles plugin, which reads configuration from the destination database.

Open References configuration options