Importing an Atom Feed with the Drupal 8 Migrate API and Paragraphs (Part 1)

Benji Fisher // December 2017

I recently worked on a Drupal 8 site where we had to create blog posts (nodes) from an Atom feed that updates regularly. In earlier versions of Drupal, this would have been a job for the Feeds module, but in Drupal 8 we use the core Migrate API instead.

Here are some of the features of this project.

  • Handle an Atom feed with the Migrate framework.
  • Download images and create File and Media entities in Drupal.
  • Skip some entries based on a custom field.
  • Split the main body into separate image and text paragraphs.

The last part is the most fun, but I will postpone that to Part 2 of this post.

There are several code snippets in this post. The full code is available at https://github.com/isovera/atom_migrate.

Getting Started

We already had a Drupal 8 site set up. I added the following contrib modules:

I also added a custom module, atom_migrate, to hold the configuration and a little bit of custom code for the migration.

This is pretty standard for using the Migrate API in Drupal 8. The Migrate Plus module adds several features, including the framework for handling XML sources. The Migrate Tools module provides drush commands for managing migrations.

I use the Features module to update the site configuration after making updates to my module. I declare the module as a feature, and then

drush fim atom_migrate

after making changes. Another method is to uninstall and re-install the module (ugh). I am told that you can put your YAML files under atom_migrate/migrations/ instead of atom_migrate/config/install/, and then clear caches (the plugin cache, to be specific), but I have not tested this. I have also seen the Configuration development recommended, but I have not tested that, either.

Finally, this migration is supposed to run periodically, so we need a way to trigger that. One way would be to set up a cron job on the server that invokes

drush migrate-import --group=atom --update

On this project, we decided it would be more portable to avoid server dependencies, so I implemented hook_cron:

File atom_migrate/atom_migrate.module (excerpt)

use Drupal\migrate\MigrateExecutable;
use Drupal\migrate\MigrateMessage;

/**
 * Implements hook_cron().
 *
 * Run the migrations.
 */
function atom_migrate_cron() {
  $manager = Drupal::service('plugin.manager.migration');
  $migration_ids = ['blog_image', 'blog_featured_image', 'blog_node'];
  foreach ($migration_ids as $migration_id) {
    $migration = $manager->createInstance($migration_id);
    $executable = new MigrateExecutable($migration, new MigrateMessage());
    $executable->import();
  }
}

Room for improvement

  • I do not think this code will update existing articles, as drush migrate-import --update would.
  • I should get the migration IDs programmatically instead of hard-coding the list. As it is, if I add a new migration, then I will have to update this code.

Maybe Migrate Tools should add API functions so that it is easier to get the same effect as the drush commands from custom code.

Configuring for an Atom feed

The Migrate Plus module provides plugins for downloading a file from an external URL (or from a local file, useful for testing purposes) and for parsing XML, so this is mostly just a question of configuration. I have the following configuration in migrate_plus.migration_group.atom.yml:

File atom_migrate/config/install/migrate_plus.migration_group.atom.yml (excerpt)

shared_configuration:
  source:
    plugin: url
    data_fetcher_plugin: http
    data_parser_plugin: xml
    namespaces:
      atom: 'http://www.w3.org/2005/Atom'
    urls: 'https://api.example.com/v2/feed?format=atom'
    item_selector: '/feed/entry'
    ids:
      guid:
        type: string

The one difficulty I had is that the XML I got had some elements qualified by namespaces and some (including everything from the Atom spec) without. Thanks to @robcast on the Drupal#migration Slack channel for pointing out thenamespaces option, which has the effect of doing

$xpath->registerNamespace('atom', "http://www.w3.org/2005/Atom");

in PHP. This lets me use selectors like atom:id to target XML tags like <id> (no namespace). Curiously, this option does not seem to apply to the item_selector key.

File atom_migrate/config/install/migrate_plus.migration.blog_node.yml (excerpt)

source:
  fields:
    -
      name: guid
      label: Guid
      selector: 'atom:id'

(The selector is relative to the item_selector used above.)

Before using the namespaces key, I found a work-around with a little help from Google and Stack Overflow:

selector: '*[local-name()="author"]/*[local-name()="name"]'

Skip some entries based on a custom field

In order to use the Migrate API effectively, it helps to get some practice with chaining multiple process plugins. The building blocks are there, and a few examples go a long way in learning how to use them.

For example, the feed I was using had some articles that I wanted to import and also some items that I wanted to ignore. In the source section of my migration, I defined the content_format key and the XPath that selects it. Then I chained the static_map and skip_on_empty plugins in the process section of the migration:

File atom_migrate/config/install/migrate_plus.migration.blog_node.yml (excerpt)

process:
  blog_type:
    -
      plugin: static_map
      source: content_format
      map:
        Blog: Blog
    -
      plugin: skip_on_empty
      method: row

The static_map plugin converts “Blog” to “Blog” (not much of a conversion) and anything else to NULL. In the latter case, the skip_on_empty plugin cancels processing of the current item.

There is not actually a field called blog_type. The Migrate API lets you make up a destination field like this and ignore it, or use it as an intermediate result in other fields. There is an example of this in the next section.

Creating File and Media entities

There are two ways to manage complex migrations. The first is to have a separate migration for each step of the process, and that is how I managed the “featured image” for my migration. (This is not part of the Atom specification. It is a custom field on the feed I was using.)

The second method is to create the intermediate entities in the “process” phase of the migration. My next blog post will give an example of this method.

Here is an example of using separate migrations. The first migration uses the download process plugin to fetch images referenced in the feed and create file entities of type image:

File atom_migrate/config/install/migrate_plus.migration.blog_image.yml (excerpt)

source:
  fields:
    -
      name: url
      label: 'Image URL'
      selector: 'media:content/@url'
  constants:
    image_base_dir: 'public:/'
    image_name: 'post.jpg'
    date_format: Y-m
process:
  settings:
    plugin: skip_row_if_not_set
    source: url
  temp_date:
    plugin: callback
    callable: date
    source: constants/date_format
  temp_image_uri:
    plugin: concat
    source:
      - constants/image_base_dir
      - '@temp_date'
      - constants/image_name
    delimiter: /
  uri:
    plugin: download
    source:
      - url
      - '@temp_image_uri'
    rename: true
  type:
    plugin: default_value
    default_value: image
destination:
  plugin: 'entity:file'

There are a few things going on in the snippet above.

The settings key is a fake destination. It is just there so that the migration will quit early if the url for this row is empty.

First look at how I create the destination file name.

  • I have a fake source key called constants. (You can call it whatever you want, but constants is the convention.) It has several sub-keys, which are referenced as constants/image_base_dir, constants/image_name, and so on.
  • I set one of my fake destination keys, temp_date, using the callback plugin: this has the effect of setting this intermediate result to date('Y-m'), or something like 2017-12.
  • I set the next fake destination key, temp_image_uri, using the concat plugin to paste together 'public:/', the date string I just created, and 'post.jpg', using / as glue. That gives something like public://2017-12/post.jpg.

Next I use the download plugin. The first argument is the URL of the image file, which I have defined in the source section of the migration. The second argument is the destination file name. Since I supply the optional rename: true key, Drupal will add _0, _1, and so on in order to create distinct file names.

The download plugin returns the URI of the created file, something like public://2017-12/post_17.jpg. I assign this to the uri property of the File entity that this migration creates.

Now that we have the migration creating File entities, we could attach those files to a content type with a file-reference field, using the migration_lookup plugin. In fact, we did something a little different on this project.

We decided to use the core Media module. This is a little aggressive with Drupal 8.4, but the Media module is developing quickly. It looks as though everyone will be using it once Drupal 8.5 is released, and we hope to be one step ahead of the crowd.

Since we are using Media entities, we have another migration that deals with them. Here is the interesting part of this migration:

process:
  bundle:
    plugin: default_value
    default_value: image
  field_media_image/target_id:
    plugin: migration_lookup
    migration: blog_image
    no_stub: true
    source: guid
  field_media_image/alt: description
destination:
  plugin: 'entity:media'
migration_dependencies:
  required:
    - blog_image

This migration creates Media entities.

The Migrate API keeps track of the association between guid (the source key) and file ID in the blog_image migration (the one before this). The migration_lookup plugin translates the guid to the file ID, and that gets assigned to the target_id property of field_media_image on the Media entity.

The alt property is populated by description: one of the source fields that I left out because there is nothing new about it.

The migration_dependencies key tells the Migrate API that this migration should not be run until the blog_image migration is complete. We can override that by adding the --force option to drush migrate-import.

The migration that creates nodes uses these Media entities, translating the feed’s guid into a Media entity ID using the migration_lookup plugin. There is nothing new in this step, but you can see the full code in the GitHub repository.

References

In the spirit of Open Source, I have borrowed heavily from blog posts, Slack messages, documentation, and other forms of support. Here are links to some of the sources I found helpful.

I mentioned it at the top but here again is the link to the full code for this post: https://github.com/isovera/atom_migrate.

Here are all the contrib module mentioned above:

Want to learn more?

We offer a range of training and workshop options that cover everything from a basic ‘Intro to Drupal’ to layout and theming, security, performance, module development, and more.

Filed under: