Scraping Content from HTML Source in Drupal

By
Scraping Content From HTML Source in Drupal
Continuing our series on migrating from Drupal 7 to Drupal 10, we take you through our step-by-step method for scraping content from HTML sources in Drupal.

Web scraping, also known as content scraping, involves extracting data from your current website or an external site. This technique is especially valuable during website updates or migrations.

Continuing our previous blog post on migrating from Drupal 7 to Drupal 10, this article details how to scrape content from HTML sources in Drupal. 

How To Scrape Content from HTML Source in Drupal

Suppose we have a situation where the description field (Body) of the source Drupal has the following HTML structure:

1 2 3 <h1>Ea enim facilisi lenis nobis</h1> <p><img alt="" src="/sites/default/files/image_body.jpeg" style="height:454px; width:228px" /></p> <p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>

For our new Drupal 10 site, we'll extract data and store it in dedicated fields, then remove the content from the body.

The result should look like this:

field_subtitle: Ea enim facilisi lenis nobis

body: 

1 <p>Blandit eligo laoreet pertineo quibus quidem. Euismod gemino iaceo praesent uxor. Amet gilvus nutus pecus turpis. Amet at comis in jus oppeto os patria qui ymo. Acsi blandit damnum exputo ibidem occuro praesent verto. Antehabeo distineo ibidem imputo pecus quibus refero sino turpis velit. Ea enim facilisi lenis nobis </p>

field_image: /sites/default/files/image_body.jpeg 

Choosing the Migration Process

To streamline our migration, we first extract the data and then remove it from the new body field, as described above. This can be done using existing migration plugins available in the community.

The migrate_plus module comes with some extra process plugins that can help us with this task.

In this case, we are going to use dom, dom_select, and dom_remove.

  • dom: This process plugin allows you to import a string as a DOM document and vice versa. You will see it in action in the example.
  • dom_select: Using an xpath selector allows you to select a part of a DOM document for further use.
  • dom_remove: Similar to the previous process plugin we can delete a part of a document using an xpath selector. Usually after this task, we need to convert back the DOM document into a string to be stored in a normal text field.

Finally, we are going to use the plugin provided by the module Migrate Files (extended), and also image_import.

  • image_import: This process plugin allows to import an image from a remote/local site, downloading the image, and saving it in the new Drupal 10 site, without doing extra steps, right in the same migration file.

Migration File

We will build on the same example of our previous article on migrating from Drupal 7 to 10, adapting it to the original purpose of this article.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 id: example_articles label: Migrate Articles from Drupal 7 migration_tags: - Example source: plugin: d7_node node_type: article constants: remote_url: http://drupal7.lndo.site file_destination: 'public://images/' process: uuid: uuid langcode: langcode revision_timestamp: revision_timestamp revision_uid: revision_uid revision_log: revision_log status: status title: title created: created changed: changed promote: promote sticky: sticky default_langcode: default_langcode revision_default: revision_default revision_translation_affected: revision_translation_affected path: path field_tags: plugin: sub_process source: field_tags process: target_id: plugin: migration_lookup migration: example_tags source: tid body/value: - plugin: dom method: import source: body/0/value - plugin: dom_remove mode: element selector: '//h1' - plugin: dom_remove mode: element selector: '//p' limit: 1 - plugin: dom method: export body/format: plugin: default_value default_value: basic_html field_subtitle: - plugin: dom method: import source: body/0/value - plugin: dom_select selector: //h1 _image_path: - plugin: dom method: import source: body/0/value - plugin: dom_select selector: //img/@src _image_url: plugin: concat source: - constants/remote_url - '@_image_path/0' field_image: plugin: image_import source: '@_image_url' destination: 'constants/file_destination' uid: plugin: migration_lookup migration: example_users source: node_uid no_stub: true destination: plugin: entity:node default_bundle: article migration_dependencies: required: - example_users - example_files - example_media_images - example_tags

Removing content from the original Body

In this section, we are linking multiple process plugins. Initially, we convert the string into a DOM object for subsequent processing.

In the second and third steps, we manipulate the DOM object using the XPath selectors. For guidance, here is a good cheat sheet of selectors for reference.

We start by removing the <h1> and the first <p> element, which includes an <img> tag, using the limit property to restrict this to just one occurrence.

After the modifications are made, we convert the DOM object back into a string to store it in the body field:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 body/value: - plugin: dom method: import source: body/0/value - plugin: dom_remove mode: element selector: '//h1' - plugin: dom_remove mode: element selector: '//p' limit: 1 - plugin: dom method: export

Extracting info to store into separate fields

This section focuses on the process of isolating specific data from the content and systematically storing it into designated fields within our system. Let’s start with the title.

Extracting the Title

Extracting the title from the content, that we will store in the field_subtitle.

It's the same as the previous field we need first to convert the string into a DOM object and then select the text enclosed in the <h1> tag, to do that we use the process plugin `dom_select`:

1 2 3 4 5 6 7 8 field_subtitle: - plugin: dom method: import source: body/0/value - plugin: dom_select selector: //h1

Extracting the Image

To extract the image we are going to use the same dom_select process plugin and the file_import to store the image in an image field:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 _image_path: - plugin: dom method: import source: body/0/value - plugin: dom_select selector: //img/@src _image_url: plugin: concat source: - constants/remote_url - '@_image_path/0' field_image: plugin: image_import source: '@_image_url' destination: 'constants/file_destination'

I prefixed the first two fields with an underscore _ just to indicate that we are using those fields to store temporary data to be used later.

In ‘image_path’ we extract the image path directory from the body. For the sake of this example, let’s suppose the body has the relative URL of the image in the img tag. we use the XPath selector //img/@src to capture the source URL.

For _image_url, we combine the extracted path with the base URL to form a complete URL to the image.

Lastly, we import the image into the field_image using the image_import process plugin provided by Migrate Files (extended).

Finally, simply run the following command:

drush migrate:import example_articles

At Octahedroid We Can Help With Your Drupal Website Migration

With over a decade of Drupal expertise, our team at Octahedroid is perfectly equipped to support your migration to Drupal 10. We understand the needs of users like you who depend on Drupal's powerful features for managing complex user permissions, workflows, and digital functionalities.

Interested in learning how we can assist you with all things Drupal? Contact us to find out more about our services.

Notes form DrupalCon Portland 2024 keynote

Notes from the DrupalCon Portland 2024 Keynote

The latest DrupalCon North America 2024 was in Portland, Oregon. The Opening Keynote, aka Driesnote, touched on some interesting topics about the current state and the future of Drupal. I will try to elaborate on some points from my perspective and the needs of our enterprise customers at Octahedroid.

Take your project to the next level!

Let us bring innovation and success to your project with the latest technologies.