How to migrate a large site to Drupal

Technology Blog

RoadlinesI’ve worked on a few large site migration jobs now, so I thought I’d share some of my experiences and what I’ve learned along the way.

I recently moved my hobby site Odd Books to Drupal, so that’s the one I’ll use as the prime example.

Preparation

Before starting, take a good long look at the existing site. How many pages does it have? What is its directory structure? How many different scripts? How is it maintained? Is it based on a database? What about other assets such as image and multimedia files?

Draw up a plan which includes the answers to all these questions. They define the problem you have to solve in migrating the site, which consists of two elements:

  1. How to get the content of the old site into the new one.
  2. How to present that content, once it is there.

Mapping content types

One of the first things I tend to do with any Drupal site is install the content construction kit module (CCK). It’s a good idea to define a content type for each section of your original site, as this has two advantages: firstly you can use the pathauto to assign Drupal paths to the different content types automatically, and replicate the structure of the old site, and secondly you can theme the different areas individually using different content type templates.

In the case of Odd Books I have different content types for, for example, pages about Frank Harris and ones about Amanda McKittrick Ros. As yet they are themed identically but if I choose to give them a different look it will be easy to do so.

Migrating content

One of the major challenges of any site migration exercise is the transference of content in bulk. In the case of Odd Books most of the content was in a home-grown CMS, with some data in flat files and some in a MySQL database. I couldn’t find an off-the-shelf module to deal with this, so I wrote my own, with which I used an incremental approach to the import process, as follows:

  1. Backup the new site database in case the migration is unsuccessful
  2. Create the code to migrate one type of content from the old site.
  3. Execute the code and check the result.
  4. Repeat for the next type of content.

Here’s some sample code I used to import entries for books in the Frank Harris section, which may be of interest:

  $dir = "c:/likemind_sites/oddbooks/harris/html/";
  $dh = opendir($dir);
  while (($file = readdir($dh)) !== false) {
    if(substr($file, -5) == '.html') {
      $content = file_get_contents($dir. $file);
      db_set_active('legacy');
      $page_key = substr($file, 0, -5);
      $result = db_query("SELECT * FROM {book} WHERE book_key = '%s'", $page_key);
      $row = db_fetch_object($result);
      if($row) {
        $title = $row->title;
      } else {
        $title = 'no title';
      }
      // Switch back to the default connection when finished.
      db_set_active('default');

      $content = preg_replace('/(img.*src=")/isU', '$1/files/images/harris/', $content);
      $node = new stdClass();
      $node->type = 'harris_book';
      $node->uid = 1;
      $node->title = $title;
      $node->format = 3;
      $node->body = $content;
      $node->status = 1;
      $node->path = 'harris/review/'.$file;
      $node->field_subtitle[0]['value'] = $row->sub_title;
      $node->field_rating[0]['value'] = $row->rating;

      node_save($node);
    }
  }
  closedir($dh);

The code is reading one file at a time from a directory, finding the related entry in a database table and combining the two to make a new node entry of the required type. The URLs for image content are amended to comply with the directory layout I’d decided on for the new site.

Note the use of two CCK fields to store additional data. Each field is represented as a two dimensional array. The first level of the array has an entry for every value of the field (a CCK field may hold multiple values): in this case there is only one value. The second level of the array holds value data in a field type dependent format: for the two fields used here, it is simply a single entry with a key of ‘value’.

As well as the home made CMS, the original site had a phpBB2 discussion board which I wanted to migrate. (This was the second move for some of the older entries, which had originally come from an online guest book I set up many years ago). Fortunately there’s a module specifically for the import of phpBB2 boards. It works OK, though I did find it to be a little buggy.

Manage URLs

As far as possible I tried to carry over URLs from the old site to the new. That’s the best thing to do for your users and search engines. However, there were some pages I wanted to deal with differently:

  1. Where there were lists of items on index pages I wanted to give them URLs to matched their function. So for example the page at the path harris/book is an index of entries which all have URLs of the form harris/book/something.html. On the original site however the corresponding page had a URL of “harris/booklist.php”.
  2. Some pages of the old site had URLs with query strings included, such as for example “http://oddbooks.co.uk/harris/page.php?page_key=drugfiend”. There’s no simple way to define a corresponding path in Drupal.

Items of the first type were easy to sort out. The path_redirect module deals with precisely this need by allowing you to specify the incoming and outgoing paths and issuing 301 redirects accordingly.

For the other cases, I added more code to my custom migration module, adding an init hook based on that in the path redirect module, but hard coded to deal with the specific URLs in question and map them to the new path names for the pages:

function myimport_init() {
  $path = $_GET['q'];
  $path = str_replace('//', '/', $path);

  switch ($path) {
    case 'harris/book.php' :
      $new_path = 'harris/book/'.$_GET['book_key'].'.html';
      break;
    case 'harris/edition.php' :
      $new_path = 'harris/book/'.$_GET['book_key'].'.html';
      break;
    case 'harris/page.php' :
      $new_path = 'harris/'.$_GET['page_key'].'.html';
      break;
    case 'harris/person.php' :
    case 'harris/person.php3' :
      $new_path = 'harris/whoswho/'.$_GET['name'].'.html';
      break;
    case 'harris/gallery.php' :
    case 'gallery.php' :
      $new_path = 'harris/gallery/'.$_GET['image_key'].'.html';
      break;
    case 'harris/genealogy.php' :
      $new_path = 'harris/genealogy/'.$_GET['ikey'];
      break;
  }
  if ($new_path) {
    if (function_exists('drupal_goto')) {
      // if there's a result found, do the redirect
      unset($_REQUEST['destination']);
      drupal_goto($new_path);
    }
    else {
      // page caching is turned on so drupal_goto() (common.inc) hasn't been loaded
      path_redirect_goto($new_path);
    }
  }
}

Note that str_replace() call right at the start of the function. It’s there because I’d noticed from looking at the data for the site in Google Webmaster Tools that Google had a lot of entries with doubled slashes on these pages – no doubt due to some programming bug on the old site – so this was an opportunity to deal with those problems as well.

Aftercare

There is a lot more to the process of migration than covered here. I may return to specific topics in future blogs but I didn’t want to make this one too long. I would like to mention finally that having migrated the site there’s still work to be done. Webmaster Tools and Drupal’s logs for pages not found will help you to find inlinks pointing to URLs that no longer exist. Of course you should also be monitoring your logs for other errors, and it’s a good idea to have the Update Status module installed so you get news of module updates.

Migration is hard work, but it’s very satisfying to finally have your new site built and ready to release, with all the familiar old content in a shiny new form. In my experience Drupal is one of the best platforms to migrate to, given its flexibility and forgiving design. What are your migration stories and lessons?

3 Responses to “How to migrate a large site to Drupal”

  1. Robert

    Very interesting work. I’ve used the Import HTML module to migrate my sites over to Drupal. It’s a little buggy but does the trick for the most part.

    One question – changing the URLs for images makes sense, but how do you handle changing the URLs for other non-HTML files (e.g. pdfs, docs, xls, etc)?

    Rob

  2. Anonymous

    Hi Robert. That’s a good question, but not one I’ve had to deal with. I think it would depend on the number of files involved. I’d probably use mod_rewrite to rewrite the URLs so dir1/dir2/dir3.pdf would go to files/dir1/dir2/dir3.pdf.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>