exceptional sites to your budget

Articles »

Relieve the pain of using Microsoft Word with WordPress or Drupal

If you are having trouble transferring text between MS Word and a CMS such as WordPress or Drupal, here’s some straightforward practical advice that should help.

What’s the matter with Word?

Here’s a fairly common scenario: you have some text in a Microsoft Word document you’d like to put in a web page. So you open the document, select the text and do a Copy.

Then you go to your website’s administration panel and find the page to insert into, and do Paste.

At this point probably the new text appears to be OK, so you save the page, but when you go to check it on the site itself, it looks out of place: there’s additional weird code, the font is different from the rest of your content, and very often it’s also the wrong colour or size. Aaaarrrggghhhh!!

What’s going on? First, in order for this to happen you must be using what’s called a Rich Text Editor to edit text on your site: one that gives you a WYSIWYG (What You See Is What You Get) view of content as you enter it. WordPress comes with the TinyMCE WYSIWYG editor built in, the one that’s shown when you select the Visual mode for editing, while Drupal sites may have any one of a number of different ones installed of which the commonest are TinyMCE and FCKEditor.

It doesn’t matter which editor you are using though: if it’s in WYSIWYG mode, you are likely to encounter this problem when copying text from Word. It happens because when you do this Word automatically tries to preserve the formatting of your original document by translating it to HTML. Unfortunately the HTML it generates is non-standard and of poor quality. It contains code that will tend to override the choices for font, text colour and layout that your web designer has built into your site, hence the problems we’ve seen above.

“Wait a minute, though,” you may be saying: “I don’t see this problem yet I copy and paste content from Word all the time.” This may be the case if your site has been set up to filter the HTML entered, thus removing all the extraneous code that was added by Word before it is displayed. Even if that’s the case, and you are happy with how things are working, you may still have a problem with copying images: that’s dealt with in the next section. so keep reading.

What becomes of the broken image?

(Yes, it's meant to look like this!)

It’s worth noting that if your Word document contains images, you will not be able to insert these successfully into a web page by means of a copy and paste operation, whatever you do. Generally, the best alternative method will be copy the text first then use your editor’s image upload function to add each image, but this will only be possible if you have the images available as separate files. Also, each image file must be in a format compatible with the web: either GIF, JPEG or PNG.

If you don’t have your images as files in the right format, here’s a quick solution that works most of the time.

  1. While editing your document in Word, go to the File-> Save As … menu option.
  2. Choose to save the document as a web page, and select a suitable name and location for it.
  3. Close the document.
  4. Using explorer, browse to where you saved the document. If you called it “my_document” then you’ll see a file called “my_document.htm” and a folder called “my_document_files”. In the folder you’ll find copies of all the images from your document.
  5. Now you can use the image uploader that comes with your CMS to insert each image into your new post.

Cures for your Word ills

Here’s three simple ways to solve your problems with copying text from MS Word:

  1. TinyMCE’s Paste from Word tool. If you are using the TinyMCE editor (which is almost certainly the case for WordPress users), there is a special button designed to help with this problem, appropriately enough called “Paste From Word”. It looks like this: . This button causes a popup window to appear, with an area into which you can paste the text copied from a Word document. TinyMCE attempts to remove all unwanted code from the text before inserting it. Most of the time it does a good enough job, but you should be aware that this function is far from perfect: sometimes code remains that can cause trouble.
  2. Back to basics. It’s usually possible to turn off the WYSIWYG function of your editor altogether (in WordPress click on the tab labelled “HTML”), and use a plain text entry box instead. If you copy and paste into this, you’ll get the text you want, but without formatting. You can then switch back into the WYSIWYG mode and add back the formatting manually. How attractive an option this is will depend on how much text you are entering and how it’s formatted, but it’s a safe way to deal with the problem. Another way of achieving the same effect is to use a text editor such as Notepad as an intermediate place to hold the text: copy from Word into Notepad, copy from Notepad into your WYSIWYG editor. Again, you’ll have to put back any formatting you need by hand.
  3. OpenOffice. OpenOffice is a free alternative to Microsoft Office that includes a word processing program called Writer. One of the advantages of using this instead of Word is that it generates much better HTML than Word does, although the text may still include some unwanted formatting. In that case, combining OpenOffice with the Paste from Word tool will often do the trick.

Please note that I haven’t tested the above solutions with every version of Word and every WYSIWYG editor, so your experience may be different. Hopefully, though, I’ve managed to shed some useful light on what is for many a frustrating problem. As ever, if you have any questions please let me know.

This article was originally published exclusively for subscribers to our free newsletter.

Share this post:
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Sphinn
  • TwitThis

7 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Jim Says
    May 7, 2009 3:01 pm

    Thanks for the write up on this common problem.
    Added to DrupalSightings.com

  2. May 8, 2009 10:00 am

    Thanks Jim, I couldn’t find a decent article about this so I wrote my own!

  3. Adam Says
    March 18, 2010 3:06 pm

    It would be a much nicer idea i think to integrate Openoffice into drupal, so that pages/docs can be automatically saved as a node, with no worries about image associations.

  4. March 18, 2010 3:35 pm

    Interesting idea, Adam, although I wouldn’t like to attempt the integration myself (unless someone has a big pot of money they’d like to spend on it).

    The integration code would have to upload embedded images behind the scenes so they could be turned into HTML IMG tags, and there might be other issues to resolve, but I can see how it might work.

  5. Dane Says
    December 13, 2010 6:27 pm

    If you’re using Drupal, the Office HTML Filter can strip Office-generated HTML gunk, no matter what your choice of editor (TinyMCE, FCKeditor, even posts submitted by mail using Mailhandler)

  6. December 13, 2010 7:15 pm

    Thanks, that sounds very useful. I do have this problem on a couple of Drupal sites so I’ll try it out.

Links to this post

Some HTML is OK

or, reply to this post via trackback.