Plone and Office 2007 / OpenXML file formats

Tagged:

Plone (both v2 & 3) do not currently support indexing of the new Office 2007 file formats: docx, xlsx, and pptx. Even TextIndexNG3 3.1.6 doesn't understand them (I assume TNG 3.2.x, which is only for Plone3, also doesn't work). Strangely, I couldn't find any mention of this issue through numerous Google searches (though I did find lots of sites Powered by Plone that mention or have ads to Office 2K7 books and software, or catalog sites listing Plone and Office books on the same page).

Since I didn't want to tell my client they would have to save everything in the old version, I started digging into how to make this happen. The new format is a zip file containing a bunch of XML files, but unlike ODF,OpenXML stores the content in different places depending on what file type it is. Word 2007 uses /word/document.xml, while Excel 2007 uses /xl/sheet<N>.xml, one file for each worksheet. ODF stores everything in content.xml, which is much easier to parse.

Instead, I used the ODFConverter from SF.net, which solves my need for the Windows server the client is running. In the long run, it will be cleaner and platform-neutral to parse the various XML files directly, but maybe TNG3 will build that in sooner.

The file below should be unpacked into the root of your Products folder, after having installed TNG 3.1.6 (the changes should be compatible with any subsequent 3.1.x releases). The only conflicting file would be configure.zcml, since it has to overwrite that file to inform Plone that it knows how to convert the OpenXML formats.

Patch: TextIndexNG3-openxml.zip

Enjoy! Feedback and suggestions are welcome!

Comments

This wasn't around when I

This wasn't around when I needed it, but glad to see someone else chip in!

(perhaps coming late but...)

(perhaps coming late but...) did you check Products.OpenXml ?

http://plone.org/products/openxml

Didn't I see this in a book

Didn't I see this in a book somewhere? :)

Post new comment

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <pre>, <shell>, <c>, <drupal6>, <java>, <javascript>, <objc>, <perl>, <php>, <python>, <rails>, <ruby>, <sql>, <xmlcode>. The supported tag styles are: <foo>, [foo].

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.