Sunday, February 22, 2009

OpenXML & Word Doc Manipulation using Sharepoint Data

Call me a lousy blogger but there is no code in this post because I've been busy refactoring it. That's not the only reason. I have been involved in a lot more Sharepoint development at work lately - moreso than .Net, so I've been kind of busy reading up SDKs and playing with APIs.

On that note, I got a chance to get my hands dirty with some OpenXML 2.0. The cool thing about that was that it was @ Microsoft. Not only did I get to play with it, but met the guys that developed parts of it and even wrote 2,000 of the 6,000 pages of documentation - Tristan Davis and Zeyad Rajabi - who helped me a great deal.

The story that I worked on involved gathering data from Sharepoint lists (some of it fed from the BDC(business data catalog)) and using OpenXML to populate a Word template. The real world scenario behind it is at my company, like I'm sure at a lot of other companies, we have administrative assistants who end up generating the documents for meetings. They just sit there copying and pasting data from various data sources and reports into this document. And there are as many documents generated as there are assistants multiplied by weeks in the year. Seriously - these people end up working 60 hour workweeks.

The program utilizing OpenXML to generate these documents for them is actually pretty simple. I had some issues deploying it as a web app onto Sharepoint because of various little issues that arose along the way (running sharepoint on a 64-bit windows 2008 server w/ IIS 7 - some assemblies were 32-bit, so couldn't get it to run; other times, there were IIS7 issues. Note to self: improve knowledge of IIS7). Because of that, I wrote it as a WinForms app.

The interface is pretty boring - it has 2 listboxes.
The first listbox contains key data - in this case, a list of product names. This field is used as a foreign key to relate it to a bunch of other lists - sales data list, product manager list, and even a Sharepoint picture library.
The second listbox contains a list of template documents from a document library that was created.
There is also a textbox for indicating what the new filename is going to be once the document is generated, and a button that generates the document.

So how does this all work?
First of all, none of this would be possible if Microsoft hadn't created the new .docx, .xlsx, .pptx formats in Office 2007. As much as I cursed it for doing so - I have seen the light last week. The new format is purely XML based. Hence, OpenXML allows us to manipulate the hell out of our documents without ever opening up Microsoft Word.

The one caveat is that Microsoft Word's Markup Language (WordML) is a bit cumbersome to learn right off the bat. The good news is that with a bit of blog reading from Brian Jones on OpenXML and an extremely powerful tool that comes with the SDK called DocumentReflector, you will be generating documents in no time. Here is a video by Alistair Spears that gives a demo of this wonderful tool: []

So, how did I get data in from sharepoint lists into a word document? Let's take a look at the process step-by-step.

Traversing Sharepoint Lists/Doc Libraries/Picture Libraries
1. The first thing you will need is to connect to a sharepoint list. The code is the easy part - it's just 2 lines:

protected SPListItemCollection product_owner_items;
protected SPListItemCollection template_doc_items;
protected SPListItemCollection marketing_info_items;
protected SPPictureLibrary product_model_pics;
protected SPSite mysite;
protected SPWeb myweb;
mysite = new SPSite("http://izmqdp6452022/Docs/Lists/Product%20Owners/AllItems.aspx");
myweb = mysite.AllWebs["Docs"];

SPSite's constructor can actually determine what site you are connecting to in your Sharepoint farm based on the URL passed in. You just have to make sure that whichever user your app is running as, must have access to these libraries and to the Sharepoint site.

Here is the code used to get my Sharepoint list objects ready for access:
private void initialize_sharepoint_lists()
  product_owner_items = mysite.AllWebs["Docs"].Lists["Product Owners"].Items;
  marketing_info_items = mysite.AllWebs["Docs"].Lists["Marketing Info List"].Items;
  template_doc_items = mysite.AllWebs["Docs"].Lists["Documents"].Items;
  product_model_pics = (SPPictureLibrary)mysite.AllWebs["Docs"].Lists["Product Model Images"];
In this case, "Docs" is the name of the sharepoint site and the Lists indexers contain the names of the libraries that I am working with in sharepoint.

Once I have the data I need to be inserted into the Word document, I need to create a template in Microsoft Word. This is the easiest part of the operation. You simply open up Microsoft Word, enable the Developer toolbar by right-clicking on the ribbon and insert a text-box field wherever you will have data inserted. Make sure you assign an alias to the text-box - this is how you will access your placeholder for insertion.

Before attempting this exercise, you should find a sample Word 2007 document and open it in DocumentReflector so that you can get an idea of how the XML is formatted because the code below can get confusing REAL quickly if you don't.

The fields that you created using the developer tab can be accessed one of two ways. If the field is on a new line, it will live in a <SdtContentBlock> element. If the field is on the same line as some text, say, a label, it will live in a element. Now, the trick is to locate these elements and pass the data into them.

Here is the method that will take a string, the alias of the field you are manipulating, and a WordProcessingDocument object which contains the open word document. This method will modify a SdtRun block - the element that lives on the same line as other text:

 private void populate_run_control(WordprocessingDocument wordDoc,string alias, string value)
            var contentControls = wordDoc.MainDocumentPart
                            .Where(s => s.SdtProperties.GetFirstChild().Val.Value == alias);
            if (contentControls.Count() != 0)
                var sdt = new SdtContentRun(
                           new Run(
                                new RunProperties(
                                    new RunStyleId() { Val = "PlaceholderText" }),
                                new Text(value)));
                contentControls.First().SdtContentRun.InsertAt(sdt, 0);

The method below has the same task in mind, but modifies an SdtContentBlock element that starts on a new line. Another difference is that it takes in an html Table object, which can be easily formed just like you are used to in ASP.Net/C# - Table, TableRow [tr], TableCell, etc. This could be as simplistic or as complex as you want.

        private void populate_block_table(WordprocessingDocument wordDoc, string alias, Table value)
            var contentControls = wordDoc.MainDocumentPart
                            .Where(s => s.SdtProperties.GetFirstChild().Val.Value == alias);
            if (contentControls.Count() != 0)
                var sdt = new SdtContentBlock(
                contentControls.First().SdtContentBlock.InsertAt(sdt, 0);

I won't bore you with the rest of the code used to open files into MemoryStreams or save them, but if you would like to see the inner workings anyway, I recommend taking a look at Brian Jones's blog post about Office Extensibility: []

Don't hesitate to comment and let me know if you have any questions.


Anonymous said...

s => sourceFile.Contains(s.SdtProperties.Alias.Val.Value)).ToList();

In the above statement i am getting following error.

'DocumentFormat.OpenXml.Wordprocessing.SdtBlock' does not contain a definition for 'SdtProperties' and no extension method 'SdtProperties' accepting a first argument of type 'DocumentFormat.OpenXml.Wordprocessing.SdtBlock' could be found (are you missing a using directive or an assembly reference?)

Max said...

Couple of questions
1) Which version of the OpenXML SDK are you using?
2) What kind of an object is sourceFile?

Also, have you considered getting your list using the code block I posted:
var contentControls = wordDoc.MainDocumentPart
.Document.Descendants<SdtBlock>().Where(s => s.SdtProperties.GetFirstChild<Alias>().Val.Value == alias);

That snippet gets only the first element, but you can easily utilize any of the IEnumerable methods like All<> to get the result you want.

Anonymous said...

Here is my code block, Source and destinationfile are files located in my physical drive.

and iam using Openxml 2.0

MergeSourceDocument(string sourceFile, string destinationFile)
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(destinationFile, true))
MainDocumentPart mainPart = myDoc.MainDocumentPart;
// Find content controls that have the name of the source file as
// an Alias value.
Where(s => sourceFile.Contains(s.SdtProperties.Alias.Val.Value)).ToList();
i removed some of the html tags since this text box not allowing the tags

Note: i don't understand why i am getting compile time error saying SdtProperties doesn't exists in Open XML library.

Max said...

1) Check your lambda expression
2) s.SdtProperties.Alias.Val.Value may not work the way you want it to, so use an IEnumerable method like GetFirstChild<Alias> instead.

In other words, instead of

Where(s => sourceFile.Contains(s.SdtProperties.Alias.Val.Value)).ToList();


Where(s => sourceFile.Contains(s.SdtProperties.GetFirstChild<Alias>.Val.Value)).ToList();

But that will only work if your lambda expression is correct/

RHurst said...

Great post! Do you know how to convert an xml to pdf format? Or maybe a website that has a converter to do it for me?