INSTICC is a company that organizes scientific conferences in an international context. In order to do this successfully, INSTICC receives scientific papers on the subject of such a conference. These papers come in the form of Microsoft Word documents. The papers are corrected and bundled so they can be distributed among the participants of the conference. The problem is that it sometimes takes more than a week of work to correct these papers.
INSTICC has created some macros for Microsoft Word to simplify the correction of the layout of the papers. To use the macros, you only need to select a paragraph and use the corresponding hotkey to automatically adjust the layout settings of that paragraph. This comes in very handy, but is not an ideal solution for the problem. The macros can only change the layout and it still takes too much time to open a document, select the paragraphs one by one and apply the macros.
To solve this problem, INSTICC started a project to automatically correct the papers. A perfect scenario would be to completely correct a paper with a single click on a button. Though that would be ideal, it is likely that the perfect scenario will never be reached. This is mainly due to the fact that the papers come from many different authors and there is a wide variety of problems with the layout of the documents.
The project is being created in C# using Visual Studio. It is a non-delimited project and the idea is to create functions to correct as many problems in the papers as possible. To be able to create these functions, INSTICC has made a basic API to transform Microsoft Word documents to the XML form of Word called WordProcessingML or WordML. WordML represents a Microsoft Word document in a more readable form for programmers. Also incorporated in the API are functions to extract data from the transformed Microsoft Word documents.
The biggest problems that will be solved include the formatting of the reference section of the papers, the formatting and numbering of chapters in the papers, the correcting of the layout of the title page, and the positioning of images and tables in the text. The greatest difficulty in creating functions to solve these problems is to isolate certain parts of the papers. There are a lot of ways to create paragraphs, from using styles in the text to manually adjusting the spacing and font size. This makes some paragraphs appear very different from each other in the WordML code although they can both be a section title for example. That is the reason why the functions will not perform successfully for every paper and they can always be adjusted in an attempt to reach perfection.
If you want to cite this thesis in your own thesis, paper, or report, use this format (APA):
VAN LERSBERGHE, K. (2009). Word File Validation.
Unpublished thesis, Xios, IWT.