I am trying to read through a word document using Open XML.
I am looking for key tags within the document in order to identify the values i need to pick up from the document.
I am looping through each paragraph, and then each run within the document to be able to find these.
However it appears that the spelling & grammar check is causing problems, splitting up the "runs" within the documents with any errors it identifies with "ProofError" elements, which is making it difficult to parse the document correctly.
I have tried to remove all ProofError elements and save the document, however they appear to come back.
If i run the spelling and grammar check within MS Word manually there is no issue, though this isn't practical.
Does anyone know a way I can get around this?
Sample text from doc:
Communication System: UID 0, CW (0); Frequency: 900 MHz;Duty Cycle: 1:1
Medium: 900MHz HSL Medium parameters used: f = 900 MHz; σ = 0.979 S/m; εr = 40.68; ρ = 1000 kg/m3
Code used to explore the document
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(openFileDialog.FileName, false))
{
// start looking through the file here
// correct proof errors here
Body body = wordDocument.MainDocumentPart.Document.Body;
foreach (Paragraph p in body.OfType<Paragraph>())
{
p.GetType();
List<ProofError> errList = new List<ProofError>();
foreach (ProofError err in p.OfType<ProofError>())
{
errList.Add(err);
}
foreach (ProofError err in errList)
{
err.Remove();
}
}
wordDocument.Save();
}
The code above will remove any of the ProofError elements from the list, I hope that doing this and saving it would merge any similar runs together, however the proof errors come back when saving.
Screenshot below should show you the children of a paragraph.
Link to an example Document which throws up errors - these are due to the language being incorrect, but i have no control over the format coming in to me, and there will be other errors thrown up unrelated to language.
Sample File
Related
So I'm doing Google Code Jam, and for their new format I have to upload my code as a single text file.
I like writing my code as properly constructed classes and multiple files even when under time pressure (I find that I save more time in clarity and my own debugging speed than I lose in wasted time.) and I want to re-use the common code.
Once I've got my code finished I have to convert from a series of classes in multiple files, to a single file.
Currently I'm just manually copying and pasting all the files' text into a single file, and then manually massaging the usings and namespaces to make it all work.
Is there a better option?
Ideally a tool that will JustDoIt for me?
Alternatively, if there were some predictable algorithm that I could implement that wouldn't require any manual tweaks?
Write your classes so that all "using"s are inside "namespace"
Write a script which collects all *.cs files and concatenates them
This is probably not the most optimal way to do this but this is a algorithm which can do what you need:
loop through every file and grab every line starting with "using" -> write them to a temp file/buffer
check for duplicates and remove them
get the position of the first '{' after the charsequence "namespace"
get the position of the last '}' in the file
append the text in between these two positions onto a temp file/buffer
append the second file/buffer to the first one
write out the merged buffer
It is very subjective. I see the algorithm as the following in pseudo code:
usingsLines = new HashSet<string>();
newFile = new StringBuilder();
foreeach (file in listOfFiles)
{
var textFromFile = file.ReadToEnd();
var usingOperators = textFromFile.GetUsings();
var fileBody = textFromFile.GetBody();
newFile+=fileBody ;
}
newFile = usingsLines.ToString() + newFile;
// As a result if will have something like this
// using usingsfromFirstFile;
// using usingsfromSecondFile;
//
// namespace FirstFileNamespace
// {
// ...
// }
//
// namespace SecondFileNamespace
// {
// ...
// }
But keep in mind this approach can lead to conflicts in namespaces if two different namespaces contan the same classes etc. To solve it you need to fix it manually, or rid of using operator and use fullnames with namespaces.
Also these few links may be useful:
Merge files,
Merge file in Java
I have code (as soon below) that loops through all the paragraphs and finds a specific string. When I find the string, I print out the automated number of the header it belongs to. The problem, I want the header number as well and can't figure out how to get it. Code here:
Edit: Cleaned up code with working content
string LastHeader = "";
foreach (Paragraph paragraph in aDoc.Paragraphs)
{
Microsoft.Office.Interop.Word.Style thisStyle =
(paragraph.Range.get_Style() as Microsoft.Office.Interop.Word.Style);
if (thisStyle != null && thisStyle.NameLocal.StartsWith("Heading 2"))
{
LastHeader = paragraph.Range.ListFormat.ListString + " " +
paragraph.Range.Text.Replace("\"", "\"\"").Replace("\r", "");
}
else
{
string content = paragraph.Range.Text;
for (int i = 0; i < textToSearch.Count; i++)
{
if (content.Contains(textToSearch[i]))
{
// Do stuff here
}
}
}
}
Every time I re-read your information I get the feeling I'm not understanding completely what you're asking, so the following may contain more information than you're looking for...
To get the numbering of a specific Paragraph:
paragraph.Range.ListFormat.ListString
Word has some built-in bookmarks that can give you information that is otherwise a lot of work to determine. One of these is \HeadingLevel. This gets the first paragraph formatted with a Heading style that precedes a SELECTION. And that's the sticky part, because we don't really want to be making selections... But it dates back to the old WordBasic days when code mimicked user actions, so we're stuck with that.
textToAnalyse.Select
aDoc.Bookmarks("\HeadingLevel").Range.ListFormat.ListString
The bookmark call, itself, will NOT change the selection.
The other option I see, since you're already looping the Paragraphs collection, would be to check the paragraph.Range.Style and, if it begins with the string "Heading" (assuming that's not used otherwise in these documents' styles) save the ListFormat.ListString so that you can call on it if your condition is met.
I do have to wonder, however, why you're "walking" the paragraphs collection and not using Word's built-in Find capability as that would be much faster. It can search text AND (style) formatting.
I am working on images to rank them. Initially i made up a data set to store image meta data into it. I ran into a problem when i have to extract meta data of images. I am able to extract all of the meta data except the "tags" field which i need to rank images.
I am attaching the link to similar post but it is in matlab.
Extracting meta data "tags field"
I only need the information which is encircled in red.
Simply put by means of meme: "One does not simply extract tag metadata", apparently.
From what I have discovered, the Tag metadata that gets set when on the properties page of a JPG, from within Explorer, is relatively easy to get.
They are in the PropertyItem with ID 9C9E (or 40094 in Decimal).
This PropertyItem's Value is a byte[] with Unicode characters and is null terminated.
Below is a method to extract the tag ( ; seperated, so you can adjust the method to return a list of them split up if you want)
private string ReadBasicTags(string filename)
{
string foundTags = string.Empty;
using (Image inputImage = new Bitmap(filename))
{
try
{
PropertyItem basicTag = inputImage.GetPropertyItem(40094); // Hex 9C9E
if (basicTag != null)
{
foundTags = Encoding.Unicode.GetString(basicTag.Value).Replace("\0", string.Empty);
}
}
// ArgumentException is thrown when GetPropertyItem(int) is not found
catch (ArgumentException)
{
// finalOutput = "Tags not found";
}
}
return foundTags;
}
With this sample image you should get a string output of
TagOne;TagTwo:
However you may have noticed that I am talking about basic Tag information and Tags that set in Windows Explorer itself. They are easy to get.
If you where to run the above code on either this sample or this sample, you would not get any data back. This is because of the massive amount of different ways that metadata can be stored inside a jpeg by all the various tools and hardware.
If you want to have an idea of how many different types of metadata and all the different names of the tags and their formats, head on over to the ExifTool by Phil Harvey, explore a bit on that page, especially the "Tag Names" page and you are sure to get a headache of the sheer amount of different tags.
Now you are probably wondering if you should even delve deeper into the world of metadata tag extraction or if you could make a wrapper for that tool to integrate it into C# (which apparently some have done, but requires the tool to be present and other things, check under the Additional Resources section of the tool's page for info).
Ah, but fear not! For someone has figured out a simpler way, in C#, to extract the right information so that you can use the metadata tags alone.
I found this little GitHub project (JpegMetadata by Marcel Wijnands) while trying to wrap my head around the enormousy that is exif metadata. This little library makes it easy for you to extract the metadata tag and works on all the samples that I have linked so far.
You can install it into your project with NuGet if you are using VisualStudio.
The resulting List<string> that you get when calling the library gives you each tag separately.
As examples, the image Arena Chapel frescoes (example027.jpg) and Getty Villa (GettyVilla0001.JPG) produce the below lists with the shown code:
JpegMetadataAdapter metaAdapter = new JpegMetadataAdapter(#"C:\Dev\example027.JPG");
foreach (string item in metaAdapter.Metadata.Keywords)
{
outputString += string.Format("{0}{1}", item, Environment.NewLine);
}
Medieval
Italian
paintings (visual works)
frescoes (paintings)
fresco painting (technique)
allegorical
architectural interiors
cycles or series
New Testament
Old Testament and Apocrypha
saints
Jesus Christ
Mary, Blessed Virgin
Saint
Christian iconography
the Passion
Judas Iscariot
disciples
JpegMetadataAdapter metaAdapter = new JpegMetadataAdapter(#"C:\Dev\GettyVilla0001.JPG");
foreach (string item in metaAdapter.Metadata.Keywords)
{
outputString += string.Format("{0}{1}", item, Environment.NewLine);
}
sunset
shadows
mural painting
peristyles (colonnades)
trompe-l'oeil
greek
roman
Getty Villa
Both of these images store their metadata tags differently, so it is fair to say that with that library you should be covered by most of the versions of the metadata tags in jpegs.
I wrote a utility for another team that recursively goes through folders and converts the Word docs found to PDF by using Word Interop with C#.
The problem we're having is that the documents were created with date fields that update to today's date before they get saved out. I found a method to disable updating fields before printing, but I need to prevent the fields from updating on open.
Is that possible? I'd like to do the fix in C#, but if I have to do a Word macro, I can.
As described in Microsoft's endless maze of documentation you can lock the field code. For example in VBA if I have a single date field in the body in the form of
{DATE \# "M/d/yyyy h:mm:ss am/pm" \* MERGEFORMAT }
I can run
ActiveDocument.Fields(1).Locked = True
Then if I make a change to the document, save, then re-open, the field code will not update.
Example using c# Office Interop:
Word.Application wordApp = new Word.Application();
Word.Document wordDoc = wordApp.ActiveDocument;
wordDoc.Fields.Locked = 1; //its apparently an int32 rather than a bool
You can place the code in the DocumentOpen event. I'm assuming you have an add-in which subscribes to the event. If not, clarify, as that can be a battle on its own.
EDIT: In my testing, locking fields in this manner locks them across all StoryRanges, so there is no need to get the field instances in headers, footers, footnotes, textboxes, ..., etc. This is a surprising treat.
Well, I didn't find a way to do it with Interop, but my company did buy Aspose.Words and I wrote a utility to convert the Word docs to TIFF images. The Aspose tool won't update fields unless you explicitly tell it to. Here's a sample of the code I used with Aspose. Keep in mind, I had a requirement to convert the Word docs to single page TIFF images and I hard-coded many of the options because it was just a utility for myself on this project.
private static bool ConvertWordToTiff(string inputFilePath, string outputFilePath)
{
try
{
Document doc = new Document(inputFilePath);
for (int i = 0; i < doc.PageCount; i++)
{
ImageSaveOptions options = new ImageSaveOptions(SaveFormat.Tiff);
options.PageIndex = i;
options.PageCount = 1;
options.TiffCompression = TiffCompression.Lzw;
options.Resolution = 200;
options.ImageColorMode = ImageColorMode.BlackAndWhite;
var extension = Path.GetExtension(outputFilePath);
var pageNum = String.Format("-{0:000}", (i+1));
var outputPageFilePath = outputFilePath.Replace(extension, pageNum + extension);
doc.Save(outputPageFilePath, options);
}
return true;
}
catch (Exception ex)
{
LogError(ex);
return false;
}
}
I think a new question on SO is appropriate then, because this will require XML processing rather than just Office Interop. If you have both .doc and .docx file types to convert, you might require two separate solutions: one for WordML (Word 2003 XML format), and another for OpenXML (Word 2007/2010/2013 XML format), since you cannot open the old file format and save as the new without the fields updating.
Inspecting the OOXML of a locked field shows us this w:fldLock="1" attribute. This can be inserted using appropriate XML processing against the document, such as through the OOXML SDK, or through a standard XSLT transform.
Might be helpful: this how-do-i-unlock-a-content-control-using-the-openxml-sdk-in-a-word-2010-document question might be similar situation but for Content Controls. You may be able to apply the same solution to Fields, if the the Lock and LockingValues types apply the same way to fields. I am not certain of this however.
To give more confidence that this is the way to do it, see example of this vendor's solution for the problem. If you need to develop this in-house, then openxmldeveloper.org is a good place to start - look for Eric White's examples for manipulating fields such as this.
I have been looking around for this but couldn't find an answer anywhere, so hope aomeone here can help.
I am doing a WinForms application in C# in wich i use WordApplcation.CompareDocuments to compare two documents and get a result document with the changes marked as Revisions.
This works well and apart from the revisions hiding stuff inside textboxes (wich i don't care about just yet) i get exactly what i want.
So next step is to count how many words got revised - specifically wdRevisionDelete and wdRevisonInsert.
Only problem is final.Revisions is sometimes empty or contains enourmous amounts of data (over 500 words).
i read on the MSDN page for Revisions.Count that document.Revisions won't show all revisions but only ones on main story and that i must use range - but that didn't help.
here's my current code:
using Word = Microsoft.Office.Interop.Word;
And
foreach (Word.Section s in final.Sections)
{
foreach (Word.Revision r in s.Range.Revisions)
{
counter += r.Range.Words.Count;
if (r.Type == Word.WdRevisionType.wdRevisionDelete)
delcnt += r.Range.Words.Count;
if (r.Type == Word.WdRevisionType.wdRevisionInsert)
inscnt += r.Range.Words.Count;
}
}
final is the Word Document created by WordApplication.CompareDocuments
So, as i said, and according to MSDN, i use range.Revision instead of document.Revision, and go section by section.
Only one document with half a dozen revisions shows none while others show 100's.
So my question is, how to use the Revisions to count added / deleted words.
I have opened the documents that CompareDocuments creates in Word 2007 and the Revisions are correctly marked and can be accepted or rejected inside Word
Any ideias on what i might be overlooking?
EDIT: I have noticed something odd - when i try to save as txt file the original doc files that are reporting 0 changes although the CompareDocuments marks (correctly) a few, i notice that not all pages get saved to the txt file - that includes all areas with revisions.
I tried converting to txt file using both Word 2007 and LibreOffice 3.3 - both have the same result (lot's of text missing).
Might be related somehow.
Wonder what is wrong with this files.
Any ideas?
Well Apparently there's nothing wrong with that code and works on simpler files.
There's just something odd with the files i was testing on.
Like my edit says, can't even save them as txt files properly.
Anyone knowing what could cause this let me know, meanwhile this one is solved as Word document file problems.