How to maintain style formatting when merging two ODT documents together - c#

I am working with the AODL library for C#. So far I have been able to wholesale import the text of the second document into the first. The issue is I can't quite figure out what I need to grab to make sure the styling is also moved over to the merged document. Below is the simple code I'm using to test. The closest answer I can find is Merging two .odt files from code, which somewhat answers my question, but it still doesn't tell me where I need to put the styling/ where to get it from. It at least lets me know that I need to go through the styles in the second document and make sure there are not matching names in the first otherwise there will be conflicts. I'm not sure exactly what to do, and documentation has been very slim. Before you suggest anything I would like to let you know that, yes, odt is the filetype I need to work with, and doing any kind of interop stuff like Microsoft does with Word is not what I'm after. If there is another library out there that works similarly to AODL I'm all ears.
TextDocument mergeTemplateDoc = ReadContentsOfFile(mergeTemplateFileName);
TextDocument vehicleTemplateDoc = ReadContentsOfFile(vehicleTemplateFileName);
foreach (IContent piece in vehicleTemplateDoc.Content)
{
XmlNode newNode = mergeTemplateDoc.XmlDoc.ImportNode(piece.Node,true);
Paragraph p = ParagraphBuilder.CreateParagraphWithExistingNode(mergeTemplateDoc, newNode);
mergeTemplateDoc.Content.Add(p);
}
mergeTemplateDoc.SaveTo("MergComplete.odt");

Here is what I ended up doing to solve my issue. Keep in mind I have since migrated to using Java since this question was asked, as the library appears to work a little better in that language.
Essentially what the methods below are doing is Grabbing the Automatic Styles that are generated in each document. It iterates through the second document and finds each style node, checking for the name attribute. That name is then tagged with an extra identifier that is unique to that document, so when they are merged together they won't conflict name wise.
The mergeFontTypesToPrimaryDoc just grabs the fonts that don't already exist in the primary doc since all the fonts are referenced in the same way in the documents there is no editing to be done.
The updateNodeChildrenStyleNames is just a recursive method that I used to make sure I get all the in line style nodes updated to remove any conflicting names between the two documents.
This similar idea should work in C# as well.
private static void mergeStylesToPrimaryDoc(OdfTextDocument primaryDoc, OdfTextDocument secondaryDoc) throws Exception {
OdfFileDom primaryContentDom = primaryDoc.getContentDom();
OdfOfficeAutomaticStyles primaryDocAutomaticStyles = primaryDoc.getContentDom().getAutomaticStyles();
OdfOfficeAutomaticStyles secondaryDocAutomaticStyles = secondaryDoc.getContentDom().getAutomaticStyles();
//Adopt style nodes from secondary doc
for(int i =0; i<secondaryDocAutomaticStyles.getLength();i++){
Node style = secondaryDocAutomaticStyles.item(i).cloneNode(true);
if(style.hasAttributes()){
NamedNodeMap attributes = style.getAttributes();
for(int j=0; j< attributes.getLength();j++){
Node a = attributes.item(j);
if(a.getLocalName().equals("name")){
a.setNodeValue(a.getNodeValue()+_stringToAddToStyle);
}
}
}
if(style.hasChildNodes()){
updateNodeChildrenStyleNames(style, _stringToAddToStyle, "name");
}
primaryDocAutomaticStyles.appendChild(primaryContentDom.adoptNode(style));
}
}
private static void mergeFontTypesToPrimaryDoc(OdfTextDocument primaryDoc, OdfTextDocument secondaryDoc) throws Exception {
//Insert referenced font types that are not in the primary document you are merging into
NodeList sdDomNodes = secondaryDoc.getContentDom().getChildNodes().item(0).getChildNodes();
NodeList pdDomNodes = primaryDoc.getContentDom().getChildNodes().item(0).getChildNodes();
OdfFileDom primaryContentDom = primaryDoc.getContentDom();
Node sdFontNode=null;
Node pdFontNode=null;
for(int i =0; i<sdDomNodes.getLength();i++){
if(sdDomNodes.item(i).getNodeName().equals("office:font-face-decls")){
sdFontNode = sdDomNodes.item(i);
break;
}
}
for(int i =0; i<pdDomNodes.getLength();i++){
Node n =pdDomNodes.item(i);
if(n.getNodeName().equals("office:font-face-decls")){
pdFontNode = pdDomNodes.item(i);
break;
}
}
if(sdFontNode !=null && pdFontNode != null){
NodeList sdFontNodeChildList = sdFontNode.getChildNodes();
NodeList pdFontNodeChildList = pdFontNode.getChildNodes();
List<String> fontNames = new ArrayList<String>();
//Get list of existing fonts in primary doc
for(int i=0; i<pdFontNodeChildList.getLength();i++){
NamedNodeMap attributes = pdFontNodeChildList.item(i).getAttributes();
for(int j=0; j<attributes.getLength();j++){
if(attributes.item(j).getLocalName().equals("name")){
fontNames.add(attributes.item(j).getNodeValue());
}
}
}
//Check each font in the secondary doc to make sure it gets added if the primary doesn't have it
for(int i=0; i<sdFontNodeChildList.getLength();i++){
Node fontNode = sdFontNodeChildList.item(i).cloneNode(true);
NamedNodeMap attributes = fontNode.getAttributes();
String fontName="";
for(int j=0; j< attributes.getLength();j++){
if(attributes.item(j).getLocalName().equals("name")){
fontName = attributes.item(j).getNodeValue();
break;
}
}
if(!fontName.equals("") && !fontNames.contains(fontName)){
pdFontNode.appendChild(primaryContentDom.adoptNode(fontNode));
}
}
}
}
private static void updateNodeChildrenStyleNames(Node n, String stringToAddToStyle, String nodeLocalName){
NodeList childNodes = n.getChildNodes();
for (int i=0; i< childNodes.getLength(); i++){
Node currentChild = childNodes.item(i);
if(currentChild.hasAttributes()){
NamedNodeMap attributes = currentChild.getAttributes();
for(int j =0; j < attributes.getLength(); j++){
Node a = attributes.item(j);
if(a.getLocalName().equals(nodeLocalName)){
a.setNodeValue(a.getNodeValue() + stringToAddToStyle);
}
}
}
if(currentChild.hasChildNodes()){
updateNodeChildrenStyleNames(currentChild, stringToAddToStyle, nodeLocalName);
}
}
}
}

I do not know how precisely it should be coded, but using 7zip i have been able to just copy the whole styles.xml from one file to another. Programatically it should be just as easy.
I always format my files with styles and never with direct formatting. So just replacing any file is prone to eliminate the local styles.
I found this answer (to the question "Cleaning a stylesheet of unused styles") https://www.mobileread.com/forums/showpost.php?s=cbbee08a1204df71ec5cd88bcf222253&p=2100914&postcount=13
which iterates through all the styles in one document. It doesn't show how to incorporate one into the other, but the backbone is clear.
'---------------------------------------------------------- 03/02/2012
' Supprimer les styles personnalisés inutilisés
' d'un document texte ou d'un classeur
'---------------------------------------------------------------------
sub stylesPersoInutiles()
dim coStylesFamilles as object, oStyleFamille as object
dim oStyle as object, nomFamille as string
dim f as long, x as long
dim ts(), buf as string, iRet as integer
const SEP = ", "
coStylesFamilles = thisComponent.StyleFamilies
for f = 0 to coStylesFamilles.count -1
' Pour chaque famille
nomFamille = coStylesFamilles.elementNames(f)
oStyleFamille = coStylesFamilles.getByName(nomFamille)
buf = ""
for x = 0 to oStyleFamille.Count -1
' Pour chaque style
oStyle = oStyleFamille(x)
'xray oStyle
if (oStyle.isUserDefined) and (not oStyle.isInUse) then
buf = buf & oStyle.name & SEP
end if
next x
if len(buf) > len(SEP) then
buf = left(buf, len(buf) - len(SEP))
iRet = msgBox("Styles personnalisés non utilisés : " _
& chr(13) & buf & chr(13) & chr(13) _
& "Faut-il les détruire ?", 4+32+256, nomFamille)
if iRet = 6 then
ts = split(buf, SEP)
for x = 0 to uBound(ts)
oStyleFamille.removeByName(ts(x))
next x
end if
end if
next f
end sub

Related

Reading the Huffman encoding tree in C#

I'm trying to build a Huffman encoder that can read the bytes out of a text file and convert them to Huffman codes. And I need to be able to decode.
So far I've been able to read the bytes from a file and put them in an tree that is based on a double-linked list representing the Huffman encoding. I have a text file with this text: "abcabcabd". This is the tree I generate:
My goal is to read out the tree and put the whole output in an array. My expected array would look like this: "1, 01, 000, 1, 01, 000, 1, 01, 001"
The only problem is I have no idea how to determine where what letter is positioned. I want to do it using the 1/0 method.
So my question here is: how do I determine where the 'byte' codes are located. Would I do it in a while loop? I want it to be variable so creating an library isn't possible. I have tried this:
First I create my tree:
public static void mBitTree()
{
W = H;
if (W.N != null)
{
int Fn = W.A;
int Sn = W.N.A;
int Cn = Fn + Sn;
cZip n = new cZip(0, Cn);
//First Set new node to link to subnodes.
n.L = W;
n.R = W.N;
if (W.N.N != null)
{
n.N = H.N.N;
H.N.N.P = n;
H.N.N = n;
//Safe and set new head and fix links.
W = H.N.N;
H.N.N = null;
H.N.P = null;
H.N = null;
H = W;
}
//this means there were 2 nodes left. so the newly created one will become Head ant Tail and the tree is complete.
else
{
H.N.P = null;
H.N = null;
H = n;
T = n;
}
}
else
{
return;
}
}
The top node from where I start is H and T. Also it has an L and R.
The first thing needed to be checked is H.L is not equal to current. Then go to H.R.L, then H.R.R.L and so on. If I have a bigger text file this needs to work as well. So I created something like this:
public static string[] mCodedchain(uint[] count, byte[] bytesInFile)
{
int counter = 0;
for (int i = 0; i < count.Length; i++) { if (count[i] != 0) counter = counter + (int)count[i]; }
string[] codedArray = new string[counter];
for (int i = 0; i < bytesInFile.Length; i++)
{
int number = 0;
while ((int)bytesInFile[i] != number )
{
number = H.L.V //and if not H.L Add R in between to go to H.R.L.V
}
}
return codedArray;
}
But have no clue as to how I would build up the while loop. I've tried a lot of things but this one seems the most valid to use, but I can't get it to work.
I'm not quite sure if this question is clear. But I hope it is.
Thanks in advance, and happy coding.
Decoding static Huffman data is actually quite simple.
What you need:
The node tree with symbols (letters/bytes in your case) attached in the same configuration as were used during encoding
A way to read bits from the stream, one bit at a time
With that in place, here's pseudocode for doing the decoding:
<node> ← <root>
WHILE MORE
<bit> ← <ReadBit()>
IF <bit> IS 1 THEN
<node> ← <node.LEFT>
ELSE
<node> ← <node.RIGHT>
END IF
IF <node.SYMBOL> THEN
OUTPUT <node.SYMBOL> AS DECODED SYMBOL
<node> ← <root>
END IF
END WHILE
Or in plain English:
Start at the root node.
Read 1 bit from the stream. If the bit is 1, go to the left child of the current node, otherwise go to the right child.
Whenever you hit a node with a symbol in it, output the symbol and go back to the root node
Go back to "read 1 bit" and keep going until you've decoded the entire stream
NOTE! You need to know how many symbols to output separately. The reason for this is that the very last byte in the encoded stream might have extra bits, and if you decode these as well you might end up with extra symbols, compared to the original data before encoding.

Custom Queue Class Iteration and Data Retrieval C#

So I'm working in Visual Studio 2015 with a few custom classes. One of which is called MinPriorityQueue, and it is a priority queue that, in this situation, allows me to retrieve the object of MinimumPriority in the queue via a property MinimumPriority. There is also a method called RemoveMinimumPriority, which is self-explanatory.
I am not allowed to modify this method, it was pre-made for us for this assignment, otherwise I would have already found a simple solution.
My program is meant to compare two text files, and return a value based off a certain equation which isn't important as far as this post goes. The problem I am having is within my UserInterface code. Here is my click event for the 'Analyze' button on my GUI.
private void uxAnalyze_Click(object sender, EventArgs e)
{
Dictionary<string, StoreWord> dictionary = new Dictionary<string, StoreWord>();
const int _numFiles = 2;
MinPriorityQueue<float, StoreInfo> minQueue = new MinPriorityQueue<float, StoreInfo>();
int numWords1 = 0;
int numWords2 = 0;
//Process Both Input Files
using (StreamReader sr = new StreamReader(uxTextBox1.Text))
{
for (int i = 0; i < _numFiles; i++)
{
if (i == 0)
{
dictionary = ReadFile(dictionary, uxTextBox1.Text, i, out numWords1);
}
if (i == 1)
{
dictionary = ReadFile(dictionary, uxTextBox2.Text, i, out numWords2);
}
}
}
int[] numWords = new int[2];
numWords[0] = numWords1;
numWords[1] = numWords2;
//Get 50 Words with Highest Combined Frequencies
foreach(var entry in dictionary.Values)
{
StoreInfo freq = new StoreInfo(entry, numWords);
minQueue.Add(freq, Convert.ToSingle(entry[0] + entry[1]));
if(minQueue.Count > 50)
{
minQueue.RemoveMinimumPriority();
}
}
//Compute and Display the Difference Measure
float diffMeasure = 0;
float temp = 0;
foreach( x in minQueue)
for (int i = 0; i < minQueue.Count; i++)
{
temp += minQueue.????; //This is where my problem stems
}
diffMeasure = (float)(100 * Math.Sqrt(temp));
}
A few lines from the end you will see a comment showing where my problem is located. The MinPriorityQueue (minQueue) has two parameters, a Priority, and a Value, where the Priority is a Float, and the Value is another class called StoreInfo. This class has an Indexer, which will return information from a different file depending on what the index is. In this case, there are only two files. For example: StoreInfo[i] returns the frequency of a word in the ith text file.
Ideally, my code would look like this:
for (int i = 0; i < minQueue.Count; i++)
{
temp += (minQueue.minimumValue[0] - minQueue.minimumValue[1])*(minQueue.minimumValue[0] - minQueue.minimumValue[1]);
}
diffMeasure = (float)(100 * Math.Sqrt(temp));
Problem is, that would require a minimumValue property, which I don't have access to. All I have is minimumPriority.
As far as I can see, there is no other way for me to get the Values that I need in order to get the frequencies that I need to get from the indexer and put into the equation.
Help is much appreciated.
Alright guys, I've been thinking at this for far too long, and it doesn't seem like anyone else sees another solution either.
At this point, I'm just going to go with the logical solution and add another property into the MinPriorityQueue class, even though it is against my professor's wishes.
Thank you all anyway.

excel xlsx file parsing - using koogra

After trying few packages with git hub, and trying to parse/process this quite a large excel document.
Each one of methods I was trying throw exception on out of memory.
I was google ing some more and found this GNU Library named koogra which seems to be only one I could see fit for the job, couldn't bother too much and continue on searching as I am running out of time for this part of the project .
The code I have got by now is working pass the part of the "out of memory" issue,
so only thing left is how do I properly parse an Excel Document so it will be possible to extract say a kind of dictionary collection key is one column and value is another.
this is the file in question
this is the code i have so far
var path = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
Net.SourceForge.Koogra.Excel2007.Workbook xcel = new Net.SourceForge.Koogra.Excel2007.Workbook(path);
var ss = xcel.GetWorksheets();
found it by some more .... google ing...
first row for usage on 2007 (xlsx)
second row is for xls version
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader("tst.xlsx");
//genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcelBIFFReader("some.xls");
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
for (uint c = genericWS.FirstCol; c <= genericWS.LastCol; ++c)
{
// raw value
Console.WriteLine(row.GetCell(c).Value);
// formatted value
Console.WriteLine(row.GetCell(c).GetFormattedValue());
}
}
i hope that i helped anyone else out there that encountered same "out of memory" issue ... '
enjoy
a small update to the code above
OK.. I Have played with this a little , so as far as it is related to the content of the file
the chart is ranked based on Unique IP and the current code is
//place source file within your current:
//project directory\bin\debug and you should find extracted file next to the source file
var pathtoRead = Path.Combine(Environment.CurrentDirectory, "tst.xlsx");
var pathtoWrite = Path.Combine(Environment.CurrentDirectory, "tst.txt");
Net.SourceForge.Koogra.IWorkbook genericWB = Net.SourceForge.Koogra.WorkbookFactory.GetExcel2007Reader(pathtoRead);
Net.SourceForge.Koogra.IWorksheet genericWS = genericWB.Worksheets.GetWorksheetByIndex(0);
StringBuilder SbXls = new StringBuilder();
for (uint r = genericWS.FirstRow; r <= genericWS.LastRow; ++r)
{
Net.SourceForge.Koogra.IRow row = genericWS.Rows.GetRow(r);
string LineEnding = string.Empty;
for (uint ColCount = genericWS.FirstCol; ColCount <= genericWS.LastCol; ++ColCount)
{
var formated = row.GetCell(ColCount).GetFormattedValue();
if (ColCount == 1)
LineEnding = Environment.NewLine;
else if (ColCount == 0)
LineEnding = "\t";
if (ColCount > 1 == false)
SbXls.Append(string.Concat(formated, LineEnding));
}
}
File.WriteAllText(pathtoWrite, SbXls.ToString());

C# Best way to parse flat file with dynamic number of fields per row

I have a flat file that is pipe delimited and looks something like this as example
ColA|ColB|3*|Note1|Note2|Note3|2**|A1|A2|A3|B1|B2|B3
The first two columns are set and will always be there.
* denotes a count for how many repeating fields there will be following that count so Notes 1 2 3
** denotes a count for how many times a block of fields are repeated and there are always 3 fields in a block.
This is per row, so each row may have a different number of fields.
Hope that makes sense so far.
I'm trying to find the best way to parse this file, any suggestions would be great.
The goal at the end is to map all these fields into a few different files - data transformation. I'm actually doing all this within SSIS but figured the default components won't be good enough so need to write own code.
UPDATE I'm essentially trying to read this like a source file and do some lookups and string manipulation to some of the fields in between and spit out several different files like in any normal file to file transformation SSIS package.
Using the above example, I may want to create a new file that ends up looking like this
"ColA","HardcodedString","Note1CRLFNote2CRLF","ColB"
And then another file
Row1: "ColA","A1","A2","A3"
Row2: "ColA","B1","B2","B3"
So I guess I'm after some ideas on how to parse this as well as storing the data in either Stacks or Lists or?? to play with and spit out later.
One possibility would be to use a stack. First you split the line by the pipes.
var stack = new Stack<string>(line.Split('|'));
Then you pop the first two from the stack to get them out of the way.
stack.Pop();
stack.Pop();
Then you parse the next element: 3* . For that you pop the next 3 items on the stack. With 2** you pop the next 2 x 3 = 6 items from the stack, and so on. You can stop as soon as the stack is empty.
while (stack.Count > 0)
{
// Parse elements like 3*
}
Hope this is clear enough. I find this article very useful when it comes to String.Split().
Something similar to below should work (this is untested)
ColA|ColB|3*|Note1|Note2|Note3|2**|A1|A2|A3|B1|B2|B3
string[] columns = line.Split('|');
List<string> repeatingColumnNames = new List<string();
List<List<string>> repeatingFieldValues = new List<List<string>>();
if(columns.Length > 2)
{
int repeatingFieldCountIndex = columns[2];
int repeatingFieldStartIndex = repeatingFieldCountIndex + 1;
for(int i = 0; i < repeatingFieldCountIndex; i++)
{
repeatingColumnNames.Add(columns[repeatingFieldStartIndex + i]);
}
int repeatingFieldSetCountIndex = columns[2 + repeatingFieldCount + 1];
int repeatingFieldSetStartIndex = repeatingFieldSetCountIndex + 1;
for(int i = 0; i < repeatingFieldSetCount; i++)
{
string[] fieldSet = new string[repeatingFieldCount]();
for(int j = 0; j < repeatingFieldCountIndex; j++)
{
fieldSet[j] = columns[repeatingFieldSetStartIndex + j + (i * repeatingFieldSetCount))];
}
repeatingFieldValues.Add(new List<string>(fieldSet));
}
}
System.IO.File.ReadAllLines("File.txt").Select(line => line.Split(new[] {'|'}))

How can I programmatically convert Microsoft Word comments to bookmarks using C#?

Since bookmarks can be included in a URL, I want to convert all the comments in a document to bookmarks.
I wrote a c# application which displays a Microsoft Word document in a web browser activex control. I get a handle to the document and am able to enumerate the comments. But when I attempt to insert bookmarks at the comment location, I end up with NULL bookmarks that don't point to anything, e.g.:
void ButtonConvertCommentsClick(object sender, EventArgs e)
{
Word.Comments wordComments = this.wordDoc.Comments;
MessageBox.Show("This document has " + wordComments.Count + " comments.");
for (int n = 1; n <= wordComments.Count; n++)
{
Word.Comment comment = this.wordDoc.Comments[n];
Word.Range range = comment.Range;
String commentText = comment.Range.Text;
this.wordDoc.Application.ActiveDocument.Bookmarks.Add("BOOKMARK"+n, range);
}
this.wordDoc.Save();
....
}
Assuming there were 3 comments in the doc, "BOOKMARK1", "BOOKMARK2" and "BOOKMARK3" shows up in the bookmark list, but the "Go To..." button is disabled for all of them.
What am I doing wrong?
Use scope to get the range of the comment...
for (int n = 1; n <= wordComments.Count; n++)
{
Word.Comment comment = this.wordDoc.Comments[n];
Word.Range range = this.wordDoc.Range(comment.Scope.Start, comment.Scope.End);
String commentText = comment.Range.Text;
this.wordDoc.Application.ActiveDocument.Bookmarks.Add("BOOKMARK"+n, range);
}

Categories

Resources