I have to convert word to html which I'm doing with Aspose and that is working well. The problem is that it is producing some redundant elements which I think is due to the way the text is store in word.
For example in my word document the text below appears:
AUTHORIZATION FOR RELEASE
When converted to html it becomes:
<span style="font-size:9pt">A</span>
<span style="font-size:9pt">UTHORIZATION FOR R</span>
<span style="font-size:9pt">ELEASE</span>
I'm using C# and would like a way to remove the redundant span elements. I'm thinking either AngleSharp or html-agility-pack should be able to do this but I'm not sure this is the best way?
What I wound up doing is iterating over all the elements and when adjacent span elements were detected I concatenated the text together. Here is some code if others run into this issue. Note code could use some cleanup.
static void CombineRedundantSpans(IElement parent)
{
if (parent != null)
{
if (parent.Children.Length > 1)
{
var children = parent.Children.ToArray();
var previousSibling = children[0];
for (int i = 1; i < children.Length; i++)
{
var current = children[i];
if (previousSibling is IHtmlSpanElement && current is IHtmlSpanElement)
{
if (IsSpanMatch((IHtmlSpanElement)previousSibling, (IHtmlSpanElement)current))
{
previousSibling.TextContent = previousSibling.TextContent + current.TextContent;
current.Remove();
}
else
previousSibling = current;
}
else
previousSibling = current;
}
}
foreach(var child in parent.Children)
{
CombineRedundantSpans(child);
}
}
}
static bool IsSpanMatch(IHtmlSpanElement first, IHtmlSpanElement second)
{
if (first.ChildElementCount < 2 && first.Attributes.Length == second.Attributes.Length)
{
foreach (var a in first.Attributes)
{
if (second.Attributes.Count(t => t.Equals(a)) == 0)
{
return false;
}
}
return true;
}
return false;
}
I have a bit of a pickle. There are a list of images I want to grab on a website. I know how to do that much, but I have to filter out the location of the images.
Such as I'd want to grab the images in a div tag with an id "theseImages", but there are another set of images within another div tag with an id called "notTheseImages". Looping through every tag into ah HtmlElementCollection with the tag "img" would ignore the divs, because it'd also grab the images from "notTheseImages."
Is there a way I could loop through the images while doing a check to see where those images are located in the div tags?
This could help you to do the selection of your current HTML and maybe for future occassions :)
protected HtmlElement[] GetElementsByParent(HtmlDocument document, HtmlElement baseElement = null, params string[] singleSelectors)
{
if (singleSelectors == null || singleSelectors.Length == 0)
{
throw new Exception("Please give at least 1 selector!");
}
IList<HtmlElement> result = new List<HtmlElement>();
bool last = singleSelectors.Length == 1;
string singleSelector = singleSelectors[0];
if (string.IsNullOrWhiteSpace(singleSelector) || string.IsNullOrWhiteSpace(singleSelector.Trim()))
{
return null;
}
singleSelector = singleSelector.Trim();
if (singleSelector.StartsWith("#"))
{
var item = document.GetElementById(singleSelector.Substring(1));
if (item == null)
{
return null;
}
if (last)
{
result.Add(item);
}
else
{
var results = GetElementsByParent(document, item, singleSelectors.Skip(1).ToArray());
if (results != null && results.Length > 0)
{
foreach (var res in results)
{
result.Add(res);
}
}
}
}
else if (singleSelector.StartsWith("."))
{
if (baseElement == null)
{
baseElement = document.Body;
}
foreach (HtmlElement child in baseElement.Children)
{
string cls;
if (!string.IsNullOrWhiteSpace((cls = child.GetAttribute("class"))))
{
if (cls.Split(' ').Contains(singleSelector.Substring(1)))
{
if (last)
{
result.Add(child);
}
else
{
var results = GetElementsByParent(document, child, singleSelectors.Skip(1).ToArray());
if (results != null && results.Length > 0)
{
foreach (var res in results)
{
result.Add(res);
}
}
}
}
}
}
}
else
{
HtmlElementCollection elements = null;
if (baseElement != null)
{
elements = baseElement.GetElementsByTagName(singleSelector);
}
else
{
elements = document.GetElementsByTagName(singleSelector);
}
foreach (HtmlElement item in elements)
{
if (last)
{
result.Add(item);
}
else
{
var results = GetElementsByParent(document, item, singleSelectors.Skip(1).ToArray());
if (results != null && results.Length > 0)
{
foreach (var res in results)
{
result.Add(res);
}
}
}
}
}
return result.ToArray();
}
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// here we can query
var result = GetElementsByParent(webBrowser1.Document, null, "#theseImages", "img");
}
result would then contain the images that are under #theseImages
Mind you the GetElementsByParent is fairly untested, I just tested it for your use case and it seemed to be ok.
Don't forget to only start the query once you are sure the document is completed ;)
I know you can't modify a collection during a foreach, but I should be able to set variable values of the underlying iterator through it. For some reason the method below, every time it executes is giving be the "Collection was modified..." error:
private static IInstrument AdjustForSimpleInstrument(DateTime valueDate, IInstrument temp)
{
var instr = temp;
foreach (var component in instr.Components)
{
component.Schedule.ScheduleRows.RemoveAll(
sr =>
((sr.Payment != null) && (sr.Payment.PaymentDate != null) &&
(sr.Payment.PaymentDate.AdjustedDate.Date <= valueDate.Date)));
if (
!component.ScheduleInputs.ScheduleType.In(ComponentType.Floating, ComponentType.FloatingLeg,
ComponentType.Cap, ComponentType.Floor)) continue;
foreach (var row in component.Schedule.ScheduleRows)
{
var clearRate = false;
if (row.Payment.CompoundingPeriods != null)
{
if (row.Payment.CompoundingPeriods.Count > 0)
{
foreach (
var period in
row.Payment.CompoundingPeriods.Where(
period => ((FloatingRate)period.Rate).ResetDate.FixingDate > valueDate))
{
period.Rate.IndexRate = null;
clearRate = true;
}
}
}
else if (row.Payment.PaymentRate is FloatingRate)
{
if (((FloatingRate)row.Payment.PaymentRate).ResetDate.FixingDate > valueDate)
clearRate = true;
}
else if (row.Payment.PaymentRate is MultipleResetRate)
{
if (
((MultipleResetRate)row.Payment.PaymentRate).ChildRates.Any(
rate => rate.ResetDate.FixingDate > valueDate))
{
clearRate = true;
}
}
if (clearRate)
{
row.Payment.PaymentRate.IndexRate = null;
}
}
}
return temp;
}
Am I just missing something easy here? The loop that is causing the exception is the second, this one:
foreach (var row in component.Schedule.ScheduleRows)
I suspect this is not .NET-framework stuff, so I assume that row is connected to its collection. Modifying the contents of the row, might shift its place inside its collection, thus modifying the collection, which is not allowed during some foreach-operations.
The solution is simple: create a copy of the collection (by using LINQ).
foreach (var row in component.Schedule.ScheduleRows.ToList())
...
The idea is simple, but the answer may get complicated:
In fact, I can check the run properties for the font size.
If absent I need to check the style applied to the paragraph in order to find the run properties defined for the font size, then that style's paragraph run properties.
If not found I need to check everything again regarding the style which this style is based on.
If not found, I should check the following style going up in the style hierarchy, and go on till I reach the default style.
I also need to check if the previous paragraph has a style applied to it. In this case the applied style may define the style for the next paragraph that affects the text I'm working with.
If there is no style influencing my paragraph, then I need to look in the default run properties from the styles part. After that I should look to the default paragraph properties in the same part.
If nothing applies then the responsibility for the size defintion goes into the application that is working with the document.
Am I right?
Don't I have any help from OPenXML SDK and/or from OpenXmlPowerTools?
An important aspect is that this question extends to almost any paragraph or run property besides text font size.
My ultimate goal is to find out if a piece of text is a section header (like heading1, heading2, etc.) based on formatting but it looks difficult to get tsomething so simple as "the current formatting of a piece of text". To get things harder I also need to deal with (section) numbering that many times doesn't have a numbering format applied to the paragraph.
Thanks,
So, I'm answering my own question as promised.
I developed a method that returns the "effective" run properties of a specific run from a word document paragraph. It takes into account the default document properties, applied styles including the related style hierarchy and the direct run properties according to the standard - ISO/IEC29500-1.
It is interesting to note that Word doesn't seem to follow completely the standard in these two aspects:
1 - If a paragraph has no style applied to it, word applies the default paragraph style. As far as I know I think that no style should be applied. This doesn't happen for a run: when a run has no run style, the default run style is not applied.
2 - In order to get the effective run properties it is necessary to "roll-up styles". Paragraph styles and run styles follow a style hierarchy. In order to get a specific property value it is necessary to look for it in the applied style, if not present look for it in the parent style and so on. A property that is defined with a specific value in a certain style shouldn't be added to the child style if it has the same value. Word doesn't follow this rule for character styles. In fact all run properties applied from the run style can be obtained directly for that run style without being necessary to follow the style hierarchy. This is not according to the standard.
Now, let me go into some details of my solution:
First, my code uses the openxml power tools:
http://powertools.codeplex.com/
Next, for rolling up styles regarding the style inheritance I adapted and implemented the solution provided by Eric White at:
http://blogs.msdn.com/b/ericwhite/archive/2009/12/13/implementing-inheritance-in-xml.aspx
and
http://blogs.msdn.com/b/ericwhite/archive/2009/10/29/open-xml-wordprocessingml-style-inheritance.aspx
The complete algorithm for getting the run properties can be found in the standard and it is also provided by Eric White at:
http://blogs.msdn.com/b/ericwhite/archive/2009/11/12/assembling-paragraph-and-run-properties-for-cells-in-a-table.aspx
In this case it regards the extraction of properties from a cell inside a table. My method doesn't work for paragraphs inside tables (I just don't need it :-) ) but it can be extended to deal with these cases (all information is in Eric's article)
Please note that I deal properly with toggle properties and the way word really works (the points I made for the differences related with the standard.
Finaly, the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Wordprocessing;
using OpenXmlPowerTools;
namespace MyNameSpace
{
class OpenXmlPowerToolsUtilities
{
public static XElement GetEffectiveRunProperties(WordprocessingDocument wordDoc, XElement run)
{
XElement runProperties = null;
List<XElement> runPropertiesList = new List<XElement>();
XElement paragraph = run.Parent;
if (paragraph.Name != W.p)
return null;
StyleDefinitionsPart styleDefinitionsPart = wordDoc.MainDocumentPart
.StyleDefinitionsPart;
if (styleDefinitionsPart == null)
return null;
XElement styles = styleDefinitionsPart.GetXDocument().Root;
// 1 - Get run default
XElement runDefault = styles.Elements(W.docDefaults)
.Elements(W.rPrDefault)
.Elements(W.rPr)
.FirstOrDefault();
if (runDefault != null)
runPropertiesList.Add(runDefault);
// 2 - get paragraph style run properties
XElement pStyleRunProperties = null;
string pStyle = (string)paragraph.Elements(W.pPr)
.Elements(W.pStyle)
.Attributes(W.val)
.FirstOrDefault();
if (pStyle != null)
{
pStyleRunProperties = AssembleStyleInformation(styles, pStyle)
.Elements(W.rPr)
.FirstOrDefault();
}
else
{
XElement defaultParagraphStyle = styles
.Elements(W.style)
.Where(e =>
(string)e.Attribute(W.type) == "paragraph" &&
(string)e.Attribute(W._default) == "1")
.Select(s => s)
.FirstOrDefault();
pStyleRunProperties = defaultParagraphStyle.Elements(W.rPr).FirstOrDefault();
}
if (pStyleRunProperties != null)
runPropertiesList.Add(pStyleRunProperties);
// 3 - get run style run properties
string rStyle = (string)run.Elements(W.rPr).Elements(W.rStyle).Attributes(W.val).FirstOrDefault();
XElement rStyleRunProperties = null;
if (rStyle != null)
{
rStyleRunProperties = AssembleStyleInformation(styles, rStyle)
.Elements(W.rPr)
.FirstOrDefault();
}
if (rStyleRunProperties != null)
runPropertiesList.Add(rStyleRunProperties);
XElement toggleProperties = AssembleToggleProperties(runDefault, pStyleRunProperties, rStyleRunProperties);
if (toggleProperties != null)
runPropertiesList.Add(toggleProperties);
// 4 - direct run properties
XElement directRunProperties = run.Elements(W.rPr).FirstOrDefault();
if (directRunProperties != null)
runPropertiesList.Add(directRunProperties);
runProperties = AssembleRunProperties(runPropertiesList);
return runProperties;
}
private static XElement AssembleRunProperties(List<XElement> runPropertiesList)
{
return runPropertiesList
.Aggregate(
new XElement(W.rPr,
new XAttribute(XNamespace.Xmlns + "w", W.w)),
(mergedRun, run) =>
MergeChildElements(mergedRun, run));
}
static XElement AssembleToggleProperties(XElement runDefault, XElement pStyleRunProperties, XElement rStyleRunProperties)
{
XElement runToggleProperties;
runToggleProperties = new XElement(W.rPr,
new XAttribute(XNamespace.Xmlns + "w", W.w));
foreach (XName toggleProperty in toggleProperties)
{
XElement runDefaultToggleProperty = runDefault.Elements(toggleProperty).FirstOrDefault();
if (runDefaultToggleProperty != null)
{
if ((string)runDefaultToggleProperty.Attributes(W.val).FirstOrDefault() != "0")
{
runToggleProperties.Add(runDefaultToggleProperty);
continue;
}
}
XElement pStyleToggleProperty = null;
if (pStyleRunProperties == null)
pStyleToggleProperty = null;
else
pStyleToggleProperty = pStyleRunProperties.Elements(toggleProperty).FirstOrDefault();
XElement rStyleToggleProperty = null;
if (rStyleRunProperties == null)
rStyleToggleProperty = null;
else
rStyleToggleProperty = rStyleRunProperties.Elements(toggleProperty).FirstOrDefault();
if (pStyleToggleProperty == null && rStyleToggleProperty != null)
runToggleProperties.Add(rStyleToggleProperty);
else if (pStyleToggleProperty != null && rStyleToggleProperty == null)
runToggleProperties.Add(pStyleToggleProperty);
else if (pStyleToggleProperty != null && rStyleToggleProperty != null)
{
if ((string)rStyleToggleProperty.Attributes(W.val).FirstOrDefault() == "0")
runToggleProperties.Add(pStyleToggleProperty);
else if ((string)pStyleToggleProperty.Attributes(W.val).FirstOrDefault() == "0")
runToggleProperties.Add(rStyleToggleProperty);
else
runToggleProperties.Add(new XElement(toggleProperty, new XAttribute(W.val, "0")));
}
}
return runToggleProperties;
}
public static IEnumerable<XElement> StyleChainReverseOrder(XElement styles, string styleId)
{
string current = styleId;
while (true)
{
XElement style = styles.Elements(W.style)
.Where(s => (string)s.Attribute(W.styleId) == current).FirstOrDefault();
yield return style;
current = (string)style.Elements(W.basedOn).Attributes(W.val).FirstOrDefault();
if (current == null)
yield break;
}
}
public static IEnumerable<XElement> StyleChain(XElement styles, string styleId)
{
return StyleChainReverseOrder(styles, styleId).Reverse();
}
private static XElement AssembleStyleInformation(XElement styles, string styleId)
{
return StyleChain(styles, styleId)
.Aggregate(
new XElement(W.style, new XAttribute(XNamespace.Xmlns + "w", W.w)),
(mergedStyle, style) => MergeChildElements(mergedStyle, style));
}
public static XName[] Others =
{
W.pStyle,
W.rStyle
};
public static XName[] ElementsWithMergeElementsSemantics =
{
W.style,
W.rPr,
W.pPr
};
public static XName[] ElementsWithMergeAttributesSemantics =
{
W.ind,
W.spacing,
W.lang
};
public static XName[] ElementsWithReplaceElementsSemantics =
{
W.name, // The style Name element
W.adjustRightInd,
W.autoSpaceDE,
W.autoSpaceDN,
W.bidi,
W.cnfStyle, // within a table
W.contextualSpacing,
W.divId,
W.framePr,
W.jc,
W.keepLines,
W.keepNext,
W.kinsoku,
W.mirrorIndents,
W.numPr,
W.outlineLvl,
W.overflowPunct,
W.pageBreakBefore,
W.pBdr,
W.shd,
W.snapToGrid,
W.suppressAutoHyphens,
W.suppressLineNumbers,
W.suppressOverlap,
W.tabs,
W.textAlignment,
W.textboxTightWrap, // within a textbox
W.textDirection,
W.topLinePunct,
W.widowControl,
W.wordWrap,
W.b,
W.bCs,
W.bdr,
W.caps,
W.color,
W.cs,
W.dstrike,
W.eastAsianLayout,
W.effect,
W.em,
W.emboss,
W.fitText,
W.highlight,
W.i,
W.iCs,
W.imprint,
W.kern,
W.noProof,
W.oMath,
W.outline,
W.position,
W.rFonts,
W.rtl,
W.shadow,
W.shd,
W.smallCaps,
W.snapToGrid,
//W.spacing, // different from paragraph spacing
W.specVanish,
W.strike,
W.sz,
W.szCs,
W.u,
W.vanish,
W.vertAlign,
W._w,
W.webHidden
};
public static XName[] toggleProperties =
{
W.b,
W.bCs,
W.caps,
W.emboss,
W.i,
W.iCs,
W.imprint,
W.outline,
W.shadow,
W.smallCaps,
W.strike,
W.vanish
};
public static bool IsValidMergeElement(XName name)
{
if (ElementsWithMergeAttributesSemantics.Contains(name) ||
ElementsWithMergeElementsSemantics.Contains(name) ||
ElementsWithReplaceElementsSemantics.Contains(name))
return true;
return false;
}
public static bool IsToggleProperty(XName name)
{
if (toggleProperties.Contains(name))
return true;
return false;
}
public static bool HasReplaceSemantics(XName name)
{
if (ElementsWithReplaceElementsSemantics.Contains(name))
return true;
return false;
}
public static bool HasMergeElementsSemantics(XName name)
{
if (ElementsWithMergeElementsSemantics.Contains(name))
return true;
return false;
}
public static bool HasMergeAttributesSemantics(XName name)
{
if (ElementsWithMergeAttributesSemantics.Contains(name))
return true;
return false;
}
public static XElement MergeChildElements(XElement mergedElement, XElement element)
{
if (mergedElement == null || element == null)
{
if (element == null)
element = mergedElement;
XElement newElement = new XElement(element.Name,
new XAttribute(XNamespace.Xmlns + "w", W.w),
element.Attributes()
.Where(a =>
{
if (a.IsNamespaceDeclaration)
return false;
if (element.Name == W.style)
if (!(a.Name == W.type || a.Name == W.styleId))
return false;
return true;
}),
element.Elements().Select(e =>
{
if (e.Name == W.rPr || e.Name == W.pPr)
return MergeChildElements(null, e);
if (IsValidMergeElement(e.Name))
return e;
return null;
}));
return newElement;
}
XElement newMergedElement = new XElement(element.Name,
new XAttribute(XNamespace.Xmlns + "w", W.w),
element.Attributes()
.Where(a =>
{
if (a.IsNamespaceDeclaration)
return false;
if (element.Name == W.style)
if (!(a.Name == W.type || a.Name == W.styleId))
return false;
return true;
}),
element.Elements().Select(e =>
{
if (HasReplaceSemantics(e.Name))
return e;
// spacing within run properties has replace semantics
if (element.Name == W.rPr && e.Name == W.spacing)
return e;
if (HasMergeAttributesSemantics(e.Name))
{
XElement newElement;
newElement = new XElement(e.Name,
e.Attributes(),
mergedElement.Elements(e.Name).Attributes()
.Where(a =>
!(e.Attributes().Any(z => z.Name == a.Name))));
return newElement;
}
if (e.Name == W.rPr || e.Name == W.pPr)
{
XElement correspondingElement = mergedElement.Element(e.Name);
return MergeChildElements(correspondingElement, e);
}
return null;
}),
mergedElement.Elements()
.Where(m => !element.Elements(m.Name).Any()));
return newMergedElement;
}
}
}
I am using a TreeListView (ObjectListView) http://objectlistview.sourceforge.net/cs/index.html - and populated it with a number of items. One of the columns I made editable on double click for user input. Unfortunately, the editing is extremely slow and going from one cell edit in the Qty column (see picture further below) to the next cell edit takes about 5-10 seconds each time. Also, the cell editor takes a while to appear and disappear. Below is the code I use to populate the TreeListView:
TreeListView.TreeRenderer renderer = this.treeListView.TreeColumnRenderer;
renderer.LinePen = new Pen(Color.Firebrick, 0.5f);
renderer.LinePen.DashStyle = DashStyle.Solid;
renderer.IsShowLines = true;
treeListView.RowFormatter = delegate(OLVListItem olvi)
{
var item = (IListView)olvi.RowObject;
if (item.ItemType == "RM")
olvi.ForeColor = Color.LightSeaGreen;
};
treeListView.CanExpandGetter = delegate(object x)
{
var job = x as IListView;
if (job != null)
{
if (job.ItemType == "PA" || job.ItemType == "JC")
{
var rm = job.ItemPart.GetRawMaterial();
var subParts = job.ItemPart.SubParts.Where(v => v != null).ToList();
if (rm.Count > 0|| subParts.Count > 0)//
return true;
}
}
return false;
};
this.treeListView.ChildrenGetter = delegate(object x)
{
try
{
var job = x as IListView;
if (job != null)
{
if (job.ItemType == "PA" || job.ItemType == "JC")
{
var part = job.ItemPart;
var rm = part.GetRawMaterial();
var subParts = part.SubParts.Where(v => v != null).ToList();
var items = new List<IListView>();
items.AddRange(subParts.GetRange(0, subParts.Count).ToList<IListView>());
items.AddRange(rm.GetRange(0, rm.Count).ToList<IListView>());
return items;
}
}
return null;
}
catch (UnauthorizedAccessException ex)
{
MessageBox.Show(this, ex.Message, "ObjectListViewDemo", MessageBoxButtons.OK, MessageBoxIcon.Exclamation);
return null;
}
};
var lItems= jobs.ToList<IListView>();
treeListView.SetObjects(lItems );
Expand(lItems[0]);
treeListView.RebuildAll(true);
}
public void Expand(object expItem)
{
treeListView.ToggleExpansion(expItem);
foreach (var item in treeListView.GetChildren(expItem))
{
Expand(item);
}
}
Here is a picture of the cell editing:
Why is the editing so very slow? Am I doing something wrong? What can I do to make it faster?
In your delegates you're using linear searches and several list copies (also linear). And this is for each item.
Bad performance is to be expected.
If you want to improve on this, you can pre-calculate the results instead.