I have two strings (RTFs), that I'd have to merge somehow - insert a new line between the two -, to display in a RichEditBox, in UWP. I have read a workaround where merging the two is done by the usage of two RichTextBox controls, but in UWP, that's not really an option (and I can't display the two RTFs in two RichEditBox controls either). Is there an alternative way, without using 3rd party libraries?
While using RichEditBox class, we can merge two RTFs by taking advantage of
ITextDocument interface and ITextRange interface. Following is a simple sample:
var rtf1 = #"{\rtf1{\fonttbl{\f0 Verdana;}{\f1 Arial;}{\f2 Verdana;}{\f3 Calibri;}}{\colortbl;\red255\green255\blue255;\red255\green0\blue0;}\f0\cf2 This is red text marked by Verdana font.\par}";
// Sets rtf1 as the content of the document
editor.Document.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, rtf1);
// Get a new text range for the active story of the document.
var range = editor.Document.GetRange(0, rtf1.Length);
// Collapses the text range into a degenerate point at the end of the range for inserting.
range.Collapse(false);
var rtf2 = #"{\rtf1{\fonttbl{\f0 Times New Roman;}}{\colortbl;\red255\green255\blue255;\red0\green0\blue255;}\f0\cf2 This is blue text marked by Times New Roman font.\par}";
// Inserts rtf2
range.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, rtf2);
//var newrtf = string.Empty;
//editor.Document.GetText(Windows.UI.Text.TextGetOptions.FormatRtf, out newrtf);
//System.Diagnostics.Debug.WriteLine(newrtf);
This will merges rtf2 to the end of rtf1 and it will automatically creates a new valid RTF. You can retrieve the new RTF with ITextDocument.GetText method.
Related
I'm trying to parse a pdf file using itextsharp (version: 5.5.1.0). The pdf file has content-type as "application/octet-stream". I'm using C# code to read based on Location Strategy
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
var word = renderInfo.GetText().Trim();
// get column no
var position = (int)rect.Left;
Pdf file image
Issue: When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop. Is there any way I canread complete word by word ?
Please let me know if you need more info, unfortunately there is no option to attach the pdf file here.
Regards
Pradeep Jain
When I read it RenderInfo.GetText() I get incomplete words like instead of "Daily" I get "Dai" and "ly" in next loop.
That behavior is expected.
In a render listener / text extraction strategy you get the individual atomic string parameters of text drawing instructions. There is no requirement for PDF creation software to put whole words into these strings.
Actually the PDF format even encourages this splitting of words! It does not by itself use the kerning information of fonts; thus, any software that wants to create text output with kerning has to split strings wherever kerning comes into play and sligthly move the text insertion point between the string parts in text drawing instructions.
Thus, a render listener has to collect the strings and glue them together before it can expect to get whole words.
Is there any way I canread complete word by word ?
Yes, by collecting the strings and gluing them together.
You mentioned you read based on Location Strategy - then look closer at what the LocationTextExtractionStrategy itself does: In its RenderText implementation it collects the text pieces with some coordinates, and only after collecting all those pieces, it sorts them and glues them together in its GetResultantText method. (You can find the code here.)
Unfortunately many members of that strategy are not immediately available in derived classes, so you may have to resort to reflection or simply to copying the whole class code and change it in situ.
I have a template Word document which contains multiple text boxes on top of shapes (to give it a better border outline than what can be achieved my the outline of a text box). These text boxes contain mail merge fields that I wish to merge to. I have the following code in an attempt to do this
foreach (Microsoft.Office.Interop.Word.Range range in document.StoryRanges)
{
foreach (Microsoft.Office.Interop.Word.Field field in range.Fields)
{
if (field.Code.Text.Contains("Test Field"))
{
field.Select();
application.Selection.TypeText("test");
}
}
The problem is this only changes the fields within the first text box, I have searched both on here and MSDN for a solution, however I am still having trouble actually finding a solution. I have also added the following lines in an attempt to figure out something
Console.WriteLine(document.StoryRanges.Count);
And within the foreach loop I also have
Console.WriteLine(range.Fields.Count);
The first call to WriteLine indicates there are two StoryRanges, one being the main document, the other being the range that all the text boxes are on, I presume. However, the second WriteLine indicates the first range has 0 fields, whereas the second range only has 1 field, even though the template document I am using contains over 10 fields.
Are the StoryRanges nested ? Are they of different type?
If so, you should consider using StoryRange.NextStoryRange as proposed here:
http://word.mvps.org/faqs/customization/ReplaceAnywhere.htm
'Iterate through all story types in the current document
For Each rngStory In ActiveDocument.StoryRanges
'Iterate through all linked stories
Do
With rngStory.Find
.Text = "find text"
.Replacement.Text = "I'm found"
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
'Get next linked story (if any)
Set rngStory = rngStory.NextStoryRange
Loop Until rngStory Is Nothing
Next
See attached K-1 Document. I have attempted to use numerous tweaks with iTextSharp library but haven't had success in loading data correctly.
Ideally I would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents.
var reader = new PdfReader(FILE, Encoding.ASCII.GetBytes(password));
string[] lines;
var strategy = new LocationTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
lines = currentPageText.Split(new string[] {"\r\n", "\n"}, StringSplitOptions.None);
I also tried playing with Annotation parsing but didn't have luck.
I'm a newbie and probably looking at wrong place. Can you help guide me in the right direction?
Thanks a lot.
You would like to parse out the document similar to how humans would read them, one textbox at a time, reading its contents. That means you first will have to try and automatically recognize those text boxes. Then you can extract text by these areas.
To recognize those text boxes automatically in your document, you have to extract the border lines enclosing the boxes. For this you will first have to find out how those border lines are created. They might be drawn using vector graphics as lines or rectangles, but they could also be part of a background bitmap image.
Unfortunately I don't have your IRS form at hand and so cannot analyze its internals. Let's assume the borders are created using vector graphics for now. Thus, you have to extract vector graphics.
To extract vector graphics with iText(Sharp), you make use of classes from the iText(Sharp) parser namespace by making them parse the document and feed the parsing events into a listener you create which collects the vector graphic operations:
You implement IExtRenderListener, in particular its ModifyPath and RenderPath methods which respectively are called when additional path elements (e.g. lines or rectangles) are added to the current path or when the current path is rendered (stroked? filled?). Your implementation collects these information.
You parse your document into an instance of your listener, e.g. using PdfReaderContentParser.
You analyse the lines and rectangles found and derive the coordinates of the boxes they build.
You parse the same page in a LocationTextExtractionStrategy instance.
You retrieve the texts of the recognized text boxes by calling LocationTextExtractionStrategy.GetResultantText with a matching ITextChunkFilter argument for each box.
(Actually you can do the parsing into the instance of your listener and the LocationTextExtractionStrategy instance in one pass for a bit of optimization.)
All iText(Sharp) specific tasks are trivial, and the only other task, the analysis of the lines and rectangles found to derive the coordinates of the boxes, should be no big problem for a software developer proficient in C#.
The first question if this form is electronic or a scanned one? the latter would make the data extraction much harder as it should involve OCR too.
in case you have electronic PDF and if you have all the similar forms then why don't you just use the following strategy:
store coordinates of each "box" in the config file
process documents and exract text from every "box" (i.e. region)
additional process extracted text with regular expressions to separate name from address (or maybe you may just set the region to read text from line by line)
In case you have few variations of the form then you may check the very first box to extract the name of the form and load the appropraite settings file (that contains a set of regions for that variation)
This approach should work with any PDF library.
Take a look at IvyPdf library and template editor. It's using c# and provides high-level functions to parse and extract data so you don't have to deal with internals of PDF documents. You can build fairly complex scenarios using it.
I don't think it can read annotations though.
I am trying to create solid databars in EPPlus 4.0.4, and am running into two problems.
First, I haven't been able to figure out how to create a solid fill color.
Second, at least for small values, the bars aren't showing up the way I expect them to.
The screenshot below illustrates both issues. In both cases, the desired outcome is that of the databar I've added manually in Excel:
This is the code I'm currently using:
var bars = doc.ConditionalFormatting.AddDatabar(range, Color.FromArgb(99,195,132));
bars.HighValue.Type = eExcelConditionalFormattingValueObjectType.Num;
bars.LowValue.Type = eExcelConditionalFormattingValueObjectType.Num;
bars.HighValue.Value = numResponses; //82
bars.LowValue.Value = 0;
For the solid color, I've been trying out variations of values for the different properties of bars.Style.Fill, to no avail. If this is implemented, it is a simple matter of me not finding the right property.
I'm having a harder time understanding the second issue. If I go into "Manage rule" in Excel, the high and low values are properly set, and I have found no value I can change them to that will make their appearance match that of the manually created bars.
This is an extension list problem. This comes up alot when getting into more complex exports. Conditional formatting is probably one of the tougher ones because there are so many nuances and it has changed so much over the years.
Extension list (extLst tags in xml) is kind of a catchall bucket that the OpenOfficeXml standard can use to added new features and formatting. In your case Excel populates the extension list section to allow for the extended min/max limit. Epplus does not support this which is why you see the difference.
Your simplest option would be just to inject it yourself via xml/string manipulation Not pretty but it gets the job done:
var bars = doc.ConditionalFormatting.AddDatabar(range, Color.FromArgb(99, 195, 132));
bars.HighValue.Type = eExcelConditionalFormattingValueObjectType.Num;
bars.LowValue.Type = eExcelConditionalFormattingValueObjectType.Num;
bars.HighValue.Value = numResponses; //82
bars.LowValue.Value = 0;
//Get reference to the worksheet xml for proper namespace
var xdoc = doc.WorksheetXml;
var nsm = new XmlNamespaceManager(xdoc.NameTable);
nsm.AddNamespace("default", xdoc.DocumentElement.NamespaceURI);
//Create the conditional format extension list entry
var extLstCf = xdoc.CreateNode(XmlNodeType.Element, "extLst", xdoc.DocumentElement.NamespaceURI);
extLstCf.InnerXml = #"<ext uri=""{B025F937-C7B1-47D3-B67F-A62EFF666E3E}"" xmlns:x14=""http://schemas.microsoft.com/office/spreadsheetml/2009/9/main""><x14:id>{3F3F0E19-800E-4C9F-9CAF-1E3CE014ED86}</x14:id></ext>";
var cfNode = xdoc.SelectSingleNode("/default:worksheet/default:conditionalFormatting/default:cfRule", nsm);
cfNode.AppendChild(extLstCf);
//Create the extension list content for the worksheet
var extLstWs = xdoc.CreateNode(XmlNodeType.Element, "extLst", xdoc.DocumentElement.NamespaceURI);
extLstWs.InnerXml = #"<ext uri=""{78C0D931-6437-407d-A8EE-F0AAD7539E65}"" xmlns:x14=""http://schemas.microsoft.com/office/spreadsheetml/2009/9/main""><x14:conditionalFormattings><x14:conditionalFormatting xmlns:xm=""http://schemas.microsoft.com/office/excel/2006/main""><x14:cfRule type=""dataBar"" id=""{3F3F0E19-800E-4C9F-9CAF-1E3CE014ED86}""><x14:dataBar minLength=""0"" maxLength=""100"" gradient=""0""><x14:cfvo type=""num""><xm:f>0</xm:f></x14:cfvo><x14:cfvo type=""num""><xm:f>82</xm:f></x14:cfvo><x14:negativeFillColor rgb=""FFFF0000""/><x14:axisColor rgb=""FF000000""/></x14:dataBar></x14:cfRule><xm:sqref>B2:B11</xm:sqref></x14:conditionalFormatting></x14:conditionalFormattings></ext>";
var wsNode = xdoc.SelectSingleNode("/default:worksheet", nsm);
wsNode.AppendChild(extLstWs);
pck.Save();
Note the gradient=""0"" which will set the color bars to solid instead of a gradient as well as the min/max settings to get the spread you are looking for.
A more "proper" way would be to would to recreate the xml objects node by node and attribute by attribute which will take a while but only have to do it once.
I have written an extraction tool using iTextSharp that extracts annotation information from PDF documents. For the highlight annotation, I only get a rectangle for the area on the page which is highlighted.
I am aiming for extracting the text that has been highlighted. For that I use `PdfTextExtractor'.
Rectangle rect = new Rectangle(
pdfArray.GetAsNumber(0).FloatValue,
pdfArray.GetAsNumber(1).FloatValue,
pdfArray.GetAsNumber(2).FloatValue,
pdfArray.GetAsNumber(3).FloatValue);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
string textInsideRect = PdfTextExtractor.GetTextFromPage(pdfReader, pageNo, strategy);
return textInsideRect;
The result returned by PdfTextExtractor is not entirely correct. For instance it returns "was going to eliminate the paper chase" even though only "eliminate" was highlighted.
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
I would love to hear any input regarding this issue - also solutions that doesn't involve iTextSharp.
The cause
Interesting enough the entire text for the TJ containing the highlighted "eliminate" is "was going to eliminate the paper chase" (TJ is the PDF instruction that writes text to the page).
This actually is the reason for your issue. The iText parser classes forward the text to the render listeners in the pieces they find as continuous strings in the content stream. The filter mechanism you use filters these pieces. Thus, that whole sentence is accepted by the filter.
What you need, therefore, is some pre-processing step which splits these pieces into their individual characters and forwards these individually to your filtered render listener.
This actually is fairly easy to implement. The argument type in which the text pieces are forwarded, TextRenderInfo, offers a method to split itself up:
/**
* Provides detail useful if a listener needs access to the position of each individual glyph in the text render operation
* #return A list of {#link TextRenderInfo} objects that represent each glyph used in the draw operation. The next effect is if there was a separate Tj opertion for each character in the rendered string
* #since 5.3.3
*/
public List<TextRenderInfo> getCharacterRenderInfos() // iText / Java
virtual public List<TextRenderInfo> GetCharacterRenderInfos() // iTextSharp / .Net
Thus, all you have to do is create and use a RenderListener / IRenderListener implementation which forwards all the calls it gets to another listener (your filtered listener in your case) with the twist that renderText / RenderText splits its TextRenderInfo argument and forwards the splinters one by one individually.
A Java sample
As the OP asked for more details, here some more code. As I'm predominantly working with Java, though, I'm providing it in Java for iText. But it is easy to port to C# for iTextSharp.
As mentioned above a pre-processing step is needed which splits the text pieces into their individual characters and forwards them individually to your filtered render listener.
For this step you can use this class TextRenderInfoSplitter:
package stackoverflow.itext.extraction;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
public class TextRenderInfoSplitter implements TextExtractionStrategy
{
public TextRenderInfoSplitter(TextExtractionStrategy strategy)
{
this.strategy = strategy;
}
public void renderText(TextRenderInfo renderInfo)
{
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
{
strategy.renderText(info);
}
}
public void beginTextBlock()
{
strategy.beginTextBlock();
}
public void endTextBlock()
{
strategy.endTextBlock();
}
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
}
If you have a TextExtractionStrategy strategy (like your new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter)), you now can feed it with single-character TextRenderInfo instances like this:
String textInsideRect = PdfTextExtractor.getTextFromPage(reader, pageNo, new TextRenderInfoSplitter(strategy));
I tested it with the PDF created in this answer for the area
Rectangle rect = new Rectangle(200, 600, 200, 135);
For reference I marked the area in the PDF:
Text extraction filtered by area without the TextRenderInfoSplitter results in:
I am trying to create a PDF file with a lot
of text contents in the document. I am
using PDFBox
Text extraction filtered by area with the TextRenderInfoSplitter results in:
to create a PDF f
ntents in the docu
n g P D F
BTW, you here see a disadvantage of splitting the text into individual characters early: The final text line is typeset using very large character spacing. If you keep the text segments from the PDF as they are, text extraction strategies still easily can see that the line consists of the two words using and PDFBox. As soon as you feed the text segments character by character into the text extraction strategies, they are likely to interpret such widely set words as many one-letter words.
An improvement
The highlighted word "eliminate" is for instance extracted as "o eliminate t". This has been highlighted by double clicking the word and highlighted in Adobe Acrobat Reader.
Something similar happens in my sample above, letters barely touching the area of interest make it into the result.
This is due to the RegionTextRenderFilter implementation of allowText allowing all text to continue whose baseline intersects the rectangle in question, even if the intersection consists of merely a single dot:
public boolean allowText(TextRenderInfo renderInfo){
LineSegment segment = renderInfo.getBaseline();
Vector startPoint = segment.getStartPoint();
Vector endPoint = segment.getEndPoint();
float x1 = startPoint.get(Vector.I1);
float y1 = startPoint.get(Vector.I2);
float x2 = endPoint.get(Vector.I1);
float y2 = endPoint.get(Vector.I2);
return filterRect.intersectsLine(x1, y1, x2, y2);
}
Given that you first split the text into characters, you might want to check whether their respective base line is completely contained in the area in question, i.e. implement an own
RenderFilter by copying RegionTextRenderFilter and then replacing the line
return filterRect.intersectsLine(x1, y1, x2, y2);
by
return filterRect.contains(x1, y1) && filterRect.contains(x2, y2);
Depending on how exactly exactly text is highlighted in Adobe Acrobat Reader, though, you might want to change this in a completely custom way.
Highlight annotations are represented a collection of quadrilaterals that represent the area(s) on the page surrounded by the annotation in the /QuadPoints entry in the dictionary.
Why are they this way?
This is my fault, actually. In Acrobat 1.0, I worked on the "find text" code which initially only used a rectangle for the representation of a selected area on the page. While working on the code, I was very unhappy with the results, especially with maps where the text followed land details.
As a result, I made the find tool build up a set of quadrilaterals on the page and anneal them, when possible, to build words.
In Acrobat 2.0, the engineer responsible for full generalized text extraction built an algorithm called Wordy that was better than my first cut, but he kept the quadrilateral code since that was the most accurate representation of what was on the page.
Almost all text-related code was refactored to use this code.
Then we get highlight annotations. When markup annotations were added to Acrobat, they were used to decorate text that was already on the page. When a user clicks down on a page, Wordy extracts the text into appropriate data structures and then the text select tool maps mouse motion onto the quadrilateral sets. When a text highlight annotation is created, the subset of quadrilaterals from Wordy get placed into a new text highlight annotation.
How do you get the words on the page that are highlighted. Tricky. You have to extract the text on the page (you don't have Wordy, sorry) and then find all quads that are contained within the set from the annotation.