A project I work on extracts text from PDF documents and includes it in an elastic search index. In some cases, the documents have some or all of the text encoded such that they render correctly in English, but copy & paste and normal iText7 text extraction contains some garbage characters (including some control-type characters like 0x0002).
I already go page by page and check all the fonts for a different reason. My current approach for detecting bad characters is to check each font and check if there is a ToUnicode value:
for (int pageIndex = 1; pageIndex <= document.GetNumberOfPages(); pageIndex++)
{
PdfPage page = document.GetPage(pageIndex);
PdfDictionary fontResources = page.GetResources()?.GetResource(PdfName.Font);
if (fontResources != null)
{
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
{
var fontAsDict = font as PdfDictionary;
PdfName subType = fontAsDict.GetAsName(PdfName.Subtype);
PdfName toUnicode = fontAsDict.GetAsName(PdfName.ToUnicode);
if (subType == PdfName.Type0 && string.IsNullOrEmpty(toUnicode?.GetValue()))
{
// font is type 0 and there's no unicode mapping available
// so extract the text a different way.
}
}
}
}
}
Does iText7 have a definitive way to determine if a PDF page has unmapped/unextractable characters in it?
Also, the current plan is to detect any pages with unextractable characters and then render and finally OCR them. Is there a better approach?
Related
I am trying to cut specific pages of my word document(.docx), say 2, 4. I am using for loop to traverse as per the page gave splitting it based on ,.Below is the code for the same
if (startEnd.Contains(','))
{
arrSpecificPage = startEnd.Split(',');
for (int i = 0; i < arrSpecificPage.Length; i++)
{
range.Start = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, arrSpecificPage[i]).Start;
range.End = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, arrSpecificPage[i]).End;
range.Copy();
newDocument.Range().Paste();
}
newDocument.SaveAs(outputSplitDocpath);
}
but the issue with this code is that its just copying the last page only to the new document i.e 4 in this case. How to add 2 as well? What's wrong in the code?
Since you always specify the entire document "range" as the target, each time you paste the entire content of the document is replaced.
It's correct that you work with a Range object and not with a selection, but it helps if you think about a Range like a selection. If you select everything (Ctrl+A) then paste, what was selected is replaced by what is pasted. Whatever is assigned to a Range will replace the content of the Range.
The way to solve this is to "collapse" the Range - think of it like pressing the Right-arrow or left-arrow key to "collapse" a selection to its start or end point. In the object model, this is the Collapse method that takes a parameter indicating whether to collapse to the start or end point (see the code below).
Note that I've also changed the code to use document.Content instead of Document.Range. Content is a property that returns the entire body of the document; Rangeis a method that expects a start and end point defining a Range. Using the property is the preferred method for the entire document.
if (startEnd.Contains(','))
{
arrSpecificPage = startEnd.Split(',');
for (int i = 0; i < arrSpecificPage.Length; i++)
{
range.Start = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, arrSpecificPage[i]).Start;
range.End = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, arrSpecificPage[i]).End;
range.Copy();
Word.Range targetRange = newDocument.Content
targetRange.Collapse(Word.WdCollapseDirection.wdCollapseEnd);
targetRange.Paste();
}
newDocument.SaveAs(outputSplitDocpath);
}
Is it possible in FieldMergingCallback retrieve the field size, calculation of the font size and change font size for merging text?
In my template, I have table with fixed size cell. The table can not grow.
So when I put a long text into the cell, I would like the font to be smaller when the text does not fit.
The Fit-text option in the table settings in Word, does not work as I would like.
Before performing Mail Merge, you can use the following code to apply some Font formatting to all Run nodes inside Merge Field:
foreach (Field field in doc.Range.Fields)
{
if (field.Type.Equals(Aspose.Words.Fields.FieldType.FieldMergeField))
{
Node currentNode = field.Start;
bool isContinue = true;
while (currentNode != null && isContinue)
{
if (currentNode.NodeType.Equals(NodeType.FieldEnd))
{
FieldEnd end = (FieldEnd)currentNode;
if (end == field.End)
isContinue = false;
}
if (currentNode.NodeType.Equals(NodeType.Run))
{
// Specify Font formatting here
Run run = ((Run)currentNode);
run.Font.Size = 6;
}
Node nextNode = currentNode.NextPreOrder(currentNode.Document);
currentNode = nextNode;
}
}
}
Hope, this helps. I work with Aspose as Developer Evangelist.
So, I have been creating an outlook add-in in c# that reads the email attachments(PDF, Doc/Docx) and searches for keywords I enter in the search bar from the attachments. but the problem is I can find those emails with the email attachments but it does not give me the count right. I think the reason this is happening is that I can't properly retrieve the words from the attachment. Any help will be appreciated Thank you!
Here's what it is so far:
^Should output 1
EDIT: Added code that I am using that gives unexpected results
private int countKeywords(Outlook.Attachment attachment, string keyword)
{
const string PR_ATTACH_DATA_BIN = "http://schemas.microsoft.com/mapi/proptag/0x37010102";
var attachmentData = attachment.PropertyAccessor.GetProperty(PR_ATTACH_DATA_BIN);
//MessageBox.Show(TextFromWord(attachment));
string data = System.Text.Encoding.Unicode.GetString(attachmentData);
int i = 0;
int startIndex = 0;
int count = 0;
if (data.Contains(" "))
{
while (i < data.Length)
{
if (data[i] == ' ' && data.Substring(startIndex, i - startIndex).Equals(keyword))
{
startIndex = i + 1;
count++;
}
i++;
}
}
else
{
if (data.Equals(keyword))
count++;
}
// MessageBox.Show(Encoding.GetString(attachmentData));
return count;
}
Firstly, do you really have UTF-16 encoded data in the attachment? Or is it single byte? Use ASCII encoding instead of Unicode.
Secondly, keep in mind that OOM won't let you access large (32kB+) binary properties using PropertyAccessor.GetProperty. You will need to save the attachment as a file (Attachment.SaveAsFile) or use other means of getting the attachment data without saving it (Extended MAPI or Redemption).
Have you tried to step through your code and inspect the values of the variables to make sure you get the data you expect?
I've been trying to get my program to replace unicode in a binary file.
The user would input what to find, and the program would find and replace it with a specific string if it can find it.
I've searched around, but there's nothing I can find to my specifics, what I would like would be something like:
string text = File.ReadAllText(path, Encoding.Unicode);
text = text.Replace(userInput, specificString);
File.WriteAllText(path, text);
but anything that works in a similar manner should suffice.
Using that results in a file that is larger and unusable, though.
I use:
int var = File.ReadAllText(path, Encoding.Unicode).Contains(userInput) ? 1 : 0;
if (var == 1)
{
//Missing Part
}
for checking if the file contains the user inputted string, if it matters.
This can work only in very limited situations. Unfortunately, you haven't offered enough details as to the nature of the binary file for anyone to know if this will work in your situation or not. There are a practically endless variety of binary file formats out there, at least some of which would be rendered invalid if you modify a single byte, many more of which could be rendered invalid if the file length changes (i.e. data after your insertion point is no longer where it is expected to be).
Of course, many binary files are also either encrypted, compressed, or both. In such cases, even if you do by some miracle find the text you're looking for, it probably doesn't actually represent that text, and modifying it will render the file unusable.
All that said, for the sake of argument let's assume your scenario doesn't have any of these problems and it's perfectly okay to just completely replace some text found in the middle of the file with some entirely different text.
Note that we also need to make an assumption about the text encoding. Text can be represented in a wide variety of ways, and you will need to use the correct encoding not just to find the text, but also to ensure the replacement text will be valid. For the sake of argument, let's say your text is encoded as UTF8.
Now we have everything we need:
void ReplaceTextInFile(string fileName, string oldText, string newText)
{
byte[] fileBytes = File.ReadAllBytes(fileName),
oldBytes = Encoding.UTF8.GetBytes(oldText),
newBytes = Encoding.UTF8.GetBytes(newText);
int index = IndexOfBytes(fileBytes, oldBytes);
if (index < 0)
{
// Text was not found
return;
}
byte[] newFileBytes =
new byte[fileBytes.Length + newBytes.Length - oldBytes.Length];
Buffer.BlockCopy(fileBytes, 0, newFileBytes, 0, index);
Buffer.BlockCopy(newBytes, 0, newFileBytes, index, newBytes.Length);
Buffer.BlockCopy(fileBytes, index + oldBytes.Length,
newFileBytes, index + newBytes.Length,
fileBytes.Length - index - oldBytes.Length);
File.WriteAllBytes(filename, newFileBytes);
}
int IndexOfBytes(byte[] searchBuffer, byte[] bytesToFind)
{
for (int i = 0; i < searchBuffer.Length - bytesToFind.Length; i++)
{
bool success = true;
for (int j = 0; j < bytesToFind.Length; j++)
{
if (searchBuffer[i + j] != bytesToFind[j])
{
success = false;
break;
}
}
if (success)
{
return i;
}
}
return -1;
}
Notes:
The above is destructive. You may want to run it only on a copy of the file, or prefer to modify the code so that it takes an addition parameter specifying the new file to which the modification should be written.
This implementation does everything in-memory. This is much more convenient, but if you are dealing with large files, and especially if you are on a 32-bit platform, you may find you need to process the file in smaller chunks.
I have a textbox control from DevExpress and we cant allow more characters beyond its capacity. The problem is that the input string is xml formatted and can have multiple fonts. If the font size increases, the maximum number of characters decrease.
My first thought is counting by line, because lines are measurable despite font size. But the column I could not see a way.
How could I fill this textbox taking in consideration the string font e xml tags ?
You can use Exception Handling to figure it out for you:
bool flag = false;
int count = line.Length;
do
{
try
{
txt.Text = line.SubString(0, count);
flag = true;
}
catch(TheException)
{
count--;
}
}
while(!flag);
This works if you are getting an exception for putting in too long a line.