Get text from PDF stored in LocalFolder using iTextSharp

Get text from PDF stored in LocalFolder using iTextSharp - c#

I am trying to get the text from a PDF stored in localStorage in a Windows Phone 8.1 application,but I always get an FileNotFoundException.
To explain the whole story, I get a PDF from an online source, I store it to a folder with name same as the username (The username is an email address, but I tried also without the # sign) of the user and then I want to get some text from the PDF file. I use iTextSharp and follow the examples, but cannot succeed. When I send the PDF to the Launcher is opening succesfully with another app like Acrobat Reader.
My function is like below. I first send an PDF Object, which has an attribute called Path and it is stored to folder specific to the username of the user.
Then I get the pdf as a StorageFile Item. When I create the PDFReader calling the constructor I get a FileNotFoundException. Does anybody knows or can guess what can be the problem? Is iTextSharp compatible with Windows Phone 8.1?
internal async Task<bool> OpenPdfFromDownloadedCollections(PDF pdfToOpen, string username)
{
try
{
StorageFolder folder = ApplicationData.Current.LocalFolder;
var pdfFolder = await folder.GetFolderAsync(username + "PDFs");
var pdf = await pdfFolder.GetFileAsync(Object.Path);
StringBuilder text = new StringBuilder();
using (PdfReader reader = new PdfReader(pdf.Path))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string[] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
}
}
return true;
}
catch (Exception)
{
return false;
}
}

var pdf = await pdfFolder.GetFileAsync(Object.Path);
In this line of code you should only pass the file name but you are giving the whole Path as parameter. As pdfFolder currently represents the path.

Related

How can I debug my add-in when WordEditor's Content property is crashing Outlook?

I have folder full of *.msg files saved from Outlook and I'm trying to convert them to Word.
There is a loop that loads each *.msg as MailItem and saves them.
public static ConversionResult ConvertEmailsToWord(this Outlook.Application app, string source, string target)
{
var word = new Word.Application();
var emailCounter = 0;
var otherCounter = 0;
var directoryTree = new PhysicalDirectoryTree();
foreach (var node in directoryTree.Walk(source))
{
foreach (var fileName in node.FileNames)
{
var currentFile = Path.Combine(node.DirectoryName, fileName);
var branch = Regex.Replace(node.DirectoryName, $"^{Regex.Escape(source)}", string.Empty).Trim('\\');
Debug.Print($"Processing file: {currentFile}");
// This is an email. Convert it to Word.
if (Regex.IsMatch(fileName, #"\.msg$"))
{
if (app.Session.OpenSharedItem(currentFile) is MailItem item)
{
if (item.SaveAs(word, Path.Combine(target, branch), fileName))
{
emailCounter++;
}
item.Close(SaveMode: OlInspectorClose.olDiscard);
}
}
// This is some other file. Copy it as is.
else
{
Directory.CreateDirectory(Path.Combine(target, branch));
File.Copy(currentFile, Path.Combine(target, branch, fileName), true);
otherCounter++;
}
}
}
word.Quit(SaveChanges: false);
return new ConversionResult
{
EmailCount = emailCounter,
OtherCount = otherCounter
};
}
The save method looks likes this:
public static bool SaveAs(this MailItem mail, Word.Application word, string path, string name)
{
Directory.CreateDirectory(path);
name = Path.Combine(path, $"{Path.GetFileNameWithoutExtension(name)}.docx");
if (File.Exists(name))
{
return false;
}
var copy = mail.GetInspector.WordEditor as Word.Document;
copy.Content.Copy();
var doc = word.Documents.Add();
doc.Content.Paste();
doc.SaveAs2(FileName: name);
doc.Close();
return true;
}
It works for most *.msg files but there are some that crash Outlook when I call copy.Content on a Word.Document.
I know you cannot tell me what is wrong with it (or maybe you do?) so I'd like to findit out by myself but the problem is that I am not able to catch the exception. Since a simple try\catch didn't work I tried it with AppDomain.CurrentDomain.UnhandledException this this didn't catch it either.
Are there any other ways to debug it?
The mail that doesn't let me get its content inside a loop doesn't cause any troubles when I open it in a new Outlook window and save it with the same method.

It makes sense to add some delays between Word calls. IO operations takes some time to finish. Also there is no need to create another document in Word for copying the content:
var copy = mail.GetInspector.WordEditor as Word.Document;
copy.Content.Copy();
var doc = word.Documents.Add();
doc.Content.Paste();
doc.SaveAs2(FileName: name);
doc.Close();
Instead, do the required modifications on the original document instance and then save it to the disk. The original mail item will remain unchanged until you call the Save method from the Outlook object model. You may call the Close method passing the olDiscard which discards any changes to the document.
Also consider using the Open XML SDK if you deal with open XML documents only, see Welcome to the Open XML SDK 2.5 for Office for more information.

Do you actually need to use Inspector.WordEditor? You can save the message in a format supported by Word (such as MHTML) using OOM alone by calling MailItem.Save(..., olMHTML) and open the file in Word programmatically to save it in the DOCX format.

How to programmatically save an outlook email attachment (e.g. PDF) that has been copied to clipboard by the user using CTRL-C, in C#

Context:
I am able to copy (using CTRL-C) and paste programmatically a file from the clipboard when the file copied to the clipboard is for example a file on the desktop. This is simple enough using the following syntax:
File.Copy(Clipboard.GetFileDropList()[0], savePath)
where Clipboard.GetFileDropList()[0] returns the path of the copied file and savePath is the paste location.
However, i find that the above syntax does NOT work if the copied file (using CTRL-C) is a file attachment in an Outlook email. In that scenario, Clipboard.ContainsFileDropList() returns false and Clipboard.GetFileDropList()[0] results in the following error message:
"ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index"
This is despite the fact that pressing CTRL-V does successfully paste the file, confirming that the file was initially successfully copied to the clipboard.
Question:
Sorry if i missed something very basic. My question is how to programmatically paste/save an email attachment (PDF, Word, etc...) from the clipboard to a file location when that email attachment was copied into the clipboard using CTRL-C from within Outlook.
Note that i do understand that what i am trying to do can be solved by skipping the Clipboard and interacting programmatically with Outlook to access a selected email attachment. However, my goal here is to learn how to interact programmatically with the Clipboard under different scenario.

You are using the wrong DataFormat. You can always get a list of currently present data formats by calling Clipboard.GetDataObject().GetFormats().
You need to use:
"FileGroupDescriptor" to retrieve the file names
private static async Task<List<string>> GetAttachedFileNamesFromClipboardAsync(IDataObject clipboardData)
{
if (!clipboardData.GetDataPresent("FileGroupDescriptor"))
{
return new List<string>();
}
using (var descriptorStream = clipboardData.GetData("FileGroupDescriptor", true) as MemoryStream)
{
using (var streamReader = new StreamReader(descriptorStream))
{
var streamContent = await streamReader.ReadToEndAsync();
string[] fileNames = streamContent.Split(new[] { '\0' }, StringSplitOptions.RemoveEmptyEntries);
return new List<string>(fileNames.Skip(1));
}
}
}
"FileContents" to retrieve the raw file contents
// Returns the attachment file content as string
private static async Task<string> GetAttachmentFromClipboardAsync(IDataObject clipboardData)
{
if (!clipboardData.GetDataPresent("FileContents"))
{
return string.Empty;
}
using (var fileContentStream = clipboardData.GetData("FileContents", true) as MemoryStream)
{
using (var streamReader = new StreamReader(fileContentStream))
{
return await streamReader.ReadToEndAsync();
}
}
}
// Returns the attachment file content as MemoryStream
private static MemoryStream GetAttachmentFromClipboard(IDataObject clipboardData)
{
if (!clipboardData.GetDataPresent("FileContents"))
{
return null;
}
return clipboardData.GetData("FileContents", true) as MemoryStream;
}
Save attached file to disk
Since Windows only adds the first selected attachment to the system clipboard, this solution can only save a single attachment. Apparently the office clipboard is not accessible.
private static async Task SaveAttachmentFromClipboardToFileAsync(IDataObject clipboardData, string destinationFilePath)
{
if (!clipboardData.GetDataPresent("FileContents"))
{
return;
}
using (var attachedFileStream = clipboardData.GetData("FileContents", true) as MemoryStream)
{
using (var destinationFileStream = File.Open(destinationFilePath, FileMode.OpenOrCreate))
{
await attachedFileStream.CopyToAsync(destinationFileStream);
}
}
}
Save attached files to disk using the Office API
Requires to reference Microsoft.Office.Interop.Outlook.dll. This solution does not depend on the system clipboard. It just reaads the selected attachments from the currently open message item in the Outlook explorer.
You still can monitor the system clipboard to trigger the process to save the attachments.
private static void SaveSelectedAttachementsToFolder(string destinationFolderPath)
{
var outlookApplication = new Microsoft.Office.Interop.Outlook.Application();
Explorer activeOutlookExplorer = outlookApplication.ActiveExplorer();
AttachmentSelection selectedAttachments = activeOutlookExplorer.AttachmentSelection;
foreach(Attachment attachment in selectedAttachments)
{
attachment.SaveAsFile(Path.Combine(destinationFolderPath, attachment.FileName));
}
}

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

Umbraco Adding base64 File with SetValue

I'll explain the problem right away, but first of all...is this achievable?
I have a Document Type in Umbraco where I store data from a Form. I can store everything except the file.
...
content.SetValue("notes", item.Notes);
content.SetValue("curriculum", item.Curriculum); /*this is the file*/
...
I'm adding items like this where SetValue comes from the following namespace namespace Umbraco.Core.Models and this is the function signature void SetValue(string propertyTypeAlias, object value)
And the return error is the following
"String or binary data would be truncated.
↵The statement has been terminated."
Did I missunderstood something? Shouldn't I be sending the base64? I'm adding the image to a media file where it creates a sub-folder with a sequential number. If I try to add an existing folder it appends the file just fine but if I point to a new media sub-folder it also returns an error. Any ideas on how should I approach this?
Thanks in advance
Edit 1: After Cryothic answer I've updated my code with the following
byte[] tempByte = Convert.FromBase64String(item.Curriculum);
var mediaFile = _mediaService.CreateMedia(item.cvExtension, -1, Constants.Conventions.MediaTypes.File);
Stream fileStream = new MemoryStream(tempByte);
var fileName = Path.GetFileNameWithoutExtension(item.cvExtension);
mediaFile.SetValue("umbracoFile", fileName, fileStream);
_mediaService.Save(mediaFile);
and the error happens at mediaFile.SetValue(...).
If I upload a file from umbraco it goes to "http://localhost:3295/media/1679/test.txt" and the next one would go to "http://localhost:3295/media/1680/test.txt". Where do I tell on my request that it has to add to the /media folder and increment? Do I only point to the media folder and umbaco handles the incrementation part?
If I change on SetValue to the following mediaFile.SetValue("curriculum", fileName, fileStream); the request succeeds but the file is not added to the content itself and the file is added to "http://localhost:3295/umbraco/media" instead of "http://localhost:3295/media".
If I try the following - content.SetValue("curriculum", item.cvExtension); - the file is added to the content but with the path "http://localhost:3295/umbraco/test.txt".
I'm not understanding very well how umbraco inserts files into the media folder (outside umbraco) and how you add the media service path to the content service.

Do you need to save base64?
I have done something like that, but using the MediaService.
My project had the option to upload multiple images on mulitple wizard-steps, and I needed to save them all at once. So I looped through the uploaded files (HttpFileCollection) per step. acceptedFiletypes is a string-list with the mimetypes I'd allow.
for (int i = 0; i < files.Count; i++) {
byte[] fileData = null;
UploadedFile uf = null;
try {
if (acceptedFiletypes.Contains(files[i].ContentType)) {
using (var binaryReader = new BinaryReader(files[i].InputStream)) {
fileData = binaryReader.ReadBytes(files[i].ContentLength);
}
if (fileData.Length > 0) {
uf = new UploadedFile {
FileName = files[i].FileName,
FileType = fileType,
FileData = fileData
};
}
}
}
catch { }
if (uf != null) {
projectData.UploadedFiles.Add(uf);
}
}
After the last step, I would loop throug my projectData.UploadedFiles and do the following.
var service = Umbraco.Core.ApplicationContext.Current.Services.MediaService;
var mediaTypeAlias = "Image";
var mediaItem = service.CreateMedia(fileName, parentFolderID, mediaTypeAlias);
Stream fileStream = new MemoryStream(file.FileData);
mediaItem.SetValue("umbracoFile", fileName, fileStream);
service.Save(mediaItem);
I also had a check which would see if the uploaded filename was ending on ".pdf". In that case I'd change the mediaTypeAlias to "File".
I hope this helps.

getting URLs from PDF using PDFNet

We're using the PDFNet library to extract the contents of a PDF file. One of the things we need to do is extract the URLs in the PDF. Unfortunately, as you scan through the elements in the file, you get the URL in pieces, and it's not always clear which piece goes with which.
What is the best way to get complete URLs from PDFNet?

Links are stored on the pages as annotations. You can do something like the following code to get the URI from the annotation. The try/catch block is there because if any of the values are missing, they still return an Obj object, but you cannot call any method on it without it throwing.
Also, be aware that not everything that looks like a link is the same. We created two PDFs from the same Word file. The first we created with print to PDF. The second we created from within Acrobat.
The links in both files work fine with Acrobat Reader, but only the second file has annotations that PDFNet can see.
Page page = doc.GetPage(1);
for (int i = 1; j < page.GetNumAnnots(); j++) {
Annot annot = page.GetAnnot(i);
if (!annot.IsValid())
continue;
var sdf = annot.GetSDFObj();
string uri = ParseURI(sdf);
Console.WriteLine(uri);
}
private string ParseURI(pdftron.SDF.Obj obj) {
try {
if (obj.IsDict()) {
var aDictionary = obj.Find("A").Value();
var uri = aDictionary.Find("URI").Value();
return uri.GetAsPDFText();
}
} catch (Exception ) {
return null;
}
return null;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.