I would like to know how to upload a resume in a pdf file in an asp.net page. I know how to upload a simple txt file and when the fields are separated by ",". Here's my code.
using System.IO;
string uploadfile = Server.MapPath("~/uploads3/") + FileUpload1.FileName;
FileUpload1.PostedFile.SaveAs(uploadfile);
if (File.Exists(uploadfile))
{
string inputline = "";
using (StreamReader sr = File.OpenText(uploadfile))
{
while ((inputline = sr.ReadLine()) != null)
{
string tempstr = inputline;
string firstname = tempstr.Substring(0, tempstr.IndexOf(","));
tempstr = tempstr.Substring(tempstr.IndexOf(",") + 1);
string lastname = tempstr.Substring(0, tempstr.IndexOf(","));
tempstr = tempstr.Substring(tempstr.IndexOf(",") + 1);
(...)
Now, I have absolutely no idea how to do this on a pdf file containing a resume. How to do that? Please explain your answers, I'm just new to system.io. Thanks again.
You will want to take a look at the open source iTextSharp library. It provides all the methods you will need for writing to a PDF. There are plenty of other PDF writing libraries that can do the same. As far as I know it isn't practical to do this using System.IO. You can still upload a CSV file, have the codebehind do the formatting and PDF creation, and then save it to the web server.
PDF is not an easy to read format. You will need a library to extract the needed information.
The iTextSharp library can work, but you will need to walk though the tree structure of the document.
A (sometimes) simple alternative is to use the .Net port of PDFBox, as instructed in this article. PDFBox converts the PDF to a pure text representation that may be easier to parse. The bad side on this approach is that the IKVM.Net library that PDFBox uses is huge, ~17MB.
Related
I want to find whether a text is present in the uploaded PDF file in ASP.NET c#.
using (MemoryStream str = new MemoryStream(this.docUploadField.FileBytes))
{
using (StreamReader sr = new StreamReader(str, Encoding.UTF8))
{
string line = sr.ReadToEnd();
}
}
I am getting the below as the file content when I read the contents of file.
Please help me with this
You surely need some PDF reading library.
Most famous being
IText (ITextSharp for who remembers it): https://github.com/itext/itext7-dotnet
PdfSharp: https://github.com/empira/PDFsharp
and many other free options.
With those you open pdf file and read it and take the text you need.
Usually they give you a collection of the PDF elements (paragraphs, images, etc etc, and you loop through them or use a search function to look for what you need)
I'm trying to read a file as string. But it seems that the data is corrupted.
string filepaths = Files[0].FullName;
System.IO.StreamReader myFile = new System.IO.StreamReader(filepaths);
string datas = myFile.ReadToEnd();
but in datas, it contains "pk0101" etc instead of original data. I'm doing this so I can replace a placeholder with this string data,datas. And finally when I replace,gets replaced text as 0101 etc. Is it because of the content in datas. How can I read the file as string. Your help will be greatly appreciated. Thank You.
*.docx is a file format which in raw view represents xml document. Take a look here to become more familiar with this format definition.
For working with office formats Microsoft recommends to use Open Xml SDK at DocumentFormat.OpenXml library.
Here is a great article for learning how to work with Word files.
It works as follows:
using (var wordDocument = WordprocessingDocument.Open(string.Empty, false))
{
var body = wordDocument.MainDocumentPart.Document.Body;
var text = body.GetFirstChild<Paragraph>().InnerText;
}
Also, take a look at this SO question: How do I read data from a word with format using the OpenXML Format SDK with c#?
I have to make an application which can get the list of fonts for a pdf and .indd file in an excel sheet. After lot of research I came to know that with C# it is not possible.I came across Indesign Navigator API in Visual Studio which can be integrated to the VS IDE. Iam aware of C#, javascript is there any way by which this could be made and can be run on MAC and windows OS both. Thank You!!
One way you could do this is by saving a text file out of InDesign and Acrobat with the font information. You could probably use extendscript to do this. The text file can then be imported easily into Excel as a csv or text file (whitespace delimited).
You weren't very clear about what your intentions are, but here's an example of a javascript that can pull font information out of InDesign to save a list of fonts for a document.
var doc = app.activeDocument;
var docFonts = doc.fonts.everyItem().getElements();
var fileContents = "";
for (var i=0; i < docFonts.length; i++) {
var font = docFonts[i];
fileContents += font.name + "\n";
};
var newFilePath = doc.filePath + "/" + doc.name.replace(/\.indd/,'') + "_fonts.txt";
var newFile = File(newFilePath);
newFile.open('w')
newFile.write(fileContents);
here is a possible approach...
It is possible to write out an XML representation of an InDesign file...
To generate IDML, choose File > Export Format: InDesign Markup (INDML)...
This is a zip with all the information.
There is a folder Resources which contains Fonts.xml (Resources: Fonts.xml)
This can be parsed cross-plattform because it just XML...
Here you find a description of the anatomy of a INDML InDesign Document...
http://www.indesignsecrets.com/downloads/Anatomy_of_IDML.pdf
Hope this helps...
I have a requirement for an application that takes Doc, Docx and PDF and converts them to RTF.
The conversion is one way and I do not need to convert back to Doc or PDF.
Has anyone done this and can you recommend a libray? I know there is aspose but it's way to pricey and the licenses are per year so that's not going to work for the company I happen to work for.
I'm ok using more than one library for each of the file types if thats what it takes.
Thanks in advance
Telerik has a nice library to do this. They actually have an entire editor that looks like Microsoft Word. It can open multiple file formats and it saves natively as RTF (although it can save as PDF, DOCX, etc.) The one thing I'm not sure of is opening the PDF and saving as an RTF. I'm not sure that the Telerik library can do that.
Here is a link to the library:
http://www.telerik.com/products/wpf/richtextbox.aspx
For a PDF to RTF library, you could use this:
http://www.sautinsoft.com/products/pdf-focus/index.php
GroupDocs.Conversion Cloud is a REST API that converts all common file formats from on format to another reliably and easily. Its free pricing plan offers 50 free credits per month.
Here is sample code for PDF to RTF from default storage:
// Get App Key and App SID from https://dashboard.groupdocs.cloud/
var configuration = new GroupDocs.Conversion.Cloud.Sdk.Client.Configuration(MyAppSid, MyAppKey);
var apiInstance = new ConvertApi(configuration);
try
{
// convert settings
var settings = new GroupDocs.Conversion.Cloud.Sdk.Model.ConvertSettings
{
StorageName = null,
FilePath = "02_pages.pdf",
Format = "rtf",
ConvertOptions = new RtfConvertOptions(),
OutputPath = "02_pages.rtf"
};
// convert to specified format
List<StoredConvertedResult> response = apiInstance.ConvertDocument(new ConvertDocumentRequest(settings));
Console.WriteLine("Document converted successfully: " + response[0].Url);
}
catch (Exception e)
{
Console.WriteLine("Exception when calling ConvertApi.QuickConvert: " + e.Message);
}
I'm developer evangelist at aspose.
currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas
Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.
There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text