I have to make an application which can get the list of fonts for a pdf and .indd file in an excel sheet. After lot of research I came to know that with C# it is not possible.I came across Indesign Navigator API in Visual Studio which can be integrated to the VS IDE. Iam aware of C#, javascript is there any way by which this could be made and can be run on MAC and windows OS both. Thank You!!
One way you could do this is by saving a text file out of InDesign and Acrobat with the font information. You could probably use extendscript to do this. The text file can then be imported easily into Excel as a csv or text file (whitespace delimited).
You weren't very clear about what your intentions are, but here's an example of a javascript that can pull font information out of InDesign to save a list of fonts for a document.
var doc = app.activeDocument;
var docFonts = doc.fonts.everyItem().getElements();
var fileContents = "";
for (var i=0; i < docFonts.length; i++) {
var font = docFonts[i];
fileContents += font.name + "\n";
};
var newFilePath = doc.filePath + "/" + doc.name.replace(/\.indd/,'') + "_fonts.txt";
var newFile = File(newFilePath);
newFile.open('w')
newFile.write(fileContents);
here is a possible approach...
It is possible to write out an XML representation of an InDesign file...
To generate IDML, choose File > Export Format: InDesign Markup (INDML)...
This is a zip with all the information.
There is a folder Resources which contains Fonts.xml (Resources: Fonts.xml)
This can be parsed cross-plattform because it just XML...
Here you find a description of the anatomy of a INDML InDesign Document...
http://www.indesignsecrets.com/downloads/Anatomy_of_IDML.pdf
Hope this helps...
Related
How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.
I'm trying to read a file as string. But it seems that the data is corrupted.
string filepaths = Files[0].FullName;
System.IO.StreamReader myFile = new System.IO.StreamReader(filepaths);
string datas = myFile.ReadToEnd();
but in datas, it contains "pk0101" etc instead of original data. I'm doing this so I can replace a placeholder with this string data,datas. And finally when I replace,gets replaced text as 0101 etc. Is it because of the content in datas. How can I read the file as string. Your help will be greatly appreciated. Thank You.
*.docx is a file format which in raw view represents xml document. Take a look here to become more familiar with this format definition.
For working with office formats Microsoft recommends to use Open Xml SDK at DocumentFormat.OpenXml library.
Here is a great article for learning how to work with Word files.
It works as follows:
using (var wordDocument = WordprocessingDocument.Open(string.Empty, false))
{
var body = wordDocument.MainDocumentPart.Document.Body;
var text = body.GetFirstChild<Paragraph>().InnerText;
}
Also, take a look at this SO question: How do I read data from a word with format using the OpenXML Format SDK with c#?
I am trying to get the content of attachment. It may be an excel file, Document file or text file whatever it is but I want to store it in database so here I am using this code: -
foreach (FileAttachment file in em.Attachments)// Here em is type of EmailMessage class
{
Console.Write("Hello friends" + file.Name);
file.Load();
var stream = new System.IO.MemoryStream(file.Content);
var reader = new System.IO.StreamReader(stream, UTF8Encoding.UTF8);
var text = reader.ReadToEnd();
reader.Close();
Console.Write("Text Document" + text);
}
So By printing file.name is showing attachment file name but while printing 'text' on the console it is working if the attachment is .txt type but if it is .doc or .xls type then it is showing some symbolic result. I am not getting any text result. Am I doing something wrong or missing something. I want text result of any kind of file attachment . Please help me , I am beginner in C#
What you are seeing is what is actually in the file. Try opening one with Notepad.
There is no built-in way in .NET to show the "text contents" of arbitrary file formats. You'll have to create (preferably using third-party libraries that already solve this problem) some kind of logic that extracts plaintext from rich text documents.
See for example How to extract text from Pdf, Word and Excel documents?, Extract text from pdf and word files, and so on.
First, what do you expect when reading a binary file?
Your result is exactly what is expected. A text file can be shown as a string, but a doc or xls file is a binary file. You will see the binary content of the file. You will need to use a tool/lib to get the text/content from a binary file in human readable format.
TXT type is simple,DOC or XLS are much more complex.You can see TXT because is just text,DOC or XLS or PPT or something else needs to be interpreted by other mechanism.
See,for example,you have different colors or font sizes on a Word document,or a chart in an Excel document,how can you show that in a simple TextBox or RichTextBox?Short answer,you can't.
I would like to know how to upload a resume in a pdf file in an asp.net page. I know how to upload a simple txt file and when the fields are separated by ",". Here's my code.
using System.IO;
string uploadfile = Server.MapPath("~/uploads3/") + FileUpload1.FileName;
FileUpload1.PostedFile.SaveAs(uploadfile);
if (File.Exists(uploadfile))
{
string inputline = "";
using (StreamReader sr = File.OpenText(uploadfile))
{
while ((inputline = sr.ReadLine()) != null)
{
string tempstr = inputline;
string firstname = tempstr.Substring(0, tempstr.IndexOf(","));
tempstr = tempstr.Substring(tempstr.IndexOf(",") + 1);
string lastname = tempstr.Substring(0, tempstr.IndexOf(","));
tempstr = tempstr.Substring(tempstr.IndexOf(",") + 1);
(...)
Now, I have absolutely no idea how to do this on a pdf file containing a resume. How to do that? Please explain your answers, I'm just new to system.io. Thanks again.
You will want to take a look at the open source iTextSharp library. It provides all the methods you will need for writing to a PDF. There are plenty of other PDF writing libraries that can do the same. As far as I know it isn't practical to do this using System.IO. You can still upload a CSV file, have the codebehind do the formatting and PDF creation, and then save it to the web server.
PDF is not an easy to read format. You will need a library to extract the needed information.
The iTextSharp library can work, but you will need to walk though the tree structure of the document.
A (sometimes) simple alternative is to use the .Net port of PDFBox, as instructed in this article. PDFBox converts the PDF to a pure text representation that may be easier to parse. The bad side on this approach is that the IKVM.Net library that PDFBox uses is huge, ~17MB.
I have a requirement for an application that takes Doc, Docx and PDF and converts them to RTF.
The conversion is one way and I do not need to convert back to Doc or PDF.
Has anyone done this and can you recommend a libray? I know there is aspose but it's way to pricey and the licenses are per year so that's not going to work for the company I happen to work for.
I'm ok using more than one library for each of the file types if thats what it takes.
Thanks in advance
Telerik has a nice library to do this. They actually have an entire editor that looks like Microsoft Word. It can open multiple file formats and it saves natively as RTF (although it can save as PDF, DOCX, etc.) The one thing I'm not sure of is opening the PDF and saving as an RTF. I'm not sure that the Telerik library can do that.
Here is a link to the library:
http://www.telerik.com/products/wpf/richtextbox.aspx
For a PDF to RTF library, you could use this:
http://www.sautinsoft.com/products/pdf-focus/index.php
GroupDocs.Conversion Cloud is a REST API that converts all common file formats from on format to another reliably and easily. Its free pricing plan offers 50 free credits per month.
Here is sample code for PDF to RTF from default storage:
// Get App Key and App SID from https://dashboard.groupdocs.cloud/
var configuration = new GroupDocs.Conversion.Cloud.Sdk.Client.Configuration(MyAppSid, MyAppKey);
var apiInstance = new ConvertApi(configuration);
try
{
// convert settings
var settings = new GroupDocs.Conversion.Cloud.Sdk.Model.ConvertSettings
{
StorageName = null,
FilePath = "02_pages.pdf",
Format = "rtf",
ConvertOptions = new RtfConvertOptions(),
OutputPath = "02_pages.rtf"
};
// convert to specified format
List<StoredConvertedResult> response = apiInstance.ConvertDocument(new ConvertDocumentRequest(settings));
Console.WriteLine("Document converted successfully: " + response[0].Url);
}
catch (Exception e)
{
Console.WriteLine("Exception when calling ConvertApi.QuickConvert: " + e.Message);
}
I'm developer evangelist at aspose.