I have a very large PDF catalog with over 50K part numbers in it. Would like to script out a process to turn the part numbers into clickable links. Have been peeking around with Acrobat, iTextSharp, PDFSharp and a few others, but cant seem to see if anything like that has been done before?
Will I need to manually update each link, or is there some hope of automating this process?
Thanks!
This task can be easily accomplished using Docotic.Pdf library.
The library can retrieve all words from a page with their bounding rectangles. Also, the library can create hyperlinks at specified locations of a PDF page.
Here is a short sample for your task. The following code opens specified file, finds all words that start with L and "turn" these words into links.
public static void makeWordsHyperlinks(string file, string outputFile)
{
using (PdfDocument pdf = new PdfDocument(file))
{
foreach (PdfPage page in pdf.Pages)
{
PdfCollection<PdfTextData> words = page.GetWords();
foreach (PdfTextData word in words)
{
// let's take anything starting from L
// you can discriminate words as you like, of course
if (word.Text.StartsWith("L", StringComparison.InvariantCultureIgnoreCase))
{
// build lookup query. you can use any url, of course
string lookupUrl = string.Format(#"https://www.google.ru/#q={0}", word.Text);
// let's draw rectangle around word.
// just to make links easier to find
page.Canvas.DrawRectangle(word.Bounds, PdfDrawMode.Stroke);
page.AddHyperlink(word.Bounds, new Uri(lookupUrl));
}
}
}
pdf.Save(outputFile);
}
}
I assume that your part numbers are something like XXX-YYYYY. If your part numbers consist of several words then task is a bit harder. You will need to combine words and their bounding rectangles.
Disclaimer: I work for the vendor of the library.
Related
I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)
The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.
You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after
List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch
I have to build an application in C#.NET with which i can search for certain words in a Word document. I've seen that there are API's for this in C#.NET. But i need to take this a step further.
One thing i want to be able to do is search with a regex string.
And another thing i need to do is search for a range of numbers. So i should be able to say something like >500. And it should then find every "word" that has a larger value than 500.
So the last two things are my problem. I couldn't find any direct info about this. Is it possible to search in a Word document using regex with C# code? And is there a good way to specify a range if numbers that it should find?
I want to do this in C#.NET.
Any info on this is appreciated!
I've done it on a .txt file, you must change first line of code and open the word file however it should be :
string fileData = System.IO.File.ReadAllText(#"C:\1\1.txt");
string[] words = fileData.Split(' ');
List<int> integers = new List<int>();
foreach (string word in words)
{
try
{
int integer = int.Parse(word);
if(integer > 500)
integers.Add(integer);
}
catch (Exception)
{
//some code maybe
}
}
foreach (int integer in integers)
{
MessageBox.Show(integer.ToString());
}
and for opening word documents take a look at how to read .docx files.
I have a little C# app that is extracting text from a Microsoft Publisher file via the COM Interop API.
This works fine, but I'm struggling if I have multiple styles in one section. Potentially every character in a word could have a different font, format, etc.
Do I really have to compare character after character? Or is there something that returns me the different style sections? Kinda like I can get the different Paragraphs?
foreach (Microsoft.Office.Interop.Publisher.Shape shp in pg.Shapes)
{
if (shp.HasTextFrame == MsoTriState.msoTrue)
{
text.Append(shp.TextFrame.TextRange.Text);
for(int i = 0; i< shp.TextFrame.TextRange.WordsCount; i++)
{
TextRange range = shp.TextFrame.TextRange.Words(i+1, 1);
string test = range.Text;
}
}
}
Or is there in general a better way to extract the text from a Publisher file? But I have to be able to actually write it back with the same formatting. It's for a translation.
You could consider using the clipboard to copy text sections as RTF which you can later paste back as RTF as with the example below for Word. I am not familiar with Publisher's object model.
string text = wordDocument.Content.Paragraphs[0];
System.Windows.Forms.Clipboard.SetText(text, TextDataFormat.Rtf);
Other than that, I have not found a collection of applied styles when using interop with any of the office products.
We tried an approach were we just compared for every character as many font styles as possible. Not pretty, but works in most cases...
I am working on a word automation project in c# and am using the interop word library to read/write to word. I am currently using bookmarks in a word template doc to find where to write info to in the word doc from c#. One of my bookmarks consists of two highlighted lines in the doc. Based on a boolean value, i have to decide whether to leave that text there and add a new line of text right after it, or delete those existing two lines from the doc.
So here is my pseudo for it:
if (writeToDoc)
{
// leave selected bookmark text intact and press enter to write another line right after
}
else
{
//delete the selected bookmark text
}
Can anyone please show me how to delete existing text as well as do the equivalent of pressing enter and writing another line from c#?
Thanks
EDIT: Here is the code i have (roughly)
foreach (var bookmark in wordDoc.Bookmarks)
{
var bookMarkNameExistsInCode = listOfBookmarks.Contains(wordDoc.Bookmarks[bookmark].Name);
if (bookMarkNameExistsInCode )
{
object oBookMarkName = wordDoc.Bookmarks[bookmark].Name;
rng = wordDoc.Bookmarks.get_Item(ref oBookMarkName).Range;
// at this point i am pointing to the two selected lines labelled as a bookmark in word. How can i deselect and add a new line?
}
}
If the word manipulation is done on DocX files you could use DocX library and use some very simple commands like text.ReplaceText(); and other very easy/intuitive commands. Replacing Interop with DocX if possible should get you up and running in no time :)
I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?
This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.
The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component
For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.
Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText