I am developing a pdf reader. i want to find any string in pdf and to know the corresponding page number. I am using iTextSharp.
Something like this should work:
// add any string you want to match on
Regex regex = new Regex("the",
RegexOptions.IgnoreCase | RegexOptions.Compiled
);
PdfReader reader = new PdfReader(pdfPath);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.NumberOfPages; i++) {
ITextExtractionStrategy strategy = parser.ProcessContent(
i, new SimpleTextExtractionStrategy()
);
if ( regex.IsMatch(strategy.GetResultantText()) ) {
// do whatever with corresponding page number i...
}
}
In order to use Itextsharp you can use Acrobat.dll to find the current page number. First of all open the pdf file and search the string usingL
Acroavdoc.open("Filepath","Temperory title")
and
Acroavdoc.FindText("String").
If the string found in this pdf file then the cursor moved into the particular page and the searched string will be highlighted. Now we use Acroavpageview.GetPageNum() to get the current page number.
Dim AcroXAVDoc As CAcroAVDoc
Dim Acroavpage As AcroAVPageView
Dim AcroXApp As CAcroApp
AcroXAVDoc = CType(CreateObject("AcroExch.AVDoc"), Acrobat.CAcroAVDoc)
AcroXApp = CType(CreateObject("AcroExch.App"), Acrobat.CAcroApp)
AcroXAVDoc.Open(TextBox1.Text, "Original document")
AcroXAVDoc.FindText("String is to searched", True, True, False)
Acroavpage = AcroXAVDoc.GetAVPageView()
Dim x As Integer = Acroavpage.GetPageNum
MsgBox("the string found in page number" & x)
Related
In my code, I need to read the PDF file content and based on some specific requirement I need to insert the content of PDF into SQL server DB.
I used iTextsharp for PDF reading. It reads well when it found the entire line in PDF.
Problems come when they found a table inside the PDF.
It first gets into column1 and reads the line and jumps into column2 and reads that line and so on.
Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which have no meaning.
I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line.
After processing column1 then jumps into colum2.
Currently I am using below code:
PdfReader reader = new PdfReader(#"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;
StringBuilder text = new StringBuilder();
for (int i = 1; i <= PageNum; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
ReadContent(text.ToString());
text.Clear();
}
There are other similar questions that have been asked and answered, but none of those answers work in what I'm trying to do, or there isn't enough information for me to know how to implement it in my own code. I've been at it for two days and now must ask for help.
I have a script task in an SSIS package where I need to do a match and replace on a large XML file that contains thousands of Record Identifier tags. Each one contains a number. I need those numbers to be consecutive and increment by one. For example, within the xml file, I am able to find tags that appear like this:
<ns1:recordIdentifier>1</ns1:recordIdentifier>
<ns1:recordIdentifier>6</ns1:recordIdentifier>
<ns1:recordIdentifier>223</ns1:recordIdentifier>
<ns1:recordIdentifier>4102</ns1:recordIdentifier>
I need to find and replace those tags with consecutive increments like so:
<ns1:recordIdentifier>1</ns1:recordIdentifier>
<ns1:recordIdentifier>2</ns1:recordIdentifier>
<ns1:recordIdentifier>3</ns1:recordIdentifier>
<ns1:recordIdentifier>4</ns1:recordIdentifier>
The code I have so far is causing all the numbers to be "1" with no incrementation.
I've tried dozens of different methods, but nothing has worked yet.
Any ideas as to how I can modify the below code to increment as desired?
public void Main()
{
string varStart = "<ns1:recordIdentifier>";
string varEnd = "</ns1:recordIdentifier>";
int i = 1;
string path = Dts.Variables["User::xmlFilename"].Value.ToString();
string outPath = Dts.Variables["User::xmlOutputFile"].Value.ToString();
string ptrn = #"<ns1:recordIdentifier>\d{1,4}<\/ns1:recordIdentifier>";
string replace = varStart + i + varEnd;
using (StreamReader sr = File.OpenText(path))
{
string s = "";
while ((s = sr.ReadLine()) != null && i>0)
{
File.WriteAllText(outPath, Regex.Replace(File.ReadAllText(path),
ptrn, replace));
i++;
}
}
}
You were on the right path with the Replace method, but will need to use the MatchEvaluater parameter when you increment.
string inputFile = Dts.Variables["User::xmlFilename"].Value.ToString();
string outPutfile = Dts.Variables["User::xmlOutputFile"].Value.ToString();
string fileText = File.ReadAllText(inputFile);
//get any number between elements
Regex reg = new Regex("<ns1:recordIdentifier>[0-9]</ns1:recordIdentifier>");
string xmlStartTag = "<ns1:recordIdentifier>";
string xmlEndTag = "</ns1:recordIdentifier>";
//assuming this starts at 1
int incrementInt = 1;
fileText = reg.Replace(fileText, tag =>
{ return xmlStartTag + incrementInt++.ToString() + xmlEndTag; });
File.WriteAllText(outPutfile, fileText);
My code is in C#
I am using Aspose to search text and highlight it in pdf.
It is working but the time taken is very huge.
Example : My document has 25 pages and it has 25 instance of search text , 1 search text in each page.
It take 2 minutes which is unacceptable.
I have 3 questions:
Is it a way to reduce this time taken ?
Currently this approach is for pdf, in my case i have all types of doc (xls, pdf, ppt, doc)? Is there any way where this search and highlighting can be performed in all docs ?
Is there some better way of doing it other than aspose ?
// open document
Document document = new Document(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\OpenAML.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Martin");
//accept the absorber for all the pages
for (int i = 1; i <= document.Pages.Count; i++)
{
document.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
// textFragment.TextState.Invisible = false;
//textFragment.Text = "TEXT";
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 9;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
//textFragment.TextState.Underline = true;
}
}
// Save resulting PDF document.
document.Save(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\Highlightdoc.pdf");
I am working on a final year project. I have a file that contain some text. I need to get words form this file that contain "//jj" tag. e.g abc//jj, bcd//jj etc.
suppose file is containing the following text
ffafa adada//bb adad ssss//jj aad adad adadad aaada dsdsd//jj
dsdsd sfsfhf//vv
dfdfdf
I need all the words that are associated with //jj tag. I am stuck here past few days.
My code that i am trying
// Create OpenFileDialog
Microsoft.Win32.OpenFileDialog dlg = new Microsoft.Win32.OpenFileDialog();
// Set filter for file extension and default file extension
dlg.DefaultExt = ".txt";
dlg.Filter = "Text documents (.txt)|*.txt";
// Display OpenFileDialog by calling ShowDialog method
Nullable<bool> result = dlg.ShowDialog();
// Get the selected file name and display in a TextBox
string filename = string.Empty;
if (result == true)
{
// Open document
filename = dlg.FileName;
FileNameTextBox.Text = filename;
}
string text;
using (var streamReader = new StreamReader(filename, Encoding.UTF8))
{
text = streamReader.ReadToEnd();
}
string FilteredText = string.Empty;
string pattern = #"(?<before>\w+) //jj (?<after>\w+)";
MatchCollection matches = Regex.Matches(text, pattern);
for (int i = 0; i < matches.Count; i++)
{
FilteredText="before:" + matches[i].Groups["before"].ToString();
//Console.WriteLine("after:" + matches[i].Groups["after"].ToString());
}
textbx.Text = FilteredText;
I cant find my result please help me.
With LINQ you could do this with one line:
string[] taggedwords = input.Split(' ').Where(x => x.EndsWith(#"//jj")).ToArray();
And all your //jj words will be there...
Personally I think Regex is overkill if that's definitely how the string will look. You haven't specified that you definitely need to use Regex so why not try this instead?
// A list that will hold the words ending with '//jj'
List<string> results = new List<string>();
// The text you provided
string input = #"ffafa adada//bb adad ssss//jj aad adad adadad aaada dsdsd//jj dsdsd sfsfhf//vv dfdfdf";
// Split the string on the space character to get each word
string[] words = input.Split(' ');
// Loop through each word
foreach (string word in words)
{
// Does it end with '//jj'?
if(word.EndsWith(#"//jj"))
{
// Yes, add to the list
results.Add(word);
}
}
// Show the results
foreach(string result in results)
{
MessageBox.Show(result);
}
Results are:
ssss//jj
dsdsd//jj
Obviously this is not quite as robust as a regex, but you didn't provide any more detail for me to go on.
You have an extra space in your regex, it assumes there's a space before "//jj". What you want is:
string pattern = #"(?<before>\w+)//jj (?<after>\w+)";
This regular expression will yield the words you are looking for:
string pattern = "(\\S*)\\/\\/jj"
A bit nicer without backslash escaping:
(\S*)\/\/jj
Matches will include the //jj but you can get the word from the first bracketed group.
I'm trying to parse PDF documents in order for certain values be added to an existing database. The problem is with parsing the PDF.
First try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
String text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
}
}
}
But unfortunately that only parsed the text after the titles (Employer, Website, Language etc). And I need the titles in order to create a class which will be mapped to a relation in the database.
Second try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
byte[] streamBytes = reader.GetPageContent(page);
PRTokeniser tokenizer = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(streamBytes)));
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
String text = tokenizer.StringValue;
}
}
}
}
}
Fortunately, this parsed the missing titles, but it parsed them first (words in new lines instead of single line) and the value afterwards.
iTextSharp documentation?
There must be classes in iTextSharp which can find the titles/values pair. Or at least parse the titles in readable format. I am happy to write my own implementation of ITextExtractionStrategy.
iTextSharp does not have an official documentation page, but you can find some answers here on SO. Instead of getting the data from the PDF in a String, try parsing it as XML and then use XPath to get the data you need. Or you can use Linq to XML. I'm guessing that each page in the PDF has the same format, so the XML structure can have the same format as well.
Here is a project sample using iTextSharp and here is a SDK (paid) taht you can use, but if you want it free it's a temporary solution.