My code is in C#
I am using Aspose to search text and highlight it in pdf.
It is working but the time taken is very huge.
Example : My document has 25 pages and it has 25 instance of search text , 1 search text in each page.
It take 2 minutes which is unacceptable.
I have 3 questions:
Is it a way to reduce this time taken ?
Currently this approach is for pdf, in my case i have all types of doc (xls, pdf, ppt, doc)? Is there any way where this search and highlighting can be performed in all docs ?
Is there some better way of doing it other than aspose ?
// open document
Document document = new Document(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\OpenAML.pdf");
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Martin");
//accept the absorber for all the pages
for (int i = 1; i <= document.Pages.Count; i++)
{
document.Pages[i].Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
//update text and other properties
// textFragment.TextState.Invisible = false;
//textFragment.Text = "TEXT";
textFragment.TextState.Font = FontRepository.FindFont("Verdana");
textFragment.TextState.FontSize = 9;
textFragment.TextState.ForegroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Blue);
textFragment.TextState.BackgroundColor = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Yellow);
//textFragment.TextState.Underline = true;
}
}
// Save resulting PDF document.
document.Save(#"C:\TestArea\Destination\SUP000011\ATM-1B4L2KQ0ZE0-0001\Highlightdoc.pdf");
Related
I am writing a program in C# using Open XML that transfers data from excel to word.
Currently, I have this:
internal override void UpdateSectionSheets(int sectionNum, List<List<string>> tableContents)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, true))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach(Table table in tables)
{
int row = 1;
if (table.Descendants<TableRow>().FirstOrDefault().Descendants<TableCell>().FirstOrDefault().InnerText == sectionNum.ToString())
{
foreach(var item in tableContents[0])
{
// splits the tableContents[0][row - 1] into individual strings at each instance of "\n\n"
String str = tableContents[0][row - 1];
String[] separator = {"\n\n"};
Int32 count = 6; // max 6 sub strings (usually only two but allowed for extra)
String[] subStrs = str.Split(separator, count, StringSplitOptions.RemoveEmptyEntries);
// transfer comment
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).RemoveAllChildren<Paragraph>(); // removes the existing contents in the cell
foreach (string s in subStrs)
{
// for every substring, create a new paragraph and append the substring to that new paragraph. Makes it so that each sentence is on its own line
Text text = new Text(s);
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
}
// transfer verdict
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).RemoveAllChildren<Paragraph>();
Paragraph p = new Paragraph(new ParagraphProperties(new Justification() { Val = JustificationValues.Center }));
p.Append(new Run(new Text(tableContents[1][row - 1])));
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).AppendChild(p);
row++;
}
}
}
doc.Save();
}
}
I believe the line causing the issue is: table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
If I put new Text(tableContents[0][row - 1]) in place of (text) in the above line, the program will run and word doc will open with no errors, but the output is not in the format I need.
The program runs without throwing any errors, but when I try to open the word doc it gives a "word found unreadable content in xxx.docm" error. If I say I trust the source and want word to recover the document, I can open the doc and see that the code is working how I want. However, I don't want to have to do that every time. Does anyone know what is causing the error and how I can fix it?
I want to extract all the words from a Word file (doc/docx) and put them into a list. It seems like microsoft.Office.Interop works just if i want to extract paragraphs and add them into a list.
List<string> data = new List<string>();
Microsoft.Office.Interop.Word.Application app = new
Microsoft.Office.Interop.Word.Application();
Document doc = app.Documents.Open(dlg.FileName);
foreach (Paragraph objParagraph in doc.Paragraphs)
data.Add(objParagraph.Range.Text.Trim());
((_Document)doc).Close();
((_Application)app).Quit();`
I also found the way to extract word by word but it didn't works with big document because of the loop that generates an exception.
`Dictionary<int, string> motRap = new Dictionary<int, string>();
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open("C:/Users/Titri/Desktop/test/test/bin/Debug/po.txt");
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
string text = document.Words[i].Text;
motRap.Add(i, text);
}
// Close word.
application.Quit();`
So my question is, if there is a way to extract words from a big word file. I think that Microsoft.Office.Interop is not the good tool to extract from a big file.
Sorry my english is not good.
The object inside a paragraph is called Run, though I don't know whether or not this is available in Interop. To enhance your experience performancewise, I would suggest you switch to using OpenXmlSdk, in case you have to process a large amount of documents.
If you want to stick to Interop, why don't you just split each paragraph into an array (delimiter obviously space) and add all the words after that?
I'm currently trying to find a way to read in, and insert data into a word document. So far this is what I have gotten:
class Program
{
static void Main(string[] args)
{
var FileName = #"C:\temp\test.DOC";
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
foreach (Paragraph objParagraph in doc.Paragraphs)
{
data.Add(objParagraph.Range.Text.Trim());
}
//data.Insert
data.Insert(16, "Test 1");
data.Insert(16, "\tTest 2\tName\tAmount");
data.Insert(16, "Test 3");
data.Insert(16, "Test 4");
data.Insert(16, "Test 5");
data.Insert(16, "Test 6");
data.Insert(16, "Test 7");
data.Insert(16, "Test 8");
data.Insert(16, "Test 9");
data.Insert(16, "Test 10");
var x = doc.Paragraphs.Add();
x.Range.Text.Insert(0,"\tTest 2\tName\tAmount");
doc.SaveAs2(#"C:\temp\test3.DOC");
((_Document)doc).Close();
((_Application)app).Quit();
}
}
Now, this successfully populates the List data - but I'm trying to append each new test element at the [16]th index, and save it into the word document. Is there a simple way to accomplish this, or am I just over-thinking this issue?
I realize the string list is separate from the Document object which represents the word document.
I have a few other places in the document where I am using bookmarks to add data, but I don't think it is possible to use bookmarks for placing the data in this instance - or If I don't have to use bookmarks I'd like to stray away from that.
EDIT: I am trying to insert X amount of elements at the [16]th position within the data[].
EDIT 2:
Essentially I am sourcing the data dynamically, and I'm not sure how many records/rows I'll need to add to the document, so it could be as follows:
[15]
[16]\tName\tID\tAMOUNT
[17]\tName\tID\tAMOUNT
[18]\tName\tID\tAMOUNT
Since the headers will already be there (NAME,ID,AMOUNT), and each time I run the program I'm not sure how many elements I'll be inserting into the document - so as long as each element is placed under one another, and on the 16th line in the document template I have setup that should accomplish what I am trying to do.
Image 1 - Image into string array
Image 2 - Image after adding content into the string - this is what the resulting document. (this is to be saved)
I'm attempting to put each element ie: Test1 Test2 Test3 in their each own column each (see above)
Again I am totally confused as to why you want to read the word file into a string list array. This simply adds the text you show after line 15 into the word document. You do not specify WHERE Test 1, Test 2, Test3... are coming from.
Edit: Added a try-catch just in case the document does not have at least 16 paragraphs.
static void Main(string[] args)
{
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
string testRows = "Test 1\n\tTest 2\tName\tAmount\nTest 3\nTest 4\nTest 5\nTest 6\nTest 7\nTest 8\nTest 9\nTest 10\n";
try
{
var x = doc.Paragraphs[16];
x = doc.Paragraphs.Add(x.Range);
x.Range.Text = testRows;
doc.SaveAs2(#"C:\temp\test3.DOC");
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("COMException: " + e.StackTrace.ToString());
Console.ReadKey();
}
((_Document)doc).Close();
((_Application)app).Quit();
}
So what I figured out (for my purposes) is that is is easiest to insert a list of strings into makeshift columns separated by tabs by inserting at specific paragraphs.
Since I am using bookmarks to place text as well - I found it useful to work from a copy of a document instead of worrying about removing/creating bookmarks each time.
When populating the list that you are going to be placing at a specific paragraph mark it is useful to append tab characters as well as newline charters on the fly. Later on this will make it easier to loop through the list and place them nicely on the document.
Depending on the way you are going to go about placing columns some logic will have to be determined to space everything correctly. I did this by creating maximum lengths for columns and trimming, and accommodating for smaller/larger lengths by adding specific amounts of tab characters.
So, my columns I am populating would look like:
myList.Add("\t12345678912345\tJohn Doe\t\t\t\t123456\r\n");
myList.Add("\987654321654987\tJohn Smith\t\t\t\98765\r\n");
These lines would be inserted at paragraph 17 and placed neatly under headers.
Lastly, I decided to use bookmarks to place single lines of text like the date,title, and signature values since those values don't need to be correctly spaced or anything.
At the end I delete the copy of the word document I'm working on, and delete the pdf (since in my case I'm sending it via email)
Thank you for the help #JohnG - I hope this answer might help others who come across it. I removed the try-catch since I'm working from the template as well.
File.Copy(sCurrentPath + "\\" + "testTemplate.DOC", sCurrentPath + "\\" + "test.DOC");
Application app = new Application();
Document doc = app.Documents.Add(sCurrentPath + "\\" + "test.DOC");
foreach (string sValue in myList)
{
var List = doc.Paragraphs[17];
myList = doc.Paragraphs.Add(myList.Range);
myList.Range.Text = sValue;
}
if (doc.Bookmarks.Exists("Date"))
{
object oBookMark = "Date";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = DateTime.Now.ToString("MM/dd/yyyy");
}
if (doc.Bookmarks.Exists("Signature"))
{
object oBookMark = "Signature";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "My Name";
}
if (doc.Bookmarks.Exists("Title"))
{
object oBookMark = "Title";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "Title Here";
}
doc.ExportAsFixedFormat(sCurrentPath + "\\" + "test.pdf", WdExportFormat.wdExportFormatPDF);
File.Delete(sCurrentPath + "\\" + "testCopy.DOC");
File.Delete(sCurrentPath + "\\" + "test.pdf");
((_Document)doc).Close();
((_Application)app).Quit();
if (richTextBox1.Lines[i].StartsWith(#"<a href=""") ||
richTextBox1.Lines[i].EndsWith(#""""))
The StartsWith should be <a href="
The EndsWith should be one single "
But the way it is now i'm getting no results.
Input for example:
Screen-reader users, click here to turn off ggg Instant.
I need to get this part:
/setprefs?suggon=2&prev=https://www.test.com/search?q%3D%2Band%2B%26espv%3D2%26biw%3D960%26bih%3D489%26source%3Dlnms%26tbm%3Disch%26sa%3DX%26ei%3DYrxxVb-hJqac7gba0YOgDQ%26ved%3D0CAYQ_AUoAQ&sig=0_seDQVVTDQQx1hvN3BRktZNFc9Ew%3D
The part between the
I also tried to use htmlagilitypack:
HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("https://www.test.com");
foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!newHtmls.Contains(hrefValue) && hrefValue.Contains("images"))
newHtmls.Add(hrefValue);
}
But this gave me only 1 link.
When i browse and see the page view-source and i make search and filter with the word image or images im getting over 350 results.
I tried also this solution:
var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
But it didnt give me the results i needed.
Forgot to mention that the view-source of the page content i copied it to richTextBox1 window and then i'm reading line by line the text from the richTextBox1 so maybe that's why i'm not getting the results as i need ?
for (int i = 0; i < richTextBox1.Lines.Length; i++)
{
if (richTextBox1.Lines[i].StartsWith("<a href=\"") &&
richTextBox1.Lines[i].EndsWith("\""))
{
listBox1.Items.Add(richTextBox1.Lines[i]);
}
}
Maybe the view-source content as it's in the browser(chrome) is not the same as in the richTextbox1. And maybe i should not read it line by line from the richTextBox1 maybe to read the whole text from the richTextBox1 first ?
Based on your input, EndsWith isn't doing to help (as your input actually ends with </a>. Your next-best option would be to store the location (position) of href=", then look for the next occurrence of a " beginning at your stored location. e.g.
var input = #"Screen-reader users, click here to turn off ggg Instant.";
var needle = #"href=""";
var start = input.IndexOf(needle);
if (start != -1)
{
start += needle.Length;
var end = input.IndexOf(#"""", start);
// final result:
var href = input.Substring(start, end - start).Dump();
}
Better than that would be to use an actual HTML parser (might I recommend HtmlAgilityPack?).
I'm trying to parse PDF documents in order for certain values be added to an existing database. The problem is with parsing the PDF.
First try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
String text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
}
}
}
But unfortunately that only parsed the text after the titles (Employer, Website, Language etc). And I need the titles in order to create a class which will be mapped to a relation in the database.
Second try
String[] AllPdf = Directory.GetFiles(Directory.GetCurrentDirectory(), "*.pdf", SearchOption.TopDirectoryOnly);
foreach (var pdfDoc in AllPdf)
{
using (PdfReader reader = new PdfReader(pdfDoc))
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
byte[] streamBytes = reader.GetPageContent(page);
PRTokeniser tokenizer = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(streamBytes)));
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
String text = tokenizer.StringValue;
}
}
}
}
}
Fortunately, this parsed the missing titles, but it parsed them first (words in new lines instead of single line) and the value afterwards.
iTextSharp documentation?
There must be classes in iTextSharp which can find the titles/values pair. Or at least parse the titles in readable format. I am happy to write my own implementation of ITextExtractionStrategy.
iTextSharp does not have an official documentation page, but you can find some answers here on SO. Instead of getting the data from the PDF in a String, try parsing it as XML and then use XPath to get the data you need. Or you can use Linq to XML. I'm guessing that each page in the PDF has the same format, so the XML structure can have the same format as well.
Here is a project sample using iTextSharp and here is a SDK (paid) taht you can use, but if you want it free it's a temporary solution.