Advanced searching in Word documents

Advanced searching in Word documents - c#

I have to build an application in C#.NET with which i can search for certain words in a Word document. I've seen that there are API's for this in C#.NET. But i need to take this a step further.
One thing i want to be able to do is search with a regex string.
And another thing i need to do is search for a range of numbers. So i should be able to say something like >500. And it should then find every "word" that has a larger value than 500.
So the last two things are my problem. I couldn't find any direct info about this. Is it possible to search in a Word document using regex with C# code? And is there a good way to specify a range if numbers that it should find?
I want to do this in C#.NET.
Any info on this is appreciated!

I've done it on a .txt file, you must change first line of code and open the word file however it should be :
string fileData = System.IO.File.ReadAllText(#"C:\1\1.txt");
string[] words = fileData.Split(' ');
List<int> integers = new List<int>();
foreach (string word in words)
{
try
{
int integer = int.Parse(word);
if(integer > 500)
integers.Add(integer);
}
catch (Exception)
{
//some code maybe
}
}
foreach (int integer in integers)
{
MessageBox.Show(integer.ToString());
}
and for opening word documents take a look at how to read .docx files.

Related

How to extract text from multiple files

I have upwards of 200 files that I need to extract a certain sequence of lines from, and write the results in a new csv file. I am just learning C#, but have experience with other languages far in the past. I have tried looking up all the individual steps, along with Regex, which I don't understand, but I don't know how to stitch it all together.
Sample text:
--> SAT1_988_Connection_Verify
EA0683010A01030F15A40202004E2000
E0068300
E40683010278053A
>
(S45, 10:38:35 AM)
Algorithm Steps
1) I need to point the program at a directory with the files.
2) I need the program to search through each file in the directory.
3) I need to find the lines that starts with "E40", of which there could be multiple or none. Additionally, this line varies in length.
4) I need to grab that line, as well as the two before it, which are highlighted in the nested block quote above.
5) There is always a blank line after the target line.
6)I need to write those three lines separated by commas in a text document.
My code so far:
using System;
using System.Collections.Generic;
using System.IO;
namespace ConsoleApplication2
{
class Program
{
static void Main()
{
string path = #"C:\ETT\Test.txt";
string[] readText = File.ReadAllLines(path);
foreach (string s in readText)
{
}
}
public static string getBetween(string[] strSource, string strKey)
{
int Start, End;
if (strSource.Contains(strKey))
{
Start = Array.IndexOf(strSource, strKey) -2;
End = Array.IndexOf(strSource, strKey) + 1;
return strSource.Substring(Start, End - Start);
}
else
{
return "";
}
}
}
}

There are many ways of doing this. However just to help you (and because you added comparatively detailed amount of information for a first post, you need to look up the following topics
Directory.EnumerateFiles Method
Returns an enumerable collection of file names that match a search
pattern in a specified path.
File.ReadAllLines Method
Opens a text file, reads all lines of the file into a string array,
and then closes the file.
Enumerable.Where<TSource> Method (IEnumerable, Func)
Filters a sequence of values based on a predicate.
String.StartsWith Method
Determines whether the beginning of this string instance matches a
specified string.
https://joshclose.github.io/CsvHelper/
A library for reading and writing CSV files. Extremely fast, flexible,
and easy to use. Supports reading and writing of custom class objects.
CSV helper implements RFC 4180. By default, it's very conservative in
its writing, but very liberal in its reading. There is a large set of
configuration that can be done to change how reading and writing
behaves, giving you the ability read/write non-standard files also.
The only tricky part will be getting 3 lines before
List<T>.IndexOf Method (T)
Searches for the specified object and returns the zero-based index of
the first occurrence within the entire List.
From that index, you can use List[Index-1] List[Index-2] to get the preceding lines
Good luck.

How to manipulate MS Word 2010 document word by word

I want to develop an MS Word 2010 Add-in (2013/2016 also) which works like a spellchecker (restores accent characters) for Turkish Text. I want to give 3 options (via context menu) to the users to use the tool.
fix all text in the document. (including the ones in the tables and lists etc.)
fix all text in a selected area
fix the word at the cursor position.
For the first option I tried to iterate all the words and fix them one by one by the following code:
var words = App.ActiveDocument.Words;
foreach (Range word in words)
{
var corr = MyCorrecter(word.Text);
word.Select();
App.Selection.TypeText(corr);
}
However this stuck in an infinite-loop. word.Next() always returns the first word. If I remove the line word.Text = MyCorrecter(word.Text);, code iterates all the words successfully. There are find/replace examples around but those are not very efficient for this particular case.
In short, what is the most effective way to manipulate words one by one in a Word Document?

For this kind of situation - where you're actually changing the content of the target Range ("word") you need to work with a loop that "counts" with an index. For example:
Word.Words words = app.ActiveDocument.Words;
int iWordCount = words.Count;
Word.Range rngWord = null;
for (int i = 1; i<= iWordCount; i++)
{
rngWord = words[i]
var corr = MyCorrecter(rngWord.Text);
rngWord.Text = corr;
}
//When you're done, dont' forget to release the COM objects
rngWord = null;
words = null;
I strongly recommend you do NOT use Select or Selection in your code unless what you need to do cannot be done any other way. Assign directly to the Range.Text object.
Note that there are situations in Word when it helps to run a loop backwards through the document (going from the highest counter to the lowest). I think this situation will work going forwards, however.

Removing text above real content of CSV file

I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)

The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.

You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after

List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch

PDF Converting Part Numbers to Links

I have a very large PDF catalog with over 50K part numbers in it. Would like to script out a process to turn the part numbers into clickable links. Have been peeking around with Acrobat, iTextSharp, PDFSharp and a few others, but cant seem to see if anything like that has been done before?
Will I need to manually update each link, or is there some hope of automating this process?
Thanks!

This task can be easily accomplished using Docotic.Pdf library.
The library can retrieve all words from a page with their bounding rectangles. Also, the library can create hyperlinks at specified locations of a PDF page.
Here is a short sample for your task. The following code opens specified file, finds all words that start with L and "turn" these words into links.
public static void makeWordsHyperlinks(string file, string outputFile)
{
using (PdfDocument pdf = new PdfDocument(file))
{
foreach (PdfPage page in pdf.Pages)
{
PdfCollection<PdfTextData> words = page.GetWords();
foreach (PdfTextData word in words)
{
// let's take anything starting from L
// you can discriminate words as you like, of course
if (word.Text.StartsWith("L", StringComparison.InvariantCultureIgnoreCase))
{
// build lookup query. you can use any url, of course
string lookupUrl = string.Format(#"https://www.google.ru/#q={0}", word.Text);
// let's draw rectangle around word.
// just to make links easier to find
page.Canvas.DrawRectangle(word.Bounds, PdfDrawMode.Stroke);
page.AddHyperlink(word.Bounds, new Uri(lookupUrl));
}
}
}
pdf.Save(outputFile);
}
}
I assume that your part numbers are something like XXX-YYYYY. If your part numbers consist of several words then task is a bit harder. You will need to combine words and their bounding rectangles.
Disclaimer: I work for the vendor of the library.

How to find the location of a string in a text file

I'm trying to figure out how to find a certain string and display how many lines down it is in a text file.
For example let's saying I'm trying to find the string "I'm a string" in a text file then also have the location of the string(As in lines down) recorded in a variable.
Anyone got any tips too accomplish this?
Thanks

First, I would read in the file, then loop through each line searching for the text. Something like...
string[] lines = System.IO.File.ReadAllLines(#"C:\file.txt");
int count = 0;
foreach (string line in lines)
{
count++;
if (line.indexOf("I'm a string") > -1) {
// found it
}
}

Since this looks like a HW Question, I will not be posting the complete solution, but only pointers and guidelines.
You basically want to scan through your whole text file, letter by letter, reading the next n chars, where n is the length of your search string.
If that set matches your search string, you have your answer.
The number of "\n" you encounter is the number of lines you had to traverse through.
There exist simpler regex solutions also.. You should try looking at those.

Better than ReadAllLines:
public static IEnumerable<string> ReadLines(string path)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Advanced searching in Word documents - c#

Related

How to extract text from multiple files

How to manipulate MS Word 2010 document word by word

Removing text above real content of CSV file

PDF Converting Part Numbers to Links

How to find the location of a string in a text file

Categories

Resources