Find in Files C# - c#

I have a Folder which has multiple sub folders. Each sub folder has many .dot and .txt files in them.
Is there a simple solution in C# .NET that will iterate through each file and check the contents of that file for a key phrase or keyword?
Document Name Keyword1 Keyword2 Keyword3 ...
test.dot Y N Y
To summarise:
Select a folder
Enter a list of keywords to search for
The program will then search through each file and at the end output something like above, I am not to worried about creating the datatable to show the datagrid as I can do this. I just need to perform the find in files function similar to Notepad++'s find in files option
Thanks in advance

What you want is recursively iterate files in a directory (and maybe it's subdirectories).
So your steps would be to loop eeach file in the specified directory with Getfiles() from .NET. then if you encounter a directory loop it again.
This can be easily done with this code sample:
public static IEnumerable<string> GetFiles(string path)
{
foreach (string s in Directory.GetFiles(path, "*.extension_here"))
{
yield return s;
}
foreach (string s in Directory.GetDirectories(path))
{
foreach (string s1 in GetFiles(s))
{
yield return s1;
}
}
}
A more indepth look on iterating throug files in directories in .NET is located here:
http://blogs.msdn.com/b/brada/archive/2004/03/04/84069.aspx
Then you use the IndexOf method from String to check if your keywords are in the file (I discourage the use of ReadAllText, if your file is 5 MB big, your string will be too. Line-by-line will be less memory-hungry)

You can use Directory.EnumerateFiles with a searchpattern and the recursive hint(SearchOption.AllDirectories). The rest is easy with LINQ:
var keyWords = new []{"Y","N","Y"};
var allDotFiles = Directory.EnumerateFiles(folder, "*.dot", SearchOption.AllDirectories);
var allTxtFiles = Directory.EnumerateFiles(folder, "*.txt", SearchOption.AllDirectories);
var allFiles = allDotFiles.Concat(allTxtFiles);
var allMatches = from fn in allFiles
from line in File.ReadLines(fn)
from kw in keyWords
where line.Contains(kw)
select new {
File = fn,
Line = line,
Keyword = kw
};
foreach (var matchInfo in allMatches)
Console.WriteLine("File => {0} Line => {1} Keyword => {2}"
, matchInfo.File, matchInfo.Line, matchInfo.Keyword);
Note that you need to add using System.Linq;
Is there a way just to get the line number?
If you just want the line numbers you can use this query:
var matches = allFiles.Select(fn => new
{
File = fn,
LineIndices = String.Join(",",
File.ReadLines(fn)
.Select((l,i) => new {Line=l, Index =i})
.Where(x => keyWords.Any(w => x.Line.Contains(w)))
.Select(x => x.Index)),
})
.Where(x => x.LineIndices.Any());
foreach (var match in matches)
Console.WriteLine("File => {0} Linenumber => {1}"
, match.File, match.LineIndices);
It's a little bit more difficult since LINQ's query syntax doesn't allow to pass the index.

The first step: locate all files. It is easily done with System.IO.Directory.GetFiles() + System.IO.File.ReadAllText(), as others have mentioned.
The second step: find keywords in a file. This is simple if you have one keyword and it can be done with IndexOf() method, but iterating a file multiple times (especially if it is big) is a waste.
To quickly find multiple keywords in a text I think you should use the Aho-Corasick automaton (algorithm). See the C# implementation at CodeProject: http://www.codeproject.com/Articles/12383/Aho-Corasick-string-matching-in-C

Here's a way using Tim's original answer to get the line number:
var keyWords = new[] { "Keyword1", "Keyword2", "Keyword3" };
var allDotFiles = Directory.EnumerateFiles(folder, "*.dot", SearchOption.AllDirectories);
var allTxtFiles = Directory.EnumerateFiles(folder, "*.txt", SearchOption.AllDirectories);
var allFiles = allDotFiles.Concat(allTxtFiles);
var allMatches = from fn in allFiles
from line in File.ReadLines(fn).Select((item, index) => new { LineNumber = index, Line = item})
from kw in keyWords
where line.Line.Contains(kw)
select new
{
File = fn,
Line = line.Line,
LineNumber = line.LineNumber,
Keyword = kw
};
foreach (var matchInfo in allMatches)
Console.WriteLine("File => {0} Line => {1} Keyword => {2} Line Number => {3}"
, matchInfo.File, matchInfo.Line, matchInfo.Keyword, matchInfo.LineNumber);

Related

LINQ IN where query

I want to write foreach loop to get all files with specified extention from external txt file. For example I have in file variable:
extensions = "jpg,tif,bmp,png" or
extensions "jpg,tif" and I want to only get this files.
So far I have something like this but I don`t know how to go on.
extensions = Extensions.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string sourceFile in Directory.GetFiles(SourcePath, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(extensions.)))
{
}
I don`t know how to get to every element in 'extensions' array. How can I solved that?
You can use Enumerable.Contains and System.IO.Path.GetExtension:
string[] extensions = {".jpg",".tif",".bmp",".png" };
var files = Directory.EnumerateFiles(SourcePath, "*.*", SearchOption.AllDirectories)
.Where(s => extensions.Contains(Path.GetExtension(s), StringComparer.InvariantCultureIgnoreCase));

How to get files from a specific sub-directories using c#?

I'm trying to get all *.html files which are inside sub-directories named abcd to an array.
The path given can contain multiple *.html files in multiple sub-directories and even in the root directory(i.e. immediately inside the user given path) but I only want those *.html files which are inside the specificly named sub-directories(abcd) using LINQ.
This is what I tried
string workingPath = #"D:\Testing";
string[] myFiles = workingPath.Select(dirs => Directory.GetDirectories(workingPath)
.Select(folders => (from item in Directory.GetDirectories(folders, "abcd", SearchOption.AllDirectories)
.Select(item => Directory.GetFiles(item, "*.html"))
)));
I'm getting an error
A query body must end with a select clause or a group clause (CS0742)
. How do I fix this?
Your code does not look like it will compile. To start with workingPath.Select will return a collection of chars and you are trying to iterate over that again , which does not make sense considering your requirements.
You need something like this
var files = new List<string>();
if (Directory.Exists(workingPath))
{
foreach (var f in Directory.GetDirectories(workingPath, "abcd",
SearchOption.AllDirectories))
{
files.AddRange(Directory.GetFiles(f, "*.html"));
}
}
You can also do a one liner using LINQ
var files2 = Directory.GetDirectories(workingPath, "abcd", SearchOption.AllDirectories)
.SelectMany(d => Directory.GetFiles(d, "*.html")).ToArray();

Remove names that contain another in a list

I have a file with "Name|Number" in each line and I wish to remove the lines with names that contain another name in the list.
For example, if there is "PEDRO|3" , "PEDROFILHO|5" , "PEDROPHELIS|1" in the file, i wish to remove the lines "PEDROFILHO|5" , "PEDROPHELIS|1".
The list has 1.8 million lines, I made it like this but its too slow :
List<string> names = File.ReadAllLines("firstNames.txt").ToList();
List<string> result = File.ReadAllLines("firstNames.txt").ToList();
foreach (string name in names)
{
string tempName = name.Split('|')[0];
List<string> temp = names.Where(t => t.Contains(tempName)).ToList();
foreach (string str in temp)
{
if (str.Equals(name))
{
continue;
}
result.Remove(str);
}
}
File.WriteAllLines("result.txt",result);
Does anyone know a faster way? Or how to improve the speed?
Since you are looking for matches everywhere in the word, you will end up with O(n2) algorithm. You can improve implementation a bit to avoid string deletion inside a list, which is an O(n) operation in itself:
var toDelete = new HashSet<string>();
var names = File.ReadAllLines("firstNames.txt");
foreach (string name in names) {
var tempName = name.Split('|')[0];
toDelete.UnionWith(
// Length constraint removes self-matches
names.Where(t => t.Length > name.Length && t.Contains(tempName))
);
}
File.WriteAllLines("result.txt", names.Where(name => !toDelete.Contains(name)));
This works but I don't know if it's quicker. I haven't tested on millions of lines. Remove the tolower if the names are in the same case.
List<string> names = File.ReadAllLines(#"C:\Users\Rob\Desktop\File.txt").ToList();
var result = names.Where(w => !names.Any(a=> w.Split('|')[0].Length> a.Split('|')[0].Length && w.Split('|')[0].ToLower().Contains(a.Split('|')[0].ToLower())));
File.WriteAllLines(#"C:\Users\Rob\Desktop\result.txt", result);
test file had
Rob|1
Robbie|2
Bert|3
Robert|4
Jan|5
John|6
Janice|7
Carol|8
Carolyne|9
Geoff|10
Geoffrey|11
Result had
Rob|1
Bert|3
Jan|5
John|6
Carol|8
Geoff|10

How to get last file that not contain specific string

I know how to get the last file, this the code:
string pattern = "Log*.xml";
string directory = set. Path;
var dirInfo = new DirectoryInfo(directory);
var file = (from f in dirInfo.GetFiles(pattern) orderby f.LastWriteTime descending select f).First();
My question is: How can I get the last file that not contain specific string? or in another words, how can I get the last file that not contain "This is temporally file" string?
Thank you!
from top of my head:
dirInfo.EnumerateFiles(pattern)
.OrderByDescending(f => f.LastWriteTime)
.Where(f => DoesntContain(f, myText))
.FirstOrDefault()
Now you are free to make DoesntContain as complex or simple as you want. Either use File.ReadAllText or something like:
bool DoesntContain(FileInfo fileInfo, string text) {
using (StreamReader r = fileInfo.OpenText()) {
var contents = r.ReadToEnd();
return !contents.Contains(text);
}
}
You can write the method as extension to get more natural syntax like fi.DoesntContain(...)
Additionally, I suggest using EnumerateFiles instead of GetFiles if the directory can contain many files: there is no need to retrieve them all, if the first one will match.
You can do something like this:
string pattern = "Log*.xml";
var dirInfo = new DirectoryInfo(directory);
var filesThatContains = dirInfo.GetFiles(pattern).
Where(f=>File.ReadAllLines(Path.Combine(directory, f.Name),
Encofing.UTF8).IndexOf(SEARCH_STRING)>=0);
I would do something simpler for a start:
public static string[] FileNamesExcluding(string path, string pattern, string textToExclude)
{
// Put all txt files in root directory into array.
string[] allFilesMatchingPattern = Directory.GetFiles(path, pattern); // <-- Case-insensitive
return allFilesMatchingPattern.SkipWhile(a => a.Contains(textToExclude)).ToArray();
}
To call this method you can do:
FileNamesExcluding(#"C:\", "*.sys", "config").Last();

Exclude certain file extensions when getting files from a directory

How to exclude certain file type when getting files from a directory?
I tried
var files = Directory.GetFiles(jobDir);
But it seems that this function can only choose the file types you want to include, not exclude.
You should filter these files yourself, you can write something like this:
var files = Directory.GetFiles(jobDir).Where(name => !name.EndsWith(".xml"));
I know, this a old request, but about me it's always important.
if you want exlude a list of file extension: (based on https://stackoverflow.com/a/19961761/1970301)
var exts = new[] { ".mp3", ".jpg" };
public IEnumerable<string> FilterFiles(string path, params string[] exts) {
return
Directory
.GetFiles(path)
.Where(file => !exts.Any(x => file.EndsWith(x, StringComparison.OrdinalIgnoreCase)));
}
You could try something like this:
var allFiles = Directory.GetFiles(#"C:\Path\", "");
var filesToExclude = Directory.GetFiles(#"C:\Path\", "*.txt");
var wantedFiles = allFiles.Except(filesToExclude);
I guess you can use lambda expression
var files = Array.FindAll(Directory.GetFiles(jobDir), x => !x.EndWith(".myext"))
You can try this,
var directoryInfo = new DirectoryInfo("C:\YourPath");
var filesInfo = directoryInfo.GetFiles().Where(x => x.Extension != ".pdb");
Afaik there is no way to specify the exclude patterns.
You have to do it manually, like:
string[] files = Directory.GetFiles(myDir);
foreach(string fileName in files)
{
DoSomething(fileName);
}
This is my version on the answers I read above
List<FileInfo> fileInfoList = ((DirectoryInfo)new DirectoryInfo(myPath)).GetFiles(fileNameOnly + "*").Where(x => !x.Name.EndsWith(".pdf")).ToList<FileInfo>();
I came across this looking for a method to do this where the exclusion could use the search pattern rules and not just EndWith type logic.
e.g. Search pattern wildcard specifier matches:
* (asterisk) Zero or more characters in that position.
? (question mark) Zero or one character in that position.
This could be used for the above as follows.
string dir = #"C:\Temp";
var items = Directory.GetFiles(dir, "*.*").Except(Directory.GetFiles(dir, "*.xml"));
Or to exclude items that would otherwise be included.
string dir = #"C:\Temp";
var items = Directory.GetFiles(dir, "*.txt").Except(Directory.GetFiles(dir, "*HOLD*.txt"));
i used that
Directory.GetFiles(PATH, "*.dll"))
and the PATH is:
public static string _PATH = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);

Categories

Resources