.net linq with regex ismatch in where - c#

In the following C# method, I know that the Directory.GetFileNsmes() does return the list of files. And, I can add in the Where contains(contact) which works. However for the life of me I can not determine why the searchPatter.IsMatch() fails to find files. I've tested the pattern in http://regexpal.com/ and it qorks as expected. The namePattern is "^\d{3}(.*).pdf" and there should be a match.
public static List<string> GetFileNames(string pathName, string namePattern, string contact)
{
var searchPattern = new Regex(namePattern, RegexOptions.IgnoreCase);
var files = Directory.GetFiles(pathName).Where(f => searchPattern.IsMatch(f));
//.Where(f => f.Contains(contact));
return files.ToList();
}
If this is already answered somewhere please let me know but I've not been able to locate it. I thought this it was pretty simple and straight forward.

Directory.GetFiles will return fill file path which will be Drive\Directory\File.ext. That's why your pattern doesn't seem to match. You need FileName alone as subject. Try this
var files = Directory.GetFiles(pathName)
.Where(f => searchPattern.IsMatch(Path.GetFileName(f)));

Directory.GetFiles() returns a list of filenames appended to the path supplied as a parameter. Your regular expression is "^\d{3}(.*).pdf", that is a string beginning with three digits. If you supplied a string that's an absolute path, it will start with either "/" on Unix or "C:\" on Windows and if it's a relative path, it will start with a directory name. Your code would work if pathName was just an empty string and you were searching the current directory.

Related

Constraint a path to be within a folder

Say you want to store a file within a folder C:\A\B\C and let the user supply the file name.
Just combine them, right?
Wrong.
If the user selects something like \..\..\Ha.txt you might be in for a surprise.
So how do we restrict the result to within C:\A\B\C? It's fine if it's within a subfolder, just not over it.
I've used one of my test projects, it really doesn't matter:
Using c#10
internal class Program
{
static void Main(string[] args)
{
string template = #"F:\Projectes\Test\SourceGenerators";
string folder = #"..\..\..\..\Test1.sln";
Console.WriteLine(MatchDirectoryStructure(template, folder)
? "Match"
: "Doesn't match");
}
static bool MatchDirectoryStructure(string template, string folder)
=> new DirectoryInfo(folder).FullName.StartsWith(template);
}
As you can see, new DirectoryInfo(fileName).FullName; returns the real name of the directory.
From here you can check if it match with the desired result.
In this case the returned value is:
Match
If you're asking for a file name, then it should be just the name of the file. The more control you give to the user about subdirectories, the more they can mess with you.
The idea here is to split your path by both possible slashes (/ and \) and see if the value of any of the entries in the array is ...
string input = #"\..\..\Ha.txt";
bool containsBadSegments = input
.Split(new [] { '/', '\\' })
.Any(s => s is "..");
This answer only takes care of detecting \..\ in the path. There are plenty of other ways to input bad values, such as characters not allowed by the OS's file system, or absolute or rooted paths.

Confused about Directory.GetFiles

I've read the docs about the Directory.GetPath search pattern and how it is used, because I noticed that *.dll finds both test.dll and test.dll_20170206. That behavior is documented
Now, I have a program that lists files in a folder based on a user-configured mask and processes them. I noticed that masks like *.txt lead to the above mentioned "problem" as expected.
However, the mask fixedname.txt also causes fixedname.txt_20170206 or the like to appear in the list, even though the documentation states this only occurs
When you use the asterisk wildcard character in a searchPattern such as "*.txt"
Why is that?
PS: I just checked: Changing the file mask to fixednam?.txt does not help even though the docs say
When you use the question mark wildcard character, this method returns only files that match the specified file extension. For example, given two files, "file1.txt" and "file1.txtother", in a directory, a search pattern of "file?.txt" returns just the first file, whereas a search pattern of "file*.txt" returns both files.
If you need a solution you may transform the filter pattern into a regular expression by replacing * by (.*) and ? by .. You also have to escape some pattern characters like the dot. Then you check each filename you got from Directory.GetFiles against this regular expression. Keep in mind to not only check if it is a match but that the match length is equal to the length of the filename. Otherwise you get the same results as before.
GetFiles uses pattern serach, it searches for all names in path ending with the letters specified.
You can write code similar to below to get only .txt extension file
foreach (string strFileName in Directory.GetFiles(#"D:\\test\","*.txt"))
{
string extension;
extension = Path.GetExtension(strFileName);
if (extension != ".txt")
continue;
else
{
//processed the file
}
}

How to use GetFiles() search to include doc files but excude docx files?

Currently I am looping through my file system like this
For Each filename As String In Directory.GetFiles(sourceFolder, "*.doc")
However this is including docx files to the list of files that GetFiles returns. I wish to only search for doc files and not docx. Any idea if there is a truncate or stop search character I can use in the search pattern?
This is the default behaviour of GetFiles, you can use LINQ to do further filtering.
var files = Directory.GetFiles(#"C:\test", "*.doc")
.Where(file=> file.EndsWith(".doc", StringComparison.CurrentCultureIgnoreCase))
.ToArray();//If you want an array back
Directory.GetFiles Method (String, String)
When you use the asterisk wildcard character in a searchPattern such
as "*.txt", the number of characters in the specified extension
affects the search as follows:
If the specified extension is exactly three characters long, the method returns files with extensions that begin with the specified extension. For example, "*.xls" returns both "book.xls" and "book.xlsx".
Given the fact that you want to iterate over your files and considering the default behavior of these methods I suggest to use EnumerateFiles instead of GetFiles. In this way you could add a simple check on the extension of the current file
foreach(string filename in Directory.EnumerateFiles(sourceFolder, "*.doc"))
{
if(!filename.EndsWith("x", StringComparison.CurrentCultureIgnoreCase))
{
.....
}
}
Not elegant as the Linq only solution but still working and not creating an array of all the filenames present in the directory
I am not a C# programmer so may be there can be syntax mistake, but i think it may solve your problem.
foreach (FileInfo fi in di.GetFiles("*.doc")
.Where(fi => string.Compare(".doc", fi.Extension,
StringComparison.OrdinalIgnoreCase) == 0))
{
myFiles.Add(fi);
}

List files in folder which match pattern

I need to list files in directory which match some pattern.
I tried playing with Directory.GetFiles, but don't fully
get why it behaves in some way.
1) For example, this code:
string[] dirs = Directory.GetFiles(#"c:\test\", "*t");
foreach (string dir in dirs)
{
Debugger.Log(0, "", dir);
Debugger.Log(0, "", "\n");
}
outputs this:
c:\test\11.11.2007.txtGif
c:\test\12.1.1990.txt
c:\test\2.tGift
c:\test\2.txtGif
c:\test\test.txt
...others hidden
You can see some files end with f but were still returned by query, why?
2) Also, this:
string[] dirs = Directory.GetFiles(#"c:\test\", "*.*.*.txt");
foreach (string dir in dirs)
{
Debugger.Log(0, "", dir);
Debugger.Log(0, "", "\n");
}
returns this:
c:\test\1.1.1990.txt
c:\test\1.31.1990.txt
c:\test\12.1.1990.txt
c:\test\12.31.1990.txt
But according to the documentation (http://msdn.microsoft.com/en-us/library/07wt70x2(v=vs.110).aspx) I think it had to return also
this file which is in the directory:
11.11.2007.txtGif
since extension (in the query string) is 3 letters long, but it didn't. why?
(when query extension is 3 letters long, doc says it will return extensions which start with specified extensions too, e.g., see Remarks).
Am I the only one who finds these results strange?
Is there any other approach you would recommend for using when one wants to list files in folder which match certain pattern?
User in my case may arbitrarily type some pattern, and I don't want to rely on
method which I am unsure about the result (like it happened with GetFiles).
This is the way that the Windows API works - you will see the same results if you use the dir command in a command prompt. This does NOT use regular expressions! It's pretty obscure...
If you want to do your own filtering, you can do it like so:
var filesEndingInT = Directory.EnumerateFiles(#"c:\test\").Where(f => f.EndsWith("t"));
If you want to use regular expressions to match, you can do it thusly:
Regex regex = new Regex(".*t$");
var matches = Directory.EnumerateFiles(#"c:\test\").Where(f => regex.IsMatch(f));
I suspect that you will want to let the user type in a simplified form of pattern and turn it into a regular expression, e.g.
"*.t" -> ".*t$"
The regular expression to find all filenames ending in t is ".*t$":
.*t$
Debuggex Demo
All of this behavior is exactly as described in the documentation you've linked. Here's an excerpt of the pertinent bits:
When you use the asterisk wildcard character in a searchPattern such
as "*.txt", the number of characters in the specified extension
affects the search as follows:
If the specified extension is exactly three characters long, the method returns files with extensions that begin with the specified
extension. For example, "*.xls" returns both "book.xls" and
"book.xlsx".
In all other cases, the method returns files that exactly match the specified extension. For example, "*.ai" returns "file.ai" but not
"file.aif".
When you use the question mark wildcard character, this method returns
only files that match the specified file extension. For example, given
two files, "file1.txt" and "file1.txtother", in a directory, a search
pattern of "file?.txt" returns just the first file, whereas a search
pattern of "file*.txt" returns both files. NoteNote
Because this method checks against file names with both the 8.3 file
name format and the long file name format, a search pattern similar to
"1.txt" may return unexpected file names. For example, using a
search pattern of "1.txt" returns "longfilename.txt" because the
equivalent 8.3 file name format is "LONGFI~1.TXT".
http://msdn.microsoft.com/en-us/library/wz42302f%28v=vs.110%29.aspx
The last paragraph above clearly explains your results when searching for *t. You can see this by using the command dir C:\test /x to show the 8.3 filenames. Here, C:\test\11.11.2007.txtGif matches *t because its 8.3 filename is 111120~1.TXT.
For the treatment of *.*.*.txt, I think you're either mis-interpreting the first bit about three-letter file extensions or perhaps it wasn't written quite clearly. Note that they quite specifically mentioned wildcard usage 'in a searchPattern such as "*.txt"'. Your search pattern doesn't match that, so you have to read between the lines a bit to see why their comment about three-letter file extensions applies to the example they gave but not yours. Really, I think that whole top section can be ignored if you just put a bit of thought into the last bit about 8.3 filenames. The treatment of three-letter file extensions after wildcards is really just a side-effect of the 8.3 filename search behavior.
Consider the examples they gave:
"*.xls" returns both "book.xls" and "book.xlsx"
This is because the filename for "book.xls" (both 8.3 and long filename, since the name naturally complies with 8.3) and the 8.3 filename for "book.xlsx" ("BOOK~1.XLS") matches a query of "*.xls".
"*.ai" returns "file.ai" but not "file.aif"
This is because "file.ai" naturally matches "*.ai" while "file.aif" doesn't. 8.3 search behavior doesn't come into play here at all, because both filenames are already 8.3-compliant. However, even if they weren't, the same would still hold true because any 8.3 filename for a file with an extension of ".ai" is still going to end in just ".AI".
The only reason it matters whether or not the file extension in your search is exactly three characters is because the 8.3 filenames are included in the search, and 8.3 filname extensions for objects with long filenames will always have just the first three characters after the last dot in the long filename. The key part missing from the documentation above is that the "first three characters" matching is done only against the 8.3 filename.
So, let's look at the anomalies you're asking about here. (If you want any other strange behaviors explained, beyond your results for *.t and *.*.*.txt, please post them as separate questions.)
TL;DR:
Output of a search for *t includes 11.11.2007.txtGif and 2.txtGif.
This is because the 8.3 filenames match a pattern of *t.
11.11.2007.txtGif = 111120~1.TXT
2.txtGIF = 2BEFD~1.TXT
(Both 8.3 filenames end in "T".)
Output of a search for *.*.*.txt does not include 11.11.2007.txtGif.
This is because neither the long filename, nor the 8.3 filename, match a pattern of *.*.*.txt.
11.11.2007.txtGif = 111120~1.TXT
(The long filename doesn't match because it doesn't end in ".txt", and the 8.3 filename doesn't match because it only has one dot.)
https://learn.microsoft.com/en-us/dotnet/api/system.io.directoryinfo.getfiles?view=netframework-4.5
The above Microsoft documentation is wrong as usual,
it says this code:
DirectoryInfo di = new DirectoryInfo(#"C:\Users\tomfitz\Documents\ExampleDir");
Console.WriteLine("No search pattern returns:");
Console.WriteLine();
Console.WriteLine("Search pattern *2* returns:");
foreach (var fi in di.GetFiles("*2*"))
{
Console.WriteLine(fi.Name);
Console.WriteLine(fi.Fullname); // this reveals the bug
}
should return the following but it does not
It still matches against the whole file path not just the filename.
Search pattern *2* returns:
log2.txt
test2.txt

C# - Check filename ends with certain word

Hi I would like to know how to validate a filename in C# to check that it ends with a certain word (not just contains but is located at the end of the filename)
e.g I want to modify certain files that end with the suffix Supplier so I want to validate each file to test if it ends with Supplier, so LondonSupplier.txt, ManchesterSupplier.txt and BirminghamSupplier.txt would all be validated and return true but ManchesterSuppliers.txt wouldn't.
is this even possible? I know you can validate a filename to test for a certain word anywhere within a filename but is it possible to do what i'm suggesting?
Try:
Path.GetFileNameWithoutExtension(path).EndsWith("Supplier")
if (myFileName.EndsWith("whatever"))
{
// Do stuff
}
By utilizing the System.IO.Path.GetFileNameWithoutExtension(string) method, you can extract the filename (sans extension) from a string. For example, calling it with the string C:\svn\trunk\MySourceFile.cs would return the string MySourceFile. After this, you can use the String.EndsWith method to see if your filename matches your criteria.
Linq solution:
var result = FilePaths.Where(name => Path.GetFileNameWithoutExtension(name).EndsWith("Supplier"))
.Select(name => name).ToList();
Assuming that FilePaths is a list containing all the paths.

Categories

Resources