exact functionality of the * wildcard in searchpatterns - c#

I have an folder with more than 150 files, I want to collect a list with all the ones which contain a certain keyword. The keyword can be at the beginning or somewhere in the middle. "*.xml" catches all the xml files.
Here is my question when I do this "*partkey*.xml" does this catch all the files which contain the substring?
for example:
string[] files = Directory.GetFiles("thepathtothefolder", "*key*.xml");
Do I get my expected output?

You might want to look it up here. There you will find the "exact" description of the meaning for the wildcard characters * and ?. It is the same meaning the * caracter had since MS DOS times, it stands for 'zero or more' characters.
The line
string[] files = Directory.GetFiles("thepathtothefolder", "*key*.xml");
will give you an array with all the filenames that contain thre characters 'key'.

Yes. With search pattern "*partkey*.xml" you will get all the files that ends with ".xml" and contains string "partkey"
Output example:
123partkey123.xml
456partkey456.xml

Related

How to use Path.GetInvalidPathChars to find a file name inside a larger string?

My goal is to find a file name ("MyFile.txt") inside a larger string. I.e.:
Some text before MyFile.txt some other text after
Currently I'm successfully using a Regular Expression with a character class of something like the following (simplified):
[\w\.\-]
This works fine, until the file contains other characters that are outside the \w group, e.g. an em dash: "My—File.txt".
My approach:
The method Path.GetInvalidPathChars returns an array of invalid characters. I've tried to use this method. Unfortunately, I found no way of "converting" this to be useful inside a Regular Expression.
I'm aware of
The SO posting "How to remove illegal characters from path and filenames?"
The concept of "Character class subtraction"
Still, I found no solution.
My question:
Is there any Regular Expression (or any other way) to find and extract a file name inside a larger string, based on the result of Path.GetInvalidPathChars?
I won´t use a regex for this at all as it becomes incredibly complex and unreadable. In particular a filename could be nearly any string, including most special characters, numbers, spaces. Even worse there are even files without a dot to seperate an extension. So I´d suggest to simply do an Contains-check on all your invalid characters:
char[] invalidChars = Path.GetInvalidPathChars;
bool valid = !myString.Contains(x => invalidChars.Contains(x));
Extracting the candidates instead is even simpler. The idea is to split your large string on all invalid characters. This means everything in between the invalid characters is considered a file-name, e.g:
"myTest.extension" → "myTest.extension"
"myFile:anotherFile" → "myFile"; "anotherFile"
"myFile with space" → "myFile with space"
"a File with .-determined extension.dot" → "a File with .-determined extension.dot"
This is achieved by this code:
var fileNames = myText.Split(invalidChars);
EDIT: If you really want a regex you can build one dynamically from your invalid characters:
var pattern = String.Format("([^{0}]*)", new String(invalidCharacters));
var r = new Regex(pattern);
If your file name do not contains space and do contain extension, then this simple idea may help you
string line = "Some text before MyFile.txt some other text after";
//If you look for path:
//var array = Path.GetInvalidPathChars().ToList();
//If you look for file name
var array = Path.GetInvalidFileNameChars().ToList();
array.Add(' ');
var potentialFileNames = line.Split(array.ToArray(), StringSplitOptions.RemoveEmptyEntries)
.Where(i => i.Contains('.')).ToList();
//potentialFileNames[0] = "MyFile.txt"

Confused about Directory.GetFiles

I've read the docs about the Directory.GetPath search pattern and how it is used, because I noticed that *.dll finds both test.dll and test.dll_20170206. That behavior is documented
Now, I have a program that lists files in a folder based on a user-configured mask and processes them. I noticed that masks like *.txt lead to the above mentioned "problem" as expected.
However, the mask fixedname.txt also causes fixedname.txt_20170206 or the like to appear in the list, even though the documentation states this only occurs
When you use the asterisk wildcard character in a searchPattern such as "*.txt"
Why is that?
PS: I just checked: Changing the file mask to fixednam?.txt does not help even though the docs say
When you use the question mark wildcard character, this method returns only files that match the specified file extension. For example, given two files, "file1.txt" and "file1.txtother", in a directory, a search pattern of "file?.txt" returns just the first file, whereas a search pattern of "file*.txt" returns both files.
If you need a solution you may transform the filter pattern into a regular expression by replacing * by (.*) and ? by .. You also have to escape some pattern characters like the dot. Then you check each filename you got from Directory.GetFiles against this regular expression. Keep in mind to not only check if it is a match but that the match length is equal to the length of the filename. Otherwise you get the same results as before.
GetFiles uses pattern serach, it searches for all names in path ending with the letters specified.
You can write code similar to below to get only .txt extension file
foreach (string strFileName in Directory.GetFiles(#"D:\\test\","*.txt"))
{
string extension;
extension = Path.GetExtension(strFileName);
if (extension != ".txt")
continue;
else
{
//processed the file
}
}

List files in folder which match pattern

I need to list files in directory which match some pattern.
I tried playing with Directory.GetFiles, but don't fully
get why it behaves in some way.
1) For example, this code:
string[] dirs = Directory.GetFiles(#"c:\test\", "*t");
foreach (string dir in dirs)
{
Debugger.Log(0, "", dir);
Debugger.Log(0, "", "\n");
}
outputs this:
c:\test\11.11.2007.txtGif
c:\test\12.1.1990.txt
c:\test\2.tGift
c:\test\2.txtGif
c:\test\test.txt
...others hidden
You can see some files end with f but were still returned by query, why?
2) Also, this:
string[] dirs = Directory.GetFiles(#"c:\test\", "*.*.*.txt");
foreach (string dir in dirs)
{
Debugger.Log(0, "", dir);
Debugger.Log(0, "", "\n");
}
returns this:
c:\test\1.1.1990.txt
c:\test\1.31.1990.txt
c:\test\12.1.1990.txt
c:\test\12.31.1990.txt
But according to the documentation (http://msdn.microsoft.com/en-us/library/07wt70x2(v=vs.110).aspx) I think it had to return also
this file which is in the directory:
11.11.2007.txtGif
since extension (in the query string) is 3 letters long, but it didn't. why?
(when query extension is 3 letters long, doc says it will return extensions which start with specified extensions too, e.g., see Remarks).
Am I the only one who finds these results strange?
Is there any other approach you would recommend for using when one wants to list files in folder which match certain pattern?
User in my case may arbitrarily type some pattern, and I don't want to rely on
method which I am unsure about the result (like it happened with GetFiles).
This is the way that the Windows API works - you will see the same results if you use the dir command in a command prompt. This does NOT use regular expressions! It's pretty obscure...
If you want to do your own filtering, you can do it like so:
var filesEndingInT = Directory.EnumerateFiles(#"c:\test\").Where(f => f.EndsWith("t"));
If you want to use regular expressions to match, you can do it thusly:
Regex regex = new Regex(".*t$");
var matches = Directory.EnumerateFiles(#"c:\test\").Where(f => regex.IsMatch(f));
I suspect that you will want to let the user type in a simplified form of pattern and turn it into a regular expression, e.g.
"*.t" -> ".*t$"
The regular expression to find all filenames ending in t is ".*t$":
.*t$
Debuggex Demo
All of this behavior is exactly as described in the documentation you've linked. Here's an excerpt of the pertinent bits:
When you use the asterisk wildcard character in a searchPattern such
as "*.txt", the number of characters in the specified extension
affects the search as follows:
If the specified extension is exactly three characters long, the method returns files with extensions that begin with the specified
extension. For example, "*.xls" returns both "book.xls" and
"book.xlsx".
In all other cases, the method returns files that exactly match the specified extension. For example, "*.ai" returns "file.ai" but not
"file.aif".
When you use the question mark wildcard character, this method returns
only files that match the specified file extension. For example, given
two files, "file1.txt" and "file1.txtother", in a directory, a search
pattern of "file?.txt" returns just the first file, whereas a search
pattern of "file*.txt" returns both files. NoteNote
Because this method checks against file names with both the 8.3 file
name format and the long file name format, a search pattern similar to
"1.txt" may return unexpected file names. For example, using a
search pattern of "1.txt" returns "longfilename.txt" because the
equivalent 8.3 file name format is "LONGFI~1.TXT".
http://msdn.microsoft.com/en-us/library/wz42302f%28v=vs.110%29.aspx
The last paragraph above clearly explains your results when searching for *t. You can see this by using the command dir C:\test /x to show the 8.3 filenames. Here, C:\test\11.11.2007.txtGif matches *t because its 8.3 filename is 111120~1.TXT.
For the treatment of *.*.*.txt, I think you're either mis-interpreting the first bit about three-letter file extensions or perhaps it wasn't written quite clearly. Note that they quite specifically mentioned wildcard usage 'in a searchPattern such as "*.txt"'. Your search pattern doesn't match that, so you have to read between the lines a bit to see why their comment about three-letter file extensions applies to the example they gave but not yours. Really, I think that whole top section can be ignored if you just put a bit of thought into the last bit about 8.3 filenames. The treatment of three-letter file extensions after wildcards is really just a side-effect of the 8.3 filename search behavior.
Consider the examples they gave:
"*.xls" returns both "book.xls" and "book.xlsx"
This is because the filename for "book.xls" (both 8.3 and long filename, since the name naturally complies with 8.3) and the 8.3 filename for "book.xlsx" ("BOOK~1.XLS") matches a query of "*.xls".
"*.ai" returns "file.ai" but not "file.aif"
This is because "file.ai" naturally matches "*.ai" while "file.aif" doesn't. 8.3 search behavior doesn't come into play here at all, because both filenames are already 8.3-compliant. However, even if they weren't, the same would still hold true because any 8.3 filename for a file with an extension of ".ai" is still going to end in just ".AI".
The only reason it matters whether or not the file extension in your search is exactly three characters is because the 8.3 filenames are included in the search, and 8.3 filname extensions for objects with long filenames will always have just the first three characters after the last dot in the long filename. The key part missing from the documentation above is that the "first three characters" matching is done only against the 8.3 filename.
So, let's look at the anomalies you're asking about here. (If you want any other strange behaviors explained, beyond your results for *.t and *.*.*.txt, please post them as separate questions.)
TL;DR:
Output of a search for *t includes 11.11.2007.txtGif and 2.txtGif.
This is because the 8.3 filenames match a pattern of *t.
11.11.2007.txtGif = 111120~1.TXT
2.txtGIF = 2BEFD~1.TXT
(Both 8.3 filenames end in "T".)
Output of a search for *.*.*.txt does not include 11.11.2007.txtGif.
This is because neither the long filename, nor the 8.3 filename, match a pattern of *.*.*.txt.
11.11.2007.txtGif = 111120~1.TXT
(The long filename doesn't match because it doesn't end in ".txt", and the 8.3 filename doesn't match because it only has one dot.)
https://learn.microsoft.com/en-us/dotnet/api/system.io.directoryinfo.getfiles?view=netframework-4.5
The above Microsoft documentation is wrong as usual,
it says this code:
DirectoryInfo di = new DirectoryInfo(#"C:\Users\tomfitz\Documents\ExampleDir");
Console.WriteLine("No search pattern returns:");
Console.WriteLine();
Console.WriteLine("Search pattern *2* returns:");
foreach (var fi in di.GetFiles("*2*"))
{
Console.WriteLine(fi.Name);
Console.WriteLine(fi.Fullname); // this reveals the bug
}
should return the following but it does not
It still matches against the whole file path not just the filename.
Search pattern *2* returns:
log2.txt
test2.txt

Get the last part of file name in C#

I need get last part means the numeric value(318, 319) of the following text (will vary)
C:\Uploads\X\X-1\37\Misc_318.pdf
C:\Uploads\X\X-1\37\Misc_ 319.pdf
C:\Uploads\X\C-1\37\Misc _ 320.pdf
Once I get that value I need to search for the entire folder. Once I find the files name with matching number, I need to remove all spaces and rename the file in that particular folder
Here is What I want
First get the last part of the file(numeric number may vary)
Based upon the number I get search in the folder to get all files names
Once I get the all files name check for spaces with file name and remove the spaces.
Finding the Number
If the naming follows the convention SOMEPATH\SomeText_[Optional spaces]999.pdf, try
var file = System.IO.Path.GetFileNameWithoutExtension(thePath);
string[] parts = file.split('_');
int number = int.Parse(parts[1]);
Of course, add error checking as appropriate. You may want to check that there are 2 parts after the split, and perhaps use int.TryParse() instead, depending on your confidence that the file names will follow that pattern and your ability to recover if TryParse() returns false.
Constructing the New File Name
I don't fully understand what you want to do once you have the number. However, have a look at Path.Combine() to build a new path if that's what you need, and you can use Directory.GetFiles() to search for a specific file name, or for files matching a pattern, in the desired directory.
Removing Spaces
If you have a file name with spaces in it, and you want all spaces removed, you can do
string newFilename = oldFilename.Replace(" ", "");
Here's a solution using a regex:
var s = #"C:\Uploads\X\X-1\37\Misc_ 319.pdf";
var match = Regex.Match(s, #"^.*?(\d+)(\.\w+)?$");
int i = int.Parse(match.Groups[1].Value);
// do something with i
It should work with or without an extension of any length (as long as it's a single extension, not like my file 123.tar.gz).

Getting files in directory with *.tif file mask using C#

So, I feel lame for asking this, but I'm kinda stumped. I'm trying to get a list of file in a directory that end in tif ... only tif ... not tiff. So, I did this in C# ...
Directory.GetFiles(path, "*.tif", SearchOption.TopDirectoryOnly);
I would expect it to only return tif files, but that is not the case. I get tiff as well. I would think that if I supplied the mask .tif? that would get me both, but not the mask .tif. I tried it at a command prompt as well and I am getting both as well in DOS. Am I missing something here? This just seems wrong to me. I guess I could sanitize the results afterwards, but if I don't have to that would be best.
From MSDN:
When using the asterisk wildcard character in a searchPattern (for
example, "*.txt"), the matching behavior varies depending on the
length of the specified file extension. A searchPattern with a file
extension of exactly three characters returns files with an extension
of three or more characters, where the first three characters match
the file extension specified in the searchPattern. A searchPattern
with a file extension of one, two, or more than three characters
returns only files with extensions of exactly that length that match
the file extension specified in the searchPattern. When using the
question mark wildcard character, this method returns only files that
match the specified file extension. For example, given two files in a
directory, "file1.txt" and "file1.txtother", a search pattern of
"file?.txt" returns only the first file, while a search pattern of
"file*.txt" returns both files.
That's just how Directory.GetFiles works. From the manual:
When using the asterisk wildcard character in a searchPattern, such as
"*.txt", the matching behavior when the extension is exactly three
characters long is different than when the extension is more or less
than three characters long. A searchPattern with a file extension of
exactly three characters returns files having an extension of three or
more characters, where the first three characters match the file
extension specified in the searchPattern.
Directory.GetFiles internally uses FindFirstFile function from Win32 API.
From the documentation of FindFirstFile:
• The search includes the long and short file names.
A file that has long file name of asd.tiff will have a short file name like asd~1.tif and this is why it shows up in the results.
More than three character extensions are matched except when the path is on a network share (or mapped drive). For some reason the pattern only matches the long file name on remote drives.

Categories

Resources