Search file with Regular expression - c#

I have following recursive Search-Function:
public List<FileInfo> Search_Files(String strDir, String line)
{
List<FileInfo> files = new List<FileInfo>();
try
{
foreach (String strFile in Directory.GetFiles(strDir,line+r))
{
files.Add(new FileInfo(strFile));
}
foreach (String strSubDir in Directory.GetDirectories(strDir))
{
List<FileInfo> sublist = Search_Files(strSubDir, line);
foreach (FileInfo file_infow in sublist)
{
files.Add(file_infow);
}
}
}
catch (Exception)
{
...
}
return (files);
}
The line variable's value looks like "1234".
Now I wanted to search for files like: 1234c.something or 1234.something
I created following Regex:
Regex r = new Regex("[a-z].* | .*");
I added it to line string, but it doesn't work. Why does this not work and how can I correct this?

i used LINQ, give it a try
string[] allFiles = Directory.GetFiles(#"C:\Users\UserName\Desktop\Files");
List<string> neededFiles = (from c in allFiles
where Path.GetFileName(c).StartsWith("fileStartName")
select c).ToList<string>();
foreach (var file in neededFiles)
{
// do the tesk you want with the matching files
}

The GetDirectories and GetFiles methods accept a searchPattern that is not a regex.
The search string to match against the names of files in path. This parameter can contain a combination of valid literal path and wildcard (* and ?) characters (see Remarks), but doesn't support regular expressions.
You can filter the results with the following regex:
var r = new Regex(#"\d{4}.*");
// var r = new Regex(#"^\d{4}.*"); // Use this if file names should start with the 4 digits.
files.Add(Directory.GetFiles(strDir)
.Where(p => r.IsMatch(Path.GetFileName(p)))
.ToList());
The \d{4}.* regex matches 4 digits (\d{4}) and any 0 or more characters but a newline.

If you want to match the '.' you have to escape it as '\.'. '.*' by itself means any character n-times. Have a look here for the specifics about formats: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
I would also suggest that you use a more strict regular expression. If you know that your file name starts by 1234, use it in the regular expression as well.

There are two ways to do this. The first is to use a windows search filter. This is what you can pass directly to the GetFiles() method. (EnumerateFiles() does the same thing, and might be faster in this case, but that's irrelevant to your question).
A windows search pattern uses * to represent 'any number of any character' and ? is used to represent a single unknown character. These are not actually regular expressions.
You can then perform a search like this:
return Directory.EnumerateFiles(strDir, line + "*.*", SearchOption.AllDirectories)
.Select(f => new FileInfo(f))
.ToList();
The second is what you were originally looking for and that's performing a linq query with actual regular expressions. That can be done like this:
Regex pattern = new Regex(line + #".*\..*")
// regex says use line, then anything any number of times,
// and then a dot and then any chars any amount of times
return Directory.EnumerateFiles(strDir, *.*, SearchOption.AllDirectories)
.Where(f => pattern.IsMatch(f))
.Select(f => new FileInfo(f))
.ToList();
Note: The above two examples show how to also convert the provided strings to FileInfo objects like the signature of your Search_Files method requires in the "linq-way." Also I'm using the SearchOption.AllDirectories flag that performs the recursive search for you, without you needing to write your own.
As for why your originally posted method did not work; there are two issues with it.
You are attempting to concatenate a regex object with a string. This is not possible because you are looking to concat the regex pattern with the string. This should be done before (or inside of) the construction of the regex object as I showed in my example.
Assuming you did not attempt to concat a regex object with a string, the Regex pattern that you are using pretty much would match anything, always. This would not limit anything down.

Related

Get the string from known index till \r\n is found

I have a string s which reads my batch file content.
Suppose the content of s is as follows:
"\t\r\n##echo off\r\necho \"Hello World!!!\"\r\necho \"One\"\r\nset /p DUMMY=Hit ENTER to continue...\r\ncall second.bat\r\necho \"done!!!\"\r\ncall third.bat\r\necho \"done 3!!!\""
i want to write a condition which does the below,
while (s.Contains("call")) && (if string next to "call" contains(.bat))
how to acheive this?
I am new to c#. Please help me in this regard.
thanks in advance
You can split the string on new lines and process only the lines you want as follows:
foreach (string line in s.Split("\r\n", StringSplitOptions.None).Where(x => x.ToLower().StartsWith("call") && x.ToLower().EndsWith(".bat")))
{
// do stuff here
}
It seems that you are parsing some kind of log; in this case I suggest using regular expressions, e.g.
using System.Text.RegularExpressions;
...
string source =
"\t\r\n##echo off\r\necho \"Hello World!!!\"\r\necho \"One\"\r\nset /p DUMMY=Hit ENTER to continue...\r\ncall second.bat\r\necho \"done!!!\"\r\ncall third.bat\r\necho \"done 3!!!\"";
var matches = Regex
.Matches(source, #"call.+?\.bat", RegexOptions.IgnoreCase)
.OfType<Match>()
.Select(match => match.Value);
// call second.bat
// call third.bat
foreach (string match in matches) {
...
}
It's unclear what is "string next", in the code above I've treated it as "after". In case it means "after several white spaces" the pattern will be
.Matches(source, #"call\s+?\.bat", RegexOptions.IgnoreCase)
The first thing that comes to my mind is using the text.Split ('\n', '\r') method. This way you get an array of strings which are separated by those line break symbols. Because you'd get empty strings, you should also filter those out. For that, I would recommend converting the array to a list, iterate through all elements and remove all empty strings (consider using string.IsNullOrEmpty (text)).
If you always have \r\n, you can use text.Split("\r\n", StringSplitOptions.None) instead, and don't have to worry about empty strings in between. You could still convert it to a list for easier use.
Then you would get a fine list of the entire content separated through line breaks. Now you could loop through that and do whatever you want.

what can be performance-improved alternate to the following Regex?

I have list and text file and I want:
Find all list items that are also in string (matched words) and
store them in list or array
Replace all the found matched words with "Names"
Count the matched words
The code is working fine, but it takes about 10 minutes to execute i want to improve the performance of the code, i have also try to use the contain function instead of the regex, but it effect on the working of the code as i am trying to matched the full words not sub-string.
Here is the code:
List<string> Names = new List<string>();
// Names = Millions values from the database
string Text = System.IO.File.ReadAllText(#"D:\record-13.txt");
var letter = new Regex(#"(?<letter>\W)");
var pattern = string.Join("|", names
.Select(n => $#"((?<=(^|\W)){letter.Replace(n, "[${letter}]")}(?=($|\W)))"));
var regex = new Regex(pattern);
var matchedWords = regex
.Matches(text)
.Cast<Match>()
.Select(m => m.Value)
//.Distinct()
.ToList();
text = regex.Replace(text, "Names");
Console.WriteLine($"Matched Words: {string.Join(", ", matchedWords.Distinct())}");
Console.WriteLine($"Count: {matchedWords.Count}");
Console.WriteLine($"Replaced Text: {text}");
Is there an alternate way to do the same thing as the above code do, with improved performance?
What you are doing is building a regular expression with "millions" of strings embedded in it, if Names really contains "millions" of strings. This is going to perform very poorly, even just to parse the regular expression, let alone evaluate it.
What you should do instead is load your Names into a HashSet<string>, then parse through the document one time, pulling out whole words. You can use a regular expression or write a state machine to do this. For each "word" you read, check if it exists in the HashSet<string> of names, and if so, write "Names" to your output (a StringBuilder would be ideal for this). If the word is not in the Names hashset, write the actual word to your output. Be sure to also write any non-word characters to the output as you encounter them. When you are done, your output will contain the sanitized result, and it should complete it milliseconds rather than minutes.
If I understand what you really want; I think you can use this code instead:
If you can ignore resulting Matched Words and Count:
text = names.Select(name => $#"\b{name}\b")
.Aggregate(text, (current, pattern) => Regex.Replace(current, pattern, "Names"));
Else:
var count = 0;
var matchedWord = new List<string>();
foreach (var name in names)
{
var regex = new Regex($#"\b{name}\b");
if (regex.IsMatch(text))
{
count++;
matchedWord.Add(name);
}
text = regex.Replace(text, "Names");
}

How to filter out string with dot 4 digit number dot and rfa extension?

This should be quite straight forward, in dos i can use a *.????.rfa filter, it evens works in windows context search.
how can i simply do this in C# if conditional?
if (file.Extension.Contains(".rfa") &!file.Extension.Contains(".0001.rfa"))
...
You can use this regular expression: #"\.\d{4}\.rfa$":
Regex.IsMatch(file.FullName, #"\.\d{4}\.rfa$")
Demo.
I would combine LINQ and Regex.
Regex r = new Regex (#"\.\d{4}\.rfa$");
foreach (FileInfo fi in new DirectoryInfo(#"c:\path").GetFiles().Where(x=> r.IsMatch(x.Name))) {
Console.WriteLine (fi.FullName);
}
//edit: Added the pattern mentioned here.

Why does regex replace my feed string once + 3 times?

Interesting situation I have here. I have some files in a folder that all have a very explicit string in the first line that I always know will be there. Want I want to do is really just append |DATA_SOURCE_KEY right after AVAILABLE_IND
//regex to search for the bb_course_*.bbd files
string courseRegex = #"BB_COURSES_([C][E][Q]|[F][A]|[H][S]|[S][1]|[S][2]|[S][P])\d{1,6}.bbd";
string courseHeaderRegex = #"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND";
//get files from the directory specifed in the GetFiles parameter and returns the matches to the regex
var matches = Directory.GetFiles(#"c:\courseFolder\").Where(path => Regex.Match(path, courseRegex).Success);
//prints the files returned
foreach (string file in matches)
{
Console.WriteLine(file);
File.WriteAllText(file, Regex.Replace(File.ReadAllText(file), courseHeaderRegex, "EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY"));
}
But this code takes the original occurrence of the matching regex, replaces it with my replacement value, and then does it 3 more times.
EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY|EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY
And I can't figure out why with breakpoints. My loop is running only 12 times to match the # of files I have in the directory. My only guess is that File.WriteAllText is somehow recursively searching itself after replacing the text and re-replacing. If that makes sense. Any ideas? Is it because courseHeaderRegex is so explicit?
If I change courseHeaderRegex to string courseHeaderRegex = #"AVAILABLE_IND";
then I get the correct changes in my files
EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY
I'd just like to understand why the original way doesn't work.
I think your problem is that you need to escape the | character in courseHeaderRegex:
string courseHeaderRegex = #"EXTERNAL_COURSE_KEY\|COURSE_ID\|COURSE_NAME\|AVAILABLE_IND";
The character | is the Alternation Operator and it will match 'EXTERNAL_COURSE_KEY' , 'COURSE_ID' , ,'COURSE_NAME' and 'AVAILABLE_IND', replacing each of them with your substitution string.
What about
string newString = File.ReadAllText(file)
.Replace(#"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND",#"EXTERNAL_COURSE_KEY|COURSE_ID|COURSE_NAME|AVAILABLE_IND|DATA_SOURCE_KEY");
just using a simple String.Replace()

C# search into a string for a specific pattern, and put in an Array

I'm having the following string as an example:
<tr class="row_odd"><td>08:00</td><td>08:10</td><td>TEST1</td></tr><tr class="row_even"><td>08:10</td><td>08:15</td><td>TEST2</td></tr><tr class="row_odd"><td>08:15</td><td>08:20</td><td>TEST3</td></tr><tr class="row_even"><td>08:20</td><td>08:25</td><td>TEST4</td></tr><tr class="row_odd"><td>08:25</td><td>08:30</td><td>TEST5</td></tr>
I need to have to have the output as a onedimensional Array.
Like 11111=myArray(0) , 22222=myArray(1) , 33333=myArray(2) ,......
I have already tried the myString.replace, but it seems I can only replace a single Char that way. So I need to use expressions and a for loop for filling the array, but since this is my first c# project, that is a bridge too far for me.
Thanks,
It seems like you want to use a Regex search pattern. Then return the matches (using a named group) into an array.
var regex = new Regex("act=\?(<?Id>\d+)");
regex.Matches(input).Cast<Match>()
.Select(m => m.Groups["Id"])
.Where(g => g.Success)
.Select(g => Int32.Parse(g.Value))
.ToArray();
(PS. I'm not positive about the regex pattern - you should check into it yourself).
Several ways you could do this. A couple are:
a) Use a regular expression to look for what you want in the string. Used a named group so you can access the matches directly
http://www.regular-expressions.info/dotnet.html
b) Split the expression at the location where your substrings are (e.g. split on "act="). You'll have to do a bit more parsing to get what you want, but that won't be to difficult since it will be at the beginning of the split string (and your have other srings that dont have your substring in them)
Use a combination of IndexOf and Substring... something like this would work (not sure how much your string varies). This will probably be quicker than any Regex you come up with. Although, looking at the length of your string, it might not really be an issue.
public static List<string> GetList(string data)
{
data = data.Replace("\"", ""); // get rid of annoying "'s
string[] S = data.Split(new string[] { "act=" }, StringSplitOptions.None);
var results = new List<string>();
foreach (string s in S)
{
if (!s.Contains("<tr"))
{
string output = s.Substring(0, s.IndexOf(">"));
results.Add(output);
}
}
return results;
}
Split your string using HTML tags like "<tr>","</tr>","<td>","</td>", "<a>","</a>" with strinng-variable.split() function. This gives you list of array.
Split html row into string array

Categories

Resources