what can be performance-improved alternate to the following Regex? - c#

I have list and text file and I want:
Find all list items that are also in string (matched words) and
store them in list or array
Replace all the found matched words with "Names"
Count the matched words
The code is working fine, but it takes about 10 minutes to execute i want to improve the performance of the code, i have also try to use the contain function instead of the regex, but it effect on the working of the code as i am trying to matched the full words not sub-string.
Here is the code:
List<string> Names = new List<string>();
// Names = Millions values from the database
string Text = System.IO.File.ReadAllText(#"D:\record-13.txt");
var letter = new Regex(#"(?<letter>\W)");
var pattern = string.Join("|", names
.Select(n => $#"((?<=(^|\W)){letter.Replace(n, "[${letter}]")}(?=($|\W)))"));
var regex = new Regex(pattern);
var matchedWords = regex
.Matches(text)
.Cast<Match>()
.Select(m => m.Value)
//.Distinct()
.ToList();
text = regex.Replace(text, "Names");
Console.WriteLine($"Matched Words: {string.Join(", ", matchedWords.Distinct())}");
Console.WriteLine($"Count: {matchedWords.Count}");
Console.WriteLine($"Replaced Text: {text}");
Is there an alternate way to do the same thing as the above code do, with improved performance?

What you are doing is building a regular expression with "millions" of strings embedded in it, if Names really contains "millions" of strings. This is going to perform very poorly, even just to parse the regular expression, let alone evaluate it.
What you should do instead is load your Names into a HashSet<string>, then parse through the document one time, pulling out whole words. You can use a regular expression or write a state machine to do this. For each "word" you read, check if it exists in the HashSet<string> of names, and if so, write "Names" to your output (a StringBuilder would be ideal for this). If the word is not in the Names hashset, write the actual word to your output. Be sure to also write any non-word characters to the output as you encounter them. When you are done, your output will contain the sanitized result, and it should complete it milliseconds rather than minutes.

If I understand what you really want; I think you can use this code instead:
If you can ignore resulting Matched Words and Count:
text = names.Select(name => $#"\b{name}\b")
.Aggregate(text, (current, pattern) => Regex.Replace(current, pattern, "Names"));
Else:
var count = 0;
var matchedWord = new List<string>();
foreach (var name in names)
{
var regex = new Regex($#"\b{name}\b");
if (regex.IsMatch(text))
{
count++;
matchedWord.Add(name);
}
text = regex.Replace(text, "Names");
}

Related

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?
Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]
I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();
There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.
To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();
Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Find only equal words in list exist in string

I have several lists, with words content about 2000-3000 words:
var list1 = new List<string> {"able", "adorable", "adventurous", ...};
and than if string inputStr = "do, dream"; contains any value from list, I want, look for each word in string into string[] words = inputStr.Split(' '); foreach (string word in words) with if (list1.Any(word.Contains)).
I'm not sure, maybe it is because I use list, or my search Contains method is not correct for this case, but in result I found words, which is not equal to words exist in input string, but which contains this words as part of word, for example for word "do" or word "dream":
(do) adorable, doubt, fully, do, doh, freedom, down, double
(dream) dreamily, dream
Not sure how to avoid this, maybe better use Dictionary or SortedDictionary if problem is list. Same result I have if I check it this way var val1 = list1.FirstOrDefault(stringToCheck => stringToCheck.Contains(word)); Seems like different search gives me same results with list, all words which contains found words in input string as part of word, but desired result is to find only equal words:
(do) do
(dream) dream
IndexOf() method will get you the index of any equivalent strings within the collection.
You could also do something like this with LINQ:
list.Any(x => x == "testString");
To find the sequence that contains your "word" you should use Linq :
// (do) adorable, doubt, fully, do, doh, freedom, down, double
var result = list1.Select(word => word.Contains("do"));
But if you're trying to get word that matches fully :
var result = list1.Select(word => word.Equals("do"));
Combining this with your input list :
var result = input.SelectMany(x => list1.Where(w => w.Equals(x)));
EDIT:
Here you can check it online
You can get it done with a single Linq line:
List<string> list1 = new List<string> { "able", "adorable", "adventurous" };
string inputstr = "the adorable adventurous cat";
var found_words = inputstr.Split(' ').Where(word => list1.Contains(word));
// found_words[0] = "adorable"
// found_words[1] = "adventurous"
if (list1.Contains(word))
Will only match whole exact strings in list.
But in that case, you should make list1 a HashSet instead, that will have much better performance.
Linq is still your best bet. Assuming you want case sensitivity but don't want to observe hanging whitespace:
public string Foo(string input, List<string> list)
{
return (list.FirstOrDefault(t.Trim() == input.Trim()));
}
I personally prefer to compare strings by value than using Equals most of the time, though for string comparisons you may want to narrow down Culture as necessary..

Search file with Regular expression

I have following recursive Search-Function:
public List<FileInfo> Search_Files(String strDir, String line)
{
List<FileInfo> files = new List<FileInfo>();
try
{
foreach (String strFile in Directory.GetFiles(strDir,line+r))
{
files.Add(new FileInfo(strFile));
}
foreach (String strSubDir in Directory.GetDirectories(strDir))
{
List<FileInfo> sublist = Search_Files(strSubDir, line);
foreach (FileInfo file_infow in sublist)
{
files.Add(file_infow);
}
}
}
catch (Exception)
{
...
}
return (files);
}
The line variable's value looks like "1234".
Now I wanted to search for files like: 1234c.something or 1234.something
I created following Regex:
Regex r = new Regex("[a-z].* | .*");
I added it to line string, but it doesn't work. Why does this not work and how can I correct this?
i used LINQ, give it a try
string[] allFiles = Directory.GetFiles(#"C:\Users\UserName\Desktop\Files");
List<string> neededFiles = (from c in allFiles
where Path.GetFileName(c).StartsWith("fileStartName")
select c).ToList<string>();
foreach (var file in neededFiles)
{
// do the tesk you want with the matching files
}
The GetDirectories and GetFiles methods accept a searchPattern that is not a regex.
The search string to match against the names of files in path. This parameter can contain a combination of valid literal path and wildcard (* and ?) characters (see Remarks), but doesn't support regular expressions.
You can filter the results with the following regex:
var r = new Regex(#"\d{4}.*");
// var r = new Regex(#"^\d{4}.*"); // Use this if file names should start with the 4 digits.
files.Add(Directory.GetFiles(strDir)
.Where(p => r.IsMatch(Path.GetFileName(p)))
.ToList());
The \d{4}.* regex matches 4 digits (\d{4}) and any 0 or more characters but a newline.
If you want to match the '.' you have to escape it as '\.'. '.*' by itself means any character n-times. Have a look here for the specifics about formats: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
I would also suggest that you use a more strict regular expression. If you know that your file name starts by 1234, use it in the regular expression as well.
There are two ways to do this. The first is to use a windows search filter. This is what you can pass directly to the GetFiles() method. (EnumerateFiles() does the same thing, and might be faster in this case, but that's irrelevant to your question).
A windows search pattern uses * to represent 'any number of any character' and ? is used to represent a single unknown character. These are not actually regular expressions.
You can then perform a search like this:
return Directory.EnumerateFiles(strDir, line + "*.*", SearchOption.AllDirectories)
.Select(f => new FileInfo(f))
.ToList();
The second is what you were originally looking for and that's performing a linq query with actual regular expressions. That can be done like this:
Regex pattern = new Regex(line + #".*\..*")
// regex says use line, then anything any number of times,
// and then a dot and then any chars any amount of times
return Directory.EnumerateFiles(strDir, *.*, SearchOption.AllDirectories)
.Where(f => pattern.IsMatch(f))
.Select(f => new FileInfo(f))
.ToList();
Note: The above two examples show how to also convert the provided strings to FileInfo objects like the signature of your Search_Files method requires in the "linq-way." Also I'm using the SearchOption.AllDirectories flag that performs the recursive search for you, without you needing to write your own.
As for why your originally posted method did not work; there are two issues with it.
You are attempting to concatenate a regex object with a string. This is not possible because you are looking to concat the regex pattern with the string. This should be done before (or inside of) the construction of the regex object as I showed in my example.
Assuming you did not attempt to concat a regex object with a string, the Regex pattern that you are using pretty much would match anything, always. This would not limit anything down.

How can I compare a string to a "filter" list in linq?

I'm trying to filter a collection of strings by a "filter" list... a list of bad words. The string contains a word from the list I dont want it.
I've gotten so far, the bad Word here is "frakk":
string[] filter = { "bad", "words", "frakk" };
string[] foo =
{
"this is a lol string that is allowed",
"this is another lol frakk string that is not allowed!"
};
var items = from item in foo
where (item.IndexOf( (from f in filter select f).ToString() ) == 0)
select item;
But this aint working, why?
You can use Any + Contains:
var items = foo.Where(s => !filter.Any(w => s.Contains(w)));
if you want to compare case-insensitively:
var items = foo.Where(s => !filter.Any(w => s.IndexOf(w, StringComparison.OrdinalIgnoreCase) >= 0));
Update: If you want to exlude sentences where at least one word is in the filter-list you can use String.Split() and Enumerable.Intersect:
var items = foo.Where(sentence => !sentence.Split().Intersect(filter).Any());
Enumerable.Intersect is very efficient since it uses a Set under the hood. it's more efficient to put the long sequence first. Due to Linq's deferred execution is stops on the first matching word.
( note that the "empty" Split includes other white-space characters like tab or newline )
The first problem you need to solve is breaking up the sentence into a series of words. The simplest way to do this is based on spaces
string[] words = sentence.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries);
From there you can use a simple LINQ expression to find the profanities
var badWords = words.Where(x => filter.Contains(x));
However this is a bit of a primitive solution. It won't handle a number of complex cases that you likely need to think about
There are many characters which qualify as a space. My solution only uses ' '
The split doesn't handle punctuations. So dog! won't be viewed as dog. Probably much better to break up words on legal characters
The reason your initial attempt didn't work is that this line:
(from f in filter select f).ToString()
evaluates to a string of the Array Iterator type name that's implied by the linq expression portion. So you're actually comparing the characters of the following string:
System.Linq.Enumerable+WhereSelectArrayIterator``2[System.String,System.String]
rather than the words of the filter when examining your phrases.

C# search into a string for a specific pattern, and put in an Array

I'm having the following string as an example:
<tr class="row_odd"><td>08:00</td><td>08:10</td><td>TEST1</td></tr><tr class="row_even"><td>08:10</td><td>08:15</td><td>TEST2</td></tr><tr class="row_odd"><td>08:15</td><td>08:20</td><td>TEST3</td></tr><tr class="row_even"><td>08:20</td><td>08:25</td><td>TEST4</td></tr><tr class="row_odd"><td>08:25</td><td>08:30</td><td>TEST5</td></tr>
I need to have to have the output as a onedimensional Array.
Like 11111=myArray(0) , 22222=myArray(1) , 33333=myArray(2) ,......
I have already tried the myString.replace, but it seems I can only replace a single Char that way. So I need to use expressions and a for loop for filling the array, but since this is my first c# project, that is a bridge too far for me.
Thanks,
It seems like you want to use a Regex search pattern. Then return the matches (using a named group) into an array.
var regex = new Regex("act=\?(<?Id>\d+)");
regex.Matches(input).Cast<Match>()
.Select(m => m.Groups["Id"])
.Where(g => g.Success)
.Select(g => Int32.Parse(g.Value))
.ToArray();
(PS. I'm not positive about the regex pattern - you should check into it yourself).
Several ways you could do this. A couple are:
a) Use a regular expression to look for what you want in the string. Used a named group so you can access the matches directly
http://www.regular-expressions.info/dotnet.html
b) Split the expression at the location where your substrings are (e.g. split on "act="). You'll have to do a bit more parsing to get what you want, but that won't be to difficult since it will be at the beginning of the split string (and your have other srings that dont have your substring in them)
Use a combination of IndexOf and Substring... something like this would work (not sure how much your string varies). This will probably be quicker than any Regex you come up with. Although, looking at the length of your string, it might not really be an issue.
public static List<string> GetList(string data)
{
data = data.Replace("\"", ""); // get rid of annoying "'s
string[] S = data.Split(new string[] { "act=" }, StringSplitOptions.None);
var results = new List<string>();
foreach (string s in S)
{
if (!s.Contains("<tr"))
{
string output = s.Substring(0, s.IndexOf(">"));
results.Add(output);
}
}
return results;
}
Split your string using HTML tags like "<tr>","</tr>","<td>","</td>", "<a>","</a>" with strinng-variable.split() function. This gives you list of array.
Split html row into string array

Categories

Resources