Filter string array with substring

Filter string array with substring - c#

So I have these values in an Array:
1_642-name.xml
1_642-name2.xml
1_678-name.xml
1_678-name2.xml
I always only want the values with the highest number to be in my array. But i cannot seem to figure out how?
The string consists of these factors:
1 is a static number - And will always only be 1
642 or numbers between _ and - is an identity and can always get larger
name.xml is always the same
I want to filter by the largest identity (678) in this case.
Ive tried something like this without luck:
string[] filter = lines.FindAll(lines, x => x.Substring(3, 3));
Result:
1_678-name.xml
1_678-name2.xml

Because the number of characters in your format can vary easily, this is a great job for Regular Expressions. For example:
var input = "1_642-name2.xml";
var pattern = #"^\d+_(\d+)-.+$";
var match = Regex.Match(input, pattern);
match.Groups[1].Value; // "642" (as a string)
An explanation of the regex string can be found here.
We can use that to extract various parts of each element of your array.
The first thing to do is find the max value, which, if we have this format:
#_###-wordswords
Then we want the number between the _ and the -.
var list = new string[]
{
"1_642-name.xml",
"1_642-name2.xml",
"1_678-name.xml",
"1_678-name2.xml"
};
var pattern = new Regex(#"^\d+_(\d+)-.+$");
var maxValue = list.Max(x => int.Parse(pattern.Match(x).Groups[1].Value));
This finds "678" as the max value. Now we just need to filter the list to only show entries that have "678" in that format slot.
var matchingEntries = list
.Where(x => pattern.Match(x).Groups[1].Value == maxValue.ToString());
foreach (var entry in matchingEntries)
{
Console.WriteLine(entry);
}
The Where filters the list with your max value.
There are a good number of inefficiencies with this code. I'm regex parsing each value twice, and calculating the string equivalent of maxValue on each element. I'll leave fixing those as an exercise to the reader.

Just to provide an alternate to regular expressions, you can also simply parse each line, examine the number, and if it's the largest we've found so far, add the line to a list. Clear the list any time a larger number is found, and then return the list at the end.
A bonus is that we only loop through the list once instead of twice:
public static List<string> GetHighestNumberedLines(List<string> input)
{
if (input == null || !input.Any()) return input;
var result = new List<string>();
var highNum = int.MinValue;
foreach (var line in input)
{
var parts = line.Split('_', '-');
int number;
// Making sure we atually have a number where we expect it
if (parts.Length > 1 && int.TryParse(parts[1], out number))
{
// If this is the highest number we've found, update
// our variable and reset the list to contain this line
if (number > highNum)
{
highNum = number;
result = new List<string> {line};
}
// If this matches our high number, add this line to our list
else if (number == highNum)
{
result.Add(line);
}
}
}
return result;
}

Related

Compare 2 big string lists

I have two lists of strings - this is not mandatory, i can convert them to any collection (list, dictionary, etc).
First is "text":
Birds sings
Dogs barks
Frogs jumps
Second is "words":
sing
dog
cat
I need to iterate through "text" and if line contains any of "words" - do one thing and if not another thing.
Important: yes, in my case i need to find partial match ignoring case, like text "Dogs" is a match for word "dog". This is why i use .Contains and .ToLower().
My naive try looks like this:
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = false;
foreach (string word in words)
{
if (line.ToLower().Contains(word.ToLower()))
{
;// one thing
found = true;
break;
}
}
if (!found)
;// another
}
Problem in size - 8000 in first list and ~50000 in second. This takes too many time.
How to make it faster?

I'm assuming that you only want to match on the specific words in your text list: that is, if text contains "dogs", and words contains "dog", then that shouldn't be a match.
Note that this is different to what your code currently does.
Given this, we can construct a HashSet<string> of all of the words in your text list. We can then query this very cheaply.
We'll also use StringComparer.OrdinalIgnoreCase to do our comparisons. This is a better way of doing a case-insensitive match than ToLower(), and ordinal comparisons are relatively cheap. If you're dealing with languages other than English, you'll need to consider whether you actually need StringComparer.CurrentCultureIgnoreCase or StringComparer.InvariantCultureIgnoreCase.
var textWords = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var line in text)
{
var lineWords = line.Split(' ');
textWords.UnionWith(lineWords);
}
if (textWords.Overlaps(words))
{
// One thing
}
else
{
// Another
}
If this is not the case, and you do want to do a .Contains on each, then you can speed it up a bit by avoiding the calls to .ToLower(). Each call to .ToLower() creates a new string in memory, so you're creating two new, useless objects per comparison.
Instead, use:
if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) >= 0)
As above, you might have to use StringComparison.CurrentCultureIgnoreCase or StringComparison.InvariantCultureIgnoreCase depending on the language of your strings. However, you should see a significant speedup if your strings are entirely ASCII and you use OrdinalIgnoreCase as this makes the string search a lot quicker.
If you're using .NET Framework, another thing to try is moving to .NET Core. .NET Core introduced a lot of optimizations in this area, and you might find that it's quicker.
Another thing you can do is see if you have duplicates in either text or words. If you have a lot, you might be able to save a lot of time. Consider using a HashSet<string> for this, or linq's .Distinct() (you'll need to see which is quicker).

You can try using LINQ for the second looping construct.
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = words.FirstOrDefault(w=>line.ToLower().Contains(w.ToLower()))!=null;
if (found)
{
//Do something
}
else
{
//Another
}
}
Might not be as fast as you want but it will be faster than before.

You can improve the search algorithm.
public static int Search(string word, List<string> stringList)
{
string wordCopy = word.ToLower();
List<string> stringListCopy = new List<string>();
stringList.ForEach(s => stringListCopy.Add(s.ToLower()));
stringListCopy.Sort();
int position = -1;
int count = stringListCopy.Count;
if (count > 0)
{
int min = 0;
int max = count - 1;
int middle = (max - min) / 2;
int comparisonStatus = 0;
do
{
comparisonStatus = string.Compare(wordCopy, stringListCopy[middle]);
if (comparisonStatus == 0)
{
position = middle;
break;
}
else if (comparisonStatus < 0)
{
max = middle - 1;
}
else
{
min = middle + 1;
}
middle = min + (max - min) / 2;
} while (min < max);
}
return position;
}
Inside this method we create copy of string list. All elements are lower case.
After that we sort copied list by ascending. This is crucial because the entire algorithm is based upon ascending sort.
If word exists in the list then the Search method will return its position inside list, otherwise it will return -1.
How the algorithm works?
Instead of checking every element in the list, we split the list in half in every iteration.
In every iteration we take the element in the middle and compare two strings (the element and our word). If out string is the same as the one in the middle then our search is finished. If our string is lexical before the string in the middle, then our string must be in the first half of the list, because the list is sorted by ascending. If our string is lexical after the string in the middle, then our string must be in the second half of the list, again because the list is sorted by ascending. Then we take the appropriate half and repeat the process.
In first iteration we take the entire list.
I've tested the Search method using these data:
List<string> stringList = new List<string>();
stringList.Add("Serbia");
stringList.Add("Greece");
stringList.Add("Egypt");
stringList.Add("Peru");
stringList.Add("Palau");
stringList.Add("Slovakia");
stringList.Add("Kyrgyzstan");
stringList.Add("Mongolia");
stringList.Add("Chad");
Search("Serbia", stringList);
This way you will search the entire list of ~50,000 elements in 16 iterations at most.

How to treat integers from a string as multi-digit numbers and not individual digits?

My input is a string of integers, which I have to check whether they are even and display them on the console, if they are. The problem is that what I wrote checks only the individual digits and not the numbers.
string even = "";
while (true)
{
string inputData = Console.ReadLine();
if (inputData.Equals("x", StringComparison.OrdinalIgnoreCase))
{
break;
}
for (int i = 0; i < inputData.Length; i++)
{
if (inputData[i] % 2 == 0)
{
even +=inputData[i];
}
}
}
foreach (var e in even)
Console.WriteLine(e);
bool something = string.IsNullOrEmpty(even);
if( something == true)
{
Console.WriteLine("N/A");
}
For example, if the input is:
12
34
56
my output is going to be
2
4
6 (every number needs to be displayed on a new line).
What am I doing wrong? Any help is appreciated.

Use string.Split to get the independent sections and then int.TryParse to check if it is a number (check Parse v. TryParse). Then take only even numbers:
var evenNumbers = new List<int>();
foreach(var s in inputData.Split(" "))
{
if(int.TryParse(s, out var num) && num % 2 == 0)
evenNumbers.Add(num); // If can't use collections: Console.WriteLine(num);
}
(notice the use of out vars introduced in C# 7.0)
If you can use linq then similar to this answer:
var evenNumbers = inputData.Split(" ")
.Select(s => (int.TryParse(s, out var value), value))
.Where(pair => pair.Item1)
.Select(pair => pair.value);

I think you do too many things here at once. Instead of already checking if the number is even, it is better to solve one problem at a time.
First we can make substrings by splitting the string into "words". Net we convert every substring to an int, and finally we filter on even numbers, like:
var words = inputData.Split(' '); # split the words by a space
var intwords = words.Select(int.Parse); # convert these to ints
var evenwords = intwords.Where(x => x % 2 == 0); # check if these are even
foreach(var even in evenwords) { # print the even numbers
Console.WriteLine(even);
}
Here it can still happen that some "words" are not integers, for example "12 foo 34". So you will need to implement some extra filtering between splitting and converting.

Int.Parse(String.Split()) returns "Input string was not in a correct format" error

I am trying to perform a LINQ query on an array to filter out results based on a user's query. I am having a problem parsing two int's from a single string.
In my database, TimeLevels are stored as strings in the format [mintime]-[maxtime] minutes for example 0-5 Minutes. My user's have a slider which they can select a min and max time range, and this is stored as an int array, with two values. I'm trying to compare the [mintime] with the first value, and the [maxtime] with the second, to find database entries which fit the user's time range.
Here is my C# code from the controller which is supposed to perform that filtering:
RefinedResults = InitialResults.Where(
x => int.Parse(x.TimeLevel.Split('-')[0]) >= data.TimeRange[0] &&
int.Parse(x.TimeLevel.Split('-')[1]) <= data.TimeRange[1] &&).ToArray();
My thinking was that it would firstly split the 0-5 Minutes string at the - resulting in two strings, 0 and 5 Minutes, then parse the ints from those, resulting in just 0 and 5.
But as soon as it gets to Int.Parse, it throws the error in the title.

some of the x.TimeLevel database records are stored as "30-40+ Minutes". Is there any method just to extract the int?
You could use regular expressions to match the integer parts of the string for you, like this:
RefinedResults = InitialResults
.Where(x => {
var m = Regex.Match(x, #"^(\d+)-(\d+)");
return m.Success
&& int.Parse(m.Groups[1]) >= data.TimeRange[0]
&& int.Parse(m.Groups[2]) <= data.TimeRange[1];
}).ToArray();
This approach requires the string to start in a pair of dash-separated decimal numbers. It would ignore anything after the second number, ensuring that only sequences of digits are passed to int.Parse.

The reason your code doesn't work is because string.Split("-", "0-5 Minutes") will return [0] = "0" and [1] = "5 Minutes", and the latter is not parseable as an int.
You can use the regular expression "\d+" to split up groups of digits and ignore non-digits. This should work:
var refinedResults =
(
from result in InitialResults
let numbers = Regex.Matches(result.TimeLevel, #"\d+")
where ((int.Parse(numbers[0].Value) >= data.TimeRange[0]) && (int.Parse(numbers[1].Value) <= data.TimeRange[1]))
select result
).ToArray();
Here's a complete compilable console app which demonstrates it working. I've used dummy classes to represent your actual classes.
using System;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication2
{
public class SampleTime
{
public SampleTime(string timeLevel)
{
TimeLevel = timeLevel;
}
public readonly string TimeLevel;
}
public class Data
{
public int[] TimeRange = new int[2];
}
class Program
{
private static void Main(string[] args)
{
var initialResults = new []
{
new SampleTime("0-5 Minutes"),
new SampleTime("4-5 Minutes"), // Should be selected below.
new SampleTime("1-8 Minutes"),
new SampleTime("4-6 Minutes"), // Should be selected below.
new SampleTime("4-7 Minutes"),
new SampleTime("5-6 Minutes"), // Should be selected below.
new SampleTime("20-30 Minutes")
};
// Find all ranges between 4 and 6 inclusive.
Data data = new Data();
data.TimeRange[0] = 4;
data.TimeRange[1] = 6;
// The output of this should be (as commented in the array initialisation above):
//
// 4-5 Minutes
// 4-6 Minutes
// 5-6 Minutes
// Here's the significant code:
var refinedResults =
(
from result in initialResults
let numbers = Regex.Matches(result.TimeLevel, #"\d+")
where ((int.Parse(numbers[0].Value) >= data.TimeRange[0]) && (int.Parse(numbers[1].Value) <= data.TimeRange[1]))
select result
).ToArray();
foreach (var result in refinedResults)
{
Console.WriteLine(result.TimeLevel);
}
}
}
}

Error happens because of the " Minutes" part of the string.
You can truncate the " Minutes" part before splitting, like;
x.TimeLevel.Remove(x.IndexOf(" "))
then you can split.

The problem is that you are splitting by - and not also by space which is the separator of the minutes part. So you could use Split(' ', '-') instead:
InitialResults
.Where(x => int.Parse(x.TimeLevel.Split('-')[0]) >= data.TimeRange[0]
&& int.Parse(x.TimeLevel.Split(' ','-')[1]) <= data.TimeRange[1])
.ToArray();
As an aside, don't store three informations in one column in the database. That's just a source of nasty errors and bad performance. It's also more difficult to filter in the database which should be the preferred way or to maintain datatabase consistency.
Regarding your comment that the format can be 0-40+ Minutes. Then you could use...
InitialResults
.Select(x => new {
TimeLevel = x.TimeLevel,
MinMaxPart = x.TimeLevel.Split(' ')[0]
})
.Select(x => new {
TimeLevel = x.TimeLevel,
Min = int.Parse(x.MinMaxPart.Split('-')[0].Trim('+')),
Max = int.Parse(x.MinMaxPart.Split('-')[1].Trim('+'))
})
.Where(x => x.Min >= data.TimeRange[0] && x.Max <= data.TimeRange[1])
.Select(x => x.TimeLevel)
.ToArray();

What's the best way to split a list of strings to match first and last letters?

I have a long list of words in C#, and I want to find all the words within that list that have the same first and last letters and that have a length of between, say, 5 and 7 characters. For example, the list might have:
"wasted was washed washing was washes watched watches wilts with wastes wits washings"
It would return
Length: 5-7, First letter: w, Last letter: d, "wasted, washed, watched"
Length: 5-7, First letter: w, Last letter: s, "washes, watches, wilts, wastes"
Then I might change the specification for a length of 3-4 characters which would return
Length: 3-4, First letter: w, Last letter: s, "was, wits"
I found this method of splitting which is really fast, made each item unique, used the length and gave an excellent start:
Spliting string into words length-based lists c#
Is there a way to modify/use that to take account of first and last letters?
EDIT
I originally asked about the 'fastest' way because I usually solve problems like this with lots of string arrays (which are slow and involve a lot of code). LINQ and lookups are new to me, but I can see that the ILookup used in the solution I linked to is amazing in its simplicity and is very fast. I don't actually need the minimum processor time. Any approach that avoids me creating separate arrays for this information would be fantastic.

this one liner will give you groups with same first/last letter in your range
int min = 5;
int max = 7;
var results = str.Split()
.Where(s => s.Length >= min && s.Length <= max)
.GroupBy(s => new { First = s.First(), Last = s.Last()});

var minLength = 5;
var maxLength = 7;
var firstPart = "w";
var lastPart = "d";
var words = new List<string> { "washed", "wash" }; // so on
var matches = words.Where(w => w.Length >= minLength && w.Length <= maxLength &&
w.StartsWith(firstPart) && w.EndsWith(lastPart))
.ToList();
for the most part, this should be fast enough, unless you're dealing with tens of thousands of words and worrying about ms. then we can look further.

Just in LINQPad I created this:
void Main()
{
var words = new []{"wasted", "was", "washed", "washing", "was", "washes", "watched", "watches", "wilts", "with", "wastes", "wits", "washings"};
var firstLetter = "w";
var lastLetter = "d";
var minimumLength = 5;
var maximumLength = 7;
var sortedWords = words.Where(w => w.StartsWith(firstLetter) && w.EndsWith(lastLetter) && w.Length >= minimumLength && w.Length <= maximumLength);
sortedWords.Dump();
}
If that isn't fast enough, I would create a lookup table:
Dictionary<char, Dictionary<char, List<string>> lookupTable;
and do:
lookupTable[firstLetter][lastLetter].Where(<check length>)

Here's a method that does exactly what you want. You are only given a list of strings and the min/max length, correct? You aren't given the first and last letters to filter on. This method processes all the first/last letters in the strings.
private static void ProcessInput(string[] words, int minLength, int maxLength)
{
var groups = from word in words
where word.Length > 0 && word.Length >= minLength && word.Length <= maxLength
let key = new Tuple<char, char>(word.First(), word.Last())
group word by key into #group
orderby Char.ToLowerInvariant(#group.Key.Item1), #group.Key.Item1, Char.ToLowerInvariant(#group.Key.Item2), #group.Key.Item2
select #group;
Console.WriteLine("Length: {0}-{1}", minLength, maxLength);
foreach (var group in groups)
{
Console.WriteLine("First letter: {0}, Last letter: {1}", group.Key.Item1, group.Key.Item2);
foreach (var word in group)
Console.WriteLine("\t{0}", word);
}
}

Just as a quick thought, I have no clue if this would be faster or more efficient than the linq solutions posted, but this could also be done fairly easily with regular expressions.
For example, if you wanted to get 5-7 letter length words that begin with "w" and end with "s", you could use a pattern along the lines of:
\bw[A-Za-z]{3,5}s\b
(and this could fairly easily be made to be more variable driven - For example, have a variable for first letter, min length, max length, last letter and plug them in to the pattern to replace w, 3, 5 & s)
Them, using the RegEx library, you could then just take your captured groups to be your list.
Again, I don't know how this compares efficiency-wise to linq, but I thought it might deserve mention.
Hope this helps!!

How to parse a numbered sequence from a List of filenames?

I would like to automatically parse a range of numbered sequences from an already sorted List<FileData> of filenames by checking which part of the filename changes.
Here is an example (file extension has already been removed):
First filename: IMG_0000
Last filename: IMG_1000
Numbered Range I need: 0000 and 1000
Except I need to deal with every possible type of file naming convention such as:
0000 ... 9999
20080312_0000 ... 20080312_9999
IMG_0000 - Copy ... IMG_9999 - Copy
8er_green3_00001 .. 8er_green3_09999
etc.
I would like the entire 0-padded range e.g. 0001 not just 1
The sequence number is 0-padded e.g. 0001
The sequence number can be located anywhere e.g. IMG_0000 - Copy
The range can start and end with anything i.e. doesn't have to start with 1 and end with 9999
Numbers may appear multiple times in the filename of the sequence e.g. 20080312_0000
Whenever I get something working for 8 random test cases, the 9th test breaks everything and I end up re-starting from scratch.
I've currently been comparing only the first and last filenames (as opposed to iterating through all filenames):
void FindRange(List<FileData> files, out string startRange, out string endRange)
{
string firstFile = files.First().ShortName;
string lastFile = files.Last().ShortName;
...
}
Does anyone have any clever ideas? Perhaps something with Regex?

If you're guaranteed to know the files end with the number (eg. _\d+), and are sorted, just grab the first and last elements and that's your range. If the filenames are all the same, you can sort the list to get them in order numerically. Unless I'm missing something obvious here -- where's the problem?

Use a regex to parse out the numbers from the filenames:
^.+\w(\d+)[^\d]*$
From these parsed strings, find the maximum length, and left-pad any that are less than the maximum length with zeros.
Sort these padded strings alphabetically. Take the first and last from this sorted list to give you your min and max numbers.

Firstly, I will assume that the numbers are always zero-padded so that they are the same length. If not then bigger headaches lie ahead.
Secondly, assume that the file names are exactly the same apart from the increment number component.
If these assumptions are true then the algorithm should be to look at each character in the first and last filenames to determine which same-positioned characters do not match.
var start = String.Empty;
var end = String.Empty;
for (var index = 0; index < firstFile.Length; index++)
{
char c = firstFile[index];
if (filenames.Any(filename => filename[index] != c))
{
start += firstFile[index];
end += lastFile[index];
}
}
// convert to int if required
edit: Changed to check every filename until a difference is found. Not as efficient as it could be but very simple and straightforward.

Here is my solution. It works with all of the examples that you have provided and it assumes the input array to be sorted.
Note that it doesn't look exclusively for numbers; it looks for a consistent sequence of characters that might differ across all of the strings. So if you provide it with {"0000", "0001", "0002"} it will hand back "0" and "2" as the start and end strings, since that's the only part of the strings that differ. If you give it {"0000", "0010", "0100"}, it will give you back "00" and "10".
But if you give it {"0000", "0101"}, it will whine since the differing parts of the string are not contiguous. If you would like this behavior modified so it will return everything from the first differing character to the last, that's fine; I can make that change. But if you are feeding it a ton of filenames that will have sequential changes to the number region, this should not be a problem.
public static class RangeFinder
{
public static void FindRange(IEnumerable<string> strings,
out string startRange, out string endRange)
{
using (var e = strings.GetEnumerator()) {
if (!e.MoveNext())
throw new ArgumentException("strings", "No elements.");
if (e.Current == null)
throw new ArgumentException("strings",
"Null element encountered at index 0.");
var template = e.Current;
// If an element in here is true, it means that index differs.
var matchMatrix = new bool[template.Length];
int index = 1;
string last = null;
while (e.MoveNext()) {
if (e.Current == null)
throw new ArgumentException("strings",
"Null element encountered at index " + index + ".");
last = e.Current;
if (last.Length != template.Length)
throw new ArgumentException("strings",
"Element at index " + index + " has incorrect length.");
for (int i = 0; i < template.Length; i++)
if (last[i] != template[i])
matchMatrix[i] = true;
}
// Verify the matrix:
// * There must be at least one true value.
// * All true values must be consecutive.
int start = -1;
int end = -1;
for (int i = 0; i < matchMatrix.Length; i++) {
if (matchMatrix[i]) {
if (end != -1)
throw new ArgumentException("strings",
"Inconsistent match matrix; no usable pattern discovered.");
if (start == -1)
start = i;
} else {
if (start != -1 && end == -1)
end = i;
}
}
if (start == -1)
throw new ArgumentException("strings",
"Strings did not vary; no usable pattern discovered.");
if (end == -1)
end = matchMatrix.Length;
startRange = template.Substring(start, end - start);
endRange = last.Substring(start, end - start);
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Filter string array with substring - c#

Related

Compare 2 big string lists

How to treat integers from a string as multi-digit numbers and not individual digits?

Int.Parse(String.Split()) returns "Input string was not in a correct format" error

What's the best way to split a list of strings to match first and last letters?

How to parse a numbered sequence from a List of filenames?

Categories

Resources