Complex string compare logic

Complex string compare logic - c#

I need help with some complex (for me anyway as I not too experienced) string comparison logic. Basically, I want to validate a string to make sure it matches a format rule. I am using C#, targeting .NET 4.5.2.
I am trying to work with an API which gives me the expected format of the string this way:
1:420+4:9#### (must have “420” starting in position 1 AND have a “9” in position 4 AND have numeric digits in positions 5-8
2:Z+14:&&+20:10,11,12 (must have a “Z” in position 2 AND and alpha letters in positions 14, 15 AND have either “10”, “11”, or “12” starting in position 20
Legend:
":" = position/valuelist separator
"," = value separator
"+" = test separator
"#" = numeric digit-only wildcard
"&" = alpha letter-only wildcard
Given this, my first thought is to do a series of substrings and splits of the input string and then do compare on each section? Or, I could do a for loop and iterate through each character one by one until I hit the end of the length of the input string.
Let's assume in this case that the input string is something like "420987435744585". Using rule number one, I should get a pass on this since the first three are 420, position 4 is a 9 and the next 5-8 are numeric.
So far, I have created a method that returns a bool if I pass/fail validation. The input string is passed in. I then started to split on + or - to get all of the and or not sections and then split on comma to get the groups of rules. But this is where I am stuck. It seems like it should be easy and maybe it is but I just can't seem to wrap my head around it and I am thinking I am going to end up with a ton of arrays, foreach loops, if statements, etc... Just to validate and return true/false if the input string matches my format.
Can somebody please assist and give some guidance?
Thank you!!!!

The best way to handle these conditions would be using Regular Expressions (Regex). At first, you may find it a bit complicated, but it's worth to put time on learning it to handle all types of string patterns in a simple non-verbose way.
You can start with these tutorials :
http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial
http://www.tutorialspoint.com/csharp/csharp_regular_expressions.htm
And use this one as a reference :
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

I think the best way is a custom function, it will be faster than RegEx, and it would be a lot of manual work to convert that format to RegEx.
I've made a start at the validation function, and it's testing ok for the samples you provided.
Here is the code:
static bool CheckFormat(string formatString, string value)
{
string[] tests = formatString.Split('+');
foreach(string test in tests)
{
string[] testElement = test.Split(':');
int startPos = int.Parse(testElement[0]);
string patterns = testElement[1];
string[] patternElements = patterns.Split(',');
foreach(string patternElement in patternElements)
{
//value string not long enough, so fail.
if(startPos + patternElement.Length > value.Length)
return false;
for (int i = 0; i < patternElement.Length; i++)
{
switch(patternElement[i])
{
case '#':
if (!Char.IsNumber(value[i]))
return false;
break;
case '&':
if (!Char.IsLetter(value[i]))
return false;
break;
default:
if(patternElement[i] != value[i])
return false;
break;
}
}
}
}
return true;
}
The dotnet fiddle is here if you want to play with it: https://dotnetfiddle.net/52olLQ.
Good luck.

Related

How to find one of many possible substrings in a larger string?

I have a simple problem, but I could not find a simple solution yet.
I have a string containing for example this
UNB+123UNH+234BGM+345DTM+456
The actual string is lots larger, but you get the idea
now I have a set of values I need to find in this string
for example UNH and BGM and DTM and so on
So I need to search in the large string, and find the position of the first set of values.
something like this (not existing but to explain the idea)
string[] chars = {"UNH", "BGM", "DTM" };
int pos = test.IndexOfAny(chars);
in this case pos would be 8 because from all 3 substrings, UNH is the first occurrence in the variable test
What I actually trying to accomplish is splitting the large string into a list of strings, but the delimiter can be one of many values ("BGM", "UNH", "DTM")
So the result would be
UNB+123
UNH+234
BGM+345
DTM+456
I can off course build a loop that does IndexOf for each of the substrings, and then remember the smallest value, but that seems so inefficient. I am hoping for a better way to do this
EDIT
the substrings to search for are always 3 letters, but the text in between can be anything at all with any length
EDIT
It are always 3 alfanumeric characters, and then anything can be there, also lots of + signs

You will find more problems with EDI than just splitting into corresponding fields, what about conditions or multiple values or lists?. I recommend you to take a look at EDI.net
EDIT:
EDIFact is a format pretty complex to just use regex, as I mentioned before, you will have conditions for each format/field/process, you will need to catch the whole field in order to really parse it, means as example DTM can have one specific datetime format and in another EDI can have a DateTime format totally different.
However, this is the structure of a DTM field:
DTM DATE/TIME/PERIOD
Function: To specify date, and/or time, or period.
010 C507 DATE/TIME/PERIOD M 1
2005 Date or time or period function code
qualifier M an..3
2380 Date or time or period text C an..35
2379 Date or time or period format code C an..3
So you will have always something like 'DTM+d3:d35:d3' to search for.
Really, it doesn't worth the struggle, use EDI.net, create your own POCO classes and work from there.
Friendly reminder that EDIFact changes every 6 months on Europe.

If the separators can be any one of UNB, UNH, BGM, or DTM, the following Regex could work:
foreach (Match match in Regex.Matches(input, #"(UNB|UNH|BGM|DTM).+?(?=(UNB|UNH|BGM|DTM)|$)"))
{
Console.WriteLine(match.Value);
}
Explanation:
(UNB|UNH|BGM|DTM) matches either of the separators
.+? matches any string with at least one character (but as short as possible)
(?=(UNB|UNH|BGM|DTM)|$) matches if either a separator follows or if the string ends there - the match is however not included in the value.

It sounds like the other answer recognises the format - you should definitely consider a library specifically for parsing this format!
If you're intent on parsing it yourself, you could simply find the index of your identifiers in the string, determine the first 2 by position, and use those positions to Substring the original input
var input = "UNB+123UNH+234BGM+345DTM+456";
var chars = new[]{"UNH", "BGM", "DTM" };
var indexes = chars.Select(c => new{Length=c.Length,Position= input.IndexOf(c)}) // Get position and length of each input
.Where(x => x.Position>-1) // where there is actually a match
.OrderBy(x =>x.Position) // put them in order of the position in the input
.Take(2) // only interested in first 2
.ToArray(); // make it an array
if(indexes.Length < 2)
throw new Exception("Did not find 2");
var result = input.Substring(indexes[0].Position + indexes[0].Length, indexes[1].Position - indexes[0].Position - indexes[0].Length);
Live example: https://dotnetfiddle.net/tDiQLG

There is already a lot of answers here, but I took the time to write mine so might as well post it even if it's not as elegant.
The code assumes all tags are accounted for in the chars array.
string str = "UNB+123UNH+234BGM+345DTM+456";
string[] chars = { "UNH", "BGM", "DTM" };
var locations = chars.Select(o => str.IndexOf(o)).Where(i => i > -1).OrderBy(o => o);
var resultList = new List<string>();
for(int i = 0;i < locations.Count();i++)
{
var nextIndex = locations.ElementAtOrDefault(i + 1);
nextIndex = nextIndex > 0 ? nextIndex : str.Length;
nextIndex = nextIndex - locations.ElementAt(i);
resultList.Add(str.Substring(locations.ElementAt(i), nextIndex));
}

This is a fairly efficient O(n) solution using a HashSet
It's extremely simple, low allocations, more efficient than regex, and doesn't need a library
Given
private static HashSet<string> _set;
public static IEnumerable<string> Split(string input)
{
var last = 0;
for (int i = 0; i < input.Length-3; i++)
{
if (!_set.Contains(input.Substring(i, 3))) continue;
yield return input.Substring(last, i - last);
last = i;
}
yield return input.Substring(last);
}
Usage
_set = new HashSet<string>(new []{ "UNH", "BGM", "DTM" });
var results = Split("UNB+123UNH+234BGM+345DTM+456");
foreach (var item in results)
Console.WriteLine(item);
Output
UNB+123
UNH+234
BGM+345
DTM+456
Full Demo Here
Note : You could get this faster with a simple sorted tree, but would require more effort

How can I check if a string follows the pattern of american currency?

I'm working on a problem that wants me to get a string input from a user, run it through a method which will check every character to see if it follows the pattern of American currency. It has to be a string that goes into the method. the amount can be any where from 1 dollar to a thousand but must have the format entered as $x.xx, $xx.xx, $xxx.xx, as long as the user enters an amount that is consistent with the above formats then my program should output that its "valid" anything else would be a "invalid format" output. first character must be the '$' and I cannot use regex.
I get the user input and then validate it with .NullOrWhiteSpace. and then send the string value holding the user input down to my created method. from this point I have no idea how to continue. I've tried .ToCharArray, I have also tried making a long and complicated if statement and I have researched for a few hours now but can't find a solid way to write this out.
class Program
{
static void Main(string[] args)
{
Console.WriteLine("enter amount between $1.00 and $1000.00");
string valueUS = Console.ReadLine();
while (string.IsNullOrWhiteSpace(valueUS))
{
Console.WriteLine("Please enter in an amount");
valueUS = Console.ReadLine();
}
currencyChecker(valueUS);
}
public static string currencyChecker(string currencyString)
{
char[] currencyArray;
currencyArray = currencyString.ToCharArray();
for (int i = 0; i < currencyArray.Length; i++)
{
if (currencyArray[0] == '$')
{
}
}
return currencyString;
the method below should check every character entered by the user and verify that it matches the above described pattern for American currency and output that its "valid" anything else should be reported back as "invalid"

Usually, you would use a regular expression for something like this.
A simple regex for this would be ^\$\d+.\d\d$. Basically, it means the string should start with a $ sign, have at last one digit, a dot, and two more digits.
However, this can be done without regular expressions, and since it seems like a homework task, I'll give you a nudge in the right direction.
So you need to test the string starts with $, the char 3rd from the right is a ., and everything else are digits.
Your method should return a bool indicating valid / invalid results - so you should do something like this:
static bool IsCurrency(string currency)
{
// Check if the string is not null or empty - if not, return false
// check if the string is at least 5 chars long, since you need at least $x.xx - if not, return false
// Check if the first char is $ - if not, return false
// Check if the 3rd char from the end is . - if not, return false
// check if all the other chars are digits - if not, return false
// If all checks are valid -
return true;
}
Note that the order of the tests is critical, for instance if you check the 3rd digit from the right is a . before you check you have at least 5 digits, you might attempt to check a string that is only 2 digits long and get an exception.
Since this is (probably) homework I'm going to leave the code-writing part for you, so you would actually learn something from this.

Read input with different datatypes and space seperation

I'm trying to figure out how to write code to let the user input three values (string, int, int) in one line with space to separate the values.
I thought of doing it with String.Split Method but that only works if all the values have the same datatype.
How can I do it with different datatypes?
For example:
The user might want to input
Hello 23 54
I'm using console application C#

Well the first problem is that you need to decide whether the text the user enters itself can contain spaces. For example, is the following allowed?
Hello World, it's me 08 15
In that case, String.Split will not really be helpful.
What I'd try is using a regular expression. The following may serve as a starting point:
Match m = Regex.Match(input, #"^(?<text>.+) (?<num1>(\+|\-)?\d+) (?<num2>(\+|\-)?\d+)$");
if (m.Success)
{
string stringValue = m.Groups["text"].Value;
int num1 = Convert.ToInt32(m.Groups["num1"].Value);
int num2 = Convert.ToInt32(m.Groups["num2"].Value);
}
BTW: The following part of your question makes me frown:
I thought of doing it with String.Split Method but that only works if all the values have the same datatype.
A string is always just a string. Whether it contains a text, your email-address or your bank account balance. It is always just a series of characters. The notion that the string contains a number is just your interpretation!
So from a program's point of view, the string you gave is a series of characters. And for splitting that it doesn't matter at all what the real semantics of the content are.
That's why the splitting part is separate from the conversion part. You need to tell your application that that the first part is a string, the second and third parts however are supposed to be numbers. That's what you need type conversions for.

You are confusing things. A string is either null, empty or contains a sequence of characters. It never contains other data types. However, it might contain parts that could be interpreted as numbers, dates, colors etc... (but they are still strings). "123" is not an int! It is a string containing a number.
In order to extract these pieces you need to do two things:
Split the string into several string parts.
Convert string parts that are supposed to represent whole numbers into a the int type (=System.Int32).
string input = "Abc 123 456"
string[] parts = input.Split(); //Whitespaces are assumed as separators by default.
if (parts.Count == 3) {
Console.WriteLine("The text is \"{0}\"", parts[0]);
int n1;
if (Int32.TryParse(parts[1], out n1)) {
Console.WriteLine("The 1st number is {0}", n1);
} else {
Console.WriteLine("The second part is supposed to be a whole number.");
}
int n2;
if (Int32.TryParse(parts[2], out n2)) {
Console.WriteLine("The 2nd number is {0}", n2);
} else {
Console.WriteLine("The third part is supposed to be a whole number.");
}
} else {
Console.WriteLine("You must enter three parts separated by a space.");
}

What you have to do is get "Hello 23 54" in a string variable. Split by " " and treat them.
string value = "Hello 23 54";
var listValues = value.Split(' ').ToList();
After that you have to parse each item from listValues to your related types.
Hope it helps. ;)

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}

You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.

If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!

You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979

I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}

It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.

You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');

If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"

Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Complex string compare logic - c#

Related

How to find one of many possible substrings in a larger string?

How can I check if a string follows the pattern of american currency?

Read input with different datatypes and space seperation

grouping adjacent similar substrings

How to properly split a CSV using C# split() function?

Categories

Resources