How to find one of many possible substrings in a larger string? - c#

I have a simple problem, but I could not find a simple solution yet.
I have a string containing for example this
UNB+123UNH+234BGM+345DTM+456
The actual string is lots larger, but you get the idea
now I have a set of values I need to find in this string
for example UNH and BGM and DTM and so on
So I need to search in the large string, and find the position of the first set of values.
something like this (not existing but to explain the idea)
string[] chars = {"UNH", "BGM", "DTM" };
int pos = test.IndexOfAny(chars);
in this case pos would be 8 because from all 3 substrings, UNH is the first occurrence in the variable test
What I actually trying to accomplish is splitting the large string into a list of strings, but the delimiter can be one of many values ("BGM", "UNH", "DTM")
So the result would be
UNB+123
UNH+234
BGM+345
DTM+456
I can off course build a loop that does IndexOf for each of the substrings, and then remember the smallest value, but that seems so inefficient. I am hoping for a better way to do this
EDIT
the substrings to search for are always 3 letters, but the text in between can be anything at all with any length
EDIT
It are always 3 alfanumeric characters, and then anything can be there, also lots of + signs

You will find more problems with EDI than just splitting into corresponding fields, what about conditions or multiple values or lists?. I recommend you to take a look at EDI.net
EDIT:
EDIFact is a format pretty complex to just use regex, as I mentioned before, you will have conditions for each format/field/process, you will need to catch the whole field in order to really parse it, means as example DTM can have one specific datetime format and in another EDI can have a DateTime format totally different.
However, this is the structure of a DTM field:
DTM DATE/TIME/PERIOD
Function: To specify date, and/or time, or period.
010 C507 DATE/TIME/PERIOD M 1
2005 Date or time or period function code
qualifier M an..3
2380 Date or time or period text C an..35
2379 Date or time or period format code C an..3
So you will have always something like 'DTM+d3:d35:d3' to search for.
Really, it doesn't worth the struggle, use EDI.net, create your own POCO classes and work from there.
Friendly reminder that EDIFact changes every 6 months on Europe.

If the separators can be any one of UNB, UNH, BGM, or DTM, the following Regex could work:
foreach (Match match in Regex.Matches(input, #"(UNB|UNH|BGM|DTM).+?(?=(UNB|UNH|BGM|DTM)|$)"))
{
Console.WriteLine(match.Value);
}
Explanation:
(UNB|UNH|BGM|DTM) matches either of the separators
.+? matches any string with at least one character (but as short as possible)
(?=(UNB|UNH|BGM|DTM)|$) matches if either a separator follows or if the string ends there - the match is however not included in the value.

It sounds like the other answer recognises the format - you should definitely consider a library specifically for parsing this format!
If you're intent on parsing it yourself, you could simply find the index of your identifiers in the string, determine the first 2 by position, and use those positions to Substring the original input
var input = "UNB+123UNH+234BGM+345DTM+456";
var chars = new[]{"UNH", "BGM", "DTM" };
var indexes = chars.Select(c => new{Length=c.Length,Position= input.IndexOf(c)}) // Get position and length of each input
.Where(x => x.Position>-1) // where there is actually a match
.OrderBy(x =>x.Position) // put them in order of the position in the input
.Take(2) // only interested in first 2
.ToArray(); // make it an array
if(indexes.Length < 2)
throw new Exception("Did not find 2");
var result = input.Substring(indexes[0].Position + indexes[0].Length, indexes[1].Position - indexes[0].Position - indexes[0].Length);
Live example: https://dotnetfiddle.net/tDiQLG

There is already a lot of answers here, but I took the time to write mine so might as well post it even if it's not as elegant.
The code assumes all tags are accounted for in the chars array.
string str = "UNB+123UNH+234BGM+345DTM+456";
string[] chars = { "UNH", "BGM", "DTM" };
var locations = chars.Select(o => str.IndexOf(o)).Where(i => i > -1).OrderBy(o => o);
var resultList = new List<string>();
for(int i = 0;i < locations.Count();i++)
{
var nextIndex = locations.ElementAtOrDefault(i + 1);
nextIndex = nextIndex > 0 ? nextIndex : str.Length;
nextIndex = nextIndex - locations.ElementAt(i);
resultList.Add(str.Substring(locations.ElementAt(i), nextIndex));
}

This is a fairly efficient O(n) solution using a HashSet
It's extremely simple, low allocations, more efficient than regex, and doesn't need a library
Given
private static HashSet<string> _set;
public static IEnumerable<string> Split(string input)
{
var last = 0;
for (int i = 0; i < input.Length-3; i++)
{
if (!_set.Contains(input.Substring(i, 3))) continue;
yield return input.Substring(last, i - last);
last = i;
}
yield return input.Substring(last);
}
Usage
_set = new HashSet<string>(new []{ "UNH", "BGM", "DTM" });
var results = Split("UNB+123UNH+234BGM+345DTM+456");
foreach (var item in results)
Console.WriteLine(item);
Output
UNB+123
UNH+234
BGM+345
DTM+456
Full Demo Here
Note : You could get this faster with a simple sorted tree, but would require more effort

Related

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}
You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.
If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

Complex string compare logic

I need help with some complex (for me anyway as I not too experienced) string comparison logic. Basically, I want to validate a string to make sure it matches a format rule. I am using C#, targeting .NET 4.5.2.
I am trying to work with an API which gives me the expected format of the string this way:
1:420+4:9#### (must have “420” starting in position 1 AND have a “9” in position 4 AND have numeric digits in positions 5-8
2:Z+14:&&+20:10,11,12 (must have a “Z” in position 2 AND and alpha letters in positions 14, 15 AND have either “10”, “11”, or “12” starting in position 20
Legend:
":" = position/valuelist separator
"," = value separator
"+" = test separator
"#" = numeric digit-only wildcard
"&" = alpha letter-only wildcard
Given this, my first thought is to do a series of substrings and splits of the input string and then do compare on each section? Or, I could do a for loop and iterate through each character one by one until I hit the end of the length of the input string.
Let's assume in this case that the input string is something like "420987435744585". Using rule number one, I should get a pass on this since the first three are 420, position 4 is a 9 and the next 5-8 are numeric.
So far, I have created a method that returns a bool if I pass/fail validation. The input string is passed in. I then started to split on + or - to get all of the and or not sections and then split on comma to get the groups of rules. But this is where I am stuck. It seems like it should be easy and maybe it is but I just can't seem to wrap my head around it and I am thinking I am going to end up with a ton of arrays, foreach loops, if statements, etc... Just to validate and return true/false if the input string matches my format.
Can somebody please assist and give some guidance?
Thank you!!!!
The best way to handle these conditions would be using Regular Expressions (Regex). At first, you may find it a bit complicated, but it's worth to put time on learning it to handle all types of string patterns in a simple non-verbose way.
You can start with these tutorials :
http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial
http://www.tutorialspoint.com/csharp/csharp_regular_expressions.htm
And use this one as a reference :
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
I think the best way is a custom function, it will be faster than RegEx, and it would be a lot of manual work to convert that format to RegEx.
I've made a start at the validation function, and it's testing ok for the samples you provided.
Here is the code:
static bool CheckFormat(string formatString, string value)
{
string[] tests = formatString.Split('+');
foreach(string test in tests)
{
string[] testElement = test.Split(':');
int startPos = int.Parse(testElement[0]);
string patterns = testElement[1];
string[] patternElements = patterns.Split(',');
foreach(string patternElement in patternElements)
{
//value string not long enough, so fail.
if(startPos + patternElement.Length > value.Length)
return false;
for (int i = 0; i < patternElement.Length; i++)
{
switch(patternElement[i])
{
case '#':
if (!Char.IsNumber(value[i]))
return false;
break;
case '&':
if (!Char.IsLetter(value[i]))
return false;
break;
default:
if(patternElement[i] != value[i])
return false;
break;
}
}
}
}
return true;
}
The dotnet fiddle is here if you want to play with it: https://dotnetfiddle.net/52olLQ.
Good luck.

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!
You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979
I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}
It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.
You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');
If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"
Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

c# searching large text file

I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.
Currently I have been using IndexOf to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.
How can I optimize searching speed? I've tried Contains() but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed
example
while ((line = myStream.ReadLine()) != null)
{
if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
{
LineIndex.Add(CurrentPosition);
LinesCounted += 1;
}
}
The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:
Boyer-Moore is "the standard", I think:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
But there are lots more out there:
http://www-igm.univ-mlv.fr/~lecroq/string/
including Morris-Pratt:
http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
and Knuth-Morris-Pratt:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).
You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])
Unfortunately, I don't think there's a whole lot you can do in straight C#.
I have found the Boyer-Moore algorithm to be extremely fast for this task. But I found there was no way to make even that as fast as IndexOf. My assumption is that this is because IndexOf is implemented in hand-optimized assembler while my code ran in C#.
You can see my code and performance test results in the article Fast Text Search with Boyer-Moore.
Have you seen these questions (and answers)?
Processing large text file in C#
Is there a way to read large text file in parts?
Matching a string in a Large text file?
Doing it the way you are now seems to be the way to go if all you want to do is read the text file. Other ideas:
If it is possible to pre-sort the data, such as when it gets inserted into the text file, that could help.
You could insert the data into a database and query it as needed.
You could use a hash table
You can user regexp.Match(String). RegExp Match is faster.
static void Main()
{
string text = "One car red car blue car";
string pat = #"(\w+)\s+(car)";
// Instantiate the regular expression object.
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match"+ (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group"+i+"='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
}
}
m = m.NextMatch();
}
}

Appending a string in a loop in effective way

for long time , I always append a string in the following way.
for example if i want to get all the employee names separated by some symbol , in the below example i opeted for pipe symbol.
string final=string.Empty;
foreach(Employee emp in EmployeeList)
{
final+=emp.Name+"|"; // if i want to separate them by pipe symbol
}
at the end i do a substring and remove the last pipe symbol as it is not required
final=final.Substring(0,final.length-1);
Is there any effective way of doing this.
I don't want to appened the pipe symbol for the last item and do a substring again.
Use string.Join() and a Linq projection with Select() instead:
finalString = string.Join("|", EmployeeList.Select( x=> x.Name));
Three reasons why this approach is better:
It is much more concise and readable
– it expresses intend, not how you
want to achieve your goal (in your
case concatenating strings in a
loop). Using a simple projection with Linq also helps here.
It is optimized by the framework for
performance: In most cases string.Join() will
use a StringBuilder internally, so
you are not creating multiple strings that are
then un-referenced and must be
garbage collected. Also see: Do not
concatenate strings inside loops
You don’t have to worry about special cases. string.Join()
automatically handles the case of
the “last item” after which you do
not want another separator, again
this simplifies your code and makes
it less error prone.
I like using the aggregate function in linq, such as:
string[] words = { "one", "two", "three" };
var res = words.Aggregate((current, next) => current + ", " + next);
You should join your strings.
Example (borrowed from MSDN):
using System;
class Sample {
public static void Main() {
String[] val = {"apple", "orange", "grape", "pear"};
String sep = ", ";
String result;
Console.WriteLine("sep = '{0}'", sep);
Console.WriteLine("val[] = {{'{0}' '{1}' '{2}' '{3}'}}", val[0], val[1], val[2], val[3]);
result = String.Join(sep, val, 1, 2);
Console.WriteLine("String.Join(sep, val, 1, 2) = '{0}'", result);
}
}
For building up like this, a StringBuilder is probably a better choice.
For your final pipe issue, simply leave the last append outside of the loop
int size = EmployeeList.length()
for(int i = 0; i < size - 1; i++)
{
final+=EmployeeList.getEmployee(i).Name+"|";
}
final+=EmployeeList.getEmployee(size-1).Name;

Categories

Resources