grouping adjacent similar substrings

grouping adjacent similar substrings - c#

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}

You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.

If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

Related

How to find one of many possible substrings in a larger string?

I have a simple problem, but I could not find a simple solution yet.
I have a string containing for example this
UNB+123UNH+234BGM+345DTM+456
The actual string is lots larger, but you get the idea
now I have a set of values I need to find in this string
for example UNH and BGM and DTM and so on
So I need to search in the large string, and find the position of the first set of values.
something like this (not existing but to explain the idea)
string[] chars = {"UNH", "BGM", "DTM" };
int pos = test.IndexOfAny(chars);
in this case pos would be 8 because from all 3 substrings, UNH is the first occurrence in the variable test
What I actually trying to accomplish is splitting the large string into a list of strings, but the delimiter can be one of many values ("BGM", "UNH", "DTM")
So the result would be
UNB+123
UNH+234
BGM+345
DTM+456
I can off course build a loop that does IndexOf for each of the substrings, and then remember the smallest value, but that seems so inefficient. I am hoping for a better way to do this
EDIT
the substrings to search for are always 3 letters, but the text in between can be anything at all with any length
EDIT
It are always 3 alfanumeric characters, and then anything can be there, also lots of + signs

You will find more problems with EDI than just splitting into corresponding fields, what about conditions or multiple values or lists?. I recommend you to take a look at EDI.net
EDIT:
EDIFact is a format pretty complex to just use regex, as I mentioned before, you will have conditions for each format/field/process, you will need to catch the whole field in order to really parse it, means as example DTM can have one specific datetime format and in another EDI can have a DateTime format totally different.
However, this is the structure of a DTM field:
DTM DATE/TIME/PERIOD
Function: To specify date, and/or time, or period.
010 C507 DATE/TIME/PERIOD M 1
2005 Date or time or period function code
qualifier M an..3
2380 Date or time or period text C an..35
2379 Date or time or period format code C an..3
So you will have always something like 'DTM+d3:d35:d3' to search for.
Really, it doesn't worth the struggle, use EDI.net, create your own POCO classes and work from there.
Friendly reminder that EDIFact changes every 6 months on Europe.

If the separators can be any one of UNB, UNH, BGM, or DTM, the following Regex could work:
foreach (Match match in Regex.Matches(input, #"(UNB|UNH|BGM|DTM).+?(?=(UNB|UNH|BGM|DTM)|$)"))
{
Console.WriteLine(match.Value);
}
Explanation:
(UNB|UNH|BGM|DTM) matches either of the separators
.+? matches any string with at least one character (but as short as possible)
(?=(UNB|UNH|BGM|DTM)|$) matches if either a separator follows or if the string ends there - the match is however not included in the value.

It sounds like the other answer recognises the format - you should definitely consider a library specifically for parsing this format!
If you're intent on parsing it yourself, you could simply find the index of your identifiers in the string, determine the first 2 by position, and use those positions to Substring the original input
var input = "UNB+123UNH+234BGM+345DTM+456";
var chars = new[]{"UNH", "BGM", "DTM" };
var indexes = chars.Select(c => new{Length=c.Length,Position= input.IndexOf(c)}) // Get position and length of each input
.Where(x => x.Position>-1) // where there is actually a match
.OrderBy(x =>x.Position) // put them in order of the position in the input
.Take(2) // only interested in first 2
.ToArray(); // make it an array
if(indexes.Length < 2)
throw new Exception("Did not find 2");
var result = input.Substring(indexes[0].Position + indexes[0].Length, indexes[1].Position - indexes[0].Position - indexes[0].Length);
Live example: https://dotnetfiddle.net/tDiQLG

There is already a lot of answers here, but I took the time to write mine so might as well post it even if it's not as elegant.
The code assumes all tags are accounted for in the chars array.
string str = "UNB+123UNH+234BGM+345DTM+456";
string[] chars = { "UNH", "BGM", "DTM" };
var locations = chars.Select(o => str.IndexOf(o)).Where(i => i > -1).OrderBy(o => o);
var resultList = new List<string>();
for(int i = 0;i < locations.Count();i++)
{
var nextIndex = locations.ElementAtOrDefault(i + 1);
nextIndex = nextIndex > 0 ? nextIndex : str.Length;
nextIndex = nextIndex - locations.ElementAt(i);
resultList.Add(str.Substring(locations.ElementAt(i), nextIndex));
}

This is a fairly efficient O(n) solution using a HashSet
It's extremely simple, low allocations, more efficient than regex, and doesn't need a library
Given
private static HashSet<string> _set;
public static IEnumerable<string> Split(string input)
{
var last = 0;
for (int i = 0; i < input.Length-3; i++)
{
if (!_set.Contains(input.Substring(i, 3))) continue;
yield return input.Substring(last, i - last);
last = i;
}
yield return input.Substring(last);
}
Usage
_set = new HashSet<string>(new []{ "UNH", "BGM", "DTM" });
var results = Split("UNB+123UNH+234BGM+345DTM+456");
foreach (var item in results)
Console.WriteLine(item);
Output
UNB+123
UNH+234
BGM+345
DTM+456
Full Demo Here
Note : You could get this faster with a simple sorted tree, but would require more effort

How do I find a variable set of 5 numbers qualified by surrounding underscores?

I am pulling file names into a variable (#[User::FileName]) and attempting to extract the work order number (always 5 numbers with underscores on both sides) from that string. For example, a file name would look like - "ABC_2017_DEF_9_12_GHI_35132_S5160.csv". I want result to return "35132". I have found examples of how to do it such as this SUBSTRING(FileName,1,FINDSTRING(FileName,"_",1) - 1) but the underscore will not always be in the same location.
Is it possible to do this in the expression builder?
Answer:
public void Main()
{
string strFilename = Dts.Variables["User::FileName"].Value.ToString();
var RegexObj = new Regex(#"_([\d]{5})_");
var match = RegexObj.Match(strFilename);
if (match.Success)
{
Dts.Variables["User::WorkOrder"].Value = match.Groups[1].Value;
}
Dts.TaskResult = (int)ScriptResults.Success;
}

First of all, the example you have provided ABC_2017_DEF_9_12_GHI_35132_S5160.csv contains 4 numbers located between underscores:
2017 , 9 , 12 , 35132
I don't know if the filename may contains many a 5 digits number can occurs many times, so in my answer i will assume that the number you want to return is the last occurrence of the number made of 5 digits.
Solution
You have to use the Following Regular Expression:
(?:_)\K[0-9][0-9][0-9][0-9][0-9](?=_)
DEMO
Or as #MartinSmith Suggested (in a comment), you can use the following RegEx:
_([\d]{5})_
Implemeting RegEx in SSIS
First add another Variable (Ex: #[User::FileNumber])
Add a Script Task and choose #[User::Filename] variable as ReadOnlyVariable, and #[User:FileNumber] as ReadWriteVariable
Inside the script task use the following code:
using System.Text.RegularExpressions;
public void Main()
{
string strFilename = Dts.Variables["filename"].Value.ToString();
string strNumber;
var objRegEx = new Regex(#"(?:_)\K[0-9][0-9][0-9][0-9][0-9](?=_)");
var mc = objRegEx.Matches(strFilename);
//The last match contains the value needed
strNumber = mc[mc.Count - 1].Value;
Dts.Variables["FileNumber"].Value.ToString();
Dts.TaskResult = (int)ScriptResults.Success;
}

do the other pieces mean something?
anyway you can use a script task and split function.
pass in #fileName as readonly, and #WO as readwrite
string fn = Dts.Variables["fileName"].Value;
string[] parts = fn.Split('_');
//Assuming it's always the 7th part
// You could extract the other parts as well.
Dts.Variables["WO"].Value = part(6);

I would do this with a Script Transformation (or Script Task if this is not in a DataFlow) and use a Regex.

Read input with different datatypes and space seperation

I'm trying to figure out how to write code to let the user input three values (string, int, int) in one line with space to separate the values.
I thought of doing it with String.Split Method but that only works if all the values have the same datatype.
How can I do it with different datatypes?
For example:
The user might want to input
Hello 23 54
I'm using console application C#

Well the first problem is that you need to decide whether the text the user enters itself can contain spaces. For example, is the following allowed?
Hello World, it's me 08 15
In that case, String.Split will not really be helpful.
What I'd try is using a regular expression. The following may serve as a starting point:
Match m = Regex.Match(input, #"^(?<text>.+) (?<num1>(\+|\-)?\d+) (?<num2>(\+|\-)?\d+)$");
if (m.Success)
{
string stringValue = m.Groups["text"].Value;
int num1 = Convert.ToInt32(m.Groups["num1"].Value);
int num2 = Convert.ToInt32(m.Groups["num2"].Value);
}
BTW: The following part of your question makes me frown:
I thought of doing it with String.Split Method but that only works if all the values have the same datatype.
A string is always just a string. Whether it contains a text, your email-address or your bank account balance. It is always just a series of characters. The notion that the string contains a number is just your interpretation!
So from a program's point of view, the string you gave is a series of characters. And for splitting that it doesn't matter at all what the real semantics of the content are.
That's why the splitting part is separate from the conversion part. You need to tell your application that that the first part is a string, the second and third parts however are supposed to be numbers. That's what you need type conversions for.

You are confusing things. A string is either null, empty or contains a sequence of characters. It never contains other data types. However, it might contain parts that could be interpreted as numbers, dates, colors etc... (but they are still strings). "123" is not an int! It is a string containing a number.
In order to extract these pieces you need to do two things:
Split the string into several string parts.
Convert string parts that are supposed to represent whole numbers into a the int type (=System.Int32).
string input = "Abc 123 456"
string[] parts = input.Split(); //Whitespaces are assumed as separators by default.
if (parts.Count == 3) {
Console.WriteLine("The text is \"{0}\"", parts[0]);
int n1;
if (Int32.TryParse(parts[1], out n1)) {
Console.WriteLine("The 1st number is {0}", n1);
} else {
Console.WriteLine("The second part is supposed to be a whole number.");
}
int n2;
if (Int32.TryParse(parts[2], out n2)) {
Console.WriteLine("The 2nd number is {0}", n2);
} else {
Console.WriteLine("The third part is supposed to be a whole number.");
}
} else {
Console.WriteLine("You must enter three parts separated by a space.");
}

What you have to do is get "Hello 23 54" in a string variable. Split by " " and treat them.
string value = "Hello 23 54";
var listValues = value.Split(' ').ToList();
After that you have to parse each item from listValues to your related types.
Hope it helps. ;)

Finding longest word in string

Ok, so I know that questions LIKE this have been asked a lot on here, but I can't seem to make solutions work.
I am trying to take a string from a file and find the longest word in that string.
Simples.
I think the issue is down to whether I am calling my methods on a string[] or char[], currently stringOfWords returns a char[].
I am trying to then order by descending length and get the first value but am getting an ArgumentNullException on the OrderByDescending method.
Any input much appreciated.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Text;
using System.Threading.Tasks;
namespace TextExercises
{
class Program
{
static void Main(string[] args)
{
var fileText = File.ReadAllText(#"C:\Users\RichardsPC\Documents\TestText.txt");
var stringOfWords = fileText.ToArray();
Console.WriteLine("Text in file: " + fileText);
Console.WriteLine("Words in text: " + fileText.Split(' ').Length);
// This is where I am trying to solve the problem
var finalValue = stringOfWords.OrderByDescending(n => n.length).First();
Console.WriteLine("Largest word is: " + finalValue);
}
}
}

Don't split the string, use a Regex
If you care about performance you don't want to split the string. The reason in order to do the split method will have to traverse the entire string, create new strings for the items it finds to split and put them into an array, computational cost of more than N, then doing an order by you do another (at least) O(nLog(n)) steps.
You can use a Regex for this, which will be more efficient, because it will only iterate over the string once
var regex = new Regex(#"(\w+)\s",RegexOptions.Compiled);
var match = regex.Match(fileText);
var currentLargestString = "";
while(match.Success)
{
if(match.Groups[1].Value.Length>currentLargestString.Length)
{
currentLargestString = match.Groups[1].Value;
}
match = match.NextMatch();
}
The nice thing about this is that you don't need to break the string up all at once to do the analysis and if you need to load the file incrementally is a fairly easy change to just persist the word in an object and call it against multiple strings
If you're set on using an Array don't order by just iterate over
You don't need to do an order by your just looking for the largest item, computational complexity of order by is in most cases O(nLog(n)), iterating over the list has a complexity of O(n)
var largest = "";
foreach(var item in strArr)
{
if(item.Length>largest.Length)
largest = item;
}

Method ToArray() in this case returns char[] which is an array of individual characters. But instead you need an array of individual words. You can get it like this:
string[] stringOfWords = fileText.Split(' ');
And you have a typo in your lambda expression (uppercase L):
n => n.Length

Try this:
var fileText = File.ReadAllText(#"C:\Users\RichardsPC\Documents\TestText.txt");
var words = fileText.Split(' ')
var finalValue = fileText.OrderByDescending(n=> n.Length).First();
Console.WriteLine("Longest word: " + finalValue");

As suggested in the other answer, you need to split your string.
string[] stringOfWords = fileText.split(new Char [] {',' , ' ' });
//all is well, now let's loop over it and see which is the biggest
int biggest = 0;
int biggestIndex = 0;
for(int i=0; i<stringOfWords.length; i++) {
if(biggest < stringOfWords[i].length) {
biggest = stringOfWords[i].length;
biggestIndex = i;
}
}
return stringOfWords[i];
What we're doing here is splitting the string based on whitespace (' '), or commas- you can add an unlimited number of delimiters there - each word, then, gets its own space in the array.
From there, we're iterating over the array. If we encounter a word that's longer than the current longest word, we update it.

Complex string compare logic

I need help with some complex (for me anyway as I not too experienced) string comparison logic. Basically, I want to validate a string to make sure it matches a format rule. I am using C#, targeting .NET 4.5.2.
I am trying to work with an API which gives me the expected format of the string this way:
1:420+4:9#### (must have “420” starting in position 1 AND have a “9” in position 4 AND have numeric digits in positions 5-8
2:Z+14:&&+20:10,11,12 (must have a “Z” in position 2 AND and alpha letters in positions 14, 15 AND have either “10”, “11”, or “12” starting in position 20
Legend:
":" = position/valuelist separator
"," = value separator
"+" = test separator
"#" = numeric digit-only wildcard
"&" = alpha letter-only wildcard
Given this, my first thought is to do a series of substrings and splits of the input string and then do compare on each section? Or, I could do a for loop and iterate through each character one by one until I hit the end of the length of the input string.
Let's assume in this case that the input string is something like "420987435744585". Using rule number one, I should get a pass on this since the first three are 420, position 4 is a 9 and the next 5-8 are numeric.
So far, I have created a method that returns a bool if I pass/fail validation. The input string is passed in. I then started to split on + or - to get all of the and or not sections and then split on comma to get the groups of rules. But this is where I am stuck. It seems like it should be easy and maybe it is but I just can't seem to wrap my head around it and I am thinking I am going to end up with a ton of arrays, foreach loops, if statements, etc... Just to validate and return true/false if the input string matches my format.
Can somebody please assist and give some guidance?
Thank you!!!!

The best way to handle these conditions would be using Regular Expressions (Regex). At first, you may find it a bit complicated, but it's worth to put time on learning it to handle all types of string patterns in a simple non-verbose way.
You can start with these tutorials :
http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial
http://www.tutorialspoint.com/csharp/csharp_regular_expressions.htm
And use this one as a reference :
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

I think the best way is a custom function, it will be faster than RegEx, and it would be a lot of manual work to convert that format to RegEx.
I've made a start at the validation function, and it's testing ok for the samples you provided.
Here is the code:
static bool CheckFormat(string formatString, string value)
{
string[] tests = formatString.Split('+');
foreach(string test in tests)
{
string[] testElement = test.Split(':');
int startPos = int.Parse(testElement[0]);
string patterns = testElement[1];
string[] patternElements = patterns.Split(',');
foreach(string patternElement in patternElements)
{
//value string not long enough, so fail.
if(startPos + patternElement.Length > value.Length)
return false;
for (int i = 0; i < patternElement.Length; i++)
{
switch(patternElement[i])
{
case '#':
if (!Char.IsNumber(value[i]))
return false;
break;
case '&':
if (!Char.IsLetter(value[i]))
return false;
break;
default:
if(patternElement[i] != value[i])
return false;
break;
}
}
}
}
return true;
}
The dotnet fiddle is here if you want to play with it: https://dotnetfiddle.net/52olLQ.
Good luck.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

grouping adjacent similar substrings - c#

Related

How to find one of many possible substrings in a larger string?

How do I find a variable set of 5 numbers qualified by surrounding underscores?

Read input with different datatypes and space seperation

Finding longest word in string

Complex string compare logic

Categories

Resources