Best way to reduce sequences in an array of strings

Best way to reduce sequences in an array of strings - c#

Please, now that I've re-written the question, and before it suffers from further fast-gun answers or premature closure by eager editors let me point out that this is not a duplicate of this question. I know how to remove duplicates from an array.
This question is about removing sequences from an array, not duplicates in the strict sense.
Consider this sequence of elements in an array;
[0] a
[1] a
[2] b
[3] c
[4] c
[5] a
[6] c
[7] d
[8] c
[9] d
In this example I want to obtain the following...
[0] a
[1] b
[2] c
[3] a
[4] c
[5] d
Notice that duplicate elements are retained but that sequences of the same element have been reduced to a single instance of that element.
Further, notice that when two lines repeat they should be reduced to one set (of two lines).
[0] c
[1] d
[2] c
[3] d
...reduces to...
[0] c
[1] d
I'm coding in C# but algorithms in any language appreciated.

EDIT: made some changes and new suggestions
What about a sliding window...
REMOVE LENGTH 2: (no other length has other matches)
//the lower case letters are the matches
ABCBAbabaBBCbcbcbVbvBCbcbcAB
__ABCBABABABBCBCBCBVBVBCBCBCAB
REMOVE LENGTH 1 (duplicate characters):
//* denote that a string was removed to prevent continual contraction
//of the string, unless this is what you want.
ABCBA*BbC*V*BC*AB
_ABCBA*BBC*V*BC*AB
RESULT:
ABCBA*B*C*V*BC*AB == ABCBABCVBCAB
This is of course starting with length=2, increase it to L/2 and iterate down.
I'm also thinking of two other approaches:
digraph - Set a stateful digraph with the data and iterate over it with the string, if a cycle is found you'll have a duplication. I'm not sure how easy it is check check for these cycles... possibly some dynamic programming, so it could be equivlent to method 2 below. I'm going to have to think about this one as well longer.
distance matrix - using a levenstein distance matrix you might be able to detect duplication from diagonal movement (off the diagonal) with cost 0. This could indicate duplication of data. I will have to think about this more.

Here's C# app i wrote that solves this problem.
takes
aabccacdcd
outputs
abcacd
Probably looks pretty messy, took me a bit to get my head around the dynamic pattern length bit.
class Program
{
private static List<string> values;
private const int MAX_PATTERN_LENGTH = 4;
static void Main(string[] args)
{
values = new List<string>();
values.AddRange(new string[] { "a", "b", "c", "c", "a", "c", "d", "c", "d" });
for (int i = MAX_PATTERN_LENGTH; i > 0; i--)
{
RemoveDuplicatesOfLength(i);
}
foreach (string s in values)
{
Console.WriteLine(s);
}
}
private static void RemoveDuplicatesOfLength(int dupeLength)
{
for (int i = 0; i < values.Count; i++)
{
if (i + dupeLength > values.Count)
break;
if (i + dupeLength + dupeLength > values.Count)
break;
var patternA = values.GetRange(i, dupeLength);
var patternB = values.GetRange(i + dupeLength, dupeLength);
bool isPattern = ComparePatterns(patternA, patternB);
if (isPattern)
{
values.RemoveRange(i, dupeLength);
}
}
}
private static bool ComparePatterns(List<string> pattern, List<string> candidate)
{
for (int i = 0; i < pattern.Count; i++)
{
if (pattern[i] != candidate[i])
return false;
}
return true;
}
}
fixed the initial values to match the questions values

I would dump them all into your favorite Set implementation.
EDIT: Now that I understand the question, your original solution looks like the best way to do this. Just loop through the array once, keeping an array of flags to mark which elements to keep, plus a counter to keep track to the size of the new array. Then loop through again to copy all the keepers to a new array.

I agree that if you can just dump the strings into a Set, then that might be the easiest solution.
If you don't have access to a Set implementation for some reason, I would just sort the strings alphabetically and then go through once and remove the duplicates. How to sort them and remove duplicates from the list will depend on what language and environment you are running your code.
EDIT: Oh, ick.... I see based on your clarification that you expect that patterns might occur even over separate lines. My approach won't solve your problem. Sorry. Here is a question for you. If I had the following file.
a
a
b
c
c
a
a
b
c
c
Would you expect it to simplify to
a
b
c

Related

How could I convert a sequence of Listnode into 2 seperate lists

I have a sequence of listnode objects
list -> [1] -> [2] -> [3] -> [4] /
and I need to convert it into 2 separate lists.
list -> [4] -> [2] /
list2 -> [3} -> [1] /
I'm not even sure where I'd begin with this one. I've been playing around with my_list.AddLast() and my_list.Remove() but am not sure what I'd do in order to create that one list into 2, and then move the numbers around as indicated.

Here's an answer based on my comment above. I take your input, convert it to an array, and then iterate over the array backwards (from the end to the beginning). I put the odd numbered items in the array into one list and the even-numbered ones in another. Then I return the two lists as a tuple:
private (List<T>, List<T>) SplitList<T>(IEnumerable<T> input)
{
var asArray = input.ToArray();
var evens = new List<T>();
var odds = new List<T>();
for (var i = asArray.Length - 1; i >= 0; --i)
{
if (i % 2 == 0) //if even
{
evens.Add(asArray[i]);
}
else
{
odds.Add(asArray[i]);
}
}
return (evens, odds);
}
Once that's done, it's easily callable. This will work with your 4-valued example, but it will also work with any number of things of any type. I'm using a range of integers to make my point clear, but it should work with just about anything.
var oneToFour = Enumerable.Range(1, 4);
var result = SplitList(oneToFour);
var oneToFifteen = Enumerable.Range(1, 15);
var other = SplitList(oneToFifteen);
Some people may say "Why not just index over the list backwards". Yes, List<T> is implemented internally with an array and is indexable. But, in my mind, a List is O(N) for indexing (rather than O(1) like an array); that it's indexable is simply an implementation detail.

How to group array of char/string with UNION?

I have a two dimensional array of char, called Letters[ ][ ]
Letters[0][0] = A
[0][1] = B
Letters[1][0] = C
[1][1] = D
Letters[2][0] = B
[2][1] = A
[2][2] = F
Letters[3][0] = I
[3][1] = F
[3][2] = J
I need to group it, so it will be something like this:
group[0] [0] = A
group[0] [1] = B
group[0] [2] = F
group[0] [3] = I
group[0] [4] = J
group[1] [0] = C
group[1] [1] = D
My logic so far for my problem is check every elements with other elements. If both elements are the same letter, it groups together with the whole other array elements with no double/duplicated elements. But, I'm not sure of using C# Linq Union or maybe just a standard array access.
How do I supposed to do to group it in best way? Or are there any other solutions for this?

I think a pure LINQ solution would be overly complex. This isn't (if I understand your specification correctly) a simple union operation. You want to union based on non-empty intersections. That would mean having to first rearrange the data so LINQ can do a join, to find the data that matches, and since LINQ will only join on equality, doing that while preserving the original grouping information is going to result in syntax that would be more trouble than it's worth, IMHO.
Here is a non-LINQ approach that works for the example you've given:
static void Main(string[] args)
{
char[][] letters =
{
new [] { 'A', 'B' },
new [] { 'C', 'D' },
new [] { 'B', 'A', 'F' },
new [] { 'I', 'F', 'J' },
};
List<HashSet<char>> sets = new List<HashSet<char>>();
foreach (char[] row in letters)
{
List<int> setIndexes = Enumerable.Range(0, sets.Count)
.Where(i => row.Any(ch => sets[i].Contains(ch))).ToList();
CoalesceSets(sets, row, setIndexes);
}
foreach (HashSet<char> set in sets)
{
Console.WriteLine("{ " + string.Join(", ", set) + " }");
}
}
private static void CoalesceSets(List<HashSet<char>> sets, char[] row, List<int> setIndexes)
{
if (setIndexes.Count == 0)
{
sets.Add(new HashSet<char>(row));
}
else
{
HashSet<char> targetSet = sets[setIndexes[0]];
targetSet.UnionWith(row);
for (int i = setIndexes.Count - 1; i >= 1; i--)
{
targetSet.UnionWith(sets[setIndexes[i]]);
sets.RemoveAt(setIndexes[i]);
}
}
}
It builds up sets of the input data by scanning the previously identified sets to find which ones the current row of data intersects with, and then coalesces these sets into a single set containing all of the members (your specification appears to impose transitive membership…i.e. if one letter joins sets A and B, and a different letter joins set B and C, you want A, B, and C all joined into a single set).
This isn't an optimal solution, but it's readable. You could avoid the O(N^2) search by maintaining a Dictionary<char, int> to map each character to the set which contains it. Then instead of scanning all the sets, it's a simple lookup for each character in the current row, to build up the list of set indexes. But there's a lot more "housekeeping" code going that approach; I would not bother implementing it that way unless you find a proven performance issue doing it the more basic way.
By the way: I have a vague recollection I've seen this type of question before on Stack Overflow, i.e. this sort of transitive unioning of sets. I looked for the question but couldn't find it. You may have more luck, and may find there is additional helpful information with that question and its answers.

Loading data from text file into a dictionary

I have a file consisting of a list of text which looks as follows:
ABC Abbey something
ABD Aasdasd
This is the text file
The first string will always be the length of 3. So I want to loop through the file content, store those first 3 letters as Key and remaining as value. I am removing white space between them and Substringing as follows to store. The key works out fine but the line where I am storing the value returns following error. ArgumentOutOfRangeException
This is the exact code causing the problem.
line.Substring(4, line.Length)
If I call the subString between 0 and line.length it works fine. As long as I call it between 1and upwards - line.length I get the error. Honestly don't get it and been at it for hours. Some assistance please.
class Program {
static string line;
static Dictionary<string, string> stations = new Dictionary<string, string>();
static void Main(string[] args) {
var lines = File.ReadLines("C:\\Users\\username\\Desktop\\a.txt");
foreach (var l in lines) {
line = l.Replace("\t", "");
stations.Add(line.Substring(0, 3), line.Substring(4, line.Length));//error caused by this line
}
foreach(KeyValuePair<string, string> item in stations) {
//Console.WriteLine(item.Key);
Console.WriteLine(item.Value);
}
Console.ReadLine();
}
}

This is because the documentation specifies it will throw an ArgumentOutOfRangeException if:
startIndex plus length indicates a position not within this instance.
With the signature:
public string Substring(int startIndex, int length)
Since you use line.Length, you know that startIndex plus length will be 4+line.Length which is definitely not a position of this instance.
I recommend using the one parameter version:
public string Substring(int startIndex)
Thus line.Substring(3) (credit to #adv12 for spotting that). Since here you only should provide the startIndex. Of course you can use line.SubString(3,line.Length-3), but as always, better use a library since libraries are made to make programs fool-proof (this is not intended as offensive, simply make sure you reduce the amount of brain cycles for this task). Mind however that it still can throw an error if:
startIndex is less than zero or greater than the length of this instance.
So better provide checks that 3 is less than or equal to line.length...
Additional advice
Perhaps you should take a look to regex capturing. Now each key in your file contains three characters. But it is possible that in the (near) future four characters will be possible. Using regex capture, you could specify a pattern such that it is less likely that errors will occur during parsing.

You need to actually get less than the length of total line:
line.Substring(4, line.Length - 4) //subtract the chars which you're skipping
Your string:
ABC Abbey something
Length = 19
Start = 4
Remaining chars = 19 - 4 = 15 //and you are expecting 19, that is the error

I know this is a late answer that doesn't address what's wrong with your code but I feel that has already been done by other people. Instead I have different way to make the dictionary that doesn't involve substring at all so it's a little more robust, IMHO.
As long as you can guarantee that the two values are always separated by tab then this would work even if there were more or less characters in the key. It uses LINQ which should be fine from .NET 3.5.
// LINQ
using System.Linq;
// Creates a string[][] array with the list of keys in the first array position
// and the values in the second
var lines = File.ReadAllLines(#"path/to/file.txt")
.Select(s => s.Split('\t'))
.ToArray();
// Your dictionary
Dictionary<string, string> stations = new Dictionary<string, string>();
// Loop through the array and add the key/value pairs to the dictionary
for (int i = 0; i < lines.Length; i++)
{
// For example lines[i][0] = ABW, lines[i][1] = Abbey Wood
stations[lines[i][0]] = lines[i][1];
}
// Prove it works
foreach (KeyValuePair<string, string> entry in stations)
{
MessageBox.Show(entry.Key + " - " + entry.Value);
}
Hope this makes sense and gives you an alternate to consider ;-)

Find intersection of two multi-dimensional Arrays in C# 4.0

Trying to find a solution to my ranking problem.
Basically I have two multi-dimensional double[,] arrays. Both containing rankings for certain scenarios, so [rank number, scenario number]. More than one scenario can have the same rank.
I want to generate a third multi-dimensional array, taking the intersections of the previous two multi-dimensional arrays to provide a joint ranking.
Does anyone have an idea how I can do this in C#?
Many thanks for any advice or help you can provide!
Edit:
Thank you for all the responses, sorry I should have included an example.
Here it is:
Array One:
[{0,4},{1,0},{1,2},{2,1},{3,5},{4,3}]
Array Two:
[{0,1},{0,4},{1,0},{1,2},{3,5},{4,3}]
Required Result:
[{0,4},{1,0},{1,2},{1,1},{2,5},{3,3}]

Here's some sample code that makes a bunch of assumptions but might be something like what you are looking for. I've added a few comments as well:
static double[,] Intersect(double[,] a1, double[,] a2)
{
// Assumptions:
// a1 and a2 are two-dimensional arrays of the same size
// An element in the array matches if and only if its value is found in the same location in both arrays
// result will contain not-a-number (NaN) for non-matches
double[,] result = new double[a1.GetLength(0), a1.GetLength(1)];
for (int i = 0; i < a1.GetLength(0); i++)
{
for (int j = 0; j < a1.GetLength(1); j++)
{
if (a1[i, j] == a2[i, j])
{
result[i, j] = a1[i, j];
}
else
{
result[i, j] = double.NaN;
}
}
}
return result;
}
For the most part, finding the intersection of multiple dimensional arrays will involve iterating over the elements in each of the dimensions in the arrays. If the indices of the array are not part of the match criteria (my second assumption in my code is removed), you would have to walk each dimension in each array - which increases the run-time of the algorithm (in this case, from O(n^2) to O(n^4).
If you care enough about run-time, I believe array matching is one of the typical examples of dynamic programming (DP) optimization; which you can read up on at your leisure.
I'm not sure how you wanted your results...you could probably return a flat collection of results that can be indexed by a pair, which would potentially save a lot of space if the expected result set is typically small. I went with a third fixed-sized array because it was the easiest thing to do.
Lastly, I'll mention that I don't see a keen C# way of doing this using IEnumerable, LINQ, or something like that. Someone more C# knowledgeable than I can chime in anytime now....

Given the additional information, I'd argue that you aren't actually working with multidimensional arrays, but instead are working with a collection of pairs. The pair is a pair of doubles. I think the following should work nicely:
public class Pair : IEquatable<Pair>
{
public double Rank;
public double Scenario;
public bool Equals(Pair p)
{
return Rank == p.Rank && Scenario == p.Scenario;
}
public override int GetHashCode()
{
int hashRank= Rank.GetHashCode();
int hashScenario = Scenario.GetHashCode();
return hashRank ^ hashScenario;
}
}
You can then use the Intersect operator on IEnumerable:
List<Pair> one = new List<Pair>();
List<Pair> two = new List<Pair>();
// ... populate the lists
List<Pair> result = one.Intersect(two).ToList();
Check out the following msdn article on Enumerable.Intersect() for more information:
http://msdn.microsoft.com/en-us/library/bb910215%28v=vs.90%29.aspx

C# - Using LINQ to take two variables into a 2-dimensional array?

I have a list<> of an "region" class with two variables, "startLocation" and "endLocation".
I'd like to combine those two into a new sorted 2 dimensional array where its just Location and an integer representing whether its start or an end.
For example, if the list has three region objects with
[Region 1] : startLocation = 5,
endLocation = 7
[Region 2] : startLocation = 3,
endLocation = 5
[Region 3] : startLocation = 8,
endLocation = 9
I'd like to get a sorted two dimensional array (or list or similar) looking like:
[3] [1]
[5] [1]
[5] [-1]
[7] [-1]
[8] [1]
[9] [-1]
(preferably i'd like the overlaps to add their second values together, so the two separate 5's in the array would be combined into [5 0]...but that's not too important)
I'm currently using a regular forloop going through each one by one and adding them to a list one at a time. This implementation is quite slow because I'm working with large datasets, and I'm guessing there's a more elegant / faster way to accomplish this through LINQ.
Any suggestions would be much appreciated.

You'll need to define a helper method which splits a region into 2 parts and it's much easier to represent this using a new struct vs. a 2D array
struct Data {
public int Value;
public bool IsStart;
}
public static IEnumerable<Data> Split(this Region region) {
yield return new Data() { Value = region.StartLocation, IsStart=true};
yield return new Data() { Value = region.EndLocation, IsStart=false};
}
Then you can use the following LINQ query to break them up and sort them.
List<Region> list = GetTheList();
var query = list
.SelectMany(x => x.Split())
.OrderBy(x => x.Data);

This isn't a solution that's suitable for LINQ in anything other than an intellectual exercise. A foreach loop will be just as fast (actually likely faster) than any cobbled-together LINQ implementation.
As a side note, I'm assuming that you're using foreach rather than for. If not, then you could significantly speed up your process by switching to the foreach loop.
foreach(Region r in regionList)
{
// add your entries using r
}
will be much faster than..
for(int i = 0; i < regionList.Count; i++)
{
// add your entires using the indexer
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best way to reduce sequences in an array of strings - c#

Related

How could I convert a sequence of Listnode into 2 seperate lists

How to group array of char/string with UNION?

Loading data from text file into a dictionary

Find intersection of two multi-dimensional Arrays in C# 4.0

C# - Using LINQ to take two variables into a 2-dimensional array?

Categories

Resources