Parsing delimited data for specific instance of repeated line - c#

I have an array of strings in the following format, where each string begins with a series of three characters indicating what type of data it contains. For example:
ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......
I want to find any repeating lines (RHG in this example) and mark the last line with a special character:
>RHG|3.........
What's the best way to do this? My current solution has a method to count the line headers and create a dictionary with the header counts.
protected Dictionary<string, int> CountHeaders(string[] lines)
{
Dictionary<string, int> headerCounts = new Dictionary<string, int>();
for (int i = 0; i < lines.Length; i++)
{
string s = lines[i].Substring(0, 3);
int value;
if (headerCounts.TryGetValue(s, out value))
headerCounts[s]++;
else
headerCounts.Add(s, 1);
}
return headerCounts;
}
In the main parsing method, I select the lines that are repeated.
var repeats = CountHeaders(lines).Where(x => x.Value > 1).Select(x => x.Key);
foreach (string s in repeats)
{
// Get last instance of line in lines and mark it
}
This is as far as I've gotten. I think I can do what I want with another LINQ query but I'm not too sure. Also, I can't help but feel that there's a more optimal solution.

You can use LINQ to achieve that.
Input string:
var input = #"ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......";
LINQ query:
var results = input.Split(new[] { Environment.NewLine })
.GroupBy(x => x.Substring(0, 3))
.Select(g => g.ToList())
.SelectMany(g => g.Count > 1 ? g.Take(g.Count - 1).Concat(new[] { string.Format(">{0}", g[g.Count - 1]) }) : g)
.ToArray();
I used Select(g => g.ToList()) projection to make g.Count O(1) operation in further query steps.
You can Join result array into one string using String.Join method:
var output = String.Join(Environment.NewLine, results);

Alternatively, you could find repeating lines with a backreferencing regex. I wrote this hacky regex using your sample data and it matches the lines starting with a preceding 'tag', pipe seperated values.
^(?<Tag>.+)[|].+[\n\r](\k<Tag>[|].+[\n\r])+
The match range starts at the beginning of the first RHG line and selects up to the last RHG line.

Here's an example that includes the parsing and the counting in one Linq statement - feel free to break it up if you want to:
string[] data = new string[]
{
"ABC|.....",
"DEF|...",
"RHG|1........",
"RHG|2........",
"RHG|3........",
"XDF|......"
};
data.Select(d=> d.Split('|')) // split the strings
.Select(d=> new { Key = d[0], Value = d[1] }) // select the key and value
.GroupBy (d => d.Key) // group by the key
.Where(g=>g.Count() > 1 ) // find duplicates
.Select(d => d.Skip(1)) // select the repeating elements
.SelectMany(g=>g) // flatten into a single list
;
This will give you a list of key/value pairs that are duplicates. so with the sample data it will return
Key Value
RHG 2........
RHG 3........
I'm not sure what you mean by "marking" the line, however...

Related

Get Elements from String List in order of Occurrence in provided string

Hi I have List of strings as below.
List<string> MyList = new List<string> { "[FirstName]", "[LastName]", "[VoicePhoneNumber]", "[SMSPhoneNumber]" };
I need to get all the elements from the List if exist in string in order. For example my string is
string MessageContent = Hello [LastName] [FirstName]There, this message is for [SMSPhoneNumber]
Right now I am doing
var Exists = MyList.Where(MessageContent.Contains);
This new list have all the items from MyList which occured in MessageContent string but not in order.
How i can get occurrence in order in string?
Desired List as per example is = { "[LastName]","[FirstName]","[SMSPhoneNumber]" }
I would suggest using IndexOf to determine position (and thereby order) as well as existence to avoid searching MessageContent twice at the expense of sorting the answer:
var ans = MyList.Select(w => new { w, pos = MessageContent.IndexOf(w) })
.Where(wp => wp.pos >= 0)
.OrderBy(wp => wp.pos)
.Select(wp => wp.w)
.ToList();
However, if a field may appear more than once, or if you think avoiding the repeated scanning of MessageContent is faster than multiple IndexOf (once per MyList member) (probably not) and avoiding the sort, then you can invert the search (using Span to avoid generating lots of Strings):
var ans2 = Enumerable.Range(0, MessageContent.Length-MyList.Select(w => w.Length).Min())
.Select(p => MyList.FirstOrDefault(w => MessageContent.AsSpan().Slice(p).StartsWith(w)))
.Where(w => w != null)
.ToList();
I did it Using
var Exists = MyList.Where(MessageContent.Contains).OrderBy(s => MessageContent.IndexOf(s));

Get the matching index of a value in a list

So I've got the following code:
string matchingName = "Bob";
List<string> names = GetAllNames();
if (names.Contains(matchingName))
// Get the index/position in the list of names where Bob exists
Is it possible to do this with a couple of lines of code, rather than iterating through the list to get the index or position?
If you have multiple matching instances and want to get all the indices you can use this:
var result = Enumerable.Range(0, names.Count).Where(i => names[i] == matchingName);
If it is just one index you want, then this will work:
int result = names.IndexOf(matchingName);
If there is no matching instance in names, the former solution will yield an empty enumeration, while the latter will give -1.
var index = names.IndexOf(matchingName);
if (index != -1)
{
// do something with index
}
If you want to look for a single match, then IndexOf will suit your purposes.
If you want to look for multiple matches, consider:
var names = new List<string> {"Bob", "Sally", "Hello", "Bob"};
var bobIndexes = names.Select((value, index) => new {value, index})
.Where(z => z.value == "Bob")
.Select(z => z.index);
Console.WriteLine(string.Join(",", bobIndexes)); // this outputs 0,3
The use of (value, index) within Select gives you access to both the element and its index.

LINQ for getting all entries of a IEnumerable<string> which start with the same characters

I got an IEnumerable<string> and I want to gather all entries which start with the same characters.
For example:
Hans
Hannes
Gustav
Klaus
Herbert
Hanne
Now I want to find all entries where the first 2 characters are the same which would return Hans, Hannes, Hanne.
You just need to use .GroupBy
list.GroupBy(x=>x.Substring(0, n)).OrderByDescending(x=>x.Count()).First()
Where n is the number of char you want to compare.
Can also add a Where to filter any requirements you may have:
list.GroupBy(x=>x.Substring(n))
.Where(x=>x.Count() > 1)
.OrderByDescending(x=>x.Count())
.First()
Complete example:
var lst = new string[]
{
"Hans",
"Hannes",
"Gustav",
"Klaus",
"Herbert",
"Hanne"
};
var source = lst.GroupBy(x => x.Substring(0, 2)).OrderByDescending(x => x.Count()).First()
Console.WriteLine(source.Key);
Console.WriteLine(string.Join(",", source));

Find the index position of duplicate entries in a comma separated string

My problem just got more complicated than I thought and I've just wiped out my original question... So I'll probably post multiple questions depending on how I get on with this.
Anyway, back to the problem. I need to find the index position of duplicate entries in string that contains csv data. For example,
FirstName,LastName,Address,Address,Address,City,PostCode,PostCode, Country
As you can see the Address is duplicated and I need to find out the index of each duplicates assuming first index position starts at 0.
If you have a better suggestion on how to do this, let me know, but assuming it can be done, could we maybe have with a dicitionary>?
So if I had to code this, you would have:
duplicateIndexList.Add(2);
duplicateIndexList.Add(3);
duplicateIndexList.Add(4);
myDuplicateList.Add("Address", duplicateIndexList);
duplicateIndexList.Add(6);
duplicateIndexList.Add(7);
myDuplicateList.Add("PostCode", duplicateIndexList);
Obviously I don't want to do this but is it possible to achieve the above using Linq to do this? I could probably write a function that does this, but I love seeing how things can be done with Linq.
In case you're curious as to why I want to do this? Well, in short, I have an xml definition which is used to map csv fields to a database field and I want to first find out if there are any duplicate columns, I then want to append the relevant values from the actual csv row i.e. Address = Address(2) + Address(3) + Address(4), PostCode = PostCode(6) + PostCode(7)
The next part will be how to remove all the relevant values from the csv string defined above based on the indexes found once I have appended their actual values, but that will be the next part.
Thanks.
T.
UPDATE:
Here is the function that does what I want but as I said, linq would be nice. Note that in this function I'm using a list instead of the comma separated string as I haven't converted that list yet to a csv string.
Dictionary<string, List<int>> duplicateEntries = new Dictionary<string, List<int>>();
int indexPosition = 0;
foreach (string fieldName in Mapping.Select(m=>m.FieldName))
{
string key = fieldName.ToUpper();
if (duplicateEntries.ContainsKey(key))
{
List<int> indexes = duplicateEntries[fieldName];
indexes.Add(indexPosition);
duplicateEntries[key] = indexes;
indexes = null;
}
else
{
duplicateEntries.Add(key, new List<int>() { indexPosition });
}
indexPosition += 1;
}
Maybe this will help clarify what I'm trying to achieve.
You need to do the following:
Use .Select on the resulting array to project a new IEnumerable of objects that contains the index of the item in the array along with the value.
Use either ToLookup or GroupBy and ToDictionary to group the results by column value.
Seems like an ILookup<string, int> would be appropriate here:
var lookup = columnArray
.Select((c, i) => new { Value = c, Index = i })
.ToLookup(o => o.Value, o => o.Index);
List<int> addressIndexes = lookup["Address"].ToList(); // 2, 3, 4
Or if you wanted to create a Dictionary<string, List<int>>:
Dictionary<string, List<int>> dictionary = columnArray
.Select((c, i) => new { Value = c, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());
List<int> addressIndexes = dictionary["Address"]; // 2, 3, 4
Edit
(in response to updated question)
This should work:
Dictionary<string, List<int>> duplicateEntries = Mapping
.Select((m, i) => new { Value = m.FieldName, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());
You could do something like :
int count = 0;
var numbered_collection =
from line in File.ReadAllLines("your_csv_name.csv").Skip(1)
let parts = line.Split(',')
select new CarClass()
{
Id = count++,
First_Field = parts[0],
Second_Field = parts[1], // rinse and repeat
};
This gives you Id's per item. (and also skip the first line which has the header). You could put it in a method if you want to automatically map the names from the first csv line to the fields).
From there, you can do:
var duplicates = (from items in numbered_collection
group items by items.First_Field into g
select g)
.Where(g => g.Count() > 1);
Now you have all the groups where you actually have duplicates, and you can just get the 'Id' from the object to know which one is the duplicated.

Determine if string appears more than once in string array (C#)

I have an array of strings, f.e.
string [] letters = { "a", "a", "b", "c" };
I need to find a way to determine if any string in the array appears more than once.
I thought the best way is to make a new string-array without the string in question and to use Contains,
foreach (string letter in letters)
{
string [] otherLetters = //?
if (otherLetters.Contains(letter))
{
//etc.
}
}
but I cannot figure out how.
If anyone has a solution for this or a better approach, please answer.
The easiest way is to use GroupBy:
var lettersWithMultipleOccurences = letters.GroupBy(x => x)
.Where(g => g.Count() > 1)
.Select(g => g.Key);
This will first group your array using the letters as keys. It then returns only those groups with multiple entries and returns the key of these groups. As a result, you will have an IEnumerable<string> containing all letters that occur more than once in the original array. In your sample, this is only "a".
Beware: Because LINQ is implemented using deferred execution, enumerating lettersWithMultipleOccurences multiple times, will perform the grouping and filtering multiple times. To avoid this, call ToList() on the result:
var lettersWithMultipleOccurences = letters.GroupBy(x => x)
.Where(g => g.Count() > 1)
.Select(g => g.Key).
.ToList();
lettersWithMultipleOccurences will now be of type List<string>.
You can the LINQ extension methods:
if (letters.Distinct().Count() == letters.Count()) {
// no duplicates
}
Enumerable.Distinct removes duplicates. Thus, letters.Distinct() would return three elements in your example.
Create a HashSet from the array and compare their sizes:
var set = new HashSet(letters);
bool hasDoubleLetters = set.Size == letters.Length;
A HashSet will give you good performance:
HashSet<string> hs = new HashSet<string>();
foreach (string letter in letters)
{
if (hs.Contains(letter))
{
//etc. more as once
}
else
{
hs.Add(letter);
}
}

Categories

Resources