How to use .OrderBy for multiple conditions, one only sometimes used? - c#

Sorry if the title, is confusing, I had some trouble putting my problem into words.
I have a List, where every string is composed of 2 words, delimited by space.
For example:
{ "word1 word2", "wordA wordB", "dog cat", "mouse cat" }
I want to use OrderBy to sort the list by the 2nd word, if any words are equal, I then want to sort those by the 1st word. I'm having trouble figuring out how to handle the 2nd condition for this (sorting by 1st word only if 2nd words are equal).
I originally tried:
public List<string> SpecialSort(List<string> text)
{
return text.OrderBy(x => x.Split(' ')[1]).ThenBy(x => x.Split(' ')[0]);
}
but this seems to just sort first by the 2nd word, and then re-sort everything by the 1st word. Is there a way for me to do this where I only sort by 1st word if the 2nd words are equal?
Thanks!

My advice would be to split the text into words, while keeping the original text in a Select. Then sort the sequence and finally remove the split words.
Requirement
Input: a sequence of strings, every string has exactly one space.
This space is neither the first nor the last character.
The characters before this one and only space are defined as the first word.
The characters after the space are defined as the second word.
Output: Sort the sequence by 2nd word, then by 1st word.
IEnumerable<string> inputTexts = ...
const string splitChar = ' ';
// first add the split words
var sortedSequence = inputTexts.Select(txt => new
{
Original = txt,
Split = txt.Split(splitChar, StringSplitOptions.None),
})
// then sort by the split words
.OrderBy(splitTxt => splitTxt.Split[1])
.ThenBy(splitTxt => splitTxt.Split[0])
// finally remove the split words
.Select(splitTxt => splitTxt.Original);

Create intermediate results within an .OrderBy() statement can be painful, cause the comparer needs to possibly call them multiple times on each object. Also to make it better maintainable I would write a class that gets the original value, creates the desired elements and feeding these intermediate objects into a specific comparer that can sort them. At the end just get the original value out of the intermediate class and you're done.
A rough sketch for your example would look something like this:
using System;
using System.Collections.Generic;
using System.Linq;
public static class Program
{
private static void Main(string[] args)
{
var words = new List<string>{"word1 word2", "wordA wordB", "dog cat", "mouse cat"};
var ordered = words
.Select(SpecialComparerInstance.Create)
.OrderBy(special => special, SpecialComparer.Default)
.Select(special => special.Value);
foreach (var item in ordered)
{
Console.WriteLine(item);
}
}
}
public class SpecialComparerInstance
{
public static SpecialComparerInstance Create(string value) => new SpecialComparerInstance(value);
public SpecialComparerInstance(string value)
{
if (string.IsNullOrEmpty(value))
throw new ArgumentNullException(nameof(value));
var elements = value.Split(' ');
if (elements.Length != 2)
throw new ArgumentException("Must contain exactly one space character", nameof(value));
Value = value;
FirstOrderValue = elements[1];
SecondOrderValue = elements[0];
}
public string Value { get; }
public string FirstOrderValue { get; }
public string SecondOrderValue { get; }
}
public class SpecialComparer : IComparer<SpecialComparerInstance>
{
public static readonly IComparer<SpecialComparerInstance> Default = new SpecialComparer(StringComparer.Ordinal);
private readonly StringComparer _comparer;
public SpecialComparer(StringComparer comparer)
{
_comparer = comparer;
}
public int Compare(SpecialComparerInstance x, SpecialComparerInstance y)
{
if (ReferenceEquals(x, y))
return 0;
if (ReferenceEquals(x, null))
return 1;
if (ReferenceEquals(y, null))
return -1;
var result = _comparer.Compare(x.FirstOrderValue, y.FirstOrderValue);
if (result == 0)
result = _comparer.Compare(x.SecondOrderValue, y.SecondOrderValue);
return result;
}
}

Related

how to merge two csv files with different columns and rows in c#

I'm trying to merge two csv files which have different headers and different number of rows/lines.
Using the following code, but doesn't get correct output. It's working when rows are same.
var first = File.ReadAllLines("firstfile.csv");
var second = File.ReadAllLines("secondfile.csv");
var result = first.Zip(second, (f, s) => string.Join(",", f, s));
File.WriteAllLines("combined.csv", result);
for ex:
firstfile is
col1,colb,colc
a,b,c
a,v,f
the secondfile is
colx,coly
x,y
cc,aa
bb,vv
m,n
the output is get
col1,colb,colc,colx,coly
a,b,c,x,y
a,v,f,cc,aa
the second file rows are missiing.
my expected output is
col1,colb,colc,colx,coly
a,b,c,x,y
a,v,f,cc,aa
,,,bb,vv
,,,m,n
There is no inbuilt method that allows you to merge two lists of unequal length. Zip only merges down to the shortest length. However, you can achieve what you want by modifying Marc Gravell's excellent answer here, in order to allow a default value. Create yourself an extensions class, something like this:
public static class Extensions
{
public static IEnumerable<T> Merge<T>(this IEnumerable<T> first,
IEnumerable<T> second, T defaultValue, Func<T, T, T> operation)
{
using (var iter1 = first.GetEnumerator())
using (var iter2 = second.GetEnumerator())
{
while (iter1.MoveNext())
{
if (iter2.MoveNext())
{
yield return operation(iter1.Current, iter2.Current);
}
else
{
yield return operation(iter1.Current, defaultValue);
}
}
while (iter2.MoveNext())
{
yield return operation(defaultValue, iter2.Current);
}
}
}
}
You can now call it with code like this:
char separator = ',';
var first = File.ReadAllLines("firstfile.csv").AsEnumerable();
var second = File.ReadAllLines("secondfile.csv").AsEnumerable();
string defaultValue = "";
int cnt = 0;
if (first.Count() < second.Count())
{
cnt = first.FirstOrDefault().Split(separator).Length;
}
else
{
cnt = second.FirstOrDefault().Split(separator).Length;
}
defaultValue = defaultValue.PadLeft(cnt - 1, separator);
var result = first.Merge(second, defaultValue, (f, s) => string.Join(separator.ToString(), f, s));
File.WriteAllLines("combined.csv", result);
Note I have added a char separator and changed the result of ReadAllLines to give an IEnumerable<string> rather than string[] to make the code more generic. Also the above code assumes that the both files have an internally consistent number of columns.
First you need to find out which of the two lists is the larger one so you can loop over that one and once you're past the length of the smaller list you can fill up the missing cells with empty values.
Next you need to know how many columns you have in the smaller list as you want to fill these columns with empty values. That means you have to take the header line of the smaller list, split it by comma and count the columns.
Then generate a string containing your empty cells (eg. if your smaller list has 3 columns, you need a string ",," - String Padding may be of help here).
So then you only have to loop over the larger list and get the two corresponding rows (or use the empty one you generated earlier) and concatenate them with a comma and put them in a list.

Custom List<string[]> Sort

I have a list of string[].
List<string[]> cardDataBase;
I need to sort that list by each list-item's second string value (item[1]) in custom order.
The custom order is a bit complicated, order by those starting characters:
"MW1"
"FW"
"DN"
"MWSTX1CK"
"MWSTX2FF"
then order by these letters following above starting letters:
"A"
"Q"
"J"
"C"
"E"
"I"
"A"
and then by the numbers following above.
a sample, unordered list left, ordered right:
MW1E10 MW1Q04
MWSTX2FFI06 MW1Q05
FWQ02 MW1E10
MW1Q04 MW1I06
MW1Q05 FWQ02
FWI01 FWI01
MWSTX2FFA01 DNC03
DNC03 MWSTX1CKC02
MWSTX1CKC02 MWSTX2FFI03
MWSTX2FFI03 MWSTX2FFI06
MW1I06 MWSTX2FFA01
I tried Linq but I am not that good in it right now and cannot solve this on my own. Do I need a dictionary, regex or a dictionary with regex in it? What would be the best approach?
I think you're approaching this incorrectly. You're not sorting strings, you're sorting structured objects that are misrepresented as strings (somebody aptly named this antipattern "stringly typed"). Your requirements show that you know this structure, yet it's not represented in the datastructure List<string[]>, and that's making your life hard. You should parse that structure into a real type (struct or class), and then sort that.
enum PrefixCode { MW1, FW, DN, MWSTX1CK, MWSTX2FF, }
enum TheseLetters { Q, J, C, E, I, A, }
struct CardRecord : IComparable<CardRecord> {
public readonly PrefixCode Code;
public readonly TheseLetters Letter;
public readonly uint Number;
public CardRecord(string input) {
Code = ParseEnum<PrefixCode>(ref input);
Letter = ParseEnum<TheseLetters>(ref input);
Number = uint.Parse(input);
}
static T ParseEnum<T>(ref string input) { //assumes non-overlapping prefixes
foreach(T val in Enum.GetValues(typeof(T))) {
if(input.StartsWith(val.ToString())) {
input = input.Substring(val.ToString().Length);
return val;
}
}
throw new InvalidOperationException("Failed to parse: "+input);
}
public int CompareTo(CardRecord other) {
var codeCmp = Code.CompareTo(other.Code);
if (codeCmp!=0) return codeCmp;
var letterCmp = Letter.CompareTo(other.Letter);
if (letterCmp!=0) return letterCmp;
return Number.CompareTo(other.Number);
}
public override string ToString() {
return Code.ToString() + Letter + Number.ToString("00");
}
}
A program using the above to process your example might then be:
static class Program {
static void Main() {
var inputStrings = new []{ "MW1E10", "MWSTX2FFI06", "FWQ02", "MW1Q04", "MW1Q05",
"FWI01", "MWSTX2FFA01", "DNC03", "MWSTX1CKC02", "MWSTX2FFI03", "MW1I06" };
var outputStrings = inputStrings
.Select(s => new CardRecord(s))
.OrderBy(c => c)
.Select(c => c.ToString());
Console.WriteLine(string.Join("\n", outputStrings));
}
}
This generates the same ordering as in your example. In real code, I'd recommend you name the types according to what they represent, and not, for example, TheseLetters.
This solution - with a real parse step - is superior because it's almost certain that you'll want to do more with this data at some point, and this allows you to actually access the components of the data easily. Furthermore, it's comprehensible to a future maintainer since the reason behind the ordering is somewhat clear. By contrast, if you chose to do complex string-based processing it's often very hard to understand what's going on (especially if it's part of a larger program, and not a tiny example as here).
Making new types is cheap. If your method's return value doesn't quite "fit" in an existing type, just make a new one, even if that means 1000's of types.
A bit spoonfeeding, but I found this question pretty interesting and perhaps it will be useful for others, also added some comments to explain:
void Main()
{
var cardDatabase = new List<string>{
"MW1E10",
"MWSTX2FFI06",
"FWQ02",
"MW1Q04",
"MW1Q05",
"FWI01",
"MWSTX2FFA01",
"DNC03",
"MWSTX1CKC02",
"MWSTX2FFI03",
"MW1I06",
};
var orderTable = new List<string>[]{
new List<string>
{
"MW1",
"FW",
"DN",
"MWSTX1CK",
"MWSTX2FF"
},
new List<string>
{
"Q",
"J",
"C",
"E",
"I",
"A"
}
};
var test = cardDatabase.Select(input => {
var r = Regex.Match(input, "^(MW1|FW|DN|MWSTX1CK|MWSTX2FF)(A|Q|J|C|E|I|A)([0-9]+)$");
if(!r.Success) throw new Exception("Invalid data!");
// for each input string,
// we are going to split it into "substrings",
// eg: MWSTX1CKC02 will be
// [MWSTX1CK, C, 02]
// after that, we use IndexOf on each component
// to calculate "real" order,
// note that thirdComponent(aka number component)
// does not need IndexOf because it is already representing the real order,
// we still want to convert string to integer though, because we don't like
// "string ordering" for numbers.
return new
{
input = input,
firstComponent = orderTable[0].IndexOf(r.Groups[1].Value),
secondComponent = orderTable[1].IndexOf(r.Groups[2].Value),
thirdComponent = int.Parse(r.Groups[3].Value)
};
// and after it's done,
// we start using LINQ OrderBy and ThenBy functions
// to have our custom sorting.
})
.OrderBy(calculatedInput => calculatedInput.firstComponent)
.ThenBy(calculatedInput => calculatedInput.secondComponent)
.ThenBy(calculatedInput => calculatedInput.thirdComponent)
.Select(calculatedInput => calculatedInput.input)
.ToList();
Console.WriteLine(test);
}
You can use the Array.Sort() method. Where your first parameter is the string[] you're sorting and the second parameter contains the complicated logic of determining the order.
You can use the IEnumerable.OrderBy method provided by the System.Linq namespace.

C#: Get an integer representation of a string array

I am new to C# and I ran into the following problem (I have looked for a solution here and on google but was not successful):
Given an array of strings (some columns can possibly be doubles or integers "in string format") I would like to convert this array to an integer array.
The question only concerns the columns with actual string values (say a list of countries).
Now I believe a Dictionary can help me to identify the unique values in a given column and associate an integer number to every country that appears.
Then to create my new array which should be of type int (or double) I could loop through the whole array and define the new array via the dictionary. This I would need to do for every column which has string values.
This seems inefficient, is there a better way?
In the end I would like to do multiple linear regression (or even fit a generalized linear model, meaning I want to get a design matrix eventually) with the data.
EDIT:
1) Sorry for being unclear, I will try to clarify:
Given:
MAKE;VALUE ;GENDER
AUDI;40912.2;m
WV;3332;f
AUDI;1234.99;m
DACIA;0;m
AUDI;12354.2;m
AUDI;123;m
VW;21321.2;f
I want to get a "numerical" matrix with identifiers for the the string valued columns
MAKE;VALUE;GENDER
1;40912.2;0
2;3332;1
1;1234.99;0
3;0;0
1;12354.2;0
1;123;0
2;21321.2;1
2) I think this is actually not what I need to solve my problem. Still it does seem like an interesting question.
3) Thank you for the responses so far.
This will take all the possible strings which represent an integer and puts them in a List.
You can do the same with strings wich represent a double.
Is this what you mean??
List<int> myIntList = new List<int>()
foreach(string value in stringArray)
{
int myInt;
if(Int.TryParse(value,out myInt)
{
myIntList.Add(myInt);
}
}
Dictionary is good if you want to map each string to a key like this:
var myDictionary = new Dictionary<int,string>();
myDictionary.Add(1,"CountryOne");
myDictionary.Add(2,"CountryTwo");
myDictionary.Add(3,"CountryThree");
Then you can get your values like:
string myCountry = myDictionary[2];
But still not sure if i'm helping you right now. Do you have som code to specify what you mean?
I'm not sure if this is what you are looking for but it does output the result you are looking for, from which you can create an appropriate data structure to use. I use a list of string but you can use something else to hold the processed data. I can expand further, if needed.
It does assume that the number of "columns", based on the semicolon character, is equal throughout the data and is flexible enough to handle any number of columns. Its kind of ugly but it should get what you want.
using System;
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication3
{
class StringColIndex
{
public int ColIndex { get; set; }
public List<string> StringValues {get;set;}
}
class Program
{
static void Main(string[] args)
{
var StringRepresentationAsInt = new List<StringColIndex>();
List<string> rawDataList = new List<string>();
List<string> rawDataWithStringsAsIdsList = new List<string>();
rawDataList.Add("AUDI;40912.2;m");rawDataList.Add("VW;3332;f ");
rawDataList.Add("AUDI;1234.99;m");rawDataList.Add("DACIA;0;m");
rawDataList.Add("AUDI;12354.2;m");rawDataList.Add("AUDI;123;m");
rawDataList.Add("VW;21321.2;f ");
foreach(var rawData in rawDataList)
{
var split = rawData.Split(';');
var line = string.Empty;
for(int i= 0; i < split.Length; i++)
{
double outValue;
var isNumberic = Double.TryParse(split[i], out outValue);
var txt = split[i];
if (!isNumberic)
{
if(StringRepresentationAsInt
.Where(x => x.ColIndex == i).Count() == 0)
{
StringRepresentationAsInt.Add(
new StringColIndex { ColIndex = i,
StringValues = new List<string> { txt } });
}
var obj = StringRepresentationAsInt
.First(x => x.ColIndex == i);
if (!obj.StringValues.Contains(txt)){
obj.StringValues.Add(txt);
}
line += (string.IsNullOrEmpty(line) ?
string.Empty :
("," + (obj.StringValues.IndexOf(txt) + 1).ToString()));
}
else
{
line += "," + split[i];
}
}
rawDataWithStringsAsIdsList.Add(line);
}
rawDataWithStringsAsIdsList.ForEach(x => Console.WriteLine(x));
Console.ReadLine();
/*
Desired output:
1;40912.2;0
2;3332;1
1;1234.99;0
3;0;0
1;12354.2;0
1;123;0
2;21321.2;1
*/
}
}
}

Finding duplicates in List<string>

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates?
For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?
Thanks
Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.
You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.
This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.
EDIT: Answer updated to reflect refined question after it was edited
void Main()
{
List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};
IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();
var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);
// do something with grouped duplicates...
}
// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {
private static Regex digitFilenameRegex = new Regex(#"\d+", RegexOptions.Compiled);
public bool Equals(string left, string right) {
Match leftDigitsMatch = digitFilenameRegex.Match(left);
Match rightDigitsMatch = digitFilenameRegex.Match(right);
long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;
return leftValue == rightValue;
}
public int GetHashCode(string value) {
return base.GetHashCode();
}
}
I understand you are looking for duplicates in order to remove them?
One way to go about it could be the following:
Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)
Make this class explicitly or implictly convertible to string (or override ToString() for that matter).
Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.
Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.
I have a simple solution for everyone to find a duplicate string word and cahracter
For word
public class Test {
public static void main(String[] args) {
findDuplicateWords("i am am a a learner learner learner");
}
private static void findDuplicateWords(String string) {
HashMap<String,Integer> hm=new HashMap<>();
String[] s=string.split(" ");
for(String tempString:s){
if(hm.get(tempString)!=null){
hm.put(tempString, hm.get(tempString)+1);
}
else{
hm.put(tempString,1);
}
}
System.out.println(hm);
}
}
for character use for loop, get array length and use charAt()
Maybe somthing like this:
List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };
theList.GroupBy(txt => txt)
.Where(grouping => grouping.Count() > 1)
.ToList()
.ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these values {2}",
groupItem.Key,
groupItem.Count(),
string.Join(" ", groupItem.ToArray())));

Custom Comparer against a parameter failing

I am trying to write a custom comparer to sort a list of search results based on similarity. I would like the term most like the entered search term to appear first in the list, followed by phrases that start with the search phrase, then all other values in alpha order.
Given this test code:
string searchTerm = "fleas";
List<string> list = new List<string>
{
"cat fleas",
"dog fleas",
"advantage fleas",
"my cat has fleas",
"fleas",
"fleas on my cat"
};
I'm trying to use this Comparer:
public class MatchComparer : IComparer<string>
{
private readonly string _searchTerm;
public MatchComparer(string searchTerm)
{
_searchTerm = searchTerm;
}
public int Compare(string x, string y)
{
if (x.Equals(_searchTerm) ||
y.Equals(_searchTerm))
return 0;
if (x.StartsWith(_searchTerm) ||
y.StartsWith(_searchTerm))
return 0;
if (x.Contains(_searchTerm))
return 1;
if (y.Contains(_searchTerm))
return 1;
return x.CompareTo(y);
}
Calling list.Sort(new MatchComparer(searchTerm) results in 'my cat has fleas' at the top of the list.
I think I must be doing something odd/weird here .. Is something wrong here or is there a better approach to what I'm trying to do?
Thanks!
You aren't using the all of the possible return values for CompareTo
-1 == x first
0 == are equal
1 == y first
you want something more like this
public class MatchComparer : IComparer<string>
{
private readonly string _searchTerm;
public MatchComparer(string searchTerm)
{
_searchTerm = searchTerm;
}
public int Compare(string x, string y)
{
if (x.Equals(y)) return 0; // Both entries are equal;
if (x.Equals(_searchTerm)) return -1; // first string is search term so must come first
if (y.Equals(_searchTerm)) return 1; // second string is search term so must come first
if (x.StartsWith(_searchTerm)) {
// first string starts with search term
// if second string also starts with search term sort alphabetically else first string first
return (y.StartsWith(_searchTerm)) ? x.CompareTo(y) : -1;
};
if (y.StartsWith(_searchTerm)) return 1; // second string starts with search term so comes first
if (x.Contains(_searchTerm)) {
// first string contains search term
// if second string also contains the search term sort alphabetically else first string first
return (y.Contains(_searchTerm)) ? x.CompareTo(y) : -1;
}
if (y.Contains(_searchTerm)) return 1; // second string contains search term so comes first
return x.CompareTo(y); // fall back on alphabetic
}
}

Categories

Resources