I am trying to remove unwanted images from the website. The product image folder contains more than 200000 + Images. I have a list of product codes that are inactive in a List. I have the list of file names in another list.
List<string> lFileList = files.ToList();
List<string> lNotinfiles = new List<string>();
foreach (var s in lFileList)
{
var s2 = (from s3 in lProductsList
where s.Contains(s3.cProductCode)
select s3.cProductCode).FirstOrDefault();
if (s2 == null)
{
lNotinfiles.Add(s);
}
}
Here lProductsList is the list containing ProductCodes that are not used.
The Image list contain multiple images for the same product but the image name contains product code ( mostly it starts with and there may be _1, _2.jpg will be there.
The above code works but it takes more than 5 min for a single folder to get the Not in list. I did try the following but that took more than 15 min.
var s2 = (from s3 in lProductsList
where s.IndexOf(s3.cProductCode) >= 0
select s3.cProductCode).FirstOrDefault();
I have tried to remove the loop all together that also didn't work.
What should be the best way to achieve this faster.
I'd suggest to: use HashSet, wait with ToList and maybe GroupBy.
HashSet + use of ToList
Currently your code is in a time complexity of o(n)2 - you iterate the outer list and for each item iterate all the items of the inner list.
Change the type of the lProductsList from a list to a HashSet<string> containing codes. Finding an item in a HashSet is of o(1) (list is o(n)). Then when you iterate each of the times of lFileList to find if they are in lProductsList it will be in the time complexity of o(n) instead of o(n)2.
This code will show you the time difference between when using 2 lists or when using a list and a HashSet:
var items = (new[] { "1", "2", "3","4","5","6","7","8","9","10" }).SelectMany(x => Enumerable.Repeat(x, 10000)).ToList();
var itemsToFilterOut = new List<string> { "1", "2", "3" };
var efficientItemsToFilterOut = new HashSet<string>(itemsToFilterOut);
var watch = System.Diagnostics.Stopwatch.StartNew();
var unwantedItems = items.Where(item => itemsToFilterOut.Contains(item)).ToList();
watch.Stop();
Console.WriteLine(watch.TotalMilliseconds);
watch = Stopwatch.StartNew();
var efficientUnwantedItems = items.Where(item => efficientItemsToFilterOut.Contains(item)).ToList();
watch.Stop();
Console.WriteLine(watch.TotalMilliseconds);
As for putting it in the context of your code:
var notInUseItems = new HashSet(from item in lProductsList
select item.cProductCode);
//Notice that here I am not using the materialized `lFileList`
lNotinfiles = files.Where(item => !notInUseItems.Contains(item));
GroupBy
Moreover - you said that the list contains multiple items mapping to the same key. Use GroupBy before filtering out. Check performance of this addition:
watch = Stopwatch.StartNew();
var moreEfficientUnwantedItems = items.GroupBy(item => item)
.Where(group => efficientItemsToFilterOut.Contains(group.Key))
.Select(group => group.Key);
watch.Stop();
Console.WriteLine(watch.TotalMilliseconds);
Check your data to analyze how significant the amount of duplications it and if needed use the GroupBy
Two suggestions:
Do not materialize files .ToList() i.e. do not wait until all files are retrieved
Organize NotInFiles as HashSet<String> to have a better compexity O(1) instead of O(N).
Something like this:
//TODO: you have to implement this
prtivate static String ExtractProductCode(string fileName) {
int p = fileName.IndexOf('_');
if (p >= 0)
return fileName.SubString(0, p);
else
return fileName;
}
...
HashSet<String> NotInFiles = new HashSet<String>(
lNotinfiles,
StringComparer.OrdinalIgnoreCase); // file names are case insensitive
..
var files = Directory
.EnumerateFiles(#"C:\MyPictures", "*.jpeg", SearchOption.AllDirectories)
.Where(path => Path.GetFileNameWithoutExtension(path))
.Select(path => ExtractProductCode(path))
.Where(code => !NotInFiles.Contains(code))
.ToList(); // if you want List materialization
You are converting Your (I assume)array to a List and then do a foreach
Using for directly on the array should make it at least a bit faster.
List<string> lNotinfiles = new List<string>();
for(int i = 0; i < files.Count(); i++)
foreach (var s in files)
{
var s2 = (from s3 in lProductsList where s.Contains(s3.cProductCode) select s3.cProductCode).FirstOrDefault();
if (s2 == null)
{
lNotinfiles.Add(s);
}
}
Related
I can't seem to find a ready answer to this, or even if the question has ever been asked before, but I want functionality similar to the SQL STRING_SPLIT functions floating around, where each item in a comma separated list is identified by its ordinal in the string.
Given the string "abc,xyz,def,tuv", I want to get a list of tuples like:
<1, "abc">
<2, "xyz">
<3, "def">
<4, "tuv">
Order is important, and I need to preserve the order, and be able to take the list and further join it with another list using linq, and be able to preserve the order. For example, if a second list is <"tuv", "abc">, I want the final output of the join to be:
<1, "abc">
<4, "tuv">
Basically, I want the comma separated string to determine the ORDER of the end result, where the comma separated string contains ALL possible strings, and it is joined with an unordered list of a subset of strings, and the output is a list of ordered tuples that consists only of the elements in the second list, but in the order determined by the comma separated string at the beginning.
I could likely figure out all of this on my own if I could just get a C# equivalent to all the various SQL STRING_SPLIT functions out there, which do the split but also include the ordinal element number in the output. But I've searched, and I find nothing for C# but splitting a string into individual elements, or splitting them into tuples where both elements of the tuple are in the string itself, not generated integers to preserve order.
The order is the important thing to me here. So if an element number isn't readily possible, a way to inner join two lists and guarantee preserving the order of the first list while returning only those elements in the second list would be welcome. The tricky part for me is this last part: the result of a join needs a specific (not easy to sort by) order. The ordinal number would give me something to sort by, but if I can inner join with some guarantee the output is in the same order as the first input, that'd work too.
That should work on .NET framework.
using System.Linq;
string str = "abc,xyz,def,tuv";
string str2 = "abc,tuv";
IEnumerable< PretendFileObject> secondList = str2.Split(',').Select(x=> new PretendFileObject() { FileName = x}); //
var tups = str.Split(',')
.Select((x, i) => { return (i + 1, x); })
.Join(secondList, //Join Second list ON
item => item.Item2 //This is the filename in the tuples
,item2 => item2.FileName, // This is the filename property for a given object in the second list to join on
(item,item2) => new {Index = item.Item1,FileName = item.Item2, Obj = item2})
.OrderBy(JoinedObject=> JoinedObject.Index)
.ToList();
foreach (var tup in tups)
{
Console.WriteLine(tup.Obj.FileName);
}
public class PretendFileObject
{
public string FileName { get; set; }
public string Foo { get; set; }
}
Original Response Below
If you wanted to stick to something SQL like here is how to do it with linq operators. The Select method has a built in index param you can make use of. And you can use IntersectBy to perform an easy inner join.
using System.Linq;
string str = "abc,xyz,def,tuv";
string str2 = "abc,tuv";
var secondList = str2.Split(',');
var tups = str.Split(',')
.Select((x, i) => { return (i + 1, x); })
.IntersectBy(secondList, s=>s.Item2) //Filter down to only the strings found in both.
.ToList();
foreach(var tup in tups)
{
Console.WriteLine(tup);
}
This will get you list of tuples
var input = "abc,xyz,def,tuv";
string[] items = input.Split(',');
var tuples = new List<(int, string)>();
for (int i = 0; i < items.Length)
{
tuples.Add(((i + 1), items[i]));
}
if then you want to add list of "tuv" and "abc" and keep 1, you probably want to "Left Join". But I am not sure, how you can do using LINQ because you first need to iterate the original list of tuples and assign same int. Then join. Or, you can join first and then assign int but technically, order is not guaranteed. However, if you assign int first, you can sort by it in the end.
I am slightly confused by "and be able to take the list and further join it with another list using linq". Join usually means aggregate result. But in your case it seem you demanding segment, not joined data.
--
"I want to remove any items from the second list that are not in the first list, and then I need to iterate over the second list IN THE ORDER of the first list"
var input2 = "xxx,xyz,yyy,tuv,";
string[] items2 = input2.Split(',');
IEnumerable<(int, string)> finalTupleOutput =
tuples.Join(items2, t => t.Item2, i2 => i2, (t, i2) => (t.Item1, i2)).OrderBy(tpl => tpl.Item1);
This will give you what you want - matching items from L2 in the order from L1
with LINQ
string inputString = "abc,xyz,def,tuv";
var output = inputString.Split(',')
.Select((item, index) => { return (index + 1, item); });
now you can use the output list as you want to use.
Not 100% sure what you're after, but here's an attempt:
string[] vals = new[] { "abc", "xyz", "dev", "tuv"};
string[] results = new string[vals.Length];
int index = 0;
for (int i = 0; i < vals.Length; i++)
{
results[i] = $"<{++index},\"{vals[i]}\">";
}
foreach (var item in results)
{
Console.WriteLine(item);
}
This produces:
<1,"abc">
<2,"xyz">
<3,"dev">
<4,"tuv">
Given the example
For example, if a second list is <"tuv", "abc">, I want the final
output of the join to be:
<1, "abc"> <4, "tuv">
I think this might be close?
List<string> temp = new List<string>() { "abc", "def", "xyz", "tuv" };
List<string> temp2 = new List<string>() { "dbc", "ace", "zyw", "tke", "abc", "xyz" };
var intersect = temp.Intersect(temp2).Select((list, idx) => (idx+1, list));
This produces an intersect result that has the elements from list 1 that are also in list 2, which in this case would be:
<1, "abc">
<2, "xyz">
If you want all the elements from both lists, switch the Intersect to Union.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a list of strings
e.g.{"apple.txt", "orange.sd.2.txt", "apple.2.tf.txt", "orange.txt"}
and another list of strings to group the first list
e.g. {"apple", "orange"}
so that the first list is split into a list of lists and looks like this:
{{"apple.txt", "apple.2.tf.txt"},{"orange.txt", "orange.sd.2.txt"}}
How can I achieve this with linq?
How about this:
var groupedList = firstList.GroupBy(x => secondList.Single(y => x.Name.Contains(y)));
You could group the elements of each of the original list by all possible keys using Split, SelectMany, and GroupBy with an anonymous type:
var list = new List<string> { "apple.txt", "orange.sd.2.txt", "apple.2.tf.txt", "orange.txt" };
var groups = list
.SelectMany(element => element
.Split('.')
.Select(part => new { Part = part, Full = element }))
.GroupBy(entry => entry.Part);
Now you can select the groups you want to keep using Where, and convert the results into the nested lists using Select and ToList:
var keys = new List<string> { "apple", "orange" };
var result = group
.Where(group => keys.Contains(group.Key))
.Select(group => group
.Select(entry => entry.Full)
.ToList())
.ToList();
N.B. Elements of the original list which do not contain any of the specified keys will not appear in the results, and elements which contain more than one of the specified keys will appear more than once in the result.
Edit: As #NetMage noted, I've made an incorrect assumption about splitting strings - here's another version, although it's O(m * n):
var result = keys
.Select(key => list.Where(element => element.Contains(key)).ToList())
.ToList();
This is one simple way to do it. There is many ways and this will include duplicated key as the comment i made on your question. If many key match the same data the grouping will include the copies.
// have the list of keys (groups)
var keyList = new List<string>() {"apple", "orange"};
// have the list of all the data to split
var dataToSplit = new List<string>()
{
"apple.txt",
"apple.2.tf.txt",
"orange.txt",
"orange.sd.2.txt"
};
// now split to get just as desired you select what you want for each keys
var groupedData = keyList.Select(key => dataToSplit.Where(data => data.Contains(key)).ToList()).ToList();
// groupedData is a List<List<string>>
A second option to get the values maybe in a more "object" fashion is to use anonymous. specially good if you will do lots of manipulation and it's more "verbiose" in the code. But if you are new to this i do NOT recommend that approach but anyhow this is it.
// have the list of keys (groups)
var keyList = new List<string>() {"apple", "orange"};
// have the list of all the data to split
var dataToSplit = new List<string>()
{
"apple.txt",
"apple.2.tf.txt",
"orange.txt",
"orange.sd.2.txt"
};
// create the anonymous
var anonymousGroup = keyList.Select(key =>
{
return new
{
Key = key,
Data = dataToSplit.Where(data => data.Contains(key)).ToList()
}
});
// anonymousGroup is a List<A> where keeping the order you should access all data for orange like this
var orangeGroup = anonymousGroup.FirstOfDefault(o=> o.Key = "orange"); // get the anonymous
var orangeData = orangeGroup.Data; // get the List<string> for that group
A third way with less complexity than O(m*n). The trick is to remove from the collection the data as you go to reduce the chance to recheck over item already processed. This is from my codebase and it's an extension for List that simply remove item from a collection based on a predicate and return what has been removed.
public static List<T> RemoveAndGet<T>(this List<T> list, Func<T, bool> predicate)
{
var itemsRemoved = new List<T>();
// iterate backward for performance
for (int i = list.Count - 1; i >= 0; i--)
{
// keep item pointer
var item = list[i];
// if the item match the remove predicate
if (predicate(item))
{
// add the item to the returned list
itemsRemoved.Add(item);
// remove the item from the source list
list.RemoveAt(i);
}
}
return itemsRemoved;
}
Now with that extension when you have a list you can use it easily like this :
// have the list of keys (groups)
var keyList = new List<string>() {"apple", "orange"};
// have the list of all the data to split
var dataToSplit = new List<string>()
{
"apple.txt",
"apple.2.tf.txt",
"orange.txt",
"orange.sd.2.txt"
};
// now split to get just as desired you select what you want for each keys
var groupedData = keyList.Select(key => dataToSplit.RemoveAndGet(data => data.Contains(key))).ToList();
In that case due to the order in both collection the first key is apple so it will iterate the 4 items in dataToSplit and keep only 2 AND reducing the dataToSplit collection to 2 items only being the one with orange in them. On the second key it will iterate only over 2 items which will make it faster for this case. Typically this method will be as fast or faster than the first 2 ones i provided while being as clear and still make use of linq.
You can achieve this using this simple code:
var list1 = new List<string>() {"apple.txt", "orange.sd.2.txt", "apple.2.tf.txt", "orange.txt"};
var list2 = new List<string>() {"apple", "orange"};
var result = new List<List<string>>();
list2.ForEach(e => {
result.Add(list1.Where(el => el.Contains(e)).ToList());
});
Tuples to the rescue!
var R = new List<(string, List<string>)> { ("orange", new List<string>()), ("apple", new List<string>()) };
var L = new List<string> { "apple.txt", "apple.2.tf.txt", "orange.txt", "orange.sd.2.txt" };
R.ForEach(r => L.ForEach(l => { if (l.Contains(r.Item1)) { r.Item2.Add(l); } }));
var resultString = string.Join("," , R.Select(x => "{" + string.Join(",", x.Item2) + "}"));
You can build R dynamically trivially if you need to.
I have 2 lists. First is a list of objects that has an int property ID. The other is a list of ints.
I need to compare these 2 lists and copy the objects to a new list with only the objects that matches between the two lists based on ID. Right now I am using 2 foreach loops as follows:
var matched = new list<Cars>();
foreach(var car in cars)
foreach(var i in intList)
{
if (car.id == i)
matched.Add(car);
}
This seems like it is going to be very slow as it is iterating over each list many times. Is there way to do this without using 2 foreach loops like this?
One slow but clear way would be
var matched = cars.Where(car => intList.Contains(car.id)).ToList();
You can make this quicker by turning the intList into a dictionary and using ContainsKey instead.
var intLookup = intList.ToDictionary(k => k);
var matched = cars.Where(car => intLookup.ContainsKey(car.id)).ToList();
Even better still, a HashSet:
var intHash = new HashSet(intList);
var matched = cars.Where(car => intHash.Contains(car.id)).ToList();
You could try some simple linq something like this should work:
var matched = cars.Where(w => intList.Contains(w.id)).ToList();
this will take your list of cars and then find only those items where the id is contained in your intList.
I wanted to ask for suggestions how I can simplify the foreach block below. I tried to make it all in one linq statement, but I couldn't figure out how to manipulate "count" values inside the query.
More details about what I'm trying to achieve:
- I have a huge list with potential duplicates, where Id's are repeated, but property "Count" is different numbers
- I want to get rid of duplicates, but still not to loose those "Count" values
- so for the items with the same Id I summ up the "Count" properties
Still, the current code doesn't look pretty:
var grouped = bigList.GroupBy(c => c.Id).ToList();
foreach (var items in grouped)
{
var count = 0;
items.Each(c=> count += c.Count);
items.First().Count = count;
}
var filtered = grouped.Select(y => y.First());
I don't expect the whole solution, pieces of ideas will be also highly appreciated :)
Given that you're mutating the collection, I would personally just make a new "item" with the count:
var results = bigList.GroupBy(c => c.Id)
.Select(g => new Item(g.Key, g.Sum(i => i.Count)))
.ToList();
This performs a simple mapping from the original to a new collection of Item instances, with the proper Id and Count values.
var filtered = bigList.GroupBy(c=>c.Id)
.Select(g=> {
var f = g.First();
f.Count = g.Sum(c=>c.Count);
return f;
});
I have a generic list
Simplified example
var list = new List<string>()
{
"lorem1.doc",
"lorem2.docx",
"lorem3.ppt",
"lorem4.pptx",
"lorem5.doc",
"lorem6.doc",
};
What I would like to do is to sort these items based on an external list ordering
In example
var sortList = new[] { "pptx", "ppt", "docx", "doc" };
// Or
var sortList = new List<string>() { "pptx", "ppt", "docx", "doc" };
Is there anything built-in to linq that could help me achieve this or do I have to go the foreach way?
With the list you can use IndexOf for Enumerable.OrderBy:
var sorted = list.OrderBy(s => sortList.IndexOf(Path.GetExtension(s)));
So the index of the extension in the sortList determines the priority in the other list. Unknown extensions have highest priority since their index is -1.
But you need to add a dot to the extension to get it working:
var sortList = new List<string>() { ".pptx", ".ppt", ".docx", ".doc" };
If that's not an option you have to fiddle around with Substring or Remove, for example:
var sorted = list.OrderBy(s => sortList.IndexOf(Path.GetExtension(s).Remove(0,1)));
This solution will work even if some file names do not have extensions:
var sortList = new List<string>() { "pptx", "ppt", "docx", "doc" };
var list = new List<string>()
{
"lorem1.doc",
"lorem2.docx",
"lorem3.ppt",
"lorem4.pptx",
"lorem5.doc",
"lorem6.doc",
};
var result =
list.OrderBy(f => sortList.IndexOf(Path.GetExtension(f).Replace(".","")));
You could try using Array.IndexOf() method:
var sortedList = list.OrderBy(i => sortList.IndexOf(System.IO.Path.GetExtension(i))).ToList();
A sortDicionary would be more efficient:
var sortDictionary = new Dictionary<string, int> {
{ ".pptx", 0 },
{ ".ppt" , 1 },
{ ".docx", 2 },
{ ".doc" , 3 } };
var sortedList = list.OrderBy(i => {
var s = Path.GetExtension(i);
int rank;
if (sortDictionary.TryGetValue(s, out rank))
return rank;
return int.MaxValue; // for unknown at end, or -1 for at start
});
This way the lookup is O(1) rather than O(# of extensions).
Also, if you have a large number of filenames and a small number of extensions, it might actually be faster to do
var sortedList = list
.GroupBy(p => Path.GetExtension(p))
.OrderBy(g => {
int rank;
if (sortDictionary.TryGetValue(g.Key, out rank))
return rank;
return int.MaxValue; // for unknown at end, or -1 for at start
})
.SelectMany(g => g);
This means the sort scales by the number of distinct extensions in the input, rather than the number of items in the input.
This also allows you to give two extensions the same priority.
Here's another way that does not use OrderBy:
var res =
sortList.SelectMany(x => list.Where(f => Path.GetExtension(f).EndsWith(x)));
Note that the complexity of this approach is O(n * m) with n = sortList.Count and m list.Count.
The OrderBy approach worst-case complexity is instead O(n * m * log m) but probably in general it will be faster (since IndexOf does not result always in O(n) ). However with small n and m you won't notice any difference.
For big lists the fastest way ( complexity O(n+m) ) could be constructing a temporary lookup i.e. :
var lookup = list.ToLookup(x => Path.GetExtension(x).Remove(0,1));
var res = sortList.Where(x => lookup.Contains(x)).SelectMany(x => lookup[x]);