I want to add column with data to a deedle dataframe.
I'm doing it this way, and it works, but I believe there should be a better way ?
void addTrendValues(Frame<int, string> df){
List<double> trend_val = new List<double>();
df.FillMissing(0);
List<int> indexes= new List<int>();
for(int i =0; i<1000;i++){
double trendpips = getPipsTilNextTrend(df,i);
trend_val.Add(trendpips);
indexes.Add(i);
if(i% 10000 == 0) { Console.WriteLine(i) ;}
}
df.AddColumn("trend_val",new Series<int, double>(indexes.ToArray(),trend_val.ToArray()));
}
Any idea ?
If you generate IEnumerable<KeyValuePair<K,V>> in some way, then you can use the ToSeries extension method provided by Deedle. The following does that quite nicely using LINQ (but it also removes the printing, which you might want to recover).
var newSeries = Enumerable.Range(0, 1000).Select(i =>
KeyValue.Create(i, getPipsTilNextTrend(df, i))).ToSeries();
df.AddColumn("trend_val", newSeries);
Related
I'd like to have a
data structure with <Key, Value> where
Key = (start, end),
Value = string
After which I should be able to search an integer optimally in the data structure and get corresponding value.
Example:
var lookup = new Something<(int, int), string>()
{
{(1,100),"In 100s"},
{(101,200),"In 100-200"},
}
var value1 = lookup[10]; //value1 = "In 100s"
var value2 = lookup[110]; //value2 = "In 100-200"
Could anyone suggest?
If you want to be able to use something like lookup[10] as you mentioned, you can create your own class that implements some sort of key/value data type. Which underlying data type you ultimately decide to use really depends on what your data looks like.
Here's a simple example of doing this while implementing a Dictionary<>:
public class RangeLookup : Dictionary<(int Min, int Max), string>
{
public string this[int index] => this.Single(x => x.Key.Min <= index && index <= x.Key.Max).Value;
}
This allows you to define a custom indexer on top of the dictionary to encapsulate your range lookup. A usage of this class would look like:
var lookup = new RangeLookup
{
{ (1, 100), "In 100s" },
{ (101, 200), "In 101-200s" },
};
Console.WriteLine($"50: {lookup[50]}");
Which produces output as:
In terms of performance with this approach, the following is an example of some tests (using Win10 with an Intel i7-4770 CPU) retrieving a value from a dictionary with 10,000,000 records:
var lookup = new RangeLookup();
for (var i = 1; i <= 10000000; i++)
{
var max = i * 100;
var min = max - 99;
lookup.Add((min, max), $"In {min}-{max}s");
}
var stopwatch = new Stopwatch();
stopwatch.Start();
Console.WriteLine($"50: {lookup[50]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
stopwatch.Restart();
Console.WriteLine($"5,000: {lookup[5000]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
stopwatch.Restart();
Console.WriteLine($"1,000,000,000: {lookup[1000000000]} (TimeToLookup: {stopwatch.ElapsedMilliseconds})");
Which gives the following results:
So unless you plan on working with more than tens of millions of records inside of this data set, an approach like this should be satisfactory in terms of performance.
You basically have a Dictionary<> structure here, for example:
var lookup = new Dictionary<(int, int), string>()
{
{(1,100),"In 100s"},
{(101,200),"In 100-200"},
};
You can use some basic Linq queries to search that container, for example:
var searchValue = 10;
var value1 = lookup.First(l => l.Key.Item1 <= searchValue && l.Key.Item2 >= searchValue);
searchValue = 110;
var value2 = lookup.First(l => l.Key.Item1 <= searchValue && l.Key.Item2 >= searchValue);
But as Lee suggested in the comments, you might get better performance using a SortedDictionary, your mileage may vary, which means you need to test the performance of both.
Although I can successfully force a column into my jagged array, can anyone help me find a better way of doing so using lambda expressions?
This code works, but there has to be a better way. I tried converting each row into a list and then using List.Insert at the column number in hopes of converting back to an array, but this led to even less clean code.
I also tried using various combinations of .Select() and .Where() with no success. My guess is that some kind of compound LINQ is the way to go, but I don't know how to do it.
Example DataOut.Select(x => x.Select(y => y[columnindex] ... something ...
public object[][] DataOut;
...
int newcol = GetColumnNumber("NewData");
int insertioncol = GetColumnNumber("InsertionPoint");
var newarr = DataOut.Select(x => x[newcol]).ToArray();
var existingcol = DataOut.Select(x => x[insertioncol]).ToArray();
if (newcol > insertioncol)
{
for (int j = newcol; j > insertioncol; j--)
{
DataOut.All(x => { x[j] = x[j-1]; return true; });
}
// Yikes, there has to be a better way to insert the data
int c = 0;
DataOut.All(x => { x[insertioncol] = newarr[c]; c++; return true; });
}
Again, this works fine and my data set is relatively small do I can go with it for the moment, but maybe I can make this scale-able and hopefully learn something too.
I'm after some help with how to query a list and return back the index, but not using Linq. I've seen many example where Linq is used, but the software I'm writing the query into doesn't support Linq. :(
So example to get us going:
List<string> location = new List<string>();
location.add(#"C:\test\numbers\FileName_IgnoreThis_1.jpg");
location.add(#"C:\test\numbers\FileName_IgnoreThis_2.jpg");
location.add(#"C:\test\numbers\FileName_ImAfterThis_3.jpg");
location.add(#"C:\test\numbers\FileName_IgnoreThis_4.jpg");
location.add(#"C:\test\numbers\FileName_ImAfterThis_5.jpg");
So this list will be populated with probably a few hundred records, what I need to do is query the list for the text "ImAfterThis" then return the index number location for this item into a string array but without using Linq.
The desired result would be 2 and 4 being added to the string array.
I was thinking of doing a for loop through the list, but is there a better way to achieve this?
List<int> results = new List<int>();
int i = 0;
foreach (string value in location)
{
if (value.Contains("IAfterThis"))
{
results.Add(i);
Console.WriteLine("Found in Index: " + i);
}
i++;
}
Console.ReadLine();
Thanks in advance.
If you want to get just the first occurrence you could simply use the IndexOf() method.
If you want all occurrences of string "whatever" then a for loop would certainly work for you. For the sake of argument here I've capture the indexes in another list:
string MyString = "whatever";
List<int> indexes = new List();
for (int i = 0; i < location.Count; i++)
{
if (location[i] == MyString)
{
indexes.Add(i);
}
}
I have a list of strings:
var list = new List<string>();
list.Add("CAT");
list.Add("DOG");
var listofItems = new List<string>();
listofItems .Add("CATS ARE GOOD");
listofItems .Add("DOGS ARE NICE");
listofItems .Add("BIRD");
listofItems .Add("CATAPULT");
listofItems .Add("DOGGY");
and now i want a function like this:
listofItems.Where(r=> list.Contains(r));
but instead of Contains, i want it to do a starts with check so 4 out of the 5 items would be returned (BIRD would NOT).
What is the fastest way to do that?
You can use StartsWith inside of an Any
listofItems.Where(item=>list.Any(startsWithWord=>item.StartsWith(startsWithWord)))
You can visualize this as a double for loop, with the second for breaking out as soon as it hits a true case
var filteredList = new List<String>();
foreach(var item in listOfItems)
{
foreach(var startsWithWord in list)
{
if(item.StartsWith(startsWithWord))
{
filteredList.Add(item)
break;
}
}
}
return filteredList;
The fastest way would be usage of another data structure, for example Trie. Basic C# implementation can be found here: https://github.com/kpol/trie
This should get you what you need in a more simplified format:
var result = listofItems.Select(n =>
{
bool res = list.Any(v => n.StartsWith(v));
return res
? n
: string.Empty;
}).Where(b => !b.Equals(string.Empty));
The Trie data structure is what you need. Take a look at this more mature library: TrieNet
using Gma.DataStructures.StringSearch;
...
var trie = new SuffixTrie<int>(3);
trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);
var result = trie.Retrieve("hel");
I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:
public void GetCRCs(List<DupInfo> dupInfos)
{
var crc = new Crc32();
for (int i = 0; i < dupInfos.Count(); i++)
{
if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
{
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
My question is:
How can I compare each entry to its neighbors without the out of bounds error?
Should I be using a loop for this, or is there a better LINQ or other function?
Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.
Compute the Crcs first:
// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
dupInfos[0].CheckSum = null ;
for (int i = 1; i < dupInfos.Count(); i++)
{
dupInfos[i].CheckSum = null ;
if (dupInfos[i].Size == dupInfos[i - 1].Size)
{
if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
After having sorted your files by size and crc, identify duplicates:
public void GetDuplicates(List<DupInfo> dupInfos)
{
for (int i = dupInfos.Count();i>0 i++)
{ // loop is inverted to allow list items deletion
if (dupInfos[i].Size == dupInfos[i - 1].Size &&
dupInfos[i].CheckSum != null &&
dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
{ // i is duplicated with i-1
... // your code here
... // eventually, dupInfos.RemoveAt(i) ;
}
}
}
I have sorted my list of files by size, and am looping through to
compare each element to the ones above and below it.
The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.
I suggest taking this approach
Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.
Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.
The code might look something like this:
var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
.Where(group => group.Count() > 1);
foreach (var grp in filesSetsWithPossibleDupes)
{
var checksums = new List<CRC32CheckSum>(); //or whatever type
foreach (var file in grp)
{
var currentCheckSum = crc.ComputeChecksum(file);
if (checksums.Contains(currentCheckSum))
{
//Found a duplicate
}
else
{
checksums.Add(currentCheckSum);
}
}
}
Or if you need the specific objects that could be duplicates, the inner foreach loop might look like
var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
.Where(grp => grp.Count() > 1);
var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates
foreach (var grp in filesSetsWithPossibleDupes)
{
var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
.Where(g => g.Count() > 1);
//Same GroupBy logic, but applied to the checksum (instead of file size)
foreach(var dupGrp in likelyDuplicates)
{
//Create the key for the dictionary (your code is likely different)
var sample = dupGrp.First();
var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
masterDuplicateDict.Add(key, dupGrp);
}
}
A demo of this idea.
I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)
var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
...
});
Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.
https://stackoverflow.com/a/13505715/1856992
Edit: Sorry for some reason I thought you were comparing file name not size.
So here is an actual answer for you.
using System;
using System.Collections.Generic;
using System.Linq;
public class ObjectWithSize
{
public int Size {get; set;}
public ObjectWithSize(int size)
{
Size = size;
}
}
public class Program
{
public static void Main()
{
Console.WriteLine("start");
var list = new List<ObjectWithSize>();
list.Add(new ObjectWithSize(12));
list.Add(new ObjectWithSize(13));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(18));
list.Add(new ObjectWithSize(15));
list.Add(new ObjectWithSize(15));
var duplicates = list.GroupBy(x=>x.Size)
.Where(g=>g.Count()>1);
foreach (var dup in duplicates)
foreach (var objWithSize in dup)
Console.WriteLine(objWithSize.Size);
}
}
This will print out
14
14
15
15
Here is a netFiddle for that.
https://dotnetfiddle.net/0ub6Bs
Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.