Fast reading large table - c#

I have csv file structured as below:
1,0,2.2,0,0,0,0,1.2,0
0,1,2,4,0,1,0.2,0.1,0
0,0,2,3,0,0,0,1.2,2.1
0,0,0,1,2,1,0,0.2,0.1
0,0,1,0,2.1,0.1,0,1.2
0,0,2,3,0,1.1,0.1,1.2
0,0.2,0,1.2,2,0,3.2,0
0,0,1.2,0,2.2,0,0,1.1
but with 10k columns and 10k rows.
I want to read it in such a way that in the result i get a dictionary
with Key as a index of the row and Value as float array filed with every value in this row.
For now my code look like this:
var lines = File.ReadAllLines(filePath).ToList();
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(',').Where(v =>!string.IsNullOrEmpty(v))
.Select(f => f.Replace('.', ','))
.Select(float.Parse).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);
but it takes up to 30 seconds to finish, so it's quite slow and i want to optimize it to be a bit faster.

While there are many small optimizations you can make, what is really killing you is the garbage collector because of all the allocations.
Your code takes 12 seconds to run on my machine. Reading the file uses 2 of those 12 seconds.
By using all the optimizations mentioned in the comments (using File.ReadLines, StringSplitOptions.RemoveEmptyEntries, also using float.Parse(f, CultureInfo.InvariantCulture) instead of calling string.Replace), we get down to 9 seconds. There's still a lot of allocations done, especially by File.ReadLines. Can we do better?
Just activate server GC in the app.config:
<runtime>
<gcServer enabled="true" />
</runtime>
With that, the execution time drops to 6 seconds using your code, and 3 seconds using the optimizations mentioned above. At that point, the file I/O are taking more than 60% of the execution time, so it's not really worth optimizing more.
Final version of the code:
var lines = File.ReadLines(filePath);
var separator = new[] {','};
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(separator, StringSplitOptions.RemoveEmptyEntries)
.Select(f => float.Parse(f, CultureInfo.InvariantCulture)).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);

Replacing the Split and Replace with hand parsing and using InvariantInfo to accept the period as decimal point, and then removing the wasteful ReadAllLines().ToList() and letting the AsParallel() read from the file while parsing, speeds up on my PC about four times.
var lines = File.ReadLines(filepath);
var result = lines.AsParallel().AsOrdered().Select((line, index) => {
var values = new List<float>(10000);
var pos = 0;
while (pos < line.Length) {
var commapos = line.IndexOf(',', pos);
commapos = commapos < 0 ? line.Length : commapos;
var fs = line.Substring(pos, commapos - pos);
if (fs != String.Empty) // remove if no value is ever missing
values.Add(float.Parse(fs, NumberFormatInfo.InvariantInfo));
pos = commapos + 1;
}
return values;
}).ToList();
Also replaced ToArray on values with a List as that is generally faster (ToList is preferred over ToArray).

using Microsoft.VisualBasic.FileIO;
protected void CSVImport(string importFilePath)
{
string csvData = System.IO.File.ReadAllText(importFilePath, System.Text.Encoding.GetEncoding("WINDOWS-1250"));
foreach (string row in csvData.Split('\n'))
{
var parser = new TextFieldParser(new StringReader(row));
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields;
fields = parser.ReadFields();
//do what you need with data in array
}
}

Related

How to search vast code base for multiple literal strings efficiently?

This question is a follow up on How to optimize performance in a simple TPL DataFlow pipeline?
The source code is here - https://github.com/MarkKharitonov/LearningTPLDataFlow
Given:
Several solutions covering about 400 C# projects encompassing thousands of C# source files totaling in more than 10,000,000 lines of code.
A file containing string literals, one per line.
I want to produce a JSON file listing all the occurrences of the literals in the source code. For every matching line I want to have the following pieces of information:
The project path
The C# file path
The matching line itself
The matching line number
And all the records arranged as a dictionary keyed by the respective literal.
So the challenge is to do it as efficiently as possible (in C#, of course).
The DataFlow pipeline can be found in this file - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs
Here it is:
private void Run(string workspaceRoot, string outFilePath, string[] literals, bool searchAllFiles, int workSize, int maxDOP1, int maxDOP2, int maxDOP3, int maxDOP4)
{
var res = new SortedDictionary<string, List<MatchingLine>>();
var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();
var progress = new Progress();
var taskSchedulerPair = new ConcurrentExclusiveSchedulerPair(TaskScheduler.Default, Environment.ProcessorCount);
var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(p => YieldCSFiles(p, searchAllFiles), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP1
});
var produceCSFileContent = new TransformBlock<CSFile, CSFile>(CSFile.PopulateContentAsync, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP2
});
var produceWorkItems = new TransformManyBlock<CSFile, (CSFile CSFile, int Pos, int Length)>(csFile => csFile.YieldWorkItems(literals, workSize, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP3,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var produceMatchingLines = new TransformManyBlock<(CSFile CSFile, int Pos, int Length), MatchingLine>(o => o.CSFile.YieldMatchingLines(literals, o.Pos, o.Length, progress), new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = maxDOP4,
TaskScheduler = taskSchedulerPair.ConcurrentScheduler
});
var getMatchingLines = new ActionBlock<MatchingLine>(o => AddResult(res, o));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFiles.LinkTo(produceCSFileContent, linkOptions);
produceCSFileContent.LinkTo(produceWorkItems, linkOptions);
produceWorkItems.LinkTo(produceMatchingLines, linkOptions);
produceMatchingLines.LinkTo(getMatchingLines, linkOptions);
var progressTask = Task.Factory.StartNew(() =>
{
var delay = literals.Length < 10 ? 1000 : 10000;
for (; ; )
{
var current = Interlocked.Read(ref progress.Current);
var total = Interlocked.Read(ref progress.Total);
Console.Write("Total = {0:n0}, Current = {1:n0}, Percents = {2:P} \r", total, current, ((double)current) / total);
if (progress.Done)
{
break;
}
Thread.Sleep(delay);
}
Console.WriteLine();
}, TaskCreationOptions.LongRunning);
projects.ForEach(p => produceCSFiles.Post(p));
produceCSFiles.Complete();
getMatchingLines.Completion.GetAwaiter().GetResult();
progress.Done = true;
progressTask.GetAwaiter().GetResult();
res.SaveAsJson(outFilePath);
}
The default parameters are (https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs#L24-L28):
private int m_maxDOP1 = 3;
private int m_maxDOP2 = 20;
private int m_maxDOP3 = Environment.ProcessorCount;
private int m_maxDOP4 = Environment.ProcessorCount;
private int m_workSize = 1_000_000;
My idea is to divide the work into work items, where a work item size is computed by multiplying the number of lines in the respective file by the count of the string literals. So, if a C# file contains 500 lines, then searching it for all the 3401 literals results in a work of size 3401 * 500 = 1700500
The unit of work is by default 1000000 lines, so in the aforementioned example the file would result in 2 work items:
Literals 0..1999
Literals 2000..3400
And it is the responsibility of the produceWorkItems block to generate these work items from files.
Example runs:
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\2.txt
Locating all the instances of the 4 literals found in the file C:\temp\2.txt in the C# code ...
Total = 49,844,516, Current = 49,702,532, Percents = 99.72%
Elapsed: 00:00:18.4320676
C:\work\TPLDataFlow [master ≡]> .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\xyz\tip -o c:\temp -l C:\temp\1.txt
Locating all the instances of the 3401 literals found in the file c:\temp\1.txt in the C# code ...
Total = 42,379,095,775, Current = 42,164,259,870, Percents = 99.49%
Elapsed: 01:44:13.4289270
Question
Many work items are undersized. If I have 3 C# files, 20 lines each, my current code would produce 3 work items, because in my current implementation work items never cross a file boundary. This is inefficient. Ideally, they would be batched into a single work item, because 60 * 3401 = 204060 < 1000000. But the BatchBlock cannot be used here, because it expects me to provide the batch size, which I do not know - it depends on the work items in the pipeline.
How would you achieve such batching ?
I have realized something. Maybe it is obvious, but I have just figured it out. The TPL DataFlow library is of no use if one can buffer all the items first. So in my case - I can do that. And so, I can buffer and sort the items from large to small. This way a simple Parallel.ForEach will do the work just fine. Having realized that I changed my implementation to use Reactive like this:
Phase 1 - get all the items, this is where all the IO is
var input = (workspaceRoot + "build\\projects.yml")
.YieldAllProjects()
.ToObservable()
.Select(project => Observable.FromAsync(() => Task.Run(() => YieldFiles(project, searchAllFiles))))
.Merge(2)
.SelectMany(files => files)
.Select(file => Observable.FromAsync(file.PopulateContentAsync))
.Merge(10)
.ToList()
.GetAwaiter().GetResult()
.AsList();
input.Sort((x, y) => y.EstimatedLineCount - x.EstimatedLineCount);
Phase 2 - find all the matching lines (CPU only)
var res = new SortedDictionary<string, List<MatchingLine>>();
input
.ToObservable()
.Select(file => Observable.FromAsync(() => Task.Run(() => file.YieldMatchingLines(literals, 0, literals.Count, progress).ToList())))
.Merge(maxDOP.Value)
.ToList()
.GetAwaiter().GetResult()
.SelectMany(m => m)
.ForEach(m => AddResult(res, m));
So, even though I have hundreds of projects, thousands of files and millions lines of code - it is not the scale for TPL DataFlow, because my tool can read all the files into memory, rearrange in a favorable order and then process.
Regarding the first question (configuring the pipeline), I can't really offer any guidance. Optimizing the parameters of a dataflow pipeline seems like a black art to me!
Regarding the second question (how to batch a work load consisting of work items having unknown size at compile time), you could use the custom BatchBlock<T> below. It uses the DataflowBlock.Encapsulate method in order to combine two dataflow blocks to one. The first block in an ActionBlock<T> that receives the input and puts it into a buffer, and the second is a BufferBlock<T[]> that holds the batched items and propagates them downstream. The weightSelector is a lambda that returns the weight of each received item. When the accumulated weight surpasses the batchWeight threshold, a batch is emitted.
public static IPropagatorBlock<T, T[]> CreateDynamicBatchBlock<T>(
int batchWeight, Func<T, int> weightSelector,
DataflowBlockOptions options = null)
{
// Arguments validation omitted
options ??= new DataflowBlockOptions();
var outputBlock = new BufferBlock<T[]>(options);
List<T> buffer = new List<T>();
int sumWeight = 0;
var inputBlock = new ActionBlock<T>(async item =>
{
checked
{
int weight = weightSelector(item);
if (weight + sumWeight > batchWeight && buffer.Count > 0)
await SendBatchAsync();
buffer.Add(item);
sumWeight += weight;
if (sumWeight >= batchWeight) await SendBatchAsync();
}
}, new()
{
BoundedCapacity = options.BoundedCapacity,
CancellationToken = options.CancellationToken,
TaskScheduler = options.TaskScheduler,
MaxMessagesPerTask = options.MaxMessagesPerTask,
NameFormat = options.NameFormat
});
PropagateCompletion(inputBlock, outputBlock, async () =>
{
if (buffer.Count > 0) await SendBatchAsync();
});
Task SendBatchAsync()
{
var batch = buffer.ToArray();
buffer.Clear();
sumWeight = 0;
return outputBlock.SendAsync(batch);
}
static async void PropagateCompletion(IDataflowBlock source,
IDataflowBlock target, Func<Task> postCompletionAction)
{
try { await source.Completion.ConfigureAwait(false); } catch { }
Exception ex =
source.Completion.IsFaulted ? source.Completion.Exception : null;
try { await postCompletionAction(); }
catch (Exception actionError) { ex = actionError; }
if (ex != null) target.Fault(ex); else target.Complete();
}
return DataflowBlock.Encapsulate(inputBlock, outputBlock);
}
Usage example:
var batchBlock = CreateDynamicBatchBlock<WorkItem>(1_000_000, wi => wi.Size);
If the weight int type has not enough range and overflows, you could switch to long or double.

Why does it take more time when you run a LINQ OrderBy before Select?

While writing a solution for a coding problem I discovered an interesting behavior of my LINQ statements. I had two scenarios:
First:
arr.Select(x => x + 5).OrderBy(x => x)
Second:
arr.OrderBy(x => x).Select(x => x + 5)
After a little bit of testing with System.Diagnostics.Stopwatch I got the following results for an integer array of length 100_000.
For the first approach:
00:00:00.0000152
For the second:
00:00:00.0073650
Now I'm interested in why it takes more time if I do the ordering first. I wasn't able to find something on google so I thought about it by myself.
I ended up with 2 Ideas:
1. The second scenario has to convert to IOrderedEnumerable and then back to IEnumerable while the first scenario only has to convert to IOrderedEnumerable and not back.
2. You end up having 2 loops. The first for sorting and the second for the selecting while approach 1 does everything in 1 loop.
So my question is why does it take much more time to do the ordering before select?
Let's have a look at the sequences:
private static void UnderTestOrderBySelect(int[] arr) {
var query = arr.OrderBy(x => x).Select(x => x + 5);
foreach (var item in query)
;
}
private static void UnderTestSelectOrderBy(int[] arr) {
var query = arr.Select(x => x + 5).OrderBy(x => x);
foreach (var item in query)
;
}
// See Marc Gravell's comment; let's compare Linq and inplace Array.Sort
private static void UnderTestInPlaceSort(int[] arr) {
var tmp = arr;
var x = new int[tmp.Length];
for (int i = 0; i < tmp.Length; i++)
x[i] = tmp[i] + 5;
Array.Sort(x);
}
In order to perform benchmark, let's run 10 times and average 6 middle results:
private static string Benchmark(Action<int[]> methodUnderTest) {
List<long> results = new List<long>();
int n = 10;
for (int i = 0; i < n; ++i) {
Random random = new Random(1);
int[] arr = Enumerable
.Range(0, 10000000)
.Select(x => random.Next(1000000000))
.ToArray();
Stopwatch sw = new Stopwatch();
sw.Start();
methodUnderTest(arr);
sw.Stop();
results.Add(sw.ElapsedMilliseconds);
}
var valid = results
.OrderBy(x => x)
.Skip(2) // get rid of top 2 runs
.Take(results.Count - 4) // get rid of bottom 2 runs
.ToArray();
return $"{string.Join(", ", valid)} average : {(long) (valid.Average() + 0.5)}";
}
Time to run and have a look at the results:
string report = string.Join(Environment.NewLine,
$"OrderBy + Select: {Benchmark(UnderTestOrderBySelect)}",
$"Select + OrderBy: {Benchmark(UnderSelectOrderBy)}",
$"Inplace Sort: {Benchmark(UnderTestInPlaceSort)}");
Console.WriteLine(report);
Outcome: (Core i7 3.8GHz, .Net 4.8 IA64)
OrderBy + Select: 4869, 4870, 4872, 4874, 4878, 4895 average : 4876
Select + OrderBy: 4763, 4763, 4793, 4802, 4827, 4849 average : 4800
Inplace Sort: 888, 889, 890, 893, 896, 904 average : 893
I don't see any significant difference, Select + OrderBy seems to be slightly more efficient (about 2% gain) than OrderBy + Select. Inplace Sort, however, has far better performance (5 times) than any of Linq.
Depending on which Linq-provider you have, there may happen some optimization on the query. E.g. if you´d use some kind of database, chances are high your provider would create the exact same query for both statements similar to this one:
select myColumn from myTable order by myColumn;
Thus performamce should be identical, no matter if you order first in Linq or select first.
As this does not seem to happen here, you probably use Linq2Objects, which has no optimization at all. So the order of your statements may have an efffect, in particular if you´d have some kind of filter using Where which would filter many objects out so that later statements won´t operate on the entire collection.
To keep long things short: the difference most probably comes from some internal initialzation-logic. As a dataset of 100000 numbers is not really big - at least not big enough - even some fast initialization has a big impact.

How to split a string into efficient way c#

I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?
Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.
You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Łukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?
Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.
Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));
heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);

Out Of Memory Exception when using less than 1.2 GB?

I have a tricky situation here. I am trying to avoid hitting out of memory exceptions when writing a large CSV dataset to an H5 file via HDFDotNet API. However, I get an out of memory exception when trying to do a second loop through my file data that is the same size as the first iteration, even though the first one works and the second does not and the amount of memory being used should be much less than the ~1.2GB ceiling. I've determined the size of the chunks I want to read in at a time and the size of the chunks I need to write at a time due to limitations with the API. The CSV file is about 105k lines long by 500 columns wide.
private void WriteDataToH5(H5Writer h5WriterUtil)
{
int startRow = 0;
int skipHeaders = csv.HasColumnHeaders ? 1 : 0;
int readIntervals = (-8 * csv.NumColumns) + 55000;
int numTaken = readIntervals;
while (numTaken == readIntervals)
{
int timeStampCol = HasTimestamps ? 1 : 0;
var readLines = File.ReadLines(this.Filepath)
.Skip(startRow + skipHeaders).Take(readIntervals)
.Select(s => s.Split(new char[] { ',').Skip(timeStampCol)
.Select(x => Convert.ToSingle(x)).ToList()).ToList();
//175k is max number of cells that can be written at one time
//(unconfirmed via API, tested and seems to be definitely less than 200k and 175k works)
int writeIntervals = Convert.ToInt32(175000/csv.NumColumns);
for (int i = 0; i < readIntervals; i += writeIntervals)
{
long[] startAt = new long[] { startRow, 0 };
h5WriterUtil.WriteTwoDSingleChunk(readLines.Skip(i).Take(writeIntervals).ToList()
, DatasetsByNamePair[Tuple.Create(groupName, dataset)], startAt);
startRow += writeIntervals;
}
numTaken = readLines.Count;
GC.Collect();
}
}
I end up hitting my out of memory exception on the second pass through of the readlines section
var readLines = File.ReadLines(this.Filepath)
.Skip(rowStartAt).Take(numToTake)
.Select(s => s.Split(new char[] { ',' }).Skip(timeStampCol)
.Select(x => Convert.ToSingle(x)).ToList()).ToList();
In this case, my read intervals var would come out to 50992 and the writeIntervals would come out to about 350. Thanks!
You do a lot of unnecessary allocations:
var readLines = File.ReadLines(this.Filepath)
.Skip(rowStartAt).Take(numToTake)
.Select(s => s.Split(new char[] { ',' }) //why you need to split here ?
.Skip(timeStampCol)
.Select(x => Convert.ToSingle(x)).ToList()).ToList(); //why 2 time ToList() ?
File.ReadLines return Enumerator, hence simply iterate over it, after split every single line, skip required column, and recover value you need for saving.
What about of memory exception while still using less then 1.2GB of memory, consider following:
You may try to compile for x64 (still re-architect your code first !)
Regardless what you do, there is still limit on single collection size, which is (true) 2GB.
You may be allocating more then stack can offer you, which 1 MB for 32-bit processes and 4 MB for 64-bit processes. Why is stack size in C# exactly 1 MB?

Int.Parse(String.Split()) returns "Input string was not in a correct format" error

I am trying to perform a LINQ query on an array to filter out results based on a user's query. I am having a problem parsing two int's from a single string.
In my database, TimeLevels are stored as strings in the format [mintime]-[maxtime] minutes for example 0-5 Minutes. My user's have a slider which they can select a min and max time range, and this is stored as an int array, with two values. I'm trying to compare the [mintime] with the first value, and the [maxtime] with the second, to find database entries which fit the user's time range.
Here is my C# code from the controller which is supposed to perform that filtering:
RefinedResults = InitialResults.Where(
x => int.Parse(x.TimeLevel.Split('-')[0]) >= data.TimeRange[0] &&
int.Parse(x.TimeLevel.Split('-')[1]) <= data.TimeRange[1] &&).ToArray();
My thinking was that it would firstly split the 0-5 Minutes string at the - resulting in two strings, 0 and 5 Minutes, then parse the ints from those, resulting in just 0 and 5.
But as soon as it gets to Int.Parse, it throws the error in the title.
some of the x.TimeLevel database records are stored as "30-40+ Minutes". Is there any method just to extract the int?
You could use regular expressions to match the integer parts of the string for you, like this:
RefinedResults = InitialResults
.Where(x => {
var m = Regex.Match(x, #"^(\d+)-(\d+)");
return m.Success
&& int.Parse(m.Groups[1]) >= data.TimeRange[0]
&& int.Parse(m.Groups[2]) <= data.TimeRange[1];
}).ToArray();
This approach requires the string to start in a pair of dash-separated decimal numbers. It would ignore anything after the second number, ensuring that only sequences of digits are passed to int.Parse.
The reason your code doesn't work is because string.Split("-", "0-5 Minutes") will return [0] = "0" and [1] = "5 Minutes", and the latter is not parseable as an int.
You can use the regular expression "\d+" to split up groups of digits and ignore non-digits. This should work:
var refinedResults =
(
from result in InitialResults
let numbers = Regex.Matches(result.TimeLevel, #"\d+")
where ((int.Parse(numbers[0].Value) >= data.TimeRange[0]) && (int.Parse(numbers[1].Value) <= data.TimeRange[1]))
select result
).ToArray();
Here's a complete compilable console app which demonstrates it working. I've used dummy classes to represent your actual classes.
using System;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication2
{
public class SampleTime
{
public SampleTime(string timeLevel)
{
TimeLevel = timeLevel;
}
public readonly string TimeLevel;
}
public class Data
{
public int[] TimeRange = new int[2];
}
class Program
{
private static void Main(string[] args)
{
var initialResults = new []
{
new SampleTime("0-5 Minutes"),
new SampleTime("4-5 Minutes"), // Should be selected below.
new SampleTime("1-8 Minutes"),
new SampleTime("4-6 Minutes"), // Should be selected below.
new SampleTime("4-7 Minutes"),
new SampleTime("5-6 Minutes"), // Should be selected below.
new SampleTime("20-30 Minutes")
};
// Find all ranges between 4 and 6 inclusive.
Data data = new Data();
data.TimeRange[0] = 4;
data.TimeRange[1] = 6;
// The output of this should be (as commented in the array initialisation above):
//
// 4-5 Minutes
// 4-6 Minutes
// 5-6 Minutes
// Here's the significant code:
var refinedResults =
(
from result in initialResults
let numbers = Regex.Matches(result.TimeLevel, #"\d+")
where ((int.Parse(numbers[0].Value) >= data.TimeRange[0]) && (int.Parse(numbers[1].Value) <= data.TimeRange[1]))
select result
).ToArray();
foreach (var result in refinedResults)
{
Console.WriteLine(result.TimeLevel);
}
}
}
}
Error happens because of the " Minutes" part of the string.
You can truncate the " Minutes" part before splitting, like;
x.TimeLevel.Remove(x.IndexOf(" "))
then you can split.
The problem is that you are splitting by - and not also by space which is the separator of the minutes part. So you could use Split(' ', '-') instead:
InitialResults
.Where(x => int.Parse(x.TimeLevel.Split('-')[0]) >= data.TimeRange[0]
&& int.Parse(x.TimeLevel.Split(' ','-')[1]) <= data.TimeRange[1])
.ToArray();
As an aside, don't store three informations in one column in the database. That's just a source of nasty errors and bad performance. It's also more difficult to filter in the database which should be the preferred way or to maintain datatabase consistency.
Regarding your comment that the format can be 0-40+ Minutes. Then you could use...
InitialResults
.Select(x => new {
TimeLevel = x.TimeLevel,
MinMaxPart = x.TimeLevel.Split(' ')[0]
})
.Select(x => new {
TimeLevel = x.TimeLevel,
Min = int.Parse(x.MinMaxPart.Split('-')[0].Trim('+')),
Max = int.Parse(x.MinMaxPart.Split('-')[1].Trim('+'))
})
.Where(x => x.Min >= data.TimeRange[0] && x.Max <= data.TimeRange[1])
.Select(x => x.TimeLevel)
.ToArray();

Categories

Resources