Related
I have csv file structured as below:
1,0,2.2,0,0,0,0,1.2,0
0,1,2,4,0,1,0.2,0.1,0
0,0,2,3,0,0,0,1.2,2.1
0,0,0,1,2,1,0,0.2,0.1
0,0,1,0,2.1,0.1,0,1.2
0,0,2,3,0,1.1,0.1,1.2
0,0.2,0,1.2,2,0,3.2,0
0,0,1.2,0,2.2,0,0,1.1
but with 10k columns and 10k rows.
I want to read it in such a way that in the result i get a dictionary
with Key as a index of the row and Value as float array filed with every value in this row.
For now my code look like this:
var lines = File.ReadAllLines(filePath).ToList();
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(',').Where(v =>!string.IsNullOrEmpty(v))
.Select(f => f.Replace('.', ','))
.Select(float.Parse).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);
but it takes up to 30 seconds to finish, so it's quite slow and i want to optimize it to be a bit faster.
While there are many small optimizations you can make, what is really killing you is the garbage collector because of all the allocations.
Your code takes 12 seconds to run on my machine. Reading the file uses 2 of those 12 seconds.
By using all the optimizations mentioned in the comments (using File.ReadLines, StringSplitOptions.RemoveEmptyEntries, also using float.Parse(f, CultureInfo.InvariantCulture) instead of calling string.Replace), we get down to 9 seconds. There's still a lot of allocations done, especially by File.ReadLines. Can we do better?
Just activate server GC in the app.config:
<runtime>
<gcServer enabled="true" />
</runtime>
With that, the execution time drops to 6 seconds using your code, and 3 seconds using the optimizations mentioned above. At that point, the file I/O are taking more than 60% of the execution time, so it's not really worth optimizing more.
Final version of the code:
var lines = File.ReadLines(filePath);
var separator = new[] {','};
var result = lines.AsParallel().AsOrdered().Select((line, index) =>
{
var values = line?.Split(separator, StringSplitOptions.RemoveEmptyEntries)
.Select(f => float.Parse(f, CultureInfo.InvariantCulture)).ToArray();
return (index, values);
}).ToDictionary(d => d.Item1, d => d.Item2);
Replacing the Split and Replace with hand parsing and using InvariantInfo to accept the period as decimal point, and then removing the wasteful ReadAllLines().ToList() and letting the AsParallel() read from the file while parsing, speeds up on my PC about four times.
var lines = File.ReadLines(filepath);
var result = lines.AsParallel().AsOrdered().Select((line, index) => {
var values = new List<float>(10000);
var pos = 0;
while (pos < line.Length) {
var commapos = line.IndexOf(',', pos);
commapos = commapos < 0 ? line.Length : commapos;
var fs = line.Substring(pos, commapos - pos);
if (fs != String.Empty) // remove if no value is ever missing
values.Add(float.Parse(fs, NumberFormatInfo.InvariantInfo));
pos = commapos + 1;
}
return values;
}).ToList();
Also replaced ToArray on values with a List as that is generally faster (ToList is preferred over ToArray).
using Microsoft.VisualBasic.FileIO;
protected void CSVImport(string importFilePath)
{
string csvData = System.IO.File.ReadAllText(importFilePath, System.Text.Encoding.GetEncoding("WINDOWS-1250"));
foreach (string row in csvData.Split('\n'))
{
var parser = new TextFieldParser(new StringReader(row));
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields;
fields = parser.ReadFields();
//do what you need with data in array
}
}
I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?
Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.
You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Ćukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?
Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.
Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));
heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);
I have the following code. Is there a more efficient way to accomplish the same tasks?
Given a folder, loop over the files within the folder.
Within each file, skip the first four header lines,
After splitting the row based on a space, if the resulting array contains less than 7 elements, skip it,
Check if the specified element is already in the dictionary. If it is, increment the count. If not, add it.
It's not a complicated process. Is there a better way to do this? LINQ?
string sourceDirectory = #"d:\TESTDATA\";
string[] files = Directory.GetFiles(sourceDirectory, "*.log",
SearchOption.TopDirectoryOnly);
var dictionary = new Dictionary<string, int>();
foreach (var file in files)
{
string[] lines = System.IO.File.ReadLines(file).Skip(4).ToArray();
foreach (var line in lines)
{
var elements = line.Split(' ');
if (elements.Length > 6)
{
if (dictionary.ContainsKey(elements[9]))
{
dictionary[elements[9]]++;
}
else
{
dictionary.Add(elements[9], 1);
}
}
}
}
Something Linqy should do you. Doubt its any more efficient. And, it's almost certainly more of a hassle to debug. But it is very trendy these days:
static Dictionary<string,int> Slurp( string rootDirectory )
{
Dictionary<string,int> instance = Directory.EnumerateFiles(rootDirectory,"*.log",SearchOption.TopDirectoryOnly)
.SelectMany( fn => File.ReadAllLines(fn)
.Skip(4)
.Select( txt => txt.Split( " ".ToCharArray() , StringSplitOptions.RemoveEmptyEntries) )
.Where(x => x.Length > 9 )
.Select( x => x[9])
)
.GroupBy( x => x )
.ToDictionary( x => x.Key , x => x.Count())
;
return instance ;
}
A more efficient (performance-wise) way to do this would be to parallelize your outer foreach with the Parallel.Foreach method. You'd also need a ConcurrentDictionary object instead of a standard dictionary.
Not sure if you are looking for better performance or more elegant code.
If you prefer functional style linq, something like this maybe:
var query= from element in
(
//go through all file names
from fileName in files
//read all lines from every file and skip first 4
from line in File.ReadAllLines(fileName).Skip(4)
//split every line into words
let lineData = line.Split(new[] {' '})
//select only lines with more than 6 words
where lineData.Count() > 6
//take 6th element from line
select lineData.ElementAt(6)
)
//outer query will group by element
group element by element
into g
select new
{
Key = g.Key,
Count = g.Count()
};
var dictionary = list.ToDictionary(e=>e.Key,e=>e.Count);
Result is dictionary with word as Key and count of word occrances as value.
I would expect that reading the files is going to be the most consuming part of the operation. In many cases, trying to read multiple files at once on different threads will hurt rather than help performance, but it may be helpful to have a thread which does nothing but read files, so that it can keep the drive as busy as possible.
If the files could get big (as seems likely) and if no line will exceed 32K bytes (8,000-32,000 Unicode characters), I'd suggest that you read them in chunks of about 32K or 64K bytes (not characters). Reading a file as bytes and subdividing into lines yourself may be faster than reading it as lines, since the subdivision can happen on a different thread from the physical disk access.
I'd suggest starting with one thread for disk access and one thread for parsing and counting, with a blocking queue between them. The disk-access thread should put on the queue data items which contain a 32K array of bytes, an indication of how many bytes are valid [may be less than 32K at the end of a file], and an indicator of whether it's the last record of a file. The parsing thread should read those items, parse them into lines, and update the appropriate counts.
To improve counting performance, it may be helpful to define
class ExposedFieldHolder<T> {public T Value; }
and then have a Dictionary<string, ExposedFieldHolder<int>>. One would have to create a new ExposedFieldHolder<int> for each dictionary slot, but dictionary[elements[9]].Value++; would likely be faster than dictionary[elements[9]]++; since the latter statement translates as dictionary[elements[9]] = dictionary[elements[9]]+1;, and has to look up the element once when reading and again when writing].
If it's necessary to do the parsing and counting on multiple threads, I would suggest that each thread have its own queue, and the disk-reading thread switch queues after each file [all blocks of a file should be handled by the same thread, since a text line might span two blocks]. Additionally, while it would be possible to use a ConcurrentDictionary, it might be more efficient to have each thread have its own independent Dictionary and consolidate the results at the end.
I have a list of 500000 randomly generated Tuple<long,long,string> objects on which I am performing a simple "between" search:
var data = new List<Tuple<long,long,string>>(500000);
...
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!
I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!
Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.
private const int TotalCount = 500000;
private const int TotalQueries = 100;
private static long NextLong(Random r) {
var data = new byte[8];
r.NextBytes(data);
return BitConverter.ToInt64(data, 0);
}
private class TupleComparer : IComparer<Tuple<long,long,string>> {
public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) {
var res = x.Item1.CompareTo(y.Item1);
if (res != 0) return res;
res = x.Item2.CompareTo(y.Item2);
return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3);
}
}
static void Test(bool doSort) {
var data = new List<Tuple<long,long,string>>(TotalCount);
var random = new Random(1000000007);
var sw = new Stopwatch();
sw.Start();
for (var i = 0 ; i != TotalCount ; i++) {
var a = NextLong(random);
var b = NextLong(random);
if (a > b) {
var tmp = a;
a = b;
b = tmp;
}
var s = string.Format("{0}-{1}", a, b);
data.Add(Tuple.Create(a, b, s));
}
sw.Stop();
if (doSort) {
data.Sort(new TupleComparer());
}
Console.WriteLine("Populated in {0}", sw.Elapsed);
sw.Reset();
var total = 0L;
sw.Start();
for (var i = 0 ; i != TotalQueries ; i++) {
var x = NextLong(random);
var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);
total += cnt;
}
sw.Stop();
Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted");
}
static void Main() {
Test(false);
Test(true);
Test(false);
Test(true);
}
Populated in 00:00:01.3176257
Found 15614281 matches in 00:00:04.2463478 (Unsorted)
Populated in 00:00:01.3345087
Found 15614281 matches in 00:00:08.5393730 (Sorted)
Populated in 00:00:01.3665681
Found 15614281 matches in 00:00:04.1796578 (Unsorted)
Populated in 00:00:01.3326378
Found 15614281 matches in 00:00:08.6027886 (Sorted)
When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.
When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.
This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.
The penalty from cache misses outweighs the saved branch prediction penalty in this case.
Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.
Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.
This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.
LINQ doesn't know whether you list is sorted or not.
Since Count with predicate parameter is extension method for all IEnumerables, I think it doesn't even know if it's running over the collection with efficient random access. So, it simply checks every element and Usr explained why performance got lower.
To exploit performance benefits of sorted array (such as binary search), you'll have to do a little bit more coding.
I have a 1 GB text file which I need to read line by line. What is the best and fastest way to do this?
private void ReadTxtFile()
{
string filePath = string.Empty;
filePath = openFileDialog1.FileName;
if (string.IsNullOrEmpty(filePath))
{
using (StreamReader sr = new StreamReader(filePath))
{
String line;
while ((line = sr.ReadLine()) != null)
{
FormatData(line);
}
}
}
}
In FormatData() I check the starting word of line which must be matched with a word and based on that increment an integer variable.
void FormatData(string line)
{
if (line.StartWith(word))
{
globalIntVariable++;
}
}
If you are using .NET 4.0, try MemoryMappedFile which is a designed class for this scenario.
You can use StreamReader.ReadLine otherwise.
Using StreamReader is probably the way to since you don't want the whole file in memory at once. MemoryMappedFile is more for random access than sequential reading (it's ten times as fast for sequential reading and memory mapping is ten times as fast for random access).
You might also try creating your streamreader from a filestream with FileOptions set to SequentialScan (see FileOptions Enumeration), but I doubt it will make much of a difference.
There are however ways to make your example more effective, since you do your formatting in the same loop as reading. You're wasting clockcycles, so if you want even more performance, it would be better with a multithreaded asynchronous solution where one thread reads data and another formats it as it becomes available. Checkout BlockingColletion that might fit your needs:
Blocking Collection and the Producer-Consumer Problem
If you want the fastest possible performance, in my experience the only way is to read in as large a chunk of binary data sequentially and deserialize it into text in parallel, but the code starts to get complicated at that point.
You can use LINQ:
int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));
File.ReadLines returns an IEnumerable<String> that lazily reads each line from the file without loading the whole file into memory.
Enumerable.Count counts the lines that start with the word.
If you are calling this from an UI thread, use a BackgroundWorker.
Probably to read it line by line.
You should rather not try to force it into memory by reading to end and then processing.
StreamReader.ReadLine should work fine. Let the framework choose the buffering, unless you know by profiling you can do better.
TextReader.ReadLine()
I was facing same problem in our production server at Agenty where we see large files (sometimes 10-25 gb (\t) tab delimited txt files). And after lots of testing and research I found the best way to read large files in small chunks with for/foreach loop and setting offset and limit logic with File.ReadLines().
int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);
// Do all your processing here and with limit and offset and save to drive in append mode
// The append mode will write the output in same file for each processed batch.
table.TableToFile(#"C:\output.txt");
}
See the complete code in my Github library : https://github.com/Agenty/FileReader/
Full Disclosure - I work for Agenty, the company who owned this library and website
My file is over 13 GB:
You can use my class:
public static void Read(int length)
{
StringBuilder resultAsString = new StringBuilder();
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(#"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
{
for (int i = 0; i < length; i++)
{
//Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
int result = memoryMappedViewStream.ReadByte();
if (result == -1)
{
break;
}
char letter = (char)result;
resultAsString.Append(letter);
}
}
}
This code will read text of file from start to the length that you pass to the method Read(int length) and fill the resultAsString variable.
It will return the bellow text:
I'd read the file 10,000 bytes at a time. Then I'd analyse those 10,000 bytes and chop them into lines and feed them to the FormatData function.
Bonus points for splitting the reading and line analysation on multiple threads.
I'd definitely use a StringBuilder to collect all strings and might build a string buffer to keep about 100 strings in memory all the time.