CSV File Splitting with specific size - c#

Hi guys I've a function which will create multiple CSV files from a DataTable in smaller chunks based on size passed through app.config key/value pair.
Issues with the code below:
I've hardcoded the file size to 1 kb, when I'll pass a value of 20, it should created csv file of 20kb. Currently it's creating a file size of 5kb for the same value.
For the last left records it's not creating any file.
Kindly help me to fix this. Thanks!
code :
public static void CreateCSVFile(DataTable dt, string CSVFileName)
{
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024; //1 KB size
string CSVPath = ConfigurationManager.AppSettings["CSVPath"];
StringBuilder FirstLine = new StringBuilder();
StringBuilder records = new StringBuilder();
int num = 0;
int length = 0;
IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
FirstLine.AppendLine(string.Join(",", columnNames));
records.AppendLine(FirstLine.ToString());
length += records.ToString().Length;
foreach (DataRow row in dt.Rows)
{
//Putting field values in double quotes
IEnumerable<string> fields = row.ItemArray.Select(field =>
string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));
records.AppendLine(string.Join(",", fields));
length += records.ToString().Length;
if (length > size)
{
//Create a new file
num++;
File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
records.Clear();
length = 0;
records.AppendLine(FirstLine.ToString());
}
}
}

Use File.ReadLines, Linq means deferred execution will be performed.
foreach(var line in File.ReadLines(FilePath))
{
// logic here.
}
From MSDN
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
Now so, you could rewrite your method as below.
public static void SplitCSV(string FilePath, string FileName)
{
//Read Specified file size
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024 * 1024; //1 MB size
int total = 0;
int num = 0;
string FirstLine = null; // header to new file
var writer = new StreamWriter(GetFileName(FileName, num));
// Loop through all source lines
foreach (var line in File.ReadLines(FilePath))
{
if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
// Length of current line
int length = line.Length;
// See if adding this line would exceed the size threshold
if (total + length >= size)
{
// Create a new file
num++;
total = 0;
writer.Dispose();
writer = new StreamWriter(GetFileName(FileName, num));
writer.WriteLine(FirstLine);
length += FirstLine.Length;
}
// Write the line to the current file
writer.WriteLine(line);
// Add length of line in bytes to running size
total += length;
// Add size of newlines
total += Environment.NewLine.Length;
}
}

The solution is quite simple... you don't need to put all your lines in memory (as you do in string[] arr = File.ReadAllLines(FilePath);).
Instead, create an StreamReader on the input file, and read line by line to line buffer. When the buffer is over your "threshold size", write it to disk into a single csv file. The code should be something like this:
using (var sr = new System.IO.StreamReader(filePath))
{
var linesBuffer = new List<string>();
while (sr.Peek() >= 0)
{
linesBuffer.Add(sr.ReadLine());
if (linesBuffer.Count > yourThreshold)
{
// TODO: implement function WriteLinesToPartialCsv
WriteLinesToPartialCsv(linesBuffer);
// Clear the buffer:
linesBuffer.Clear();
// Try forcing c# to clear the memory:
GC.Collect();
}
}
}
As you can see, reading the stream line by line (instead of the whole CSV inpunt file, as your code did) you have better control over the memory.

Related

How to Stream string data from a txt file into an array

I'm doing this exercise from a lab. the instructions are as follows
This method should read the product catalog from a text file called “catalog.txt” that you should
create alongside your project. Each product should be on a separate line.Use the instructions in the video to create the file and add it to your project, and to return an
array with the first 200 lines from the file (use the StreamReader class and a while loop to read
from the file). If the file has more than 200 lines, ignore them. If the file has less than 200 lines,
it’s OK if some of the array elements are empty (null).
I don't understand how to stream data into the string array any clarification would be greatly appreciated!!
static string[] ReadCatalogFromFile()
{
//create instance of the catalog.txt
StreamReader readCatalog = new StreamReader("catalog.txt");
//store the information in this array
string[] storeCatalog = new string[200];
int i = 0;
//test and store the array information
while (storeCatalog != null)
{
//store each string in the elements of the array?
storeCatalog[i] = readCatalog.ReadLine();
i = i + 1;
if (storeCatalog != null)
{
//test to see if its properly stored
Console.WriteLine(storeCatalog[i]);
}
}
readCatalog.Close();
Console.ReadLine();
return storeCatalog;
}
Here are some hints:
int i = 0;
This needs to be outside your loop (now it is reset to 0 each time).
In your while() you should check the result of readCatalog() and/or the maximum number of lines to read (i.e. the size of your array)
Thus: if you reached the end of the file -> stop - or if your array is full -> stop.
static string[] ReadCatalogFromFile()
{
var lines = new string[200];
using (var reader = new StreamReader("catalog.txt"))
for (var i = 0; i < 200 && !reader.EndOfStream; i++)
lines[i] = reader.ReadLine();
return lines;
}
A for-loop is used when you know the exact number of iterations beforehand. So you can say it should iterate exactly 200 time so you won't cross the index boundaries. At the moment you just check that your array isn't null, which it will never be.
using(var readCatalog = new StreamReader("catalog.txt"))
{
string[] storeCatalog = new string[200];
for(int i = 0; i<200; i++)
{
string temp = readCatalog.ReadLine();
if(temp != null)
storeCatalog[i] = temp;
else
break;
}
return storeCatalog;
}
As soon as there are no more lines in the file, temp will be null and the loop will be stopped by the break.
I suggest you use your disposable resources (like any stream) in a using statement. After the operations in the braces, the resource will automatically get disposed.

Intersect and Union in byte array of 2 files

I have 2 files.
1 is Source File and 2nd is Destination file.
Below is my code for Intersect and Union two file using byte array.
FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0; // actual number of bytes read
int subcount = 0;
while (frsrc.Read(src, 0, length) > 0)
{
try
{
Counter = 0;
frdes.Position = subcount * length;
while (frdes.Read(des, 0, length) > 0)
{
var data = src.Intersect(des);
var data1 = src.Union(des);
Counter++;
}
subcount++;
Console.WriteLine(subcount.ToString());
}
}
catch (Exception ex)
{
}
}
It is works fine with fastest speed.
but Now the problem is that I want count of it and when I Use below code then it becomes very slow.
var data = src.Intersect(des).Count();
var data1 = src.Union(des).Count();
So, Is there any solution for that ?
If yes,then please lete me know as soon as possible.
Thanks
Intersect and Union are not the fastest operations. The reason you see it being fast is that you never actually enumerate the results!
Both return an enumerable, not the actual results of the operation. You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". Now, when you do Count, you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!).
You'll need to make your own methods, most likely. You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table.
A few points: the comment // get file length is misleading as it is the buffer size. Counter is not the number of bytes read, it is the number of blocks read. data and data1 will end up with the result of the last block read, ignoring any data before them. That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors.
What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files.
It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.)
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var file1 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
var file2 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
List<string> files = new List<string> { file1, file2 };
var sw = System.Diagnostics.Stopwatch.StartNew();
// Prepare array of counters for the bytes
var nFiles = files.Count;
int[][] count = new int[nFiles][];
for (int i = 0; i < nFiles; i++)
{
count[i] = new int[256];
}
// Get the counts of bytes in each file
int bufLen = 32768;
byte[] buffer = new byte[bufLen];
int bytesRead;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
{
bytesRead = bufLen;
while (bytesRead > 0)
{
bytesRead = sr.Read(buffer, 0, bufLen);
for (int i = 0; i < bytesRead; i++)
{
count[fileNum][buffer[i]]++;
}
}
}
}
// Find which bytes are in any of the files or in all the files
var inAny = new List<byte>(); // union
var inAll = new List<byte>(); // intersect
for (int i = 0; i < 256; i++)
{
Boolean all = true;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
if (count[fileNum][i] > 0)
{
if (!inAny.Contains((byte)i)) // avoid adding same value more than once
{
inAny.Add((byte)i);
}
}
else
{
all = false;
}
};
if (all)
{
inAll.Add((byte)i);
};
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
// Display the results
Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
Console.WriteLine();
// Compare to using LINQ.
// N/B. Will need adjustments for more than two files.
var srcBytes1 = File.ReadAllBytes(file1);
var srcBytes2 = File.ReadAllBytes(file2);
sw.Restart();
var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);
Console.WriteLine(sw.ElapsedMilliseconds);
Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));
Console.ReadLine();
}
}
}
The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB).

c# string array out of memory

I tried to take 2 txt file and combine the line that every line in file1 concat with every line in file2
Example:
file1:
a
b
file2:
c
d
result:
a c
a d
b c
b d
This is the code:
{
//int counter = 0;
string[] lines1 = File.ReadLines("e:\\1.txt").ToArray();
string[] lines2 = File.ReadLines("e:\\2.txt").ToArray();
int len1 = lines1.Length;
int len2 = lines2.Length;
string[] names = new string[len1 * len2];
int i = 0;
int finish = 0;
//Console.WriteLine("Check this");
for (i = 0; i < lines2.Length; i++)
{
for (int j = 0; j < lines1.Length; j++)
{
names[finish] = lines2[i] + ' ' + lines1[j];
finish++;
}
}
using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"E:\text.txt"))
{
foreach (string line in names)
{
// If the line doesn't contain the word 'Second', write the line to the file.
file.WriteLine(line);
}
}
}
I get this exception:
"An unhandled exception of type 'System.OutOfMemoryException' occurred
in ConsoleApplication2.exe" on this line:
string[] names = new string[len1 * len2];
Is there other way to combine this 2 files without getting OutOfMemoryException?
something like
using (var output = new StreamWriter(#"E:\text.txt"))
{
foreach(var line1 in File.ReadLines("e:\\1.txt"))
{
foreach(var line2 in File.ReadLines("e:\\2.txt"))
{
output.WriteLine("{0} {1}", line1, line2);
}
}
}
Unless the lines are very long, this should avoid an OutOfMemoryException.
It looks like you want a cartesian product rather than concatenation. Instead of loading all lines into memory, use ReadLines with SelectMany, this may not be fast but will avoid the exception:
var file1 = File.ReadLines("e:\\1.txt");
var file2 = File.ReadLines("e:\\2.txt");
var lines = file1.SelectMany(x => file2.Select(y => string.Join(" ", x, y));
File.WriteAllLines("output.txt", lines);
Use an StringBuilder instance instead of concatenating strings. Strings are unmutable in .Net, so each change to any instance creates a new one, consuming the available memory.
Make names become StringBuilder[] names and use the Append method to construct your result.
If your files are large (this is the reason for out of memory) you should not (never) load the complete file into memory. Especially the result file (with size = size1 * size2) will get very large.
I'd suggest to use a StreamReader to read through the input files line by line and
use a StreamWriter to write the result file line by line.
Wth this technique you may process arbuitrarilly large files (as long as the result fits on your hard disk)
Use stringbuilder instead of appending via "+"
Use "List names" and "Add" your combined lines to the List.
So you don't need to alloc memory.

How to read text file lines after Specific lines using StreamReader

I have a text file which i am reading using StreamReader .Now as per my requirement whatever lines i have read first,i dont want to read again means i dont want to take that data again.So i have added File.ReadLines(FileToCopy).Count(); code to get the number of lines read at first.Now whatever line returned by above line of code,i want to read after that.
Here is my code .
string FileToCopy = "E:\\vikas\\call.txt";
if (System.IO.File.Exists(FileToCopy) == true)
{
lineCount = File.ReadLines(FileToCopy).Count();
using (StreamReader reader = new StreamReader(FileToCopy))
{
}
}
What Condition i need to specify here .Please help me.
while ((line = reader.ReadLine()) != null)
{
var nextLines = File.ReadLines(FileToCopy).Skip(lineCount);
if (line != "")
{
}
There's a much faster way to do this that doesn't require you to read the entire file in order to get to the point where you left off. The key is to keep track of the file's length. Then you open the file as a FileStream, position to the previous length (i.e. the end of where you read before), and then create a StreamReader. So it looks like this:
long previousLength = 0;
Then, when you want to copy new stuff:
using (var fs = File.OpenRead(FileToCopy))
{
// position to just beyond where you read before
fs.Position = previousLength;
// and update the length for next time
previousLength = fs.Length;
// now open a StreamReader and read
using (var sr = new StreamReader(fs))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
// do something with the line
}
}
}
This will save you huge amounts of time if the file gets large. For example if the file was a gigabyte in size the last time you read it, then File.ReadLines(filename).Skip(count) will take you 20 seconds to get to the end so you can read the next lines. The method I described above will take much less time--probably less than a second.
This:
lineCount = File.ReadLines(FileToCopy).Count();
Will return total lines count in your file.It's useless for you.You need to store the line count that you read from the file.Then everytime you read again, use Skip method:
var nextLines = File.ReadLines("filaPath").Skip(lineCount);
You don't need StreamReader here.For example if you read file for first time,let's say 10 line:
var lines = File.ReadLines(filePath).Take(10);
lineCount += 10;
For second time Skip the first 10 line and read more and update the lineCount:
var nextLines = File.ReadLines(filePath).Skip(lineCount).Take(20);
lineCount += 20;
More generically you can write a method for this and call it whenever you want to read next lines:
public static string[] ReadFromFile(string filePath, int count, ref int lineCount)
{
lineCount += count;
return File.ReadLines(filePath).Skip(lineCount).Take(count).ToArray();
}
private static int lineCount = 0;
private static void Main(string[] args)
{
// read first ten line
string[] lines = ReadFromFile("sample.txt", 10, ref lineCount);
// read next 30 lines
string[] otherLines = ReadFromFile("sample.txt", 30, ref lineCount)
}
I hope you get the idea.
Just read lineCount lines from your new stream:
for(int n=0; n<lineCount; n++)
{
reader.ReadLine();
}
That is the easiest method, when you have to actually skip N lines (not N bytes).

Split large file into smaller files by number of lines in C#?

I am trying to figure out how to split a file by the number of lines in each file. THe files are csv and I can't do it by bytes. I need to do it by lines. 20k seems to be a good number per file. What is the best way to read a stream at a given position? Stream.BaseStream.Position? So if I read the first 20k lines i would start the position at 39,999? How do I know I am almost at the end of a files? Thanks all
using (System.IO.StreamReader sr = new System.IO.StreamReader("path"))
{
int fileNumber = 0;
while (!sr.EndOfStream)
{
int count = 0;
using (System.IO.StreamWriter sw = new System.IO.StreamWriter("other path" + ++fileNumber))
{
sw.AutoFlush = true;
while (!sr.EndOfStream && ++count < 20000)
{
sw.WriteLine(sr.ReadLine());
}
}
}
}
int index=0;
var groups = from line in File.ReadLines("myfile.csv")
group line by index++/20000 into g
select g.AsEnumerable();
int file=0;
foreach (var group in groups)
File.WriteAllLines((file++).ToString(), group.ToArray());
I'd do it like this:
// helper method to break up into blocks lazily
public static IEnumerable<ICollection<T>> SplitEnumerable<T>
(IEnumerable<T> Sequence, int NbrPerBlock)
{
List<T> Group = new List<T>(NbrPerBlock);
foreach (T value in Sequence)
{
Group.Add(value);
if (Group.Count == NbrPerBlock)
{
yield return Group;
Group = new List<T>(NbrPerBlock);
}
}
if (Group.Any()) yield return Group; // flush out any remaining
}
// now it's trivial; if you want to make smaller files, just foreach
// over this and write out the lines in each block to a new file
public static IEnumerable<ICollection<string>> SplitFile(string filePath)
{
return File.ReadLines(filePath).SplitEnumerable(20000);
}
Is that not sufficient for you? You mention moving from position to position,but I don't see why that's necessary.

Categories

Resources