c# string array out of memory - c#

I tried to take 2 txt file and combine the line that every line in file1 concat with every line in file2
Example:
file1:
a
b
file2:
c
d
result:
a c
a d
b c
b d
This is the code:
{
//int counter = 0;
string[] lines1 = File.ReadLines("e:\\1.txt").ToArray();
string[] lines2 = File.ReadLines("e:\\2.txt").ToArray();
int len1 = lines1.Length;
int len2 = lines2.Length;
string[] names = new string[len1 * len2];
int i = 0;
int finish = 0;
//Console.WriteLine("Check this");
for (i = 0; i < lines2.Length; i++)
{
for (int j = 0; j < lines1.Length; j++)
{
names[finish] = lines2[i] + ' ' + lines1[j];
finish++;
}
}
using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"E:\text.txt"))
{
foreach (string line in names)
{
// If the line doesn't contain the word 'Second', write the line to the file.
file.WriteLine(line);
}
}
}
I get this exception:
"An unhandled exception of type 'System.OutOfMemoryException' occurred
in ConsoleApplication2.exe" on this line:
string[] names = new string[len1 * len2];
Is there other way to combine this 2 files without getting OutOfMemoryException?

something like
using (var output = new StreamWriter(#"E:\text.txt"))
{
foreach(var line1 in File.ReadLines("e:\\1.txt"))
{
foreach(var line2 in File.ReadLines("e:\\2.txt"))
{
output.WriteLine("{0} {1}", line1, line2);
}
}
}
Unless the lines are very long, this should avoid an OutOfMemoryException.

It looks like you want a cartesian product rather than concatenation. Instead of loading all lines into memory, use ReadLines with SelectMany, this may not be fast but will avoid the exception:
var file1 = File.ReadLines("e:\\1.txt");
var file2 = File.ReadLines("e:\\2.txt");
var lines = file1.SelectMany(x => file2.Select(y => string.Join(" ", x, y));
File.WriteAllLines("output.txt", lines);

Use an StringBuilder instance instead of concatenating strings. Strings are unmutable in .Net, so each change to any instance creates a new one, consuming the available memory.
Make names become StringBuilder[] names and use the Append method to construct your result.

If your files are large (this is the reason for out of memory) you should not (never) load the complete file into memory. Especially the result file (with size = size1 * size2) will get very large.
I'd suggest to use a StreamReader to read through the input files line by line and
use a StreamWriter to write the result file line by line.
Wth this technique you may process arbuitrarilly large files (as long as the result fits on your hard disk)

Use stringbuilder instead of appending via "+"

Use "List names" and "Add" your combined lines to the List.
So you don't need to alloc memory.

Related

How to improve performance of accessing lines in IEnumerable<T> File.ReadLines()

I am loading a file using File.ReadLines method (Files could get very large so I used this rather than ReadAllLines)
I need to access each line and perform an action on it. So my code is like this
IEnumerable<String> lines = File.ReadLines("c:\myfile.txt", new UTF8Encoding());
StringBuilder sb = new StringBuilder();
int totalLines = lines.Count(); //used for progress calculation
//use for instead of foreach here - easier to know the line I'm on for progress percent complete calculation
for(int i = 0; i < totalLines; i++){
//for example get the line and do something
sb.Append(lines.ElementAt(i) + "\r\n");
//get the line again using ElementAt(i) and do something else
//...ElementAt(I)...
}
So my bottleneck is each time I access ElementAt(i)because it has to iterate over the entire IEmumerable to get to position i.
Is there any way to keep using File.ReadLines, but improve this somehow?
EDIT - the reason I count at the beginning is so I can calculate progress complete for display to the user. Which is why I removed foreach in favor of the for.
How about using foreach? It's designed to handle exactly this situation.
IEnumerable<String> lines = File.ReadLines("c:\myfile.txt", new UTF8Encoding());
StringBuilder sb = new StringBuilder();
string previousLine = null;
int lineCounter = 0;
int totalLines = lines.Count();
foreach (string line in lines) {
// show progress
float done = ++lineCounter/totalLines;
Debug.WriteLine($"{done*100:0.00}% complete");
//get the line and do something
sb.AppendLine(line);
//do something else, like look at the previous line to compare
if (line == previousLine) {
Debug.WriteLine($"Line {lineCounter} is the same as the previous line.");
}
previousLine = line;
}
Sure, you can use a foreach instead the for loop, so you don't have to go back and reference the line via its index:
foreach (string line in lines)
{
sb.AppendLine(line);
}
You will also no longer need the int totalLines = lines.Count(); line because you don't need the count for anything (unless you're using somewhere you're not showing).

CSV File Splitting with specific size

Hi guys I've a function which will create multiple CSV files from a DataTable in smaller chunks based on size passed through app.config key/value pair.
Issues with the code below:
I've hardcoded the file size to 1 kb, when I'll pass a value of 20, it should created csv file of 20kb. Currently it's creating a file size of 5kb for the same value.
For the last left records it's not creating any file.
Kindly help me to fix this. Thanks!
code :
public static void CreateCSVFile(DataTable dt, string CSVFileName)
{
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024; //1 KB size
string CSVPath = ConfigurationManager.AppSettings["CSVPath"];
StringBuilder FirstLine = new StringBuilder();
StringBuilder records = new StringBuilder();
int num = 0;
int length = 0;
IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
FirstLine.AppendLine(string.Join(",", columnNames));
records.AppendLine(FirstLine.ToString());
length += records.ToString().Length;
foreach (DataRow row in dt.Rows)
{
//Putting field values in double quotes
IEnumerable<string> fields = row.ItemArray.Select(field =>
string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));
records.AppendLine(string.Join(",", fields));
length += records.ToString().Length;
if (length > size)
{
//Create a new file
num++;
File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
records.Clear();
length = 0;
records.AppendLine(FirstLine.ToString());
}
}
}
Use File.ReadLines, Linq means deferred execution will be performed.
foreach(var line in File.ReadLines(FilePath))
{
// logic here.
}
From MSDN
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
Now so, you could rewrite your method as below.
public static void SplitCSV(string FilePath, string FileName)
{
//Read Specified file size
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024 * 1024; //1 MB size
int total = 0;
int num = 0;
string FirstLine = null; // header to new file
var writer = new StreamWriter(GetFileName(FileName, num));
// Loop through all source lines
foreach (var line in File.ReadLines(FilePath))
{
if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
// Length of current line
int length = line.Length;
// See if adding this line would exceed the size threshold
if (total + length >= size)
{
// Create a new file
num++;
total = 0;
writer.Dispose();
writer = new StreamWriter(GetFileName(FileName, num));
writer.WriteLine(FirstLine);
length += FirstLine.Length;
}
// Write the line to the current file
writer.WriteLine(line);
// Add length of line in bytes to running size
total += length;
// Add size of newlines
total += Environment.NewLine.Length;
}
}
The solution is quite simple... you don't need to put all your lines in memory (as you do in string[] arr = File.ReadAllLines(FilePath);).
Instead, create an StreamReader on the input file, and read line by line to line buffer. When the buffer is over your "threshold size", write it to disk into a single csv file. The code should be something like this:
using (var sr = new System.IO.StreamReader(filePath))
{
var linesBuffer = new List<string>();
while (sr.Peek() >= 0)
{
linesBuffer.Add(sr.ReadLine());
if (linesBuffer.Count > yourThreshold)
{
// TODO: implement function WriteLinesToPartialCsv
WriteLinesToPartialCsv(linesBuffer);
// Clear the buffer:
linesBuffer.Clear();
// Try forcing c# to clear the memory:
GC.Collect();
}
}
}
As you can see, reading the stream line by line (instead of the whole CSV inpunt file, as your code did) you have better control over the memory.

How to Stream string data from a txt file into an array

I'm doing this exercise from a lab. the instructions are as follows
This method should read the product catalog from a text file called “catalog.txt” that you should
create alongside your project. Each product should be on a separate line.Use the instructions in the video to create the file and add it to your project, and to return an
array with the first 200 lines from the file (use the StreamReader class and a while loop to read
from the file). If the file has more than 200 lines, ignore them. If the file has less than 200 lines,
it’s OK if some of the array elements are empty (null).
I don't understand how to stream data into the string array any clarification would be greatly appreciated!!
static string[] ReadCatalogFromFile()
{
//create instance of the catalog.txt
StreamReader readCatalog = new StreamReader("catalog.txt");
//store the information in this array
string[] storeCatalog = new string[200];
int i = 0;
//test and store the array information
while (storeCatalog != null)
{
//store each string in the elements of the array?
storeCatalog[i] = readCatalog.ReadLine();
i = i + 1;
if (storeCatalog != null)
{
//test to see if its properly stored
Console.WriteLine(storeCatalog[i]);
}
}
readCatalog.Close();
Console.ReadLine();
return storeCatalog;
}
Here are some hints:
int i = 0;
This needs to be outside your loop (now it is reset to 0 each time).
In your while() you should check the result of readCatalog() and/or the maximum number of lines to read (i.e. the size of your array)
Thus: if you reached the end of the file -> stop - or if your array is full -> stop.
static string[] ReadCatalogFromFile()
{
var lines = new string[200];
using (var reader = new StreamReader("catalog.txt"))
for (var i = 0; i < 200 && !reader.EndOfStream; i++)
lines[i] = reader.ReadLine();
return lines;
}
A for-loop is used when you know the exact number of iterations beforehand. So you can say it should iterate exactly 200 time so you won't cross the index boundaries. At the moment you just check that your array isn't null, which it will never be.
using(var readCatalog = new StreamReader("catalog.txt"))
{
string[] storeCatalog = new string[200];
for(int i = 0; i<200; i++)
{
string temp = readCatalog.ReadLine();
if(temp != null)
storeCatalog[i] = temp;
else
break;
}
return storeCatalog;
}
As soon as there are no more lines in the file, temp will be null and the loop will be stopped by the break.
I suggest you use your disposable resources (like any stream) in a using statement. After the operations in the braces, the resource will automatically get disposed.

Reading input from a text file, converting into int, and storing into array.

I need some help. I have a text file that looks like so:
21,M,S,1
22,F,M,2
19,F,S,3
65,F,M,4
66,M,M,4
What I need to do is put the first column into an array int[] age and the last column into an array int[] districts. This is for a college project due in a week. I've been having a lot of trouble trying to figure this out. Any help would be greatly appreciated. I did try searching for an answer already but didn't find anything that i understood. I also cannot use anything we havent learned from the book, so it rules out lists<> and things of the like.
FileStream census = new FileStream("census.txt", FileMode.Open, FileAccess.Read);
StreamReader inFile = new StreamReader(census);
string input = "";
string[] fields;
int[] districts = new int[SIZE];
int[] ageGroups = new int[SIZE];
input = inFile.ReadLine();
while (input != null)
{
fields = input.Split(',');
for (int i = 0; i < 1; i++)
{
int x = int.Parse(fields[i]);
districts[i] = x;
}
input = inFile.ReadLine();
}
Console.WriteLine(districts[0]);
if your file is nothing but this then File.ReadAllLines() will return a string array with each element being a line of your file. Having done that, you can then use the length of the returned array to initialize the other two arrays, into which the data will be stored.
Once you have your string array you call string.Split() on each element with "," as your delimiter, now you will have another array of strings minus the commas, you will them take the values you want by their index position, 0 and 3 respectively, and you can store those somewhere. Your code would look something like this:
//you will need to replace path with the actual path to the file.
string[] file = File.ReadAllLines("path");
int[] age = new int[file.Length];
int[] districts = new int[file.Length];
int counter = 0;
foreach (var item in file)
{
string[] values = item.Split(',');
age[counter] = Convert.ToInt32(values[0]);
districts[counter] = Convert.ToInt32(values[3]);
counter++
}
Proper way of writing this code:
Write each step your trying to perform:
// open file
// for each line
// parse line
Then refine "parse line"
// split by fields
// parse and handle age
// parse and handle gender
// parse and handle martial status
// parse and handle ....
Then start writing missing code.
At that point you should figure out that iterating through fields of single record not going to do you any good as all fields have different meaning.
So you'll need to remove for and replace it with filed-by-field parsing/assignments.
Instead of looping through all your fields, simply refer to the actual index of the field:
Wrong:
for (int i = 0; i < 1; i++)
{
int x = int.Parse(fields[i]);
districts[i] = x;
}
Right:
districts[i] = int.Parse(fields[0]);
ageGroups[i] = int.Parse(fields[3]);
i++;
So I just made some BS to do what you are seeking. I do not agree with it because I hate directly hardcoding for split, but since you can't use a list this is what you get:
FileStream census = new FileStream(path, FileMode.Open, FileAccess.Read);
StreamReader inFile = new StreamReader(census);
int[] districts = new int[1024];
int[] ageGroups = new int[1024];
int counter = 0;
string line;
while ((line = inFile.ReadLine()) != null)
{
string[] splitString = line.Split(',');
int.TryParse(splitString[0], out ageGroups[counter]);
int.TryParse(splitString[3], out districts[counter]);
counter++;
}
This will give you two arrays, districts and ageGroups that are of length 1024 and will contain the values for each row in the census.txt file.

Improve the code for performance?

I wrote the follow c# codes to generate a set of numbers and then compare with another set of numbers to remove the unwanted numbers.
But its taking too long at run time to complete the process. Following is the code behind file.
The numbers it has to generate is like 7 figures large and the numbers list which I use it as to remove is around 700 numbers.
Is there a way to improve the run time performance?
string[] strAry = txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
int[] intAry = new int[strAry.Length];
List<int> intList = new List<int>();
for (int i = 0; i < strAry.Length; i++)
{
intList.Add(int.Parse(strAry[i]));
}
List<int> genList = new List<int>();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
genList.Add(i);
}
lblStatus.Text += "Generated: " + genList.Capacity;
var finalvar = from s in genList where !intList.Contains(s) select s;
List<int> finalList = finalvar.ToList();
foreach (var item in finalList)
{
txtGeneratedNum.Text += "959" + item + "\n";
}
First thing to do is grab a profiler and see which area of your code is taking too long to run, try http://www.jetbrains.com/profiler/ or http://www.red-gate.com/products/dotnet-development/ants-performance-profiler/.
You should never start performance tuning until you know for sure where the problem is.
If the problem is in the linq query than you could try sorting the intlist and doing a binary search for each item to remove, though you can probably get a similar behavour with the right linq query.
string numbersStr = txtNumbersToBeExc.Text;
string startSeedStr = txtStartSeed.Text;
string endSeedStr = txtEndSeed.Text;
//next, the input type actually is of type int, we should test if the strings are ok ( they do represent ints)
var intAry = numbersStr.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries).Select(s=>Int32.Parse(s));
int startSeed = Int32.Parse(startSeedStr);
int endSeed = Int32.Parse(endSeedStr);
/*FROM HERE*/
// using Enumerable.Range
var genList = Enumerable.Range(startSeed, endSeed - startSeed + 1);
// we can use linq except
var finalList = genList.Except(intAry);
// do you need a string, for 700 concatenations I would suggest StringBuilder
var sb = new StringBuilder();
foreach ( var item in finalList)
{
sb.AppendLine(string.Concat("959",item.ToString()));
}
var finalString = sb.ToString();
/*TO HERE, refactor it into a method or class*/
txtGeneratedNum.Text = finalString;
They key point here is that String is a immutable class, so the "+" operation between two strings will create another string. StringBuilder it doesn't do this. On your situation it really doesn't matter if you're using for loops, foreach loops, linq fancy functions to accomplish the exclusion. The performance hurt was because of the string concatenations. I'm trusting more the System.Linq functions because they are already tested for performance.
Change intList from a List to a HashSet - gives much better performance when determining if an entry is present.
Consider using Linq's Enumerable.Intersect, especially combined with #1.
Change the block of code that create genList with this:
List<int> genList = new List<int>();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
if (!intList.Contains(i)) genList.Add(i);
}
and after create txtGeneratedNum looping on genList. This will reduce the number of loop of your implementation.
Why not do the inclusion check when you are parsing the int and just build the result list directley.
There is not much point in iterating over the list twice. In fact, why build the intermediate list at all !?! just write straight to a StringBuilder since a newline delimited string seems to be your goal.
string[] strAry = txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
var exclusions = new HashSet<T>();
foreach (string s in txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine })
{
int value;
if (int.TryParse(s, value)
{
exclusions.Add(value);
}
}
var output = new StringBuilder();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
if (!exclusions.Contains(i))
{
output.AppendFormat("959{0}\n", i);
}
}
txtGeneratedNum.Text = output.ToString();

Categories

Resources