Most memory efficient way to merge two files - c#

I need to merge two files while also applying a sort. It is important the I keep the task light on memory usage. I need to create a console app in c# for this.
Input File 1:
Some Header
A12345334
A00123445
A44566555
B55677
B55683
B66489
record count: 6
Input File 2:
Some Header
A00123465
B99423445
record count: 2
So, I need to make sure that the third file should have all the "A" records coming first and then the "B" records followed by the Total record count.
Output File:
Some header
A12345334
A00123445
A44566555
A00123465
B99423445
B55677
B55683
B66489
record count: 8
Record sorting within "A" and "B" is not relevant.

Since your source files appear sorted, you can do with with very low memory usage.
Just open both input files as well as a new file for writing. Then compare the next available line from each input file and write the line that comes first to your output file. Each time you write a line to the output file, get the next line from the input file it came from.
Continue until both input files are finished.

If memory is an issue the easiest way to do this is probably going to be to read the records from both files, store them in a SQLite or SQL Server Compact database, and execute a SELECT query that returns a sorted record set. Make sure you have an index on the field you want to sort on.
That way, you don't have to store the records in memory, and you don't need any sorting algorithms; the database will store the records on disk and do your sorting for you.

Quick idea, assuming the records are already sorted in the original files:
Start looping through file 2, collecting all A-records
Once you reach the first B-record, start collecting those in a separate collection.
Read all of File 1.
Write out the content of the A-records collection from file 2, then append the contents read from file 1, followed by the B-records from file 2.
Visualized:
<A-data from file 2>
<A-data, followed by B-data from file 1>
<B-data from file 2>

If you are concerned about memory this is a perfect case for insertion sort and read one line at a time from each file. If that is not an issue read the whole thing into a list and just call sort the write it out.
If you can't even keep the whole sorted list in memory then a database or memory mapped file is you best bet.

Assuming your input files are already ordered:
Open Input files 1 and 2 and create the Output file.
Read the first record from file 1. If it starts with A, write it to the output file. Continue reading from input file 1 until you reach a record that starts with B.
Read the first record from file 2. If it start with A, write it to the output file. Continue reading from input file 2 until you reach a record that starts with B.
Go back to file 1, and write the 'B' record to the output file. Continue reading from input file 1 until you reach the end of the stream.
Go back to file 2, and write the 'B' record to the output file. Continue reading from input file 2 until you reach the end of the stream.
This method will prevent you from ever having to hold more than 2 rows of data in memory at a time.

i would recommend using StreamReader and StreamWriter for this application. So you can open a file using StreamWriter, copy all lines using StreamReader for file #1, then for file #2. This operations are very fast, have integrated buffers and are very lightweight.
if the input files are already sorted by A and B, you can switch between the source readers to make the output sorted.

Since you have two sorted sequences you just need to merge the two sequences into a single sequence, in much the same way the second half of the MergeSort algorithm works.
Unfortunately, given the interface that IEnumerable provides, it ends up a bit mess and copy-pasty, but it should perform quite well and use a very small memory footprint:
public class Wrapper<T>
{
public T Value { get; set; }
}
public static IEnumerable<T> Merge<T>(IEnumerable<T> first, IEnumerable<T> second, IComparer<T> comparer = null)
{
comparer = comparer ?? Comparer<T>.Default;
using (var secondIterator = second.GetEnumerator())
{
Wrapper<T> secondItem = null; //when the wrapper is null there are no more items in the second sequence
if (secondIterator.MoveNext())
secondItem = new Wrapper<T>() { Value = secondIterator.Current };
foreach (var firstItem in first)
{
if (secondItem != null)
{
while (comparer.Compare(firstItem, secondItem.Value) > 0)
{
yield return secondItem.Value;
if (secondIterator.MoveNext())
secondItem.Value = secondIterator.Current;
else
secondItem = null;
}
}
yield return firstItem;
yield return secondItem.Value;
while (secondIterator.MoveNext())
yield return secondIterator.Current;
}
}
}
Once you have a Merge function it's pretty trivial:
File.WriteAllLines("output.txt",
Merge(File.ReadLines("File1.txt"), File.ReadLines("File2.txt")))
The File ReadLines and WriteAllLines here each utilize IEnumerable and will stream the lines accordingly.

Here's the source code for the more generic/boiler plate solution for merge sorting 2 files.
public static void Merge(string inFile1, string inFile2, string outFile)
{
string line1 = null;
string line2 = null;
using (StreamReader sr1 = new StreamReader(inFile1))
{
using (StreamReader sr2 = new StreamReader(inFile2))
{
using (StreamWriter sw = new StreamWriter(outFile))
{
line1 = sr1.ReadLine();
line2 = sr2.ReadLine();
while(line1 != null && line2 != null)
{
// your comparison function here
// ex: (line1[0] < line2[0])
if(line1 < line2)
{
sw.WriteLine(line1);
line1 = sr1.ReadLine();
}
else
{
sw.WriteLine(line2);
line2 = sr2.ReadLine();
}
}
while(line1 != null)
{
sw.WriteLine(line1);
line1 = sr1.ReadLine();
}
while(line2 != null)
{
sw.WriteLine(line2);
line2 = sr2.ReadLine();
}
}
}
}
}

public void merge_click(Object sender, EventArgs e)
{
DataTable dt = new DataTable();
dt.Clear();
dt.Columns.Add("Name");
dt.Columns.Add("designation");
dt.Columns.Add("age");
dt.Columns.Add("year");
string[] lines = File.ReadAllLines(#"C:\Users\user1\Desktop\text1.txt", Encoding.UTF8);
string[] lines1 = File.ReadAllLines(#"C:\Users\user2\Desktop\text1.txt", Encoding.UTF8);
foreach (string line in lines)
{
string[] values = line.Split(',');
DataRow dr = dt.NewRow();
dr["Name"] = values[0].ToString();
dr["designation"] = values[1].ToString();
dr["age"] = values[2].ToString();
dr["year"] = values[3].ToString();
dt.Rows.Add(dr);
}
foreach (string line in lines1)
{
string[] values = line.Split(',');
DataRow dr = dt.NewRow();
dr["Name"] = values[0].ToString();
dr["designation"] = values[1].ToString();
dr["age"] = values[2].ToString();
dr["year"] = values[3].ToString();
dt.Rows.Add(dr);
}
grdstudents.DataSource = dt;
grdstudents.DataBind();
}

Related

C# Converting an array to a list

I am working on an assignment that deals with file input and output. The instructions are as follows:
Write a program to update an inventory file. Each line of the inventory file will have a product number, a product name and a quantity separated by vertical bars. The transaction file will contain a product number and a change amount, which may be positive for an increase or negative for a decrease. Use the transaction file to update the inventory file, writing a new inventory file with the update quantities. I have provided 2 Input files to test your program with as well as a sample output file so you see what it should look like when you are done.
Hints:
This program requires 3 files
Initial Inventory File
File showing updates to be made
New Inventory File with changes completed
Use Lists to capture the data so you don’t have to worry about the number of items in the files
Each line of the Inventory file looks something like this:
123 | television | 17
I have also been given the basic structure and outline of the program:
class Program
{
public class InventoryNode
{
// Create variables to hold the 3 elements of each item that you will read from the file
// Make them all public
public InventoryNode()
{
// Create a constructor that sets all 3 of the items to default values
}
public InventoryNode(int ID, string InvName, int Number)
{
// Create a constructor that sets all 3 of the items to values that are passed in
}
public override string ToString() // This one is a freebie
{
return IDNumber + " | " + Name + " | " + Quantity;
}
}
static void Main(String[] args)
{
// Create variables to hold the 3 elements of each item that you will read from the file
// Create variables for all 3 files (2 for READ, 1 for WRITE)
List<InventoryNode> Inventory = new List<InventoryNode>();
InventoryNode Item = null;
// Create any other variables that you need to complete the work
// Check for proper number of arguments
// If there are not enough arguments, give an error message and return from the program
// Otherwise
// Open Output File
// Open Inventory File (monitor for exceptions)
// Open Update File (monitor for exceptions)
// Read contents of Inventory into the Inventory List
// Read each item from the Update File and process the data
// Write output file
//Close all files
return;
}
}
There is a lot of steps to this problem but right now I am only really concerned with how to read the inventory file into a list. I have read files into arrays before, so I thought I could do that and then convert the array to a list. But I am not entirely sure how to do that. Below is what I have created to add to the main method of the structure above.
int ID;
string InvName;
int Number;
string line;
List<InventoryNode> Inventory = new List<InventoryNode>();
InventoryNode Item = null;
StreamReader f1 = new StreamReader(args[0]);
StreamReader f2 = new StreamReader(args[1]);
StreamWriter p = new StreamWriter(args[2]);
// Read each item from the Update File and process the data
while ((line = f1.ReadLine()) != null)
{
string[] currentLine = line.Split('|');
{
ID = Convert.ToInt16(currentLine[0]);
InvName = currentLine[1];
Number = Convert.ToInt16(currentLine[2]);
}
}
I am a bit hung up on the Inventory Node Item = null; line. I am not really sure what this is supposed to be doing. I really just want to read the file to an array so I can parse it and then pass that data to a list. Is there a way to do that that is something similar to the block I have written? Maybe there is a simpler way. I am open to that, but I figured I'd show my train of thought.
There is not need to add everything to an array and then convert it to the list. InventoryNode Item = null is there to represent a line from the file.
You're pretty close. You just need to instantiate the InventoryNode and feed it the results of the split() method.
You're almost there. You already have fetched ID, InvName and Number, so you just have to instantiate the InventoryNode:
Item = new InventoryNode(...);
And then add Item to your list.
Note that Inventory Node Item = null; is not doing much; it just declares a variable that you can use later. This wasn't strictly necessary as the variable could have been declared inside the loop instead.

Extremely Large Single-Line File Parse

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!
That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(#"path\to\clinic.txt"))
using (var sw = new StreamWriter(#"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}
First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:
File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");
Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:
char[] buffer = new char[1024];
string remainder = string.Empty;
List<ClientData> list = new List<ClientData>();
using (StreamReader reader = File.OpenText(#"source.txt"))
{
while (reader.Read(buffer, 0, 1024) > 0)
{
remainder = Parse(remainder + new string(buffer), list);
}
}
with the following method:
string Parse(string value, List<ClientData> list)
{
string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length - 1; i++)
list.Add(new ClientData(parts[i]));
return parts[parts.Length - 1];
}
and the ClientData class however you have it implemented:
class ClientData
{
public ClientData(string value)
{
// fill in however you are already parsing out ID, and other info
}
}
There are many ways to implement something like this, but hopefully this can help get you started.
StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like.
http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx
Alternately, if you are reading an XML file, XmlReader is another option.
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx

Can't find string in input file

I have a text file, which I am trying to insert a line of code into. Using my linked-lists I believe I can avoid having to take all the data out, sort it, and then make it into a new text file.
What I did was come up with the code below. I set my bools, but still it is not working. I went through debugger and what it seems to be going on is that it is going through the entire list (which is about 10,000 lines) and it is not finding anything to be true, so it does not insert my code.
Why or what is wrong with this code?
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
using (StreamReader inFile = new StreamReader("Students.txt", true))
{
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i];
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.CompareTo(lastName) < 0)
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
{
lines.Add(newRecord);
}
You're just reading the file into memory and not committing it anywhere.
I'm afraid that you're going to have to load and completely re-write the entire file. Files support appending, but they don't support insertions.
you can write to a file the same way that you read from it
string[] lines;
/// instanciate and build `lines`
File.WriteAllLines("path", lines);
WriteAllLines also takes an IEnumerable, so you can past a List of string into there if you want.
one more issue: it appears as though you're reading your file twice. one with ReadAllLines and another with your StreamReader.
There are at least four possible errors.
The opening of the streamreader is not required, you have already read
all the lines. (Well not really an error, but...)
The check for StartsWith can be fooled if you lines starts with blank
space and you will miss the insertionPoint. (Adding a Trim will remove any problem here)
In the CompareTo line you check for < 0 but you should check for == 0. CompareTo returns 0 if the strings are equivalent, however.....
To check if two string are equals you should avoid using CompareTo as
explained in MSDN link above but use string.Equals
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i].Trim();
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.Equals(lastName))
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
lines.Add(newRecord);
I don't list as an error the missing write back to the file. Hope that you have just omitted that part of the code. Otherwise it is a very simple problem.
(However I think that the way in which CompareTo is used is probably the main reason of your problem)
EDIT Looking at your comment below it seems that the answer from Sam I Am is the right one for you. Of course you need to write back the modified array of lines. All the changes are made to an in memory array of lines and nothing is written back to a file if you don't have code that writes a file. However you don't need new file
File.WriteAllLines("Students.txt", lines);

How to delete first and last line from a text file c#?

I found this code on stackoverflow to delete first and last line from a text file.
But I'm not getting how to combine this code into one so that it will delete 1st and
last line from a single file?
What I tried was using streamreader read the file and then skip 1st and last line then
streamwriter to write in new file, but couldn't get the proper structure.
To delete first line.
var lines = System.IO.File.ReadAllLines("test.txt");
System.IO.File.WriteAllLines("test.txt", lines.Skip(1).ToArray());
to delete last line.
var lines = System.IO.File.ReadAllLines("...");
System.IO.File.WriteAllLines("...", lines.Take(lines.Length - 1).ToArray());
You can chain the Skip and Take methods. Remember to subtract the appropriate number of lines in the Take method. The more you skip at the beginning, the less lines remain.
var filename = "test.txt";
var lines = System.IO.File.ReadAllLines(filename);
System.IO.File.WriteAllLines(
filename,
lines.Skip(1).Take(lines.Length - 2)
);
Whilst probably not a major issue in this case, the existing answers all rely on reading the entire contents of the file into memory first. For small files, that's probably fine, but if you're working with very large files, this could prove prohibitive.
It is reasonably trivial to create a SkipLast equivalent of the existing Skip Linq method:
public static class SkipLastExtension
{
public static IEnumerable<T> SkipLast<T>(this IEnumerable<T> source, int count)
{
var queue = new Queue<T>();
foreach (var item in source)
{
queue.Enqueue(item);
if (queue.Count > count)
{
yield return queue.Dequeue();
}
}
}
}
If we also define a method that allows us to enumerate over each line of a file without pre-buffering the whole file (per: https://stackoverflow.com/a/1271236/381588):
static IEnumerable<string> ReadFrom(string filename)
{
using (var reader = File.OpenText(filename))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Then we can use the following the following one-liner to write a new file that contains all the lines from the original file, except the first and last:
File.WriteAllLines("output.txt", ReadFrom("input.txt").Skip(1).SkipLast(1));
This is undoubtedly (considerably) more code than the other answers that have already been posted here, but should work on files of essentially any size, (as well as providing a code for a potentially useful SkipLast extension method).
Here's a different approach that uses ArraySegment<string> instead:
var lines = File.ReadAllLines("test.txt");
File.WriteAllLines("test.txt", new ArraySegment<string>(lines, 1, lines.Length-2));

Copy line number in text file c#

I have 2 files file A and file B how need to copy line 30 on file A and paste it over the top of line 30 in file B can I do this in C#?
Here's a very simple way, assuming file B is small enough to read into memory:
string lineFromA = File.ReadLines("fileA.txt").Skip(29).First();
string[] linesFromB = File.ReadAllLines("fileB.txt");
linesFromB[29] = lineFromA;
File.WriteAllLines("fileC.txt", linesFromB);
This assumes you're using .NET 4, with its lazy File.ReadLines method. If you're not, the simplest approach would be to read both files into memory completely, using File.ReadAllLines twice:
string[] linesFromA = File.ReadAllLines("fileA.txt");
string[] linesFromB = File.ReadAllLines("fileB.txt");
linesFromB[29] = linesFromA[29];
File.WriteAllLines("fileC.txt", linesFromB);
There are definitely more efficient approaches, but I'd go with the above unless I had any reason to need a more efficient one.
If you use a streamwriter for the writing side you get a routine that does not use a lot of memory and can also be used for larger files.
string lineFromA = File.ReadLines("fileA.txt").Skip(29).First();
using (var fileC = File.AppendText("fileC.txt"))
{
int i = 0;
foreach (var lineFromB in File.ReadLines("fileB.txt"))
{
i++;
fileC.WriteLine(i != 30 ? lineFromB : lineFromA);
}
}

Categories

Resources