Split large file into smaller files by number of lines in C#?

Split large file into smaller files by number of lines in C#? - c#

I am trying to figure out how to split a file by the number of lines in each file. THe files are csv and I can't do it by bytes. I need to do it by lines. 20k seems to be a good number per file. What is the best way to read a stream at a given position? Stream.BaseStream.Position? So if I read the first 20k lines i would start the position at 39,999? How do I know I am almost at the end of a files? Thanks all

using (System.IO.StreamReader sr = new System.IO.StreamReader("path"))
{
int fileNumber = 0;
while (!sr.EndOfStream)
{
int count = 0;
using (System.IO.StreamWriter sw = new System.IO.StreamWriter("other path" + ++fileNumber))
{
sw.AutoFlush = true;
while (!sr.EndOfStream && ++count < 20000)
{
sw.WriteLine(sr.ReadLine());
}
}
}
}

int index=0;
var groups = from line in File.ReadLines("myfile.csv")
group line by index++/20000 into g
select g.AsEnumerable();
int file=0;
foreach (var group in groups)
File.WriteAllLines((file++).ToString(), group.ToArray());

I'd do it like this:
// helper method to break up into blocks lazily
public static IEnumerable<ICollection<T>> SplitEnumerable<T>
(IEnumerable<T> Sequence, int NbrPerBlock)
{
List<T> Group = new List<T>(NbrPerBlock);
foreach (T value in Sequence)
{
Group.Add(value);
if (Group.Count == NbrPerBlock)
{
yield return Group;
Group = new List<T>(NbrPerBlock);
}
}
if (Group.Any()) yield return Group; // flush out any remaining
}
// now it's trivial; if you want to make smaller files, just foreach
// over this and write out the lines in each block to a new file
public static IEnumerable<ICollection<string>> SplitFile(string filePath)
{
return File.ReadLines(filePath).SplitEnumerable(20000);
}
Is that not sufficient for you? You mention moving from position to position,but I don't see why that's necessary.

Related

Reading the 2 last line from a text

i am new to c# and i am working on an app that display the time difference from two date on the last two line on a text file.
I want to read the before last line from a file text, i already know how to read the last line but i need to read the before last.
This is my code :
var lastLine = File.ReadAllLines("C:\\test.log").Last();
richTextBox1.Text = lastLine.ToString();

All the previous answers eagerly load all the file up in memory before returning the requested last lines. This can be an issue if the file is big. Luckily, it is easily avoidable.
public static IEnumerable<string> ReadLastLines(string path, int count)
{
if (count < 1)
return Enumerable.Empty<string>();
var queue = new Queue<string>(count);
foreach (var line in File.ReadLines(path))
{
if (queue.Count == count)
queue.Dequeue();
queue.Enqueue(line);
}
return queue;
}
This will only keep in memory the last n read lines avoiding memory issues with large files.

Since
File.ReadAllLines("C:\\test.log");
returns an array you can take the last two items of the array:
var data = File.ReadAllLines("C:\\test.log");
string last = data[data.Length - 1];
string lastButOne = data[data.Length - 2];
In general case with long files (and that's why ReadAllLines is a bad choice) you can implement
public static partial class EnumerableExtensions {
public static IEnumerable<T> Tail<T>(this IEnumerable<T> source, int count) {
if (null == source)
throw new ArgumentNullException("source");
else if (count < 0)
throw new ArgumentOutOfRangeException("count");
else if (0 == count)
yield break;
Queue<T> queue = new Queue<T>(count + 1);
foreach (var item in source) {
queue.Enqueue(item);
if (queue.Count > count)
queue.Dequeue();
}
foreach (var item in queue)
yield return item;
}
}
...
var lastTwolines = File
.ReadLines("C:\\test.log") // Not all lines
.Tail(2);

You can try to do this
var lastLines = File.ReadAllLines("C:\\test.log").Reverse().Take(2).Reverse();
But depending on how large your file is there are probably more efficient methods to process this than reading all lines at once. See Get last 10 lines of very large text file > 10GB and How to read last “n” lines of log file

Simply store the result of ReadAllLines to a variable and than take the two last ones:
var allText = File.ReadAllLines("C:\\test.log");
var lastLines = allText.Skip(allText.Length - 2);

You can use Skip() and Take() like
var lastLine = File.ReadAllLines("C:\\test.log");
var data = lastLine.Skip(lastLine.Length - 2);
richTextBox1.Text = lastLine.ToString();

You can use StreamReader in a combination of Queue<string> since you have to read whole file either way.
// if you want to read more lines change this to the ammount of lines you want
const int LINES_KEPT = 2;
Queue<string> meQueue = new Queue<string>();
using ( StreamReader reader = new StreamReader(File.OpenRead("C:\\test.log")) )
{
string line = string.Empty;
while ( ( line = reader.ReadLine() ) != null )
{
if ( meQueue.Count == LINES_KEPT )
meQueue.Dequeue();
meQueue.Enqueue(line);
}
}
Now you can just use these 2 lines like such :
string line1 = meQueue.Dequeue();
string line2 = meQueue.Dequeue(); // <-- this is the last line.
Or to add this to the RichTextBox :
richTextBox1.Text = string.Empty; // clear the text
while ( meQueue.Count != 0 )
{
richTextBox1.Text += meQueue.Dequeue(); // add all lines in the same order as they were in file
}
Using File.ReadAllLines will read the whole text and then using Linq will iterate through already red lines. This method does everything in one run.

string line;
string[] lines = new string[]{"",""};
int index = 0;
using ( StreamReader reader = new StreamReader(File.OpenRead("C:\\test.log")) )
{
while ( ( line = reader.ReadLine() ) != null )
{
lines[index] = line;
index = 1-index;
}
}
// Last Line -1 = lines[index]
// Last line = lines[1-index]

How to Stream string data from a txt file into an array

I'm doing this exercise from a lab. the instructions are as follows
This method should read the product catalog from a text file called “catalog.txt” that you should
create alongside your project. Each product should be on a separate line.Use the instructions in the video to create the file and add it to your project, and to return an
array with the first 200 lines from the file (use the StreamReader class and a while loop to read
from the file). If the file has more than 200 lines, ignore them. If the file has less than 200 lines,
it’s OK if some of the array elements are empty (null).
I don't understand how to stream data into the string array any clarification would be greatly appreciated!!
static string[] ReadCatalogFromFile()
{
//create instance of the catalog.txt
StreamReader readCatalog = new StreamReader("catalog.txt");
//store the information in this array
string[] storeCatalog = new string[200];
int i = 0;
//test and store the array information
while (storeCatalog != null)
{
//store each string in the elements of the array?
storeCatalog[i] = readCatalog.ReadLine();
i = i + 1;
if (storeCatalog != null)
{
//test to see if its properly stored
Console.WriteLine(storeCatalog[i]);
}
}
readCatalog.Close();
Console.ReadLine();
return storeCatalog;
}

Here are some hints:
int i = 0;
This needs to be outside your loop (now it is reset to 0 each time).
In your while() you should check the result of readCatalog() and/or the maximum number of lines to read (i.e. the size of your array)
Thus: if you reached the end of the file -> stop - or if your array is full -> stop.

static string[] ReadCatalogFromFile()
{
var lines = new string[200];
using (var reader = new StreamReader("catalog.txt"))
for (var i = 0; i < 200 && !reader.EndOfStream; i++)
lines[i] = reader.ReadLine();
return lines;
}

A for-loop is used when you know the exact number of iterations beforehand. So you can say it should iterate exactly 200 time so you won't cross the index boundaries. At the moment you just check that your array isn't null, which it will never be.
using(var readCatalog = new StreamReader("catalog.txt"))
{
string[] storeCatalog = new string[200];
for(int i = 0; i<200; i++)
{
string temp = readCatalog.ReadLine();
if(temp != null)
storeCatalog[i] = temp;
else
break;
}
return storeCatalog;
}
As soon as there are no more lines in the file, temp will be null and the loop will be stopped by the break.
I suggest you use your disposable resources (like any stream) in a using statement. After the operations in the braces, the resource will automatically get disposed.

find all next lines when previous line contains a string

I'm working on an ASP mvc application and i'm trying to get all the next lines when previous line contains a word
I've used the code below but i just can get the last line that contains the word given
int counter = 0;
string line;
List<string> found = new List<string>();
// Read the file and display it line by line.
System.IO.StreamReader file = new System.IO.StreamReader("C:\\Users\\Chaimaa\\Documents\\path.txt");
while ((line = file.ReadLine()) != null)
{
if (line.Contains("fact"))
{
found.Add(line);
}
foreach (var i in found)
{
var output = i;
ViewBag.highlightedText = output;
}
}
Any help on what should I add to
1- get ALL lines that contains the word
2- and preferably get the ALL NEXT lines

You can use an overload of Where that provides an index, store indexes in a hash set, and use the containment check to decide if a line should be kept or not, like this:
var seen = new HashSet<int>();
var res = data.Where((v, i) => {
if (v.Contains("fact")) {
seen.Add(i);
}
return seen.Contains(i-1);
});
Demo.
As a side benefit, seen would contain indexes of all lines where the word "fact" has been found.

You can write a PairWise method that takes in a sequence of values and returns a sequence containing each item paired with the item that came before it:
public static IEnumerable<Tuple<T, T>> Pairwise<T>(this IEnumerable<T> source)
{
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
yield break;
T prev = iterator.Current;
while (iterator.MoveNext())
{
yield return Tuple.Create(prev, iterator.Current);
prev = iterator.Current;
}
}
}
With this method we can pair off each of the lines, get the lines where the previous value contains a word, and then project out the second value, which is the line after it:
var query = lines.Pairwise()
.Where(pair => pair.Item1.Contains(word))
.Select(pair => pair.Item2);

Write results in text file

I want to write the result of var item in text file.
I use File.WriteAllText, the path is #"C:\Users\TBM\Desktop\test.txt"
but I only get the last value of item, which is EDC
static void Main(string[] args)
{
var alphabet = "ABCDE";
var q = alphabet.Select(x => x.ToString());
int size = 3;
for (int i = 0; i < size - 1; i++)
{
q = q.SelectMany(x => alphabet, (x, y) => x + y);
}
foreach (var item in q)
{
if ((item[0] == item[1]) || (item[1] == item[2]) || (item[0] == item[2]))
{
continue;
File.WriteAllText(#"C:\Users\TBM\Desktop\test.txt", item);
}
}
}

The StreamWriter is the easiest way to write to a text file while you're looping through something:
using (var sw = new StreamWriter(filename))
{
foreach (something in somethingElse)
{
string line = "compute this line somehow";
sw.WriteLine(line);
}
}
That's all you need - your file will be created, written, saved, and closed.

File.WriteAllText() opens the file, writes your text to it then closes the file. You need to lookup one of the many examples of writing a text file line by line.

WriteAllText Overrites the entire contents of the text file with that item. The smallest change would be to change your code to use AppendAllText instead (clearing the file before the loop if that's what you want), but the more idomatic solution is to refactor your code to use WriteAllLines:
File.WriteAllLines(path, q);
That will simply write each item out on its own line.

Your File.WriteAllText() call in your example comes after continue -- how does it ever get invoked? It needs to be outside your if brackets.

should it be
foreach (var item in q)
{
if ((item[0] == item[1]) || (item[1] == item[2]) || (item[0] == item[2]))
{
continue;
}
File.WriteAllText(#"C:\Users\TBM\Desktop\test.txt", item);
}
Doesn't continue keyword skip the rest of the loop body?

How can I optimize this function? c# scan text file for strings

I'm writing a program to scan a text file for blocks of strings (lines) and output the blocks to a file when found
In my process class, the function proc() is taking an unusually long time to process a 6MB file. On a previous program I wrote where I scan the text for only one specific type of string it took 5 seconds to process the same file. Now I rewrote it to scan for the presence of different strings. it is taking over 8 minutes which is a significant difference. Does any one have any ideas how to optimize this function?
This is my RegEx
System.Text.RegularExpressions.Regex RegExp { get { return new System.Text.RegularExpressions.Regex(#"(?s)(?-m)MSH.+?(?=[\r\n]([^A-Z0-9]|.{1,2}[^A-Z0-9])|$)", System.Text.RegularExpressions.RegexOptions.Compiled); } }
.
public static class TypeFactory
{
public static List<IMessageType> GetTypeList()
{
List<IMessageType> types = new List<IMessageType>();
types.AddRange(from assembly in AppDomain.CurrentDomain.GetAssemblies()
from t in assembly.GetTypes()
where t.IsClass && t.GetInterfaces().Contains(typeof(IMessageType))
select Activator.CreateInstance(t) as IMessageType);
return types;
}
}
public class process
{
public void proc()
{
IOHandler.Read reader = new IOHandler.Read(new string[1] { #"C:\TEMP\DeIdentified\DId_RSLTXMIT.LOG" });
List<IMessageType> types = MessageType.TypeFactory.GetTypeList();
//TEST1
IOHandler.Write.writeReport(System.DateTime.Now.ToString(), "TEST", "v3test.txt", true);
foreach (string file in reader.FileList)
{
using (FileStream readStream = new FileStream(file, FileMode.Open, FileAccess.Read))
{
int charVal = 0;
Int64 position = 0;
StringBuilder fileFragment = new StringBuilder();
string message = string.Empty;
string current = string.Empty;
string previous = string.Empty;
int currentLength = 0;
int previousLength = 0;
bool found = false;
do
{
//string line = reader.ReturnLine(readStream, out charVal, ref position);
string line = reader.ReturnLine(readStream, out charVal);
for (int i = 0; i < types.Count; i++)
{
if (Regex.IsMatch(line, types[i].BeginIndicator)) //found first line of a message type
{
found = true;
message += line;
do
{
previousLength = types[i].RegExp.Match(message).Length;
//keep adding lines until match length stops growing
//message += reader.ReturnLine(readStream, out charVal, ref position);
message += reader.ReturnLine(readStream, out charVal);
currentLength = types[i].RegExp.Match(message).Length;
if (currentLength == previousLength)
{
//stop - message complete
IOHandler.Write.writeReport(message, "TEST", "v3test.txt", true);
//reset
message = string.Empty;
currentLength = 0;
previousLength = 0;
break;
}
} while (charVal != -1);
break;
}
}
} while (charVal != -1);
//END OF FILE CONDITION
if (charVal == -1)
{
}
}
}
IOHandler.Write.writeReport(System.DateTime.Now.ToString(), "TEST", "v3test.txt", true);
}
}
.
EDIT: I ran profiling wizard in VS2012 and I found most time was spent on RegEx.Match function

Here are some thoughts:
RegEx matching is not the most efficient way to do a substring search, and you are performing the match check once per "type" of match. Have a look at efficient substring matching algorithms such as Boyer-Moore if you need to match literal substrings rather than patterns.
If you must use RegEx, consider using compiled expressions.
Use a BufferedStream to improve IO performance. Probably marginal for a 6MB file, but it only costs a line of code.
Use a profiler to be sure exactly where time is being spent.

High level ideas:
Use Regex.Matches to find all matches at once instead of one by one. Probably the main performance hit
Pre-build the search pattern to include multiple messages at once. You can use Regex OR.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split large file into smaller files by number of lines in C#? - c#

int index=0; var groups = from line in File.ReadLines("myfile.csv") group line by index++/20000 into g select g.AsEnumerable(); int file=0; foreach (var group in groups) File.WriteAllLines((file++).ToString(), group.ToArray());

Related

Reading the 2 last line from a text

How to Stream string data from a txt file into an array

find all next lines when previous line contains a string

Write results in text file

How can I optimize this function? c# scan text file for strings

Categories

Resources