Reading a large log file in C# - c#

For my project, I need to extract message types from a log file. I have a 700 MB log file which contains about 4.7 million lines and I need to read each entry line by line and extract the message field. I need to find the size of message in each entry(which is the event size) and store it along with that message in a dictionary. There can be multiple messages for same event size. But I get a OutOfMemoryException when I use the below logic.
Dictionary<Int32,List<String>> dt=new Dictionary<Int32,List<String>>();
List<String> entries=new List<String>();
StreamReader sr=new StreamReader("Bluegene.log");
String s;
while((s=sr.readLine())!=null)
{
eventsize=s.length - 9; //size of only the message field
entries.Add(s);
if (!dt.ContainsKey(eventsize))
{
dt.Add(eventsize, entries);
}
else
{
dt.Remove(eventsize);
dt.Add(eventsize, entries);
}
}
Will using MemoryMappedFile help?

The problem is your list is ever growing.
So, you can try the following:
Dictionary<Int32, List<String>> dt = new Dictionary<Int32, List<String>>();
int eventsize;
StreamReader sr = new StreamReader("Bluegene.log");
string s;
while ((s = sr.ReadLine()) != null)
{
eventsize = s.Length - 9; //size of only the message field
if (!dt.ContainsKey(eventsize))
{
List<String> entries = new List<String>();
entries.Add(s);
dt.Add(eventsize, entries);
}
else
{
dt[eventsize].Add(s);
}
}

Related

How do you store a text file into an array with different string formats?

Using C#. I am creating a program that stores movies in a text file such as:
38#Finding Nemo#Family#2001#yes
36#Wonder Woman#Action#2017#yes
35#Solo#Action#2018#yes
I am trying to process a batch transaction into my movie inventory file. This text file holds an action code after the first delimiter('#') whether to add, change, or delete a movie in the movie inventory file. The problem that I am having is that each line has a different format while I am trying to store it into an array. What is a way to read the header line differently and ignore empty values between delimiters?
An example batch file is:
H#Movie Inventory Updates#10/31/2019
D#C#5#Family Movie 1###no
D#A#21#Sci-Fi Movie 1#sci-fi#2005#yes
D#A#22#Other Movie 1#other#2001#yes
D#C#1###2002#
D#D#4####
public static Batch[] ReadBatchFile()
{
Batch[] bTransactions = new Batch[100];
if (File.Exists("batch_transaction.txt"))
{
StreamReader inFile = new StreamReader("batch_transaction.txt");
//string path = "batch_transaction.txt";
string line = inFile.ReadLine();
while (line != null) //read all transaction records, ignoring header and footer
{
string[] tempArray = line.Split('#');
bTransactions[Batch.GetCount()] = new Batch(tempArray[0], tempArray[1], int.Parse(tempArray[2]), tempArray[3], tempArray[4], double.Parse(tempArray[5]), tempArray[6]);
Batch.SetCount(Batch.GetCount() + 1);
line = inFile.ReadLine();
}
inFile.Close();
}
else
{
Console.WriteLine("File not found.");
}
return bTransactions;
}
}
Here is how I'm currently trying to read in the the other lines:
public Batch(string recordType, string actionCode, int movieId, string movieTitle, string movieGenre, double releaseYear, string inStock)
{
this.recordType = recordType;
this.actionCode = actionCode;
this.movieId = movieId;
this.movieTitle = movieTitle;
this.movieGenre = movieGenre;
this.releaseYear = releaseYear;
}

How to count strings from text file in C#

in this button click event I am trying to count strings from text file that are the same as in textboxes, then display number of them in label. My problem is that I have no idea how to count them-I'm talking about code inside if-statement. I would really appreciate any help.
private void btnCalculate_Click(object sender, EventArgs e)
{
string openFileName;
using (OpenFileDialog ofd = new OpenFileDialog())
{
if (ofd.ShowDialog() != DialogResult.OK)
{
MessageBox.Show("You did not select OK");
return;
}
openFileName = ofd.FileName;
}
FileStream fs = null;
StreamReader sr = null;
try
{
fs = new FileStream("x", FileMode.Open, FileAccess.Read);
fs.Seek(0, SeekOrigin.Begin);
sr = new StreamReader(fs);
string s = sr.ReadLine();
while (s != null)
{
s = sr.ReadLine();
}
if(s.Contains(tbFirstClub.Text))
{
s.Count = lblResult1.Text; //problem is here
}
else if(s.Contains(tbSecondClub.Text))
{
s.Count = lblResult2.Text; //problem is here
}
}
catch (IOException)
{
MessageBox.Show("Error reading file");
}
catch (Exception)
{
MessageBox.Show("Something went wrong");
}
finally
{
if (sr != null)
{
sr.Close();
}
}
}
Thanks in advance.
s.Count = lblResult1.Text; //problem is here
wait...you are saying here..
you have a variable (s)
and you access its property (Count)
and then set it to the label text(lblResult1.Text)
is that what you're trying to do? because the reverse seems more likely
Using LINQ you can get the number of occurences, like below:
int numOfOcuurences= s.Count( s=> s == tbFirstClub.Text);
lblResult1.Text = numOfOcuurences.ToString();
welcome to Stack Overflow.
I want to point out something you said.
else if(s.Contains(tbSecondClub.Text))
{
s.Count = lblResult2.Text; //problem is here
}
S is our string that we just read from the file.
You're saying assoung S.Count (The length of the string) to text.
I don't think this is what you want. We want to return the number of times specified strings show up in a specified file
Let's refactor this, (And add some tricks along the way).
// Let's create a dictionary to store all of our desired texts, and the counts.
var textAndCounts = new Dictionary<string, int>();
textAndCounts.Add(tbFirstClub.Text, 0); // Assuming the type of Text is string, change acccorrdingly
textAndCounts.Add(tbSecondClub.Text, 0);
//We added both out texts fields to our dictionary with a value of 0
// Read all the lines from the file.
var allLines = File.ReadAllLines(openFileName); /* using System.IO */
foreach(var line in allLines)
{
if(line.Contains(tbFirstClub.Text))
{
textAndCounts[tbFirstClub.Text] += 1; // Go to where we stored our count for our text and increment
}
if(line.Contains(tbSecondClub.Text))
{
textandCounts[tbSecondClub.Text] += 1;
}
}
This should solve your problem, but it's still pretty brittle. Optimally, we want to design a system that works for any number of strings and counts them.
So how would I do it?
public Dictionary<string, int> GetCountsPerStringInFile(IEnumerable<string> textsToSearch, string filePath)
{
//Lets use Linq to create a dictionary, assuming all strings are unique.
//This means, create a dictionary in this list, where the key is the values in the list, and the value is 0 <Text, 0>
var textsAndCount = textsToSearch.ToDictionary(text => text, count => 0);
var allLines = File.ReadAllLines(openFileName);
foreach (var line in allLines)
{
// You didn't specify if a line could maintain multiple values, so let's handle that here.
var keysContained = textsAndCounts.Keys.Where(c => line.Contains(c)); // take all the keys where the line has that key.
foreach (var key in keysContained)
{
textsAndCounts[key] += 1; // increment the count associated with that string.
}
}
return textsAndCounts;
}
The above code allows us to return a data structure with any amount of strings with a count.
I think this is a good example for you to save you some headaches going forward, and it's probably a good first toe-dip into design patterns. I'd suggest looking up some material on Data structures and their use cases.

Add two lines from csv file to array(s)

I have a csv file with the following data:
500000,0.005,6000
690000,0.003,5200
I need to add each line as a separate array. So 50000, 0.005, 6000 would be array1. How would I do this?
Currently my code adds each column into one element.
For example data[0] is showing 500000
690000
static void ReadFromFile(string filePath)
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(filePath))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] data = line.Split(',');
Console.WriteLine(data[0] + " " + data[1]);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
Using the limited data set you've provided...
const string test = #"500000,0.005,6000
690000,0.003,5200";
var result = test.Split('\n')
.Select(x=> x.Split(',')
.Select(y => Convert.ToDecimal(y))
.ToArray()
)
.ToArray();
foreach (var element in result)
{
Console.WriteLine($"{element[0]}, {element[1]}, {element[2]}");
}
Can it be done without LINQ? Yes, but it's messy...
const string test = #"500000,0.005,6000
690000,0.003,5200";
List<decimal[]> resultList = new List<decimal[]>();
string[] lines = test.Split('\n');
foreach (var line in lines)
{
List<decimal> decimalValueList = new List<decimal>();
string[] splitValuesByComma = line.Split(',');
foreach (string value in splitValuesByComma)
{
decimal convertedValue = Convert.ToDecimal(value);
decimalValueList.Add(convertedValue);
}
decimal[] decimalValueArray = decimalValueList.ToArray();
resultList.Add(decimalValueArray);
}
decimal[][] resultArray = resultList.ToArray();
That will give the exact same output as what I've done with the first example
If you may use a List<string[]> you do not have to worry about the array length.
In the following example, the variable lines will be a list arrays, like:
["500000", "0.005", "6000"]
["690000", "0.003", "5200"]
static void ReadFromFile(string filePath)
{
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(filePath))
{
List<string[]> lines = new List<string[]>();
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] splittedLine = line.Split(',');
lines.Add(splittedLine);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
While other have split method, I will have a more "scolar"-"specified" method.
You have some Csv value in a file. Find a name for this object stored in a Csv, name every column, type them.
Define the default value of those field. Define what happends for missing column, and malformed field. Header?
Now that you know what you have, define what you want. This time again: Object name -> Property -> Type.
Believe me or not, the simple definition of your input and output solved your issue.
Use CsvHelper to simplify your code.
CSV File Definition:
public class CsvItem_WithARealName
{
public int data1;
public decimal data2;
public int goodVariableNames;
}
public class CsvItemMapper : ClassMap<CsvItem_WithARealName>
{
public CsvItemMapper()
{ //mapping based on index. cause file has no header.
Map(m => m.data1).Index(0);
Map(m => m.data2).Index(1);
Map(m => m.goodVariableNames).Index(2);
}
}
A Csv reader method, point a document it will give your the Csv Item.
Here we have some configuration: no header and InvariantCulture for decimal convertion
private IEnumerable<CsvItem_WithARealName> GetCsvItems(string filePath)
{
using (var fileReader = File.OpenText(filePath))
using (var csvReader = new CsvHelper.CsvReader(fileReader))
{
csvReader.Configuration.CultureInfo = CultureInfo.InvariantCulture;
csvReader.Configuration.HasHeaderRecord = false;
csvReader.Configuration.RegisterClassMap<CsvItemMapper>();
while (csvReader.Read())
{
var record = csvReader.GetRecord<CsvItem_WithARealName>();
yield return record;
}
}
}
Usage :
var filename = "csvExemple.txt";
var items = GetCsvItems(filename);

How to avoid c# File.ReadLines First() locking file

I do not want to read the whole file at any point, I know there are answers on that question, I want t
o read the First or Last line.
I know that my code locks the file that it's reading for two reasons 1) The application that writes to the file crashes intermittently when I run my little app with this code but it never crashes when I am not running this code! 2) There are a few articles that will tell you that File.ReadLines locks the file.
There are some similar questions but that answer seems to involve reading the whole file which is slow for large files and therefore not what I want to do. My requirement to only read the last line most of the time is also unique from what I have read about.
I nead to know how to read the first line (Header row) and the last line (latest row). I do not want to read all lines at any point in my code because this file can become huge and reading the entire file will become slow.
I know that
line = File.ReadLines(fullFilename).First().Replace("\"", "");
... is the same as ...
FileStream fs = new FileStream(#fullFilename, FileMode.Open, FileAccess.Read, FileShare.Read);
My question is, how can I repeatedly read the first and last lines of a file which may be being written to by another application without locking it in any way. I have no control over the application that is writting to the file. It is a data log which can be appended to at any time. The reason I am listening in this way is that this log can be appended to for days on end. I want to see the latest data in this log in my own c# programme without waiting for the log to finish being written to.
My code to call the reading / listening function ...
//Start Listening to the "data log"
private void btnDeconstructCSVFile_Click(object sender, EventArgs e)
{
MySandbox.CopyCSVDataFromLogFile copyCSVDataFromLogFile = new MySandbox.CopyCSVDataFromLogFile();
copyCSVDataFromLogFile.checkForLogData();
}
My class which does the listening. For now it simply adds the data to 2 generics lists ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using MySandbox.Classes;
using System.IO;
namespace MySandbox
{
public class CopyCSVDataFromLogFile
{
static private List<LogRowData> listMSDataRows = new List<LogRowData>();
static String fullFilename = string.Empty;
static LogRowData previousLineLogRowList = new LogRowData();
static LogRowData logRowList = new LogRowData();
static LogRowData logHeaderRowList = new LogRowData();
static Boolean checking = false;
public void checkForLogData()
{
//Initialise
string[] logHeaderArray = new string[] { };
string[] badDataRowsArray = new string[] { };
//Get the latest full filename (file with new data)
//Assumption: only 1 file is written to at a time in this directory.
String directory = "C:\\TestDir\\";
string pattern = "*.csv";
var dirInfo = new DirectoryInfo(directory);
var file = (from f in dirInfo.GetFiles(pattern) orderby f.LastWriteTime descending select f).First();
fullFilename = directory + file.ToString(); //This is the full filepath and name of the latest file in the directory!
if (logHeaderArray.Length == 0)
{
//Populate the Header Row
logHeaderRowList = getRow(fullFilename, true);
}
LogRowData tempLogRowList = new LogRowData();
if (!checking)
{
//Read the latest data in an asynchronous loop
callDataProcess();
}
}
private async void callDataProcess()
{
checking = true; //Begin checking
await checkForNewDataAndSaveIfFound();
}
private static Task checkForNewDataAndSaveIfFound()
{
return Task.Run(() => //Call the async "Task"
{
while (checking) //Loop (asynchronously)
{
LogRowData tempLogRowList = new LogRowData();
if (logHeaderRowList.ValueList.Count == 0)
{
//Populate the Header row
logHeaderRowList = getRow(fullFilename, true);
}
else
{
//Populate Data row
tempLogRowList = getRow(fullFilename, false);
if ((!Enumerable.SequenceEqual(tempLogRowList.ValueList, previousLineLogRowList.ValueList)) &&
(!Enumerable.SequenceEqual(tempLogRowList.ValueList, logHeaderRowList.ValueList)))
{
logRowList = getRow(fullFilename, false);
listMSDataRows.Add(logRowList);
previousLineLogRowList = logRowList;
}
}
//System.Threading.Thread.Sleep(10); //Wait for next row.
}
});
}
private static LogRowData getRow(string fullFilename, bool isHeader)
{
string line;
string[] logDataArray = new string[] { };
LogRowData logRowListResult = new LogRowData();
try
{
if (isHeader)
{
//Asign first (header) row data.
//Works but seems to block writting to the file!!!!!!!!!!!!!!!!!!!!!!!!!!!
line = File.ReadLines(fullFilename).First().Replace("\"", "");
}
else
{
//Assign data as last row (default behaviour).
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
}
logDataArray = line.Split(',');
//Copy Array to Generics List and remove last value if it's empty.
for (int i = 0; i < logDataArray.Length; i++)
{
if (i < logDataArray.Length)
{
if (i < logDataArray.Length - 1)
{
//Value is not at the end, from observation, these always have a value (even if it's zero) and so we'll store the value.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else
{
//This is the last value
if (logDataArray[i].Replace("\"", "").Trim().Length > 0)
{
//In this case, the last value is not empty, store it as normal.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else { /*The last value is empty, e.g. "123,456,"; the final comma denotes another field but this field is empty so we will ignore it now. */ }
}
}
}
}
catch (Exception ex)
{
if (ex.Message == "Sequence contains no elements")
{ /*Empty file, no problem. The code will safely loop and then will pick up the header when it appears.*/ }
else
{
//TODO: catch this error properly
Int32 problemID = 10; //Unknown ERROR.
}
}
return logRowListResult;
}
}
}
I found the answer in a combination of other questions. One answer explaining how to read from the end of a file, which I adapted so that it would read only 1 line from the end of the file. And another explaining how to read the entire file without locking it (I did not want to read the entire file but the not locking part was useful). So now you can read the last line of the file (if it contains end of line characters) without locking it. For other end of line delimeters, just replace my 10 and 13 with your end of line character bytes...
Add the method below to public class CopyCSVDataFromLogFile
private static string Reverse(string str)
{
char[] arr = new char[str.Length];
for (int i = 0; i < str.Length; i++)
arr[i] = str[str.Length - 1 - i];
return new string(arr);
}
and replace this line ...
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
with this code block ...
Int32 endOfLineCharacterCount = 0;
Int32 previousCharByte = 0;
Int32 currentCharByte = 0;
//Read the file, from the end, for 1 line, allowing other programmes to access it for read and write!
using (FileStream reader = new FileStream(fullFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 0x1000, FileOptions.SequentialScan))
{
int i = 0;
StringBuilder lineBuffer = new StringBuilder();
int byteRead;
while ((-i < reader.Length) /*Belt and braces: if there were no end of line characters, reading beyond the file would give a catastrophic error here (to be avoided thus).*/
&& (endOfLineCharacterCount < 2)/*Exit Condition*/)
{
reader.Seek(--i, SeekOrigin.End);
byteRead = reader.ReadByte();
currentCharByte = byteRead;
//Exit condition: the first 2 characters we read (reading backwards remember) were end of line ().
//So when we read the second end of line, we have read 1 whole line (the last line in the file)
//and we must exit now.
if (currentCharByte == 13 && previousCharByte == 10)
{
endOfLineCharacterCount++;
}
if (byteRead == 10 && lineBuffer.Length > 0)
{
line += Reverse(lineBuffer.ToString());
lineBuffer.Remove(0, lineBuffer.Length);
}
lineBuffer.Append((char)byteRead);
previousCharByte = byteRead;
}
reader.Close();
}

Ignore certain lines in a text file (C# Streamreader)

I'm trying to work out a way of removing records from a program I'm writing. I have a text file with all the customer data spread over a set of lines and I read in these lines one at a time and store them in a List
When writing I simply append to the file. However, for deleting I had the idea of adding a character such as * or # to the front of lines no longer needed. However I am unsure how to do this
Below is how I currrently read the data in:
Thanks in advance
StreamReader dataIn = null;
CustomerClass holdcus; //holdcus and holdacc are used as "holding pens" for the next customer/account
Accounts holdacc;
bool moreData = false;
string[] cusdata = new string[13]; //holds customer data
string[] accdata = new string[8]; //holds account data
if (fileIntegCheck(inputDataFile, ref dataIn))
{
moreData = getCustomer(dataIn, cusdata);
while (moreData == true)
{
holdcus = new CustomerClass(cusdata[0], cusdata[1], cusdata[2], cusdata[3], cusdata[4], cusdata[5], cusdata[6], cusdata[7], cusdata[8], cusdata[9], cusdata[10], cusdata[11], cusdata[12]);
customers.Add(holdcus);
int x = Convert.ToInt32(cusdata[12]);
for (int i = 0; i < x; i++) //Takes the ID number for the last customer, as uses it to set the first value of the following accounts
{ //this is done as a key to which accounts map to which customers
moreData = getAccount(dataIn, accdata);
accdata[0] = cusdata[0];
holdacc = new Accounts(accdata[0], accdata[1], accdata[2], accdata[3], accdata[4], accdata[5], accdata[6], accdata[7]);
accounts.Add(holdacc);
}
moreData = getCustomer(dataIn, cusdata);
}
}
if (moreData != null) dataIn.Close();
Since your using string arrays, you can just do cusdata[index] = "#"+cusdata[index] to append it to the beginning of the line. However if your question is how to delete it from the file, why not skip the above step and just not add the line you want deleted when writing the file?
Here is a small read / write sample that should suit your needs. If it doesnt then let me know in the comment.
class Program
{
static readonly string filePath = "c:\\test.txt";
static void Main(string[] args)
{
// Read your file
List<string> lines = ReadLines();
//Create your remove logic here ..
lines = lines.Where(x => x.Contains("Julia Roberts") != true).ToList();
// Rewrite the file
WriteLines(lines);
}
static List<string> ReadLines()
{
List<string> lines = new List<string>();
using (StreamReader sr = new StreamReader(new FileStream(filePath, FileMode.Open)))
{
while (!sr.EndOfStream)
{
string buffer = sr.ReadLine();
lines.Add(buffer);
// Just to show you the results
Console.WriteLine(buffer);
}
}
return lines;
}
static void WriteLines(List<string> lines)
{
using (StreamWriter sw = new StreamWriter(new FileStream(filePath, FileMode.Create)))
{
foreach (var line in lines)
{
sw.WriteLine(line);
}
}
}
}
I used the following "data sample" for this
Matt Damon 100 222
Julia Roberts 125 152
Robert Downey Jr. 150 402
Tom Hanks 55 932

Categories

Resources