Text File - Read and fix delimiter problems - Too slow

Text File - Read and fix delimiter problems - Too slow - c#

I am looking for a bit of advice on ways I can make this function quicker.
The function is designed to run through a delimited text file (with CRLF row ends) and remove any carriage returns or line breaks in between data rows.
E.g. A file of -
A|B|C|D
A|B|C|D
A|B|
C|D
A|B|C|D
Would become -
A|B|C|D
A|B|C|D
A|B|C|D
A|B|C|D
The function seems to work well, however when we start processing large files, the performance is too slow. An example is - for 800k rows it takes 3 seconds, for 130 million rows it takes over an hour....
The code is -
private void CleanDelimitedFile(string readFilePath, string writeFilePath, string delimiter, string problemFilePath, string rejectsFilePath, int estimateNumberOfRows)
{
ArrayList rejects = new ArrayList();
ArrayList problems = new ArrayList();
int safeSameLengthBreak = 0;
int numberOfLinesSameLength = 0;
int lineCount = 0;
int maxCount = 0;
string previousLine = string.Empty;
string currentLine = string.Empty;
// determine after how many rows with the same number of delimiter chars that we can safety
// say that we have found the expected length of a row (to save reading the full file twice)
if (estimateNumberOfRows > 100000000)
safeSameLengthBreak = estimateNumberOfRows / 200; // set the safe check limit as 0.5% of the file (minimum of 500,000)
else if (estimateNumberOfRows > 10000000)
safeSameLengthBreak = estimateNumberOfRows / 50; // set the safe check limit as 2% of the file (minimum of 200,000)
else
safeSameLengthBreak = 50000; // set the safe check limit as 50,000 (if there are less than 50,000 this wont be required anyway)
// open a reader
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// if the number is higher than the previous maximum set the new maximum
if (maxCount < chars)
{
maxCount = chars;
// the maximum has changed, reset the number of lines in a row with the same delimiter
numberOfLinesSameLength = 0;
}
else
{
// the maximum has not changed, add to the number of lines in a row with the same delimiter
numberOfLinesSameLength += 1;
}
// is the number of lines parsed in a row with the same number of delimiter chars above the safe limit? If so break the loop
if (numberOfLinesSameLength > safeSameLengthBreak)
{
break;
}
}
}
// reset the line count
lineCount = 0;
// open a writer for the duration of the next read
using (var writer = new StreamWriter(writeFilePath))
{
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// check the number of chars in the line matches the required number
if (chars == maxCount)
{
// write line
writer.WriteLine(currentLine);
// clear the previous line variable as this was a valid write
previousLine = string.Empty;
}
else
{
// add the line to problems
problems.Add(currentLine);
// append the new line to the previous line
previousLine += currentLine;
// get the number of chars in the new appended previous line
int newPreviousChars = (previousLine.Length - previousLine.Replace(delimiter, "").Length);
// check the number of chars in the previous appended line matches the required number
if (newPreviousChars == maxCount)
{
// write line
writer.WriteLine(previousLine);
// clear the previous line as this was a valid write
previousLine = string.Empty;
}
else if (newPreviousChars > maxCount)
{
// the number of delimiter chars in the new line is higher than the file maximum, add to rejects
rejects.Add(previousLine);
// clear the previous line and move on
previousLine = string.Empty;
}
}
}
}
}
// rename the original file as _original
System.IO.File.Move(readFilePath, readFilePath.Replace(".txt", "") + "_Original.txt");
// rename the new file as the original file name
System.IO.File.Move(writeFilePath, readFilePath);
// Write rejects
using (var rejectWriter = new StreamWriter(rejectsFilePath))
{
// loop through the problem array list and write the problem row to the problem file
foreach (string reject in rejects)
{
rejectWriter.WriteLine(reject);
}
}
// Write problems
using (var problemWriter = new StreamWriter(problemFilePath))
{
// loop through the reject array list and write the reject row to the problem file
foreach (string problem in problems)
{
problemWriter.WriteLine(problem);
}
}
}
Any pointers would be greatly appreciated.
Thanks in advance.

A few ideas
List<String>
For rejects and problems and allocate an initial capacity to you think they will need
Don't process over the network
Get an SSD, copy to it, process, write lines to it, and then copy the file back
This does not look like an efficient way to me to count delimeters
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
This is wastefully expensive: currentLine.Replace(delimiter, "")
int chars = 0;
foreach(char c in currentLine) if (c == delimeter) chars++;
This is not efficient
previousLine += currentLine;
Use StringBuilder
And allocate StringBuilder once outside the loop
In the loop call .Clear()

Related

Grab text between two lines

I was just learning and had a problem working with files.
I have a method that has two inputs, one at the beginning of the line (lineStart) I want and the other at the end of the line (lineEnd)
I need method that extract between these two numbers for me and write on file .
ex ) lineStart = 20 , lineEnd = 90, in output Must be = 21-89 line of txt file.
string[] lines = File.ReadAllLines(#"");
int lineStart = 0;
foreach (string line0 in lines)
{
lineStart++;
if (line0.IndexOf("target1") > -1)
{
Console.Write(lineStart + "\n");
}
}
int lineEnd = 0;
foreach (string line1 in lines)
{
lineEnd++;
if (line1.IndexOf("target2") > -1)
{
Console.Write(lineEnd);
}
}
// method grabText(lineStart,lineEnd){}
enter code here

It is just a line of code
string[] lines = File.ReadLines(#"").Skip(lineStart).Take(lineEnd-lineStart);
Notice also that I use ReadLines and not ReadAllLines. The first one doesn't load everything in memory.
It is not very clear what are the boundary of the lines to take but of course it is very easy to adapt the calculation

If your text file is huge, don't read it into memory. Don't look for indexes either, just process it line by line:
bool writing = false;
using var sw = File.CreateText(#"C:\some\path\to.txt");
foreach(var line in File.ReadLines(...)){ //don't use ReadAllInes, use ReadLines - it's incremental and burns little memory
if(!writing && line.Contains("target1")){
writing = true; //start writing
continue; //don't write this line
}
if(writing){
if(line.Contains("target2"))
break; //exit loop without writing this line
sw.WriteLine(line);
}
}

Binary search on file with different line length

I have some code which does a binary search over a file with sorted hex values (SHA1 hashes) on each line. This is used to search the HaveIBeenPwned database. The latest version contains a count of the number of times each password hash was found, so some lines have extra characters at the end, in the format ':###'
The length of this additional check isn't fixed, and it isn't always there. This causes the buffer to read incorrect values and fail to find values that actually exist.
Current code:
static bool Check(string asHex, string filename)
{
const int LINELENGTH = 40; //SHA1 hash length
var buffer = new byte[LINELENGTH];
using (var sr = File.OpenRead(filename))
{
//Number of lines
var high = (sr.Length / (LINELENGTH + 2)) - 1;
var low = 0L;
while (low <= high)
{
var middle = (low + high + 1) / 2;
sr.Seek((LINELENGTH + 2) * ((long)middle), SeekOrigin.Begin);
sr.Read(buffer, 0, LINELENGTH);
var readLine = Encoding.ASCII.GetString(buffer);
switch (readLine.CompareTo(asHex))
{
case 0:
return true;
case 1:
high = middle - 1;
break;
case -1:
low = middle + 1;
break;
default:
break;
}
}
}
return false;
}
My idea is to seek forward from the middle until a newline character is found, then seek backwards for the same point, which should give me a complete line which I can split by the ':' delimiter. I then compare the first part of the split string array which should be just a SHA1 hash.
I think this should still centre on the correct value, however I am wondering if there is a neater way to do this? If the midpoint isn't that actual midpoint between the end of line characters, should it be adjusted before the high and low values are?

I THINK this may be a possible simpler (faster) solution without the backtracking to the beginning of the line. I think you can just use byte file indexes instead of trying to work with a full "record/line. Because the middle index will not always be at the start of a line/record, the "readline" can return a partial line/record. If you were to immediately do a second "readline", you would get a full line/record. It wouldn't be quite optimal, because you would actually be comparing a little ahead of the middle index.
I downloaded the pwned-passwords-update-1 and pulled out about 30 records at the start, end, and in the middle, it seemed to find them all. What do you think?
const int HASHLENGTH = 40;
static bool Check(string asHex, string filename)
{
using (var fs = File.OpenRead(filename))
{
var low = 0L;
// We don't need to start at the very end
var high = fs.Length - (HASHLENGTH - 1); // EOF - 1 HASHLENGTH
StreamReader sr = new StreamReader(fs);
while (low <= high)
{
var middle = (low + high + 1) / 2;
fs.Seek(middle, SeekOrigin.Begin);
// Resync with base stream after seek
sr.DiscardBufferedData();
var readLine = sr.ReadLine();
// 1) If we are NOT at the beginning of the file, we may have only read a partial line so
// Read again to make sure we get a full line.
// 2) No sense reading again if we are at the EOF
if ((middle > 0) && (!sr.EndOfStream)) readLine = sr.ReadLine() ?? "";
string[] parts = readLine.Split(':');
string hash = parts[0];
// By default string compare does a culture-sensitive comparison we may not be what we want?
// Do an ordinal compare (0-9 < A-Z < a-z)
int compare = String.Compare(asHex, hash, StringComparison.Ordinal);
if (compare < 0)
{
high = middle - 1;
}
else if (compare > 0)
{
low = middle + 1;
}
else
{
return true;
}
}
}
return false;
}

My way of solving your problem was to create a new binary file containing the hashes only. 16 byte/hash and a faster binary search  ( I don't have 50 reps needed to comment only )

editing one column in the Csv File

i am trying to editing only one column within my csv. however the code does not seem to affect the file. the changes im trying to make is to change to separate the 4th column data with a comma.
class Program
{
static void Main(string[] args)
{
var filePath = Path.Combine(Directory.GetCurrentDirectory(), "kaviaReport 02_08_2016.csv");
var fileContents = ReadFile(filePath);
foreach (var line in fileContents)
{
Console.WriteLine(line);
}
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
public static IList<string> ReadFile(string fileName)
{
var results = new List<string>();
int lineCounter = 0;
string currentLine = string.Empty;
var target = File
.ReadAllLines(fileName);
while ((currentLine = fileName) != null)//while there are lines to read
{
List<string> fielded = new List<string>(currentLine.Split(','));
if (lineCounter != 0)
{
//If it's not the first line
var lineElements = currentLine.Split(',');//split your fields into an array
var replace = target[4].Replace(' ', ',');//replace the space in position 4(field 5) of your array
results.Add(replace);
//target.WriteAllLines(string.Join(",", fielded));//write the line in the new file
}
lineCounter++;
File.WriteAllLines(fileName, target);
}
return results;
}
}

The current code has some errors.
The biggest one is the assignement of currentLine to fileName. This, of course is meaningless if you want to loop over the lines. So you need a foreach over the read lines.
Then inside the loop you should use the variable lineElements to get the 5 column available after the splitting of the currentLine.
Finally the rewrite of the file goes outside the loop and should use the result list.
// Loop, but skip the first line....
foreach(string currentLine in target.Skip(1))
{
// split your line into an array of strings
var lineElements = currentLine.Split(',');
// Replace spaces with commas on the fifth column of lineElements
var replace = lineElements[4].Replace(' ', ',');
// Add the changed line to the result list
results.Add(replace);
}
// move outside the foreach loop the write of your changes
File.WriteAllLines(fileName, results.ToArray());
Something has occured to my mind while writing this code. It is not clear if you want to rewrite the CSV file with only the data in the fifth column expanded with commas or if you want to rewrite the entire line (also column 0,1,2,3,4 etc..) in this latter case you need a different code
// Replace spaces with commas on the fifth column of lineElements
// And resssign the result to the same fifth column
lineElements[4] = lineElements[4].Replace(' ', ',');
// Add the changed line to the result list putting the comma
// between the array of strings lineElements
results.Add(string.Join(",", lineElements);

while ((currentLine = fileName) != null) will set currentLine = fileName which will make the line always true and make a infinite loop
I would write it as a for loop instead of a while
public static IList<string> ReadFile(string fileName)
{
var target = File.ReadAllLines(fileName).ToList();
// i = 1 (skip first line)
for (int i = 1; i < target.Count; i++)
{
target[4] = target[4].Replace(' ', ','); //replace the space in position 4(field 5)
}
File.WriteAllLines(fileName, target);
// Uncomment the RemoveAt(0) to remove first line
// target.RemoveAt(0);
return target;
}

How to edit and replace a group of lines of a text file using c#?

I have a large text file(20MB), and I'm trying to change every 4th & 5th line to 0,0
I've tried with the following code but I will be interested to know if theres any better way of doing it..
EDIT:
Power = new List<float>();
Time = new List<float>();
string line;
float _i =0.0f;
float _q =0.0f;
int counter = 0;
StreamReader file = new StreamReader(iqFile2Open);
while ((line = file.ReadLine()) != null)
{
if (Regex.Matches(line, #"[a-zA-Z]").Count == 0)
{
string[] IQ = line.Split(',');
if (IQ.Length == 2)
{
_i = float.Parse(IQ[0]);
_q = float.Parse(IQ[1]);
double _p = 10 * (Math.Log10((_i * _i) + (_q * _q)));
if((counter%4)==0 || (counter%5)==0)
sw.WriteLine("0,0");
else
sw.WriteLine(string.Format("{0},{1}", _i, _q));
counter++;
}
}
}
Thanks in advance.!

You can read in all of the lines, map each line to what it should be based on it's position, and then write them all out:
var lines = File.ReadLines(inputFile)
.Select((line, i) => ComputeLine(line, i + 1));
File.WriteAllLines(outputFile, lines);
As for the actual mapping, you can mod the line number by 5 to get an "every 5th item" result, and then just compare the result to the two mod values you care about. Note that since you don't want the first item wiped out it's important that the index is 1-indexed, not zero indexed.
private static string ComputeLine(string line, int i)
{
if (i % 5 == 4 || i % 5 == 0)
return "0,0";
else
return line;
}
This streams through each line in the file, rather than loading the entire file into memory. Because of this it's important that the input and output files be different. You can copy the output file to the input file if needed, or you could instead use ReadAllLines to bring the entire file into memory (assuming the file stays suitably small) thus allowing you to write to the same file you read from.

What exactly are you trying to replace? Are you replacing by specific LINE or specific TEXT?
If you are looking to replace specific text you can easily do a string.Replace() method...
StreamReader fileIn = new StreamReader("somefile");
string fileText = fileIn.Readlines();
fileText = fileText.Replace("old", "new");
//Repeat last line for all old strings.
//write file...

Trying to get the largest int in each line of a file and sum the result

When I try and run my code I get the error:
Input string was not in a correct format.
I am trying to find the largest int in each line of a text file and then add them all up.
I am sure that there are no letters in this file and everything is separated by a space.
Here is my code:
int counter = 0;
string line;
List<int> col = new List<int>();
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader(label3.Text);
while ((line = file.ReadLine()) != null)
{
int[] storage = new int[10000];
Console.WriteLine(line);
counter++;
string s = line;
string[] words = s.Split(' ');
for (int i = 0; i < words.Length; i++)
{
storage[i] = Convert.ToInt32(words[i]);
}
int large = storage.Max();
col.Add(large);
Console.WriteLine(" ");
foreach (int iii in col)
{
Console.WriteLine(iii);
}
int total = col.Sum();
Console.WriteLine(total);
}
file.Close();
// Suspend the screen.
Console.ReadLine();

It's possible that target string cannot be stored in a 32 bit integer. You can try parsing to ulong type. Take a look at Integral Types Table and Floating-Point Types Table.
Instead of doing Convert.ToInt32(), try int.TryParse(). It will return a bool value telling you if operation succeeded, and it has an out parameter where it will place result of parse operation. TryParse operation is also available on other numeric types if you decide you need them.
E.g.
int val;
string strVal = "1000";
if (int.TryParse(strVal, out val))
{
// do something with val
}
else
{
// report error or skip
}

I did a quick test and it is likely you get the error in the line
storage[i] = Convert.ToInt32(words[i]);
If so, make sure what you are trying to convert is an integer and not an empty string, for example.

I believe that the line in your code that can cause this error is
Convert.ToInt32(words[i]);
Now, when you're running this application in debug mode(which you probably are) in visual studio, you have a way to check what's going on in your program when the exception happens.
At the very very bottom of your screen is going to be some tabs. these tabs include your error list among other things. The ones I like to use are called "Locals" and "Watch". You can use the Locals tab.
When you click on the Locals tab, you should see a tree structure of all the local variables in your program. if you expand the words variable, you should see all the individual members of the array. you should also be able to see the variable i check the i'th member of your words array, and make sure that it's an integer, and not something else.

You're either converting out of size, or attempting to parse a return carriage '/r'
Make sure you're trimming your input.
My solution:
static void Main(string[] args)
{
int linecount = 100;
string path = #"C:\test\test.txt";
Random rand = new Random();
//Create File
StreamWriter writer = new StreamWriter(path, false);
for (int i = 0; i < linecount; i++)
{
for (int j = 0; j < rand.Next(10, 15); j++)
{
writer.Write(rand.Next() + " ");
}
writer.WriteLine("");
}
writer.Close();
//Sum File
long sum = Enumerable.Sum<string>(
(new StreamReader(path)).ReadToEnd().Split(new char[] { '\n' }, StringSplitOptions.RemoveEmptyEntries),
l => Enumerable.Max(
l.Split(' '),
i => String.IsNullOrEmpty(i.Trim()) ? 0 : long.Parse(i.Trim())
)
);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Text File - Read and fix delimiter problems - Too slow - c#

Related

Grab text between two lines

Binary search on file with different line length

editing one column in the Csv File

How to edit and replace a group of lines of a text file using c#?

Trying to get the largest int in each line of a file and sum the result

Categories

Resources