Efficiently replace unwanted separator from csv field

Efficiently replace unwanted separator from csv field - c#

As input, I have a csv file of approximately 1,000,000 lines (about 300Mb) which contains columns separated by semi-colons.
colA;colB;colC;colD;colE
aaaa;bbbb;cccc;dddd;"eeee;"
aaaa;bbbb;cccc;dddd;evfvdfeee
aaaa;bb1bb;cc2cc;dd3dd;evfve
Some of the fields on the 5th column may have a semi-colon. Not all of them though. So, I want to remove all semi-colons after the 4th occurrence. The code below works, but it takes ages (approx. 10min) to save the csv to the file. How can I speed up this?
void Main()
{
string finput = #"myfile.csv";
FileInfo finfo = new FileInfo(finput);
var Lines = File.ReadLines(finfo.FullName);
List<string> output = new List<string>();
Stopwatch sw = new Stopwatch();
sw.Start();
int count = 0;
foreach (string s in Lines)
{
count++;
if (count % 10000 == 0)
count.Dump();
output.Add((StringExtender.ReplaceAfterNthOccurrency(s, ";", ".", 5)));
}
sw.Elapsed.Dump();
File.WriteAllLines(finfo.DirectoryName + finfo.Name + "_conv.csv", output);
}
public static class StringExtender
{
public static string ReplaceAfterNthOccurrency(string input, string to_replace, string to_add, int n)
{
var cont = true;
int count = 0;
int start = 0;
while (cont)
{
int i = input.IndexOf(to_replace, start);
if (i != -1)
{
count++;
start = i + 1;
if (count >= n)
{
input = input.Remove(i, 1);
input = input.Insert(i, to_add);
}
}
else
cont = false;
}
return input;
}
}

Related

split a txt file into multiple files with the number of lines in each file being able to be set by a user

I'm writing something in C# and I need to find a way to split a text file into more files with the number of lines in the file being equivalent to a user input.
Example : file a had 1000 lines in it and I want the code to ask the user for a number and then use that number to make more files like this
a = 1000 lines .
Then after the code has run with the input of 300
a = 300 lines
b = 300 lines
c = 300 lines
d = 300 lines
e = 300 lines
Repeat that until the original file has been split into more files all with 300 lines .
This is what I have so far
var file = File.ReadAllLines(ofd.FileName);
Console.Write("> ");
int userlinestosplit = int.Parse(Console.ReadLine());
ArrayList fileA = new ArrayList();
for (int i = 0; i < userlinestosplit; i++)
{
string line = file[i];
fileA.Add(line);
}
int linesleft = file.Length - userlinestosplit;
ArrayList fileB = new ArrayList();
for (int i = linesleft; i < file.Length; i++)
{
string line = file[i];
fileB.Add(line);
}
string[] fileAArr = (string[])fileA.ToArray(typeof(string));
string[] fileBArr = (string[])fileB.ToArray(typeof(string));
string resdir = "results";
string modir = "splited";
Directory.CreateDirectory(resdir);
Directory.SetCurrentDirectory(resdir);
Directory.CreateDirectory(modir);
Directory.SetCurrentDirectory(modir);
File.WriteAllLines("FA.txt", fileAArr);
File.WriteAllLines("FB.txt", fileBArr);
Console.ReadKey();
Any help would be greatly appreciated

Here's a way to do it using streams. This has the benefit of not needing to read it all into memory at once, allowing it to work on very large files.
Console.Write("> ");
var maxLines = int.Parse(Console.ReadLine());
var filename = ofd.FileName;
var fileStream = File.OpenRead(filename);
var readStream = new StreamReader(fileStream);
var nameBase = filename[0..^4]; //strip .txt
var parts = 1;
var notfinished = true;
while (notfinished)
{
var part = File.OpenWrite($"{nameBase}-{parts}.txt");
var writer = new StreamWriter(part);
for (int i = 0; i < maxLines; i++)
{
writer.WriteLine(readStream.ReadLine());
if (readStream.EndOfStream)
{
notfinished = false;
break;
}
}
writer.Close();
parts++;
}
Console.WriteLine($"Done splitting the file into {parts} parts.");

Splitting text file to multiple parts
public void SplitFile(string inputFile, int size, string path)
{
int index = 0;
string s = string.Empty;
using (StreamReader sr = File.OpenText(inputFile))
{
while (true)
{
if (sr.EndOfStream) break;
using (StreamWriter output = new StreamWriter($"{path}\\part{index}.txt", false, Encoding.UTF8))
{
int linesRead = 0;
while ((s = sr.ReadLine()) != null && linesRead < size)
{
output.WriteLine(s);
linesRead++;
}
}
index++;
}
}
}
How to use:
var inputFile = "test.txt";
int size =300;
SplitFile(inputFile, size, "c:\\data");

Read a text file and write the texts in chunks of 1000 characters keeping the words intact in C#

I am trying to read a file and split the text after every 1000 characters. But I want to keep the words intact. So it should just split at the space. If the 1000th character is not a space, then split at the first space just before or just after it. Any idea how to do that? I am also removing the extra spaces from the text.
while ((line = file.ReadLine()) != null)
{
text = text + line.Trim();
noSpaceText = Regex.Replace(text, #"\r\n?|\n/", "");
}
List<string> rowsToInsert = new List<string>();
int splitAt = 1000;
for (int i = 0; i < noSpaceText.Length; i = i + splitAt)
{
if (noSpaceText.Length - i >= splitAt)
{
rowsToInsert.Add(noSpaceText.Substring(i, splitAt));
}
else
rowsToInsert.Add(noSpaceText.Substring(i,
((noSpaceText.Length - i))));
}
foreach(var item in rowsToInsert)
{
Console.WriteLine(item);
}

Okay, just typed this non tested solution which should do the trick:
public static List<string> SplitOn(this string input, int charLength, char[] seperator)
{
List<string> splits = new List<string>();
var tokens = input.Split(seperator);
// -1 because first token adds 1 to length
int totalLength = -1;
List<string> segments = new List<string>;
foreach(var t in tokens)
{
if(totalLength + t.Length+1 > charLength)
{
splits.Add(String.Join(" ", segments));
totalLength = -1;
segments.Clear();
}
totalLength += t.Length + 1;
segments.Add(t);
}
if(segments.Count>0)
{
splits.Add(String.Join(" ", segments));
}
return splits;
}
It's an extension Method, which splits an input text in segments by whitespaces, means, i iterate over an array with just words. Then counting the length of each segment, checking for totallength and add it to result list.

An alternate solution:
public static List<string> SplitString(string stringInput, int blockLength)
{
var output = new List<string>();
var count = 0;
while(count < stringInput.Length)
{
string block = "";
if(count + blockLength > stringInput.Length)
{
block = stringInput.Substring(count, stringInput.Length - count);
}
else
{
block = stringInput.Substring(count, blockLength + 1);
}
if(block.Length < blockLength)
{
output.Add(block);
count += block.Length;
}
else if(block.EndsWith(" "))
{
output.Add(block);
count = count+blockLength + 1;
}
else
{
output.Add(block.Substring(0, block.LastIndexOf(" ")));
count = count + block.LastIndexOf(" ") +1;
}
}
return output;
}

Reading the multiline text box in Windows AS it is Entered by user

I have a static width of the multiline text box.
When i Continuously type long text it is showing in single line in the code behind when i try to read.
I wanted to use this data to print in the component One {C1PrintDocument} . As user entered . but it showing everything in the same line and it is getting truncated .
I am dealing japense as well as English language .
Below is the same of the JP text.
Text Example: "れはれはれはれはれはれはれ1はれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれははれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれ6はれはれれれれれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれれれれれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれはれれれれれ7はれはれはれはれはれはれは" .
I want to internally identify the new line as user enter in the system.
i tried using Environment.NewLine . but it doesn't work because this is the continuous text.
Help me to read the text as it is appeared in the windows text box to print that.

Algorithm i wrote to solve this problem is as below :
// <summary>
// Functions takes the Large text and split the string according to the maximum length of the string per line.
// </summary>
// <param name="strData">String Data</param>
// <param name="intMaxSize">Maximum Size of the row</param>
// <returns></returns>
public static List<string> GetLines(string strData, int intMaxSize)
{
List<string> strReturn = new List<string>();
string[] strLineSplited = strData.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
for (int intNoOfLine = 0; intNoOfLine < strLineSplited.Length; intNoOfLine++)
{
List<KeyValuePair<char, int>> listOfChar = new List<KeyValuePair<char, int>>();
string strValue = strLineSplited[intNoOfLine];
StringBuilder sbSingleLine = new StringBuilder();
if (!string.IsNullOrEmpty(strValue))
{
char[] charArr = strValue.ToCharArray();
int intCharCount = 0;
byte[] bascii = Encoding.ASCII.GetBytes(strValue);
for (int intCounter = 0; intCounter < bascii.Length; intCounter++)
{
int intASCIICode = bascii[intCounter];
if (intASCIICode >= 32 && intASCIICode <= 62)
{
listOfChar.Add(new KeyValuePair<char, int>(charArr[intCounter], 1));
}
else if (intASCIICode >= 65 && intASCIICode <= 122)
{
listOfChar.Add(new KeyValuePair<char, int>(charArr[intCounter], 1));
}
else
{
listOfChar.Add(new KeyValuePair<char, int>(charArr[intCounter], 2));
}
}
bool bFlag = true;
foreach (var charValue in listOfChar)
{
intCharCount += Convert.ToInt32(charValue.Value);
if (intCharCount < intMaxSize)
{
sbSingleLine.Append(charValue.Key);
bFlag = true;
}
else
{
sbSingleLine.Append(charValue.Key);
strReturn.Add(sbSingleLine.ToString());
sbSingleLine.Length = 0;
sbSingleLine.Capacity = 0;
intCharCount = 0;
bFlag = false;
}
}
if (intCharCount < intMaxSize && bFlag)
{
strReturn.Add(sbSingleLine.ToString());
}
}
else
{
strReturn.Add(sbSingleLine.ToString());
}
}
return strReturn;
}

Replace the start of line in a file quickly

I have an initial file containing lines such as:
34 964:0.049759 1123:0.0031 2507:0.015979
32,48 524:0.061167 833:0.030133 1123:0.002549
34,52 534:0.07349 698:0.141667 1123:0.004403
106 389:0.013396 417:0.016276 534:0.023859
The first part of a line is the class number. A line can have several classes.
For each class, I create a new file.
For instance for class 34 the resulting file will be :
+1 964:0.049759 1123:0.0031 2507:0.015979
-1 524:0.061167 833:0.030133 1123:0.002549
+1 534:0.07349 698:0.141667 1123:0.004403
-1 389:0.013396 417:0.016276 534:0.023859
For class 106 the resulting file will be :
-1 964:0.049759 1123:0.0031 2507:0.015979
-1 524:0.061167 833:0.030133 1123:0.002549
-1 534:0.07349 698:0.141667 1123:0.004403
+1 389:0.013396 417:0.016276 534:0.023859
The problem is I have 13 files to write for 200 class.
I already ran a less optimized version of my code and it took several hours.
With my code below it takes 1 hour to generate the 2600 files.
Is there a way to perform such a replacement in a faster way? Are regex a viable option?
Below is my implementation (works on LINQPAD with this data file)
static void Main()
{
const string filePath = #"C:\data.txt";
const string generatedFilesFolderPath = #"C:\";
const string fileName = "data";
using (new TimeIt("Whole process"))
{
var fileLines = File.ReadLines(filePath).Select(l => l.Split(new[] { ' ' }, 2)).ToList();
var classValues = GetClassValues();
foreach (var classValue in classValues)
{
var directoryPath = Path.Combine(generatedFilesFolderPath, classValue);
if (!Directory.Exists(directoryPath))
Directory.CreateDirectory(directoryPath);
var classFilePath = Path.Combine(directoryPath, fileName);
using (var file = new StreamWriter(classFilePath))
{
foreach (var line in fileLines)
{
var lineFirstPart = line.First();
string newFirstPart = "-1";
var hashset = new HashSet<string>(lineFirstPart.Split(','));
if (hashset.Contains(classValue))
{
newFirstPart = "+1";
}
file.WriteLine("{0} {1}", newFirstPart, line.Last());
}
}
}
}
Console.Read();
}
public static List<string> GetClassValues()
{
// In real life there is 200 class values.
return Enumerable.Range(0, 2).Select(c => c.ToString()).ToList();
}
public class TimeIt : IDisposable
{
private readonly string _name;
private readonly Stopwatch _watch;
public TimeIt(string name)
{
_name = name;
_watch = Stopwatch.StartNew();
}
public void Dispose()
{
_watch.Stop();
Console.WriteLine("{0} took {1}", _name, _watch.Elapsed);
}
}
The output:
Whole process took 00:00:00.1175102
EDIT: I also ran a profiler and it looks like the split method is the hottest spot.
EDIT 2: Simple example:
2,1 1:0.8 2:0.2
3 1:0.4 3:0.6
12 1:0.02 4:0.88 5:0.1
Expected output for class 2:
+1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1
Expected output for class 3:
-1 1:0.8 2:0.2
+1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1
Expected output for class 4:
-1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1

I have eliminated the hottest paths from your code by removing the split and using a bigger buffer on the FileStream.
Instead of Split I now call ToCharArray and then parse the first Chars to the first space and while I'm at it a match with classValue on a char by char basis is performed. The boolean found indicates an exact match for anything before the , of the first space. The rest of the handling is the same.
var fsw = new FileStream(classFilePath,
FileMode.Create,
FileAccess.Write,
FileShare.None,
64*1024*1024); // use a large buffer
using (var file = new StreamWriter(fsw)) // use the filestream
{
foreach(var line in fileLines) // for( int i = 0;i < fileLines.Length;i++)
{
char[] chars = line.ToCharArray();
int matched = 0;
int parsePos = -1;
bool takeClass = true;
bool found = false;
bool space = false;
// parse until space
while (parsePos<chars.Length && !space )
{
parsePos++;
space = chars[parsePos] == ' '; // end
// tokens
if (chars[parsePos] == ' ' ||
chars[parsePos] == ',')
{
if (takeClass
&& matched == classValue.Length)
{
found = true;
takeClass = false;
}
else
{
// reset matching
takeClass = true;
matched = 0;
}
}
else
{
if (takeClass
&& matched < classValue.Length
&& chars[parsePos] == classValue[matched])
{
matched++; // on the next iteration, match next
}
else
{
takeClass = false; // no match!
}
}
}
chars[parsePos - 1] = '1'; // replace 1 in front of space
var correction = 1;
if (parsePos > 1)
{
// is classValue before the comma (or before space)
if (found)
{
chars[parsePos - 2] = '+';
}
else
{
chars[parsePos - 2] = '-';
}
correction++;
}
else
{
// is classValue before the comma (or before space)
if (found)
{
// not enough space in the array, write a single char
file.Write('+');
}
else
{
file.Write('-');
}
}
file.WriteLine(chars, parsePos - correction, chars.Length - (parsePos - correction));
}
}

Instead of iterating over the un-parsed lines 200 times, how about parsing the lines upfront into a data structure then iterating over that 200 times? This should minimize the numer of string manipulation operations.
Also using StreamReader instead of File.ReadLines, so the entire file is not in memory twice -- once as string[] and another time as Detail[].
static void Main(string[] args)
{
var details = ReadDetail("data.txt").ToArray();
var classValues = Enumerable.Range(0, 10).ToArray();
foreach (var classValue in classValues)
{
// Create file/directory etc
using (var file = new StreamWriter("out.txt"))
{
foreach (var detail in details)
{
file.WriteLine("{0} {1}", detail.Classes.Contains(classValue) ? "+1" : "-1", detail.Line);
}
}
}
}
static IEnumerable<Detail> ReadDetail(string filePath)
{
using (StreamReader reader = new StreamReader(filePath))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
int separator = line.IndexOf(' ');
Detail detail = new Detail
{
Classes = line.Substring(0, separator).Split(',').Select(c => Int32.Parse(c)).ToArray(),
Line = line.Substring(separator + 1)
};
yield return detail;
}
}
}
public class Detail
{
public int[] Classes { get; set; }
public string Line { get; set; }
}

Reading lines from a text file, converting them, then writing back to new file

I have some basic knowledge of C#, but I am having trouble coding something that seems simple in concept. I want to read a file (.asm) containing values such as
#1
#12
#96
#2
#46
etc.
on multiple consecutive lines. I then want to get rid of the # symbols (if they are present), convert the remaining number values to binary, then write these binary values back to a new file (.hack) on their own lines. There isn't a set limit on the number of lines, which is my biggest issue as I don't know how to check for lines dynamically. So far I can only read and convert lines if I code to look for them, then I can't figure out how to write these values on their own lines in the new file. Sorry if this sounds a bit convoluted, but any help would be appreciated. Thanks!
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
var line = File.ReadAllText(openFileDialog1.FileName);
using (StreamWriter sw = File.CreateText("testCode.hack"))
{
var str = line;
var charsToRemove = new string[] {"#"};
foreach (var c in charsToRemove)
{
str = str.Replace(c, string.Empty);
}
int value = Convert.ToInt32(str);
string value2 = Convert.ToString(value, 2);
if (value2.Length < 16)
{
int zeroes = 16 - value2.Length;
if(zeroes == 12)
{
sw.WriteLine("000000000000" + value2);
}
}
else
{
sw.WriteLine(value2);
}
}

This code should help you get going real fast:
static void Main(string[] args)
{
string line = string.Empty;
System.IO.StreamReader reader = new System.IO.StreamReader(#"C:\test.txt");
System.IO.StreamWriter writer = new System.IO.StreamWriter(#"C:\test.hack");
while ((line = reader.ReadLine()) != null) // Read until there is nothing more to read
{
if (line.StartsWith("#"))
{
line = line.Remove(0, 1); // Remove '#'
}
int value = -1;
if (Int32.TryParse(line, out value)) // Check if the rest string is an integer
{
// Convert the rest string to its binary representation and write it to the file
writer.WriteLine(intToBinary(value));
}
else
{
// Couldn't convert the string to an integer..
}
}
reader.Close();
writer.Close();
Console.WriteLine("Done!");
Console.Read();
}
//http://www.dotnetperls.com/binary-representation
static string intToBinary(int n)
{
char[] b = new char[32];
int pos = 31;
int i = 0;
while (i < 32)
{
if ((n & (1 << i)) != 0)
{
b[pos] = '1';
}
else
{
b[pos] = '0';
}
pos--;
i++;
}
return new string(b);
}

My suggestion create a List<string>. Here are steps
Read input (.asm) file into List
Open StreamWriter for output (.hack) file.
Loop through List<string> modify the string and write into file.
Code Example:
List<string> lstInput = new List<string>();
using (StreamReader reader = new StreamReader(#"input.asm"))
{
string sLine = string.Empty;
//read one line at a time
while ((sLine = reader.ReadLine()) != null)
{
lstInput.Add(sLine);
}
}
using (StreamWriter writer = new StreamWriter(#"output.hack"))
{
foreach(string sFullLine in lstInput)
{
string sNumber = sFullLine;
//remove leading # sign
if(sFullLine.StartsWith("#"))
sNumber = sFullLine.Substring(1);
int iNumber;
if(int.TryParse(sNumber, out iNumber))
{
writer.WriteLine(IntToBinaryString(iNumber));
}
}
}
public string IntToBinaryString(int number)
{
const int mask = 1;
var binary = string.Empty;
while(number > 0)
{
// Logical AND the number and prepend it to the result string
binary = (number & 1) + binary;
number = number >> 1;
}
return binary;
}
Reference: IntToBinaryString method.
NOTE: Int to Binary String method mentioned in the answer of #TheDutchMan is better choice.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Efficiently replace unwanted separator from csv field - c#

Related

split a txt file into multiple files with the number of lines in each file being able to be set by a user

Read a text file and write the texts in chunks of 1000 characters keeping the words intact in C#

Reading the multiline text box in Windows AS it is Entered by user

Replace the start of line in a file quickly

Reading lines from a text file, converting them, then writing back to new file

Categories

Resources