Reading a line in c# without trimming the line delimiter character - c#

I've got a string that I want to read line-by-line, but I also need to have the line delimiter character, which StringReader.ReadLine unfortunately trims (unlike in ruby where it is kept). What is the fastest and most robust way to accomplish this?
Alternatives I've been thinking about:
Reading the input character-by-character and checking for the line delimiter each time
Using RegExp.Split with a positive lookahead
Alternatively I only care about the line delimiter because I need to know the actual position in the string, and the delimiter can be either one or tho character long. Therefore if I could get back the actual position of the cursor within the string would be also good, but StringReader doesn't have this feature.
EDIT: here is my current implementation. End-of-file is designated by returning an empty string.
StringBuilder line = new StringBuilder();
int r = _input.Read();
while (r >= 0)
{
char c = Convert.ToChar(r);
line.Append(c);
if (c == '\n') break;
if (c == '\r')
{
int peek = _input.Peek();
if (peek == -1) break;
if (Convert.ToChar(peek) != '\n') break;
}
r = _input.Read();
}
return line.ToString();

Are you concerned about inconsistencies between files (i.e. coming from Unix/Mac vs. Windows), or within files?
One very easy optimization if you know that individual files are consistent with themselves would be to only read the first line character-by-character and figure out what the delimiter is. Then determining the exact position of any other line would be simple math.
Failing that, I think I would go the character-by-character route. A regex seems too "clever." This sounds like a complex function and I think the most important thing would be to make it easy to write, read, understand, and most importantly debug.
There's another way to do this, which would be more efficient if your data source was a stream. Unfortunately it's not, as referenced in your comment, so you would have to create one first; however, I'll include the solution anyway, it might give you some inspiration:
public IEnumerable<int> GetLineStartIndices(string s)
{
yield return 0;
byte[] chars = Encoding.UTF8.GetBytes(s);
using (MemoryStream stream = new MemoryStream(chars))
{
using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
{
while (reader.ReadLine() != null)
{
yield return stream.Position;
}
}
}
}
This will give you back the start position of each new line. Obviously you can tweak this to do whatever else you need, i.e. do something else with the actual lines you read.
Just note that this has to make a copy of the string to create the byte array, so it's really not suitable for very large strings. It's a bit nicer than the char-by-char approach though, less bug-prone, so perhaps worth considering if the strings are not megabytes-long.

If you only care about the position: ReadLine() moves you to the next line. If you store the .Position of the stream underneath you can compare it to the .Position after the following ReadLine(). That's the length of the string you just read plus the delimiter.
Length of the delimiter is currentPosition - previousPosition - line.Length.
That way you could easily find out if it was 1 or 2 bytes (without knowing the details, but you said you care only about the positions anyway).

File.ReadAllText will get you all of the file contents. Yup. All. So you better check that file size before using it.
EDIT:
read it all in then create an enumerator that yields line by line.
foreach(string line in Read("some.file"))
{ ... }
private IEnumerator Read(string file)
{
string buffer = File.ReadAllText()
for (int index=0;index<buffer.length;index++)
{
string line = ... logic to build a "line" here
yield return line;
}
yield break;
}

FileStream fs = new FileStream("E:\\hh.txt", FileMode.Open, FileAccess.Read);
BinaryReader read = new BinaryReader(fs);
byte[] ch = read.ReadBytes((int)fs.Length);
byte[] che=new byte[(int)fs.Length];
int size = (int)fs.Length,j=0;
for ( int i =0; i <= (size-1); i++)
{
if (ch[i] != '|')
{
che[j] = ch[i];
j++;
}
}
richTextBox1.Text = Encoding.ASCII.GetString(che);
read.Close();
fs.Close();

Related

read a text file and search for string in memory efficient way (and abort when found)

I'm searching for a string in a text file (also includes XML). This is what I thought first:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("mySpecialString"))
return true;
}
}
return false;
I want to read line by line to minimize the amount of RAM used. When the string has been found it should abort the operation. The reason why I don't process it as XML is because it has to be parsed and would also consume more memory as necessary.
Another easy implementation would be
bool found = File.ReadAllText(path).Contains("mySpecialString") ? true : false;
but that would read the complete file into memory, which isn't what I want. On the other side it could have a performance increase.
Another one would be this
foreach (string line in File.ReadLines(path))
{
if (line.Contains("mySpecialString"))
{
return true;
}
}
return false;
But which one of them (or another one from you?) is more memory efficient?
You can use a query with File.ReadLines, so it only reads as many lines as it needs to, in order to satisfy your query. The Any() method will stop when it hits a line containing your string.
return File.ReadLines(fileName).Any(line => line.Contains("mySpecialString"));
I also prefer the accepted answer. Maybe i'm micro opimizing things here but you have asked for a memory efficient approach. Also consider that the text you are searching could also contain new-line characters like '\r', '\n' or "\r\n" and a large file could theoretically contain a single line which negates the benefit of ReadLines.
So you could use this method:
public static bool FileContainsString(string path, string str, bool caseSensitive = true)
{
if(String.IsNullOrEmpty(str))
return false;
using (var stream = new StreamReader(path))
while (!stream.EndOfStream)
{
bool stringFound = true;
for (int i = 0; i < str.Length; i++)
{
char strChar = caseSensitive ? str[i] : Char.ToUpperInvariant(str[i]);
char fileChar = caseSensitive ? (char)stream.Read() : Char.ToUpperInvariant((char)stream.Read());
if (strChar != fileChar)
{
stringFound = false;
break; // break for-loop, start again with first character at next position
}
}
if (stringFound)
return true;
}
return false;
}
bool containsString = FileContainsString(path, "mySpecialString", false); // ignore case if desired
Note that this might be the most efficient approach and hidden in a method also readable. But it has one drawback, it's not feasible to implement a culture-sensitive comparison because it looks at single characters and not at substrings.
So you have to keep some edge cases in mind where you can run into issues, like the famous turkish i example or surrogate pairs.
I think both of your solutions are the same. Read at the MSDN: https://msdn.microsoft.com/en-us/library/dd383503%28v=vs.110%29.aspx
There it says: "The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned"
The same article also suggests that ReadLines should be used in conjunction with LINQ to Objects.

Extremely Large Single-Line File Parse

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!
That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(#"path\to\clinic.txt"))
using (var sw = new StreamWriter(#"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}
First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:
File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");
Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:
char[] buffer = new char[1024];
string remainder = string.Empty;
List<ClientData> list = new List<ClientData>();
using (StreamReader reader = File.OpenText(#"source.txt"))
{
while (reader.Read(buffer, 0, 1024) > 0)
{
remainder = Parse(remainder + new string(buffer), list);
}
}
with the following method:
string Parse(string value, List<ClientData> list)
{
string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length - 1; i++)
list.Add(new ClientData(parts[i]));
return parts[parts.Length - 1];
}
and the ClientData class however you have it implemented:
class ClientData
{
public ClientData(string value)
{
// fill in however you are already parsing out ID, and other info
}
}
There are many ways to implement something like this, but hopefully this can help get you started.
StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like.
http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx
Alternately, if you are reading an XML file, XmlReader is another option.
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx

Can't find string in input file

I have a text file, which I am trying to insert a line of code into. Using my linked-lists I believe I can avoid having to take all the data out, sort it, and then make it into a new text file.
What I did was come up with the code below. I set my bools, but still it is not working. I went through debugger and what it seems to be going on is that it is going through the entire list (which is about 10,000 lines) and it is not finding anything to be true, so it does not insert my code.
Why or what is wrong with this code?
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
using (StreamReader inFile = new StreamReader("Students.txt", true))
{
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i];
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.CompareTo(lastName) < 0)
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
{
lines.Add(newRecord);
}
You're just reading the file into memory and not committing it anywhere.
I'm afraid that you're going to have to load and completely re-write the entire file. Files support appending, but they don't support insertions.
you can write to a file the same way that you read from it
string[] lines;
/// instanciate and build `lines`
File.WriteAllLines("path", lines);
WriteAllLines also takes an IEnumerable, so you can past a List of string into there if you want.
one more issue: it appears as though you're reading your file twice. one with ReadAllLines and another with your StreamReader.
There are at least four possible errors.
The opening of the streamreader is not required, you have already read
all the lines. (Well not really an error, but...)
The check for StartsWith can be fooled if you lines starts with blank
space and you will miss the insertionPoint. (Adding a Trim will remove any problem here)
In the CompareTo line you check for < 0 but you should check for == 0. CompareTo returns 0 if the strings are equivalent, however.....
To check if two string are equals you should avoid using CompareTo as
explained in MSDN link above but use string.Equals
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i].Trim();
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.Equals(lastName))
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
lines.Add(newRecord);
I don't list as an error the missing write back to the file. Hope that you have just omitted that part of the code. Otherwise it is a very simple problem.
(However I think that the way in which CompareTo is used is probably the main reason of your problem)
EDIT Looking at your comment below it seems that the answer from Sam I Am is the right one for you. Of course you need to write back the modified array of lines. All the changes are made to an in memory array of lines and nothing is written back to a file if you don't have code that writes a file. However you don't need new file
File.WriteAllLines("Students.txt", lines);

how to replace characters in a array quickly

I am using a XML Text reader on a XML file that may contain characters that are invalid for the reader. My initial thought was to create my own version of the stream reader and clean out the bad characters but it is severely slowing down my program.
public class ClensingStream : StreamReader
{
private static char[] badChars = { '\x00', '\x09', '\x0A', '\x10' };
//snip
public override int Read(char[] buffer, int index, int count)
{
var tmp = base.Read(buffer, index, count);
for (int i = 0; i < buffer.Length; ++i)
{
//check the element in the buffer to see if it is one of the bad characters.
if(badChars.Contains(buffer[i]))
buffer[i] = ' ';
}
return tmp;
}
}
according to my profiler the code is spending 88% of its time in if(badChars.Contains(buffer[i])) what is the correct way to do this so I am not causing horrible slowness?
The reason that it spends so much time in that line is because the Contains method loops through the array to look for the character.
Put the characters in a HashSet<char> instead:
private static HashSet<char> badChars =
new HashSet<char>(new char[] { '\x00', '\x09', '\x0A', '\x10' });
The code to check if the set contains the character looks the same as when looking in the array, but it uses the hash code of the character to look for it instead of looping through all the items in the array.
Alternatively, you could put the characters in a switch, that way the compiler would create an efficient comparison:
switch (buffer[i]]) {
case '\x00':
case '\x09':
case '\x0A':
case '\x10': buffer[i] = ' '; break;
}
If you have more characters (five or six IIRC), the compiler will actually create a hash table to look up the cases, so that would be similar to using a HashSet.
You might have better results with a switch statement:
switch (buffer[i])
{
case '\x00':
case '\x09':
case '\x0A':
case '\x10':
buffer[i] = ' ';
break;
}
This should be compiled down to fast code by the JIT compiler at runtime. Heck, the compiler might get close too. You don't need a method call this way either.
You could use regular expressions for that which should be optimized. Read the text into a string and use Replace with your characters in the regular expression afterwards.
However, your code also looks fine to me, I guess regex also can't do anything else than searching through your text... and you need to take a string there which you don't need to do with the other options.
you could check how well it optimises with just checking the read chars, making it
for (int i = index; i < index + count; i++){
//etc
}
Don't know if/how much this would help you, you'd have to profile your real world application to check
Try converting the char[] to a string and then using IndexOfAny.
You could use a boolean array
char[] badChars = { '\x00', '\x09', '\x0A', '\x10' };
char maxChar = badChars.Max();
Debug.Assert(maxChar < 256);
bool[] badCharsTable = new bool[maxChar + 1];
Array.ForEach(badChars, ch => badCharsTable[ch] = true);
and replace badChars.Contains(...) with (ch < badCharsTable.Length && badCharsTable[ch]).
Edit: Finally had time to improve the answer.

c# how do I count lines in a textfile

any problems with doing this?
int i = new StreamReader("file.txt").ReadToEnd().Split(new char[] {'\n'}).Length
The method you posted isn't particularly good. Lets break this apart:
// new StreamReader("file.txt").ReadToEnd().Split(new char[] {'\n'}).Length
// becomes this:
var file = new StreamReader("file.txt").ReadToEnd(); // big string
var lines = file.Split(new char[] {'\n'}); // big array
var count = lines.Count;
You're actually holding this file in memory twice: once to read all the lines, once to split it into an array. The garbage collector hates that.
If you like one liners, you can write System.IO.File.ReadAllLines(filePath).Length, but that still retrieves the entire file in an array. There's no point doing that if you aren't going to hold onto the array.
A faster solution would be:
int TotalLines(string filePath)
{
using (StreamReader r = new StreamReader(filePath))
{
int i = 0;
while (r.ReadLine() != null) { i++; }
return i;
}
}
The code above holds (at most) one line of text in memory at any given time. Its going to be efficient as long as the lines are relatively short.
Well, the problem with doing this is that you allocate a lot of memory when doing this on large files.
I would rather read the file line by line and manually increment a counter. This may not be a one-liner but it's much more memory-efficient.
Alternatively, you may load the data in even-sized chunks and count the line breaks in these. This is probably the fastest way.
If you're looking for a short solution, I can give you a one-liner that at least saves you from having to split the result:
int i = File.ReadAllLines("file.txt").Count;
But that has the same problems of reading a large file into memory as your original. You should really use a streamreader and count the line breaks as you read them until you reach the end of the file.
Sure - it reads the entire stream into memory. It's terse, but I can create a file today that will fail this hard.
Read a character at a time and increment your count on newline.
EDIT - after some quick research
If you want terse and want that shiny new generic feel, consider this:
public class StreamEnumerator : IEnumerable<char>
{
StreamReader _reader;
public StreamEnumerator(Stream stm)
{
if (stm == null)
throw new ArgumentNullException("stm");
if (!stm.CanSeek)
throw new ArgumentException("stream must be seekable", "stm");
if (!stm.CanRead)
throw new ArgumentException("stream must be readable", "stm");
_reader = new StreamReader(stm);
}
public IEnumerator<char> GetEnumerator()
{
int c = 0;
while ((c = _reader.Read()) >= 0)
{
yield return (char)c;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
which defines a new class which allows you to enumerate over streams, then your counting code can look like this:
StreamEnumerator chars = new StreamEnumerator(stm);
int lines = chars.Count(c => c == '\n');
which gives you a nice terse lambda expression to do (more or less) what you want.
I still prefer the Old Skool:
public static int CountLines(Stream stm)
{
StreamReader _reader = new StreamReader(stm);
int c = 0, count = 0;
while ((c = _reader.Read()) != -1)
{
if (c == '\n')
{
count++;
}
}
return count;
}
NB: Environment.NewLine version left as an exercise for the reader
That should do the trick:
using System.Linq;
....
int i = File.ReadLines(file).Count();
Mayby this?
string file = new StreamReader("YourFile.txt").ReadToEnd();
string[] lines = file.Split('\n');
int countOfLines = lines.GetLength(0));
Assuming the file exists and you can open it, that will work.
It's not very readable or safe...

Categories

Resources