How to reliably split the contents of a TextBox into lines? - c#

When I get TextBox.Text, how can I split it into lines?
For text files I do this by splitting at \r\n (Notepad++). With the TextBox, usually it is \r\n, but apparently not always - since my \r\n split occasionally fails for pasted input. Luckily for text files I can use Notepad++'s "Show all characters" features to inspect the whitespace. But how can I see what characters are in the TextBox?
Clearly the TextBox itself knows how to deal with this, since it is able to display the text with linebreaks at correct locations. How can I take var s = MyTextBox.Text; and "split s at every location where the TextBox would have displayed a break"?
Edit: I've checked my Regex and actually I am already splitting at Environment.Newline, not \r\n.

You could use
var lines = Regex.Split(s, #"\r\n|\r|\n");
to reliably split at any newline, no matter where the newlines came from.

In the case of a TextBox, you can use the Lines property, as suggested in the comments. But you might need to get the lines from a string in other situations too.
For doing this I use an extension method based on a StringReader:
public static IEnumerable<string> GetLines(this string s)
{
using (var reader = new StringReader(s))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
(this method is also available in my NString library)
You can use it like this:
string[] lines = textBox.Text.GetLines().ToArray();
The StringReader class knows where to split the lines, and it's probably faster than using a regex (just an intuition, I haven't actually benchmarked it EDIT: I just did a quick benchmark, StringReader is about 6 times faster than Regex)

Related

Filtering RTF codes from text in C#

I am trying to parse data that I receive as a text file. My goal is to remove the formatting codesIt appears to be in Rich Text Format, but I suspect it contains some proprietary codes. I suspect this because when I run the following code, I get an error that says 'File format is not valid.'
public static string RemoveRTF(string rtfString)
{
RichTextBox rtb = new RichTextBox();
rtb.Rtf = rtfString;
return rtb.Text;
}
I have tried using a string aggregate, as in the code below, to remove specific codes.
public static string RemoveSpecificCodes(string text)
{
List<string> words = new List<string>();
words.Add("\\par\\pard");
words.Add("\\pard\\par");
words.Add("\\pard");
words.Add("\\par");
words.Add("\\~");
output = words.Aggregate(text, (input, word) => input.Replace(word, ""));
return output;
}
This approach works if I know all of the format codes, but I have >10,000 lines to process, and I don't have a list of all of the codes (there appears to be a lot of them). I suspect regex may be a more appropriate way to remove the codes, but I know virtually nothing about regular expressions. Can someone help me get started? The text doesn't have any backslashes, so I would like to identify the format codes by finding a backslash, and then deleting the backslash and everything up to, but not including the next backslash or space.

Trouble with finding new line character from a string, Reading Text from .txt file

I am reading text from a .TXT file in a String. I am using File.ReadAllText.
string Str= File.ReadAllText(#"C:\temp\file.txt", Encoding.Default);
Let's assume the Str contains following string.
string Str= #"one
two
three";
Now the problem is I cannot find the newline characters from Str.
string[] lines = Str.Split('\n');
foreach(string line in lines)
{
Console.WriteLine(line.IndexOf('\n'); // prints -1 three times
}
Is there any way I can find newline character in this situation? Please suggest.
.Split will delete the delimiter characters, and the resulting output will not include them.
From MSDN:
Delimiter characters are not included in the elements of the returned
array.
If you need to find the length of a line, just use the .Length property of the string.
In any case, as mentioned in the comments, use the File.ReadAllLines method to avoid having to split the file contents yourself.
Per MSDN documentation for the String.Split method:
Delimiter characters are not included in the elements of the returned
array.
Just as an observation, if you are trying to load a file and process it line by line you may want to do something like this:
foreach (var line in File.ReadLines(#"C:\temp\file.txt"))
{
Console.WriteLine(line);
}

Remove excessive whitespace in user input field

In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a
A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.
I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks

C# Regex quick help

I'm trying to read a text file, and then break it up by each line thats is split by a "\n". Then Regex it and write out the regex.
string contents = File.ReadAllText(filename);
string[] firefox = filename.Split("\r\n");
string prefix = prefix = Regex.Match(firefox, #"(\d)").Groups[0].Value;
File.AppendAllText(workingdirform2 + "configuration.txt", prefix);
string[] firefox = filename.Split("\r\n"); doesnt exactly work.
What I want to do is run a regex foreach line of contents and then write out each line after the regex
So...
filename:
Hero123
Hero243
Hero5959
writes out to:
13
243
5959
Well everybody is suggesting something off the base in which i started. the ending result will be about a 20 line regex with Ints. I've got to parse it out line by line.
File.ReadAllLines
var lines = File.ReadAllLines(originalPath);
File.WriteAllLines(newPath, lines
.Select(l => Regex.Match(l, #"\d+").Value).ToArray());
There are a number of problems with your code:
The reason the splitting doesn't work, is because you're splitting filename, not contents, which contains the actual file data. I agree with the other poster on using File.ReadAllLines :) It's a little more flexible with the file format compared to using \r\n, amongst other things.
Also, you have string prefix = prefix = ..., the second equals sign is probably intended to be a +. You should using StringBuilder if the data files can become large, or better yet, write to an output stream as you go.
Passing an array to Regex.Match doesn't work either. To apply the regex to all lines, you should do something like:
foreach (string line in firefox)
{
prefix = prefix + Regex.Match(line, // etc
// Or rather:
// stringBuilder.AppendLine(...)
}
Either that, or do it all at once with a multiline regex :)

Reading line by line

I have a program that generates a plain text file. The structure (layout) is always the same. Example:
Text File:
LinkLabel
"Hello, this text will appear in a LinkLabel once it has been
added to the form. This text may not always cover more than one line. But will always be surrounded by quotation marks."
240, 780
So, to explain what is going on in that file:
Control
Text
Location
And when a button on the Form is clicked, and the user opens one of these files from the OpenFileDialog dialog, I need to be able to Read each line. Starting from the top, I want to check to see what control it is, then starting on the second line I need to be able to get all text inside the quotation marks (regardless of whether is is one line of text or more), and on the next line (after the closing quotation mark), I need to extract the location (240, 780)... I have thought of a few ways of going about this but when I go to write it down and put it to practice, it doesn't make much sense and end up figuring out ways that it won't work.
Has anybody ever done this before? Would anybody be able to provide any help, suggestions or advice on how I'd go about doing this?
I have looked up CSV files but that seems too complicated for something that seems so simple.
Thanks
jase
You could use a regular expression to get the lines from the text:
MatchCollection lines = Regex.Matches(File.ReadAllText(fileName), #"(.+?)\r\n""([^""]+)""\r\n(\d+), (\d+)\r\n");
foreach (Match match in lines) {
string control = match.Groups[1].Value;
string text = match.Groups[2].Value;
int x = Int32.Parse(match.Groups[3].Value);
int y = Int32.Parse(match.Groups[4].Value);
Console.WriteLine("{0}, \"{1}\", {2}, {3}", control, text, x, y);
}
I'll try and write down the algorithm, the way I solve these problems (in comments):
// while not at end of file
// read control
// read line of text
// while last char in line is not "
// read line of text
// read location
Try and write code that does what each comment says and you should be able to figure it out.
HTH.
You are trying to implement a parser and the best strategy for that is to divide the problem into smaller pieces. And you need a TextReader class that enables you to read lines.
You should separate your ReadControl method into three methods: ReadControlType, ReadText, ReadLocation. Each method is responsible for reading only the item it should read and leave the TextReader in a position where the next method can pick up. Something like this.
public Control ReadControl(TextReader reader)
{
string controlType = ReadControlType(reader);
string text = ReadText(reader);
Point location = ReadLocation(reader);
... return the control ...
}
Of course, ReadText is the most interesting one, since it spans multiple lines. In fact it's a loop that calls TextReader.ReadLine until the line ends with a quotation mark:
private string ReadText(TextReader reader)
{
string text;
string line = reader.ReadLine();
text = line.Substring(1); // Strip first quotation mark.
while (!text.EndsWith("\"")) {
line = reader.ReadLine();
text += line;
}
return text.Substring(0, text.Length - 1); // Strip last quotation mark.
}
This kind of stuff gets irritating, it's conceptually simple, but you can end up with gnarly code. You've got a comparatively simple case:one record per file, it gets much harder if you have lots of records, and you want to deal nicely with badly formed records (consider writing a parser for a language such as C#.
For large scale problems one might use a grammar driven parser such as this: link text
Much of your complexity comes from the lack of regularity in the file. The first field is terminated by nwline, the second by delimited by quotes, the third terminated by comma ...
My first recomendation would be to adjust the format of the file so that it's really easy to parse. You write the file so you're in control. For example, just don't have new lines in the text, and each item is on its own line. Then you can just read four lines, job done.

Categories

Resources