string manipulations - c#

I have string variable declared globally.I have to append a substring to this string dynamically based on the user input.To do this I use str=str+substring;
In this case the string in str doesn't have meaningful sentence finally ie.,there is no spaces between the words.to make it sense I used the following statement instead,
str=str+" "+substring; or str=str+substring+" ";
here everytime I have to append extra space to the substring before appending this to the main string were additional string processing is required.
Can anybody help on this were i can do this effectively?

It depends on how often you are doing it. If this is intermittent (or in fact pretty-much anything except a tight loop), then forget it; what you have is fine. Sure an extra string is generated occasionally (the combined substring/space), but it will be collected at generation 0; very cheap.
If you are doing this aggressively (in a loop etc), then use a StringBuilder instead:
// declaration
StringBuilder sb = new StringBuilder();
...
// composition
sb.Append(' ').Append(substring);
...
// obtaining the string
string s = sb.ToString();
A final (unrelated) point - re "globally" - if you mean static, you might want to synchronize access if you have multiple threads.

What do you want to achieve exactly? You could store the words in a list
List<string> words = new List<string>();
...
words.Add(str);
And then delay the string manipulation (i.e. adding the spaces between words) until at the very end. This way, you're on the fly operation is just an add to a list, and you can do all the complex processing (whatever it may be) at the end.

If you are doing it rarely, you could slightly pretty up the code by doing:
str += " " + substring;
Otherwise, I'd go with Nanda's solution.

#Nanda: in your case you should use string builder.
StringBuilder data = new StringBuilder();
data.AppendFormat(" {0}", substring);

Related

C# String class: No way of pushing a character to the end of the string?

Really???
I've searched through https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx and don't see any method that can directly push a character onto the end of a string. The best I can figure is
mystr.Insert(mystr.Length, newchar.ToString());
which seems innefficient because of the overhead involved in converting the character to a string and performing string concatenation. My particular use case looks like
while (eqtn[curidx] >= '0' && eqtn[curidx] <= '9') istr.Insert(istr.Length, eqtn[curidx++].ToString());
only because I can't think of a better way to do it. Is there a better way?
Strings in .NET are immutable, so your code doesn't do anything. Every method on a String creates a new instance, it doesn't modify the existing string.
String class overrides + operator to create a new string with the character appended to the end:
istr = istr + eqtn[curidx++];
If you are doing a lot of such operations it will be more efficient to use a StringBuilder. It's basically a mutable String.
You can use the Append method to add a char to end. When you're ready, call ToString to get the constructed string.
Yes, that is absolutely right: you cannot push a character onto the end of a string because C# strings are immutable. Once you have an object, you are stuck with its value until you create a new string object.
On the other hand, creating a new string with an extra character at the end is very simple: use + operator overload that performs concatenation:
string s = "abc";
s += '9'; // s becomes "abc9"
Note that this solution is not so good for use in a loop, because if your loop runs N times you create N throw-away objects in the process. A better solution is to use StringBuilder, which provides a mutable string in C#. StringBuilder class has a convenient Append method, which pushes characters to the end of the StringBuilder. Once you are done building the string, call ToString to harvest the result as an immutable string object.

Remove excessive whitespace in user input field

In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a
A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.
I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks

C# replace string in string

Is it possible to replace a substring in a string without assigning a return value?
I have a string:
string test = "Hello [REPLACE] world";
And I want to replace the substring [REPLACE] with something else:
test = test.replace("[REPLACE]", "test");
This works fine, but how can I do it without assigning the return value to a variable?
I want something like this:
test.replace("[REPLACE]", "test");
As mentioned by dlev, you can't do this with string as strings are immutable in .NET - once a string has been constructed, there's nothing you can do (excluding unsafe code or reflection) to change the contents. This makes strings generally easier to work with, as you don't need to worry about defensive copying, they're naturally thread-safe etc.
Its mutable cousin, however, is StringBuilder - which has a Replace method to perform an in-object replacement. For example:
string x = "Hello [first] [second] world";
StringBuilder builder = new StringBuilder(x);
builder.Replace("[first]", "1st");
builder.Replace("[second]", "2nd");
string y = builder.ToString(); // Value of y is "Hello 1st 2nd world"
You can't, because string is immutable. It was designed so that any "changes" to a string would actually result in the creation of a new string object. As such, if you don't assign the return value (which is the "updated" string, actually copy of the original string with applied changes), you have effectively discarded the changes you wanted to make.
If you wanted to make in-place changes, you could in theory work directly with a char[] (array of characters), but that is dangerous, and should be avoided.
Another option (as pointed out by Mr. Skeet below) is to use StringBuilder and its Replace() method. That being said, simple replacements like the one you've shown are quite fast, so you may not want to bother with a StringBuilder unless you'll be doing so quite often.
Strings in .NET are immutable. They cannot be edited in-line.
The closest you can get to in-line editing is to create a StringBuilder from a string. In-line fiddles with its contents and then get it to spit a string back out again.
But this will still produce a new string rather than altering the original. It is a useful technique, though, to avoid generating lots of intermediary strings when doing lots of string fiddling, e.g. in a loop.
You can't. You have to assign the value, as strings are immutable.
Built-in reference types (C# reference)
You can't. Strings are immutable in .NET.
You can't, as in C# strings are immutable. Something like this would violate that.
You need to have the return type of string, because the one you're working with cannot change.
Here is the code to fetch a string from HTML content and pass it to StringBuilder and set the value from your variable. You cannot do string.replace. You have to use StringBuilder while manipulating. Here in the HTML page I added [Name] which is replaced by Name from code behind. Make sure [Name] is unique or you can give any unique name.
string Name = txtname.Text;
string contents = File.ReadAllText(Server.MapPath("~/Admin/invoice.html"));
StringBuilder builder = new StringBuilder(contents);
builder.Replace("[Name]", Name);
StringReader sr = new StringReader(builder.ToString());

Is there a more elegant way to change Unicode to Ascii?

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.
In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?
Heres what I have at the moment...
formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");
Any Ideas?
It's a rather inelegant problem, so no method will really be deeply elegant.
Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).
At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).
I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).
Taking this approach to your code it becomes:
formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');
Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:
StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
switch(c)
{
case '\u2013': case '\u2014': case '\u2015':
sb.Append('-');
default:
sb.Append(c)
}
formattedString = sb.ToString();
(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).
With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).
Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:
case 'ß':
sb.Append("ss");
Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.
Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:
char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
sb.Append(replacements[(int)c]);
formattedString = sb.ToString();
Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.
Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:
char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();
char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;
/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/
StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
int cAsI = (int)c;
sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
throw new ArgumentException("Unconvertable character");
formattedString = ret;
The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.
The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.
Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?
In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.
You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.
For example:
var lookup = new Dictionary<char, char>
{
{ '`', '-' },
{ 'இ', '-' },
//next pair, etc, etc
};
var input = "blah இ blah ` blah";
var r;
var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);
string output = new string(result.ToArray());
Or if you want blanket treatment of non ASCII range characters:
string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());
Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.
That being said, you could make a few improvements.
If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.
If they are all replaced with the same string:
formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));
or
foreach (char c in "\u2013\u2014\u2015")
formattedString = formattedString.Replace(c, '-');

Are StringBuilder strings immutable?

StringBuilder has a reputation as being a faster string manipulation tool than simply concatenating strings. Whether or not that's true, I'm left wondering about the results of StringBuilder operations and the strings they produce.
A quick jaunt into Reflector shows that StringBuilder.ToString() doesn't always return a copy, sometimes it seems to return an instance of the internal string. It also seems to use some internal functions to manipulate the internal strings.
So what do I get if I do this?
string s = "Yo Ho Ho";
StringBuilder sb = new StringBuilder(s);
string newString = sb.ToString();
sb.Append(" and a bottle of rum.");
string newNewString = sb.ToString();
Are newString and newNewString different string instances or the same? I've tried to figure this out via reflector, and I'm just not quite understanding everything.
How about this code?
StringBuilder sb = new StringBuilder("Foo\n");
StringReader sr = new StringReader(sb.ToString());
string s = sr.ReadLine();
sb.Append("Bar\n");
s = sr.ReadLine();
Will the last statement return null or "Bar"? And if it returns one or ther other, is this defined or undefined behavior? In other words, can I rely on it?
The documentation is remarkably terse on this subject, and I'm reluctant to rely on observed behavior over specification.
Outside of mscorlib, any instance of a System.String is immutable, period.
StringBuilder does some interesting manipulation of Strings internally but at the end of the day it won't return a string to you and then subsequently mutate it in a way that is visible to your code.
As to whether subsequent calls to StringBuilder.ToString() returns the same instance of a String or a different String with the same value, that is implementation dependent and you should not rely on this behavior.
newString and newNewString are different string instances.
Although ToString() returns the current string, it clears its current thread variable. That means next time you append, it will take a copy of the current string before appending.
I'm not entirely sure what you're getting at in your second question, but s will be null: if the final characters in a file are the line termination character(s) for the previous line, the line is not deemed to have an empty line between those characters and the end of the file. The string which has been read previously makes no difference to this.
Are newString and newNewString
different string instances or the
same? I've tried to figure this out
via reflector, and I'm just not quite
understanding everything.
They are different string instances: newString is "Yo Ho Ho" and newNewString is "Yo Ho Ho and a bottle of rum.". strings are immutable, and when you call StringBuilder.ToString() the method returns an immutable string that represents the current state.
Will the last statement return null or
"Bar"? And if it returns one or ther
other, is this defined or undefined
behavior? In other words, can I rely
on it?
It will return null. The StringReader is working on the immutable string you passed to it at the constructor, so it is not affected by whatever you do to the StringBuilder.
The whole purpose of this class is to make string mutable, so it actually is. I believe (but not sure) it'll return the same string that goes into it only if nothing else had been done with this object. So after
String s_old = "Foo";
StringBuilder sb = new StringBuilder(s_old);
String s_new = sb.ToString();
s_old would be the same as s_new but it won't be in any other case.
I should note, that for Java compiler automatically convert multiple string additions into operations with StringBuilder (or StringBuffer which is similar but even faster) class and I would be really surprised in .NET compiler doesn't do this conversion also.

Categories

Resources