Text parsing, conditional text - c#

I have a text template with placehoders that I parse in order to replace placeholders
with real values.
Text Template:
Name:%name%
Age:%age%
I use StringBuilder.Replace() to replace placeholders
sb.Replace("%name%", Person.Name);
Now I want to make more advanced algorithm. Some lines of code are conditional. They
have to be either removed completely of kept.
Text Template
Name:%Name%
Age:%age%
Employer:%employer%
The line Employer should appear only when person is employed (controlled by boolean variable Person.IsEmployed).
UPDATE: I could use open/close tags. How can find text between string A and B?
Can I use Regex? How?

Perhaps you could include the label "Employer:" in the replacement text instead of the template:
Template:
Name:%Name%
Age:%age%
%employer%
Replacement
sb.Replace("%employer%",
string.IsNullOrEmpty(Person.Employer) ? "" : "Employer: " + Person.Employer)

Another alternative might be to use a template engine such as Spark or NVelocity.
See a quick example for Spark here
A full-fledged template engine should give you the most control over the formatted output. For example conditionals and repeating sections.

Your current templating scheme isn't robust enough - you should add more special placeholders, like this for example:
Name:%Name%
Age:%age%
[if IsEmployed]
Employer:%employer%
[/if]
You can parse out [if *] blocks using a regex (not tested):
Match[] ifblocks = Regex.Match(input, "\\[if ([a-zA-Z0-9]+)\\]([^\\[]*)\\[/if\\]");
foreach(Match m in ifblocks) {
string originalBlockText = m.Groups[0];
string propertyToCheck = m.Groups[1];
string templateString = m.Groups[2];
// check the property that corresponds to the keyword, i.e. "IsEmployed"
// if it's true, do the normal replacement on the templateString
// and then replace the originalBlockText with the "filled" templateString
// else, just don't write anything out
}
Really, though, this implementation is full of holes... You may be better off with a templating framework like another answer suggested.

One option would be to do all of your replacement as you're doing now, then fix empty variables with a RegEx replacement on the way out the door. Something like this:
Response.Write(RegEx.Replace(sb.ToString(), "\r\n[^:]+:r\n", "\r\n"));

Related

How to Replace the generic css style into empty string using C#

When i take a list of items from one list, some css styles are added into that. I want to remove / Replace that.
The following code is used to replace the style generic. But it is not given the result.
flag.Text = flag.Text.Replace("style=[\"'](.*)[\"']", "");
But it is not replacing. How to give this. Or shall i use Contains method?
You probably want to try using Regex.Replace (and using Multiline option, just in case) instead of string.Replace:
RegexOptions options = RegexOptions.Multiline;
flag.Text = Regex.Replace(flag.Text, "style=[\"'](.*)[\"']", "", options);
What you show above is Replace using string.Replace. It tries to find exact match of the static text instead of text with pattern. If you want to replace text with pattern, use Regex.Replace instead.
I think you are trying to replace the style by using the string.replace() methode but i think it cant do a regex like replace. I think you need to take a look at Regex.Replace.

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.
If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

multiline textbox to string

I have a multiline textbox that I wish to convert to a string,
I found this
string textBoxValue = textBox1.Text.Replace(Environment.NewLine,"TOKEN");
But dont understand TOKEN what is TOKEN? whitespace or /n newline ?
If this is the incorrect answer then Please let me know of the correct way of doing this
Thanks
In the code snippet you gave, "TOKEN" is any value you wish to insert, such as an HTML <br /> tag, more Environment.NewLines for formatting, or just some random delimiter that will later allow you to split the text on it.
A very simple example:
string text = textBox1.Text.Replace(Environment.NewLine, "^"); // a random token
string[] lines = test.Split( '^' );
If you are handling input from a textbox available on the web, you also need to take into account XSS (http://en.wikipedia.org/wiki/Cross-site_scripting). Also, in a real scenario I would split on a more complex token and make sure to handle multiple carriage returns in the input value.
EDIT: now that I see your actual requirements, this code may do what you need:
// replace newlines with a single whitespace
string text = textBox1.Text.Replace(Environment.NewLine, " ");
EDIT #2:
further I need to enter this data into
SQLite and rewrite his whole
application, The company does not wish
to have information from the previos
application inputted to the new
database, there are hyperlinks etc
inbedded in the content , so if there
is a way I can make the text box only
accept RAW data this would be the
best.
Regular Expressions are the way to go for something like this, unless the data is structured enough to load into an XML or HTML DOM and process. You can build regular expressions in a variety of tools (do a Google search for a free online tester and you will find many). Once you have determined the expressions you need, you can use the Regex object in C# to match, replace, etc.
http://msdn.microsoft.com/en-us/library/ms228595(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace(v=VS.100).aspx
As it stands, "TOKEN" is just a meangingless string, unless it is elsewhere in your code? You can replace "TOKEN" with any text you like.
Edit:
Okay, so you say you're removing NewLine's from your client's text. So you would do it like this. Paste their text into a textBox called textBox2, then use the following:
textBox2.Text = textBox2.Text.Replace(Environment.NewLine, string.Empty);

Regex for a string

It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources