C# Regex Replace ignore specific string - c#

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.

If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Related

C# Matching a recursive structure in a string

I'm currently working on a program that loads up a text file, searches through it to find a specific structure, and then replaces a certain part of that structure with different text.
The structure I need to find and extract is "N"(N) where N is any character. For example. Lets say I had a text file like this:
Everyone knows the saying "Do not do more than you can do" (Jim Doe).
Well, I'm here to tell you that this saying is awesome. Here is
another, "The sky is blue and the sun is bright" (Sally Wantsmore).
I would want to be able to match the text "Do not do more than you can do" (Jim Doe) along with "The sky is blue and the sun is bright" (Sally Wantsmore).
I don't think there is really a way to do this with a regular expression from the best of my knowledge. I've been trying for the last few days. I can't get it to work, it's a recursive pattern by nature. My question is, how would I go about writing C# code to parse through and find these patterns. I would like to do something where I can find the position within the string and the length, that way I can then extract it from the string.
EDIT
I need to be able to capture all characters in the quote. This means that there could also be another set of quotes within the quote and even another set of parenthesis. This means that the structure could also contain a match within itself.
I'm now trying to use this expression because I need to be able to capture all characters within a quote: \"(.+)\" \(([\w ]+)\)
The listed answers below both work. However, I've discovered a limitation. There is a possible recursive structure to this. The problem I am currently having is when there is a "N"(N) inside of a "N"(N)". For example:
"Random quote" (random person) Here is a fun saying, "The sky is blue and
the sun is bright, some even say "really bright" (others)" (Sally
Wantsmore).
This presents many problems. There is only one match because it takes the very first ", and then finds the last " just after (others) and finds the set of parens for (Sally Wantsmore) and only finds that match. However, I desire for it to find all the matches, especially the beginning one and last one separably, and even the inner quote. Is this possible with Regular expressions? If not, how do I go about solving this with Recursive c# code.
The following regex should find the two things you're looking for:
\"([\w ]+)\" \(([\w ]+)\)
In C# you can use Regex.Match to retrieve the two items in brackets.
An example on how you could have it in C#:
var quotes = Regex.Matches(#"Everyone knows the saying ""Do not do more than you can do"" (Jim Doe). Well, I'm here to tell you that this saying is awesome. Here is another, ""The sky is blue and the sun is bright"" (Sally Wantsmore).",
"(?<Quotes>\"(?<Text>[\\w ]+)\\\" \\((?<Author>[\\w ]+)\\))", RegexOptions.Singleline);
foreach (Match quote in quotes)
{
var text = quote.Groups["Text"].Value;
var author = quote.Groups["Author"].Value;
Console.WriteLine($"Text: {text}, Author: {author}");
}

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

multiline textbox to string

I have a multiline textbox that I wish to convert to a string,
I found this
string textBoxValue = textBox1.Text.Replace(Environment.NewLine,"TOKEN");
But dont understand TOKEN what is TOKEN? whitespace or /n newline ?
If this is the incorrect answer then Please let me know of the correct way of doing this
Thanks
In the code snippet you gave, "TOKEN" is any value you wish to insert, such as an HTML <br /> tag, more Environment.NewLines for formatting, or just some random delimiter that will later allow you to split the text on it.
A very simple example:
string text = textBox1.Text.Replace(Environment.NewLine, "^"); // a random token
string[] lines = test.Split( '^' );
If you are handling input from a textbox available on the web, you also need to take into account XSS (http://en.wikipedia.org/wiki/Cross-site_scripting). Also, in a real scenario I would split on a more complex token and make sure to handle multiple carriage returns in the input value.
EDIT: now that I see your actual requirements, this code may do what you need:
// replace newlines with a single whitespace
string text = textBox1.Text.Replace(Environment.NewLine, " ");
EDIT #2:
further I need to enter this data into
SQLite and rewrite his whole
application, The company does not wish
to have information from the previos
application inputted to the new
database, there are hyperlinks etc
inbedded in the content , so if there
is a way I can make the text box only
accept RAW data this would be the
best.
Regular Expressions are the way to go for something like this, unless the data is structured enough to load into an XML or HTML DOM and process. You can build regular expressions in a variety of tools (do a Google search for a free online tester and you will find many). Once you have determined the expressions you need, you can use the Regex object in C# to match, replace, etc.
http://msdn.microsoft.com/en-us/library/ms228595(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace(v=VS.100).aspx
As it stands, "TOKEN" is just a meangingless string, unless it is elsewhere in your code? You can replace "TOKEN" with any text you like.
Edit:
Okay, so you say you're removing NewLine's from your client's text. So you would do it like this. Paste their text into a textBox called textBox2, then use the following:
textBox2.Text = textBox2.Text.Replace(Environment.NewLine, string.Empty);

Wikilinks - turn the text [[a]] into an internal link

I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?
On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.
I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.
Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources