regex to capture multiple groups separated by an initial delimiter

regex to capture multiple groups separated by an initial delimiter - c#

I have a string like this:
|T1| This is some text for the first tag |T2| this is some text for the second tag
I need to parse out the tags and the text that is associated with each one. The tags are not known ahead of time but they are delimited by \|\w+\|.
I know there is something I can do here as far as capturing groups and so on but after messing around in powershell the best I can come up with is to first isolate each pairing using \|\w+\|.*with the ExplicitCapture option and then parse out the tag and text from there.
But that is doing double the work and totally not super-cool haxor. What's the regex-pro way to do this?
Edit: Actually I realize that it's late and I misread my results. The above doesn't actually work so now I don't even have a bad solution.

\|(?<tag>\w+)\|(?<text>[^|]*)
Matches |T1| This is some text for the first tag |T2| this is some text for the second tag
into
|T1| This is some text for the first tag
|T2| this is some text for the second tag
EDIT:
Use Regex Groups to get parts of match;
var tagName = match.Groups["tag"].Value;
var text = match.Groups["text"].Value;
Swithed to named groups instead of numbered

Related

How do I remove specific paired tags?

I'm using C# and trying to remove a specific pair of tags from string like
Remove <color=#FFFFFF>White </color>Not <color=#000000>Black</color>
And what I want is
Remove White Not <color=#000000>Black</color>
I tried to do it by myself but failed.Is there a way to do this?Any help would be appreciated!

I'm not sure what exactly you've already tried,
But this can easily be done by doing a Replace with RegularExpressions:
var original = "Remove <color=#FFFFFF>White </color>Not <color=#000000>Black</color>";
var replaced = Regex.Replace(original, "<color=#FFFFFF>(.*?)<\\/color>", "$1");
In the regex pattern, you can see I captured what's within the color tag to preserve the text inside, and then replaced the entire thing with that group capture, essentially removing everything around it.

Extract text between tags c#

I have to make a function that reads all substrings between all tags of format: <tag>sometext</tag>. 'tag' can be any alphanumeric character, and user can enter as many different tags as he wants, but without nested tags. I have to use regex-es...
I made something that prints first substring between first tags, but I can't figure out how to automate function to work from start to end of user input string...
Thanks!

You can use the back reference:
<([^>]+)>([^<]*)</(\1)>
(\1) indicate that it must be the text contained in the first group.
I put [^<]* as content, but if you may have sub elements, you should use .*

Try this
<[a-zA-Z0-9^>]>*(.*)</[[a-zA-Z0-9^>]*>

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.

If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}

Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash

If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>

IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regex to capture multiple groups separated by an initial delimiter - c#

Related

How do I remove specific paired tags?

Extract text between tags c#

C# Regex Replace ignore specific string

Checking a HTML string for unopened tags

Removing <div>'s from text file?

Categories

Resources