Regex Split in Dictionary C# - c#

I have a text that has many questions and each question has a seperator QId {QuestionId}.
Now I want to split this text with QId {QuestionId} to get a dictionary with keys as {QuestionId} and the question body as a Value. Here questionId is dynamic will have any Integer value.
I have tried this Regex.Split with regex "\\s*QId\\s*[\\d]+\\s*\\s*".
I can get the question body but I want Question Id, too, as a Key so I can do some action based on QuestionId.
I have tried below code:
Regex.Split(
text, string.Format("\\s*{0}\\s*[\\d]+\\s*\\s*",
"QID"), RegexOptions.Singleline);
It did not give me data in dictionary.

Your input does not look like this:
QID 1
Where, oh where, did the Joel Data go?
QID 2
What time is it?
QID 3
In what time zone?
It's more something like this:
<p><B>QID 1</B><BR>Where, oh where, did the Joel Data go?</p>
<p><B>QID 2</B><BR>What time is it?</p>
<p><B>QID 3</B><BR> In what time zone?</p>
This means that you will have to take the html tags into consideration and \s (whitespace) will not be enough.
Html and regex don't mix well and you may want to consider using a html library or some other way (maybe string.Split). Please see, RegEx match open tags except XHTML self-contained tags and its legendary answer.

Related

Stop regex from spanning across unrequired content

I need to extract a series of meaningful values from a file. The basic pattern for the values I need to match looks like:
"indicator\..+?"\[true\]
Unfortunately, in places this is spanning across quite a bit of content to get a true match, and the lazy quantifier (?) is not being as lazy as I'd like.
How do I modify the above so that out of the following:
"indicator.value here"[false],"other content","more other
content","indicator don't match this one because the full stop is missing"[true],"indicator.this is the
value I want matched"[true]
only this value is returned: "indicator.this is the value I want matched"[true]
Currently, that whole string is being returned by my above regex.
Assuming commas are the delimiter - simply avoid matching on them:
#"""indicator\.[^,]+?""\[true\]"
Try using "indicator\.(.*)?"\[true\] instead and see if that helps. I think the lazy only applies to the * operator. I vaguely remember having this issue years ago.
You can leverage the discard technique by discarding the pattern you don't want. So, you could have something like this:
"indicator\..+?"\[false\]|"indicator\.(.+?)"\[true\]
Discard this pattern --^ Capture this --^
Working demo
Match information
MATCH 1
1. [150-182] `this is the value I want matched`

Regular expression - how to match xml value [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.
I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.
I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.
var regex = new Regex(#"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);
Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
//do stuff...
}
m = m.NextMatch();
}
In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.
That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.
In your situation, you have essentially:
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
And you want all of the <AirlineCode> tags that occur within <Flight> tags.
The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.
If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.

regex to capture multiple groups separated by an initial delimiter

I have a string like this:
|T1| This is some text for the first tag |T2| this is some text for the second tag
I need to parse out the tags and the text that is associated with each one. The tags are not known ahead of time but they are delimited by \|\w+\|.
I know there is something I can do here as far as capturing groups and so on but after messing around in powershell the best I can come up with is to first isolate each pairing using \|\w+\|.*with the ExplicitCapture option and then parse out the tag and text from there.
But that is doing double the work and totally not super-cool haxor. What's the regex-pro way to do this?
Edit: Actually I realize that it's late and I misread my results. The above doesn't actually work so now I don't even have a bad solution.
\|(?<tag>\w+)\|(?<text>[^|]*)
Matches |T1| This is some text for the first tag |T2| this is some text for the second tag
into
|T1| This is some text for the first tag
|T2| this is some text for the second tag
EDIT:
Use Regex Groups to get parts of match;
var tagName = match.Groups["tag"].Value;
var text = match.Groups["text"].Value;
Swithed to named groups instead of numbered

Parse Text with RegEx?

I need to Parse Values out of a Text that looks like this:
Description. Question?
A. First Answer
B. Second Answer
C. Third Answer
Answer: A, B
Now i need to find out the Description, the Question, the Answers and which Answers are correct. Is that Possible with RegEx? I know it should be possible, but I'm not an RegEx Expert.
Seriously Regex is great, but once the parsing logic becomes advanced, so does the regex needed to solve the problem. I would suggest breaking up the logic into smaller pieces (i take it you have some sort of scripting language available to do some preprocessing?)
Even if you get the whole thing matched with one killer regex - changing it later (by you or some other sorry person) would be a pain.
I would match the answers with something like this (You'd need to strip the commas):
^Answer: (\w,?)+
And then I'd do logic to reparse the text with the answers found with the first regex, with something like this (rebuilding the match, in this case A was an answer):
^A\.\s(.*)
It might not be something to flash your friends with, but it will be easier to maintain, and a heck lot easier to understand.
Just about anything you could possibly want to do with parsing text is possible with Regular Expressions, you will have to invest some time to learn it though. How tricky your particular task is depends on how consistent your body of text it. So in short, yes, but don't ask me for the Reg Ex! Good Luck.
If you could be more specific with your example and show an actual question and description it would be easier to tell for sure, but if I'm reading this right you could find all the text up to the last full stop "." before the question-mark "?", then find the text after it up to the question mark "?", and finally use the letters with full stops "." right after them, so something like this pseudo:
lastFullStopBeforeQ = text.substring(0 to first question
mark).lastIndexOf(".")
Description = text.substring(0 to lastFullStopBeforeQ)
Question = text.substring(lastFullStopBeforeQ+1 to first question
mark)
Answers[0] = text.substring(first question mark+1 to next "\n") ...
CorrectAnswers[0] = text.substring(next index of "Answer:" to next
",") ...
I know this is possible using C#, if you use something else then i can't give you a clear answer.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources