Parse Text with RegEx?

Parse Text with RegEx? - c#

I need to Parse Values out of a Text that looks like this:
Description. Question?
A. First Answer
B. Second Answer
C. Third Answer
Answer: A, B
Now i need to find out the Description, the Question, the Answers and which Answers are correct. Is that Possible with RegEx? I know it should be possible, but I'm not an RegEx Expert.

Seriously Regex is great, but once the parsing logic becomes advanced, so does the regex needed to solve the problem. I would suggest breaking up the logic into smaller pieces (i take it you have some sort of scripting language available to do some preprocessing?)
Even if you get the whole thing matched with one killer regex - changing it later (by you or some other sorry person) would be a pain.
I would match the answers with something like this (You'd need to strip the commas):
^Answer: (\w,?)+
And then I'd do logic to reparse the text with the answers found with the first regex, with something like this (rebuilding the match, in this case A was an answer):
^A\.\s(.*)
It might not be something to flash your friends with, but it will be easier to maintain, and a heck lot easier to understand.

Just about anything you could possibly want to do with parsing text is possible with Regular Expressions, you will have to invest some time to learn it though. How tricky your particular task is depends on how consistent your body of text it. So in short, yes, but don't ask me for the Reg Ex! Good Luck.

If you could be more specific with your example and show an actual question and description it would be easier to tell for sure, but if I'm reading this right you could find all the text up to the last full stop "." before the question-mark "?", then find the text after it up to the question mark "?", and finally use the letters with full stops "." right after them, so something like this pseudo:
lastFullStopBeforeQ = text.substring(0 to first question
mark).lastIndexOf(".")
Description = text.substring(0 to lastFullStopBeforeQ)
Question = text.substring(lastFullStopBeforeQ+1 to first question
mark)
Answers[0] = text.substring(first question mark+1 to next "\n") ...
CorrectAnswers[0] = text.substring(next index of "Answer:" to next
",") ...
I know this is possible using C#, if you use something else then i can't give you a clear answer.

Related

Regex find and replace

I am not very good at regex, and frankly find it difficult to wrap my head around. Therefore my question may not make any sense.
Could you use regular expressions to guarantee that when someone enters a string that it finds the closest fit from a list and makes it match one of the entries?
Here is what the list might look like.
QR9456
QR6222
QR9487
QR2311
QR2311 AB
QR2311 A
QR4781
QR4781 A
XX920-009
QR9456 Z
I apologize if this question can't be answered or doesn't make sense.

Nope. Regexp:es only describe exact match to the patterns you specify: I doubt you could handcraft patterns to match the list above satisfactorily (much less define regexpes to match any list).
It sounds like what you are after is a fuzzy search algorithm like e.g. bitap.

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.

If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Text macros - replace them with function result

I need to introduce some text macros, for example:
"Some text here, some text here #from_file[a.txt,2,N] and here and here"
The #from_file[a.txt,2,N] macro should get 2 random lines from a.txt and join them with new line character another #from_file[a.txt,5,S] - take 5 random lines and join with space
I of course need some another macros: #random[0-9] - random number, #random[A-B,5] - random string with 5 characters
Macros can be in another format etc: {from_file:a.txt,2,N}
My first idea was to use regular expressions - but maybe exist another solution for my problem?

It sounds like you want to create some sort of "general purpose" text-macro system, and while I'm sure this can be done with regexps, what you want basically boil down to what you want to be capable of, and how extensive & flexible it needs to be.
You basically need to define your grammar and constraints. Can the file-name contain the macro-block terminator-character '}' ? If so, does it need to be escaped? Should escaping be supported? Are spaces within a macro-block allowed?
Basically find out how you want things to work, preferably as constrained as possible, as this means you can implement a simpler solution, and there might not be any need for a full blown parser and similar ilk.
Maybe a regex-based solution will be sufficient (although most certainly not very good). But before you can tell that, you need to spec better ;)

What's the best way to parse a string for "bad" words in C#?

I'm thinking of something like:
foreach (var word in paragraph.split(' ')) {
if (badWordArray.Contains(word) {
// do something about it
}
}
but I'm sure there's a better way.
Thanks in advance!
UPDATE
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.
Edit to add sample code:
public string FilterWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.Replace(inputWords, "<3");
}
That should work for you, more or less.
Edit to answer OP clarification:
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.
Much as the replacement portion above, you can see if something matches like so:
public bool HasBadWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.IsMatch(inputWords);
}
It will return true if the string you passed to it contains any words in the list.

At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).
One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we
uppercase everything in the input
remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)
And then get some friends to try to break it. It's fun.

You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

Wikilinks - turn the text [[a]] into an internal link

I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?

On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.

I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.

Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse Text with RegEx? - c#

Related

Regex find and replace

C# Regex Replace ignore specific string

Text macros - replace them with function result

What's the best way to parse a string for "bad" words in C#?

Wikilinks - turn the text [[a]] into an internal link

Categories

Resources