Wikilinks - turn the text [[a]] into an internal link - c#

I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?

On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.

I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.

Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.

Related

Parse Text with RegEx?

I need to Parse Values out of a Text that looks like this:
Description. Question?
A. First Answer
B. Second Answer
C. Third Answer
Answer: A, B
Now i need to find out the Description, the Question, the Answers and which Answers are correct. Is that Possible with RegEx? I know it should be possible, but I'm not an RegEx Expert.
Seriously Regex is great, but once the parsing logic becomes advanced, so does the regex needed to solve the problem. I would suggest breaking up the logic into smaller pieces (i take it you have some sort of scripting language available to do some preprocessing?)
Even if you get the whole thing matched with one killer regex - changing it later (by you or some other sorry person) would be a pain.
I would match the answers with something like this (You'd need to strip the commas):
^Answer: (\w,?)+
And then I'd do logic to reparse the text with the answers found with the first regex, with something like this (rebuilding the match, in this case A was an answer):
^A\.\s(.*)
It might not be something to flash your friends with, but it will be easier to maintain, and a heck lot easier to understand.
Just about anything you could possibly want to do with parsing text is possible with Regular Expressions, you will have to invest some time to learn it though. How tricky your particular task is depends on how consistent your body of text it. So in short, yes, but don't ask me for the Reg Ex! Good Luck.
If you could be more specific with your example and show an actual question and description it would be easier to tell for sure, but if I'm reading this right you could find all the text up to the last full stop "." before the question-mark "?", then find the text after it up to the question mark "?", and finally use the letters with full stops "." right after them, so something like this pseudo:
lastFullStopBeforeQ = text.substring(0 to first question
mark).lastIndexOf(".")
Description = text.substring(0 to lastFullStopBeforeQ)
Question = text.substring(lastFullStopBeforeQ+1 to first question
mark)
Answers[0] = text.substring(first question mark+1 to next "\n") ...
CorrectAnswers[0] = text.substring(next index of "Answer:" to next
",") ...
I know this is possible using C#, if you use something else then i can't give you a clear answer.

Regex Help (again)

I don't really know what to entitle this, but I need some help with regular expressions. Firstly, I want to clarify that I'm not trying to match HTML or XML, although it may look like it, it's not. The things below are part of a file format I use for a program I made to specify which details should be exported in that program. There is no hierarchy involved, just that each new line contains a 'tag':
<n>
This is matched with my program to find an enumeration, which tells my program to export the name value, anyway, I also have tags like this:
<adr:home>
This specifies the home address. I use the following regex:
<((?'TAG'.*):(?'SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
The problem is that the regex will split the adr:home tag fine, but fail to find the n tag because it lacks a colon, but when I add a ? or a *, it then doesn't split the adr:home and similar tags. Can anyone help? I'm sure it's only simple, it's just this is my first time at creating a regular expression. I'm working in C#, by the way.
Will this help
<((?'TAG'.*?)(?::(?'SUBTAG'.*))?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
I've wrapped the : capture into a non capturing group round subtag and made the tag capture non greedy
Not entirely sure what your aim is but try this:
(?><)(?'TAG'[^:\s>]*)(:(?'SUBTAG'[^\s>:]*))?(\s\w+=['"](?'VALUE'[^'"]*)['"])?(?>>)
I find this site extremely useful for testing C# regex expressions.
What if you put the colon as part of the second tag?
<((?'TAG'.*)(?':SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.
If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Regular expression in C# , is this possible?

I never use regular expression before and plan to use it to solve my problem but not quite sure whether it can help me.
I have a situation where I need store a rule or formula to build string values like following examples in a database field then retrieve this rule and build the string value.
FacilityCode + Left(ModelNO,2)
Right(PO,3) + Left(Serial,2)
Is this achievable using .net regular expression? Any good tutorial or simple examples of this problem.
Regexp : http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
But it doesn't seems fitting :)
It might be better to code some random string generator. Regex is for searching data not creating data.
The thing to remember about regex is that it is like an aircraft carrier; it does one thing very very well, it does not do other jobs very well at all.
An aircraft carrier moves planes very well on the ocean; it does not make a cheese sandwich well AT ALL!!
That is to say, if you use regex when you shouldn't you will almost certainly use far more processing power than if you used another tool for that job. Html parsing comes to mind.
Regex is provided as part of System.Text.RegularExpressions, but you can't rely exclusively on it. It'll let you search existing strings, but you'll need to implement your own logic for building new strings based on what you find in the existing data.
Also, keep in mind that System.Text.RegularExpressions works differently from regexp in Perl and other implementations. For example, it doesn't recognize POSIX character class definitions.
Since you're new to regex, you might want to check out the "Regular Expressions User Guide" on zytrax.com. It's not as comprehensive as an O'Reilly manual, but it'll do as a start.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources