Remove some bbcodes from code - c#

I am facing a problem. I am using a bbcode parser to HTML and when I try to parse it I have some problem when I have tags that is not in my set of parser.
For example:
My parser permit just [b], [center] and [i] tags.
If I try to parse [u] or [color={anyColor}] tags it returns me an exception.
I would like to remove any other tag not permited.
First I thought about not permitting it on my textarea, but, when I use ctrl+c/v to fill the textarea it fills with those tags and I notice it when the data is already on my database.
What I thought:
User enter the string with wrong tags
I call any method to remove not permitted tags (here is my problem)
save data on my database
Can anyone help me with it? Or suggest me something else?

After taking a quick look into the parser src found on the link you provided, it seems that if it runs into a tag that it does not know(meaning not in the list of tags provided during instantiation) it errors out(in some manner).
As it stands it looks like you have a few options:
Change your ErrorMode to ErrorFree.
this will no longer produce any exceptions and instead treat Unknown tags as text.
Go with your original Idea and restrict the input on the front end.
If you can, instead of going straight to HTML, add all possible tags to the parser, check to see if you can get a c# object out of the parser and eliminate the unwanted tags before outputting to html.
Or on the downswing of things after html is produced prohibit the use of the generated HTML tags.
Send the authors of the parser an email/(if you know german) a ticket/issue on codeplex and ask them to add support for striping unwanted tags.
Or if you want since you have the src add functionality to strip unwanted tags, yourself
This shouldn't be too hard I think, follow the pattern they have for the current Tags list in BBCodeParser.cs and make an TagsToIgnore list and just add a check before the rest of the parsing of a tag just to strip/ continue on to the next token.
EDIT:
You may be able to make the parser interpret the tags to display nothing. where you init the bbCodeParser.
var parser = new BBCodeParser(new[]
{
// keep these tags
new BBTag("b", "<b>", "</b>"),
new BBTag("i", "<span style=\"font-style:italic;\">", "</span>"),
new BBTag("u", "<span style=\"text-decoration:underline;\">", "</span>"),
// remove these (or at least there markup)
new BBTag("code", "", ""),
new BBTag("img", "", ""),
});

Related

How to restrict the HTML tags in multiline textbox?

I have a Multiline Textbox. I donot want to let user type HTML Tags or validation can be done in the server side. Any suggestions?
When I set ValidateRequest="true" it throws error
potentially dangerous Request.Form value was detected from the client
This is also not required. I tried to put validation by checking the character < but this is also not a proper validation because you can type like <kanavi and this is not a HTML tag
set ValidateRequest="false"
and handle on the server if there is a tag in input show message.
you can remove the tags
Regex.Replace(source, "<.*?>", string.Empty);
OR you use encoding if you want to keep them
have a look at this package from nuget.HtmlLaundry
it should help you clean out the HTML before it gets to the server.
Try with regular expression, this is for finding html tags. Use it on application side.
Regex.Match(TextBox.Text, "</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/ >");
I have another solution with XDocument always on application side.
Create an XDocument and set a root to it :
XDocument yourXDocument = new XDocument(new XElement("Root"));
Then load content :
yourXDocument.Root = XDocument.Load(TextBox.Text);
Then use a recursive function to find if you are more than 2 levels in your XDocument.
Of Couse, if you want to parse only HTML tags, I think you have to create a Dictionary to store all of them and compare your textbox value with each of them.

Read specific text from page into string array in C#

I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.

Is this not a suitable scenario for an Html parser?

I have to deal with malformed Html and Html tags inside Html attributes:
<p class="<sometag attr="something"></sometag>">
Link
</p>
I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:
<p class="<sometag attr=" something"="">">
Link
</p>
The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).
Is there something else I can use to help me parse bad Html like this?
it's not valid html, so i don't think you can rely on an html parser to parse it.
You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.
The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?
If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.
Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.
One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources