looking for specific text and convert into link - c#

I have a bunch of html files (5000).
My business requirements defines a reference format, let's say it's XXX-YY(Year)-ZZZ.
I want to replace, in all html files, any occurrence of such format by a link like this :
<a href='~/app/document/XXX-YY(Year)-ZZZ'>XXX-YY(Year)-ZZZ</a>
While it sounds "simple" using a standard regex replace, it's actually more difficult as I thought as the process can run multiple times.
My current process will "nest" the replacements to produces something like this :
<a href='~/app/document/<a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a>><a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a></a>
How can I reach my goal ?
PS: performance is not an issue (at least when it stays reasonable)

all you need is: HTML Agility Pack
check this one: c# html agility pack and plenty of other questions about it here in SO ;-)
this because you better to use a parser with solid understanding of the HTML tree, not just regex or text parsing which may fail depending on the specific markup...

Related

Detect Razor/C# code?

Is there a way to detect if an HTML page contains any razor/C# code? Essentially I want users to be able to provide custom layouts, with tags that I will replace with RenderSection. I want to validate that prior to making this replacement, that none of the HTML contains anything like for example, <a href="#(some C# code)".
All discussions about alternative ways to do this, should/could/would aside, just simply:
Is there a way to programmatically detect if a file contains C#/Razor code?
I don't know a lot about the Razor markup -- but I am thinking that when you grab the layout string they are passing in you will want to parse the text out and grab everything that starts with an # and toss those words into an array. Then, when you republish it to you website use razor code to access the data in the array...
Alternately, and easier, would be to go through all the passed in code and replace all the # signs with a different symbol say & that way it wont get interpreted by the Razor processor:
layoutString = layoutString.Replace('#', '&');
In the browser? No, because unless the programmer made a mistake, there is no Razor/C# code in teh rendered HTML, only HTML that was the result of that.
What you ask is like asking what type of oven was used to bake a pizza from the pizza. Bad news - you never will know.
If you provie sensible tags from those, you could parse them in javascript, but you have to output that metadata yourself as part of the generated html.
After reading your comment to TomTom; the answer is:
No. Razor does not come with any public syntax parser.

How to find all tags from string using RegEx using C#.net?

I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.

Regex with not match at end

I'm trying to write a regex to match patterns like this:
<td style="alskdjf" />
i.e. a self terminating <td>
but not this:
<td style=alsdkjf"><br /></td>
I initially came up with:
<td\s+.*?/>
but that obviously fails on the second example and I thought that something like this might work:
<td\s+.*?[^>]/>
but it doesn't. I'm using C#.NET.
Only looking for <td>'s that have an attribute. e.g. looking for <td style="alsdfkj" /> but not <td>.
This will match what you're looking for, and not match the problematic case you had with your first few tries:
<td[^>]*?/>
Note, however, that if you need to allow > characters in attribute values, you'd need something like this:
<td(?:[^>]|"[^"]*?")*?/>
Which allows > only within matching double-quotes (you could similarly expand it to allow single-quotes).
You can add whatever specific attribute you're looking for into the regex; for instance for your example:
<td[^>]*? style="alskdjf"[^>]*?/>
You're going to have problems using regexps with HTML since HTML is not regular. I'd recommend using an HTML parser for all but the very simplest cases.
Regex will have serious trouble interpreting messy HTML, as is the sort browsers often have to deal with. There are all sorts of horrible obfuscations that can be done to the markup that you just don't want to have to think about!
The HTML Agility Pack is what you really want to be using, and has had very good reviews everywhere I've seen. It is a robust library for reading any kind of mangled HTML into a DOM model. I have personally found it to be an superb library, as surely have others, many using the library in the context of business applications.

RegEx matching HTML tags and extracting text

I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources