Regex with not match at end - c#

I'm trying to write a regex to match patterns like this:
<td style="alskdjf" />
i.e. a self terminating <td>
but not this:
<td style=alsdkjf"><br /></td>
I initially came up with:
<td\s+.*?/>
but that obviously fails on the second example and I thought that something like this might work:
<td\s+.*?[^>]/>
but it doesn't. I'm using C#.NET.
Only looking for <td>'s that have an attribute. e.g. looking for <td style="alsdfkj" /> but not <td>.

This will match what you're looking for, and not match the problematic case you had with your first few tries:
<td[^>]*?/>
Note, however, that if you need to allow > characters in attribute values, you'd need something like this:
<td(?:[^>]|"[^"]*?")*?/>
Which allows > only within matching double-quotes (you could similarly expand it to allow single-quotes).
You can add whatever specific attribute you're looking for into the regex; for instance for your example:
<td[^>]*? style="alskdjf"[^>]*?/>

You're going to have problems using regexps with HTML since HTML is not regular. I'd recommend using an HTML parser for all but the very simplest cases.

Regex will have serious trouble interpreting messy HTML, as is the sort browsers often have to deal with. There are all sorts of horrible obfuscations that can be done to the markup that you just don't want to have to think about!
The HTML Agility Pack is what you really want to be using, and has had very good reviews everywhere I've seen. It is a robust library for reading any kind of mangled HTML into a DOM model. I have personally found it to be an superb library, as surely have others, many using the library in the context of business applications.

Related

Get value between unknown string

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.
The string I'm trying to retrieve is Christchurch.
The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.
What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.
To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.
Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags
Here's how you'd do it for your example:
using HtmlAgilityPack; //This is a nuget package!
var html = #"<tr >
<td align=""right"" valign=""top""><strong>Office:</strong> </td>
<td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.SelectSingleNode("//td[#class='stippel']");
Console.WriteLine(node.InnerHtml);
I haven't tested this code but it should do what you need.
I guess you need something like this:
office.*\n.*|(?<=300px">).*(?=<\/td)
The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.
Office:[\s\S]*?300px">(.*?)</td
This solution uses a group match rather than look-arounds.
Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.
Thanks for you help.
(?<=office.*\n.*300px">).*(?=<\/td)

Use C# to grab text from an HTML table

I need some advice and possible code examples for parsing an HTML table from a website. I'm using the webclient class to download the html from an address. I then need to find the table I want the data from. So for example if the table id is <table id="cia_list", I want to loop through the <td> tags and get just the text inside them. What would be the best way to approach this?
In the past I have converted the HTML to XML and then used XSLT to parse the results. If this is an approach you want to take I would recommend looking at SGMLReader, which will handle the conversion.
People will often attempt to use regex to do what you are talking about. This is something I typically advise against. Here is an amusing post that goes over some of the reasons not to do this:
RegEx match open tags except XHTML self-contained tags

looking for specific text and convert into link

I have a bunch of html files (5000).
My business requirements defines a reference format, let's say it's XXX-YY(Year)-ZZZ.
I want to replace, in all html files, any occurrence of such format by a link like this :
<a href='~/app/document/XXX-YY(Year)-ZZZ'>XXX-YY(Year)-ZZZ</a>
While it sounds "simple" using a standard regex replace, it's actually more difficult as I thought as the process can run multiple times.
My current process will "nest" the replacements to produces something like this :
<a href='~/app/document/<a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a>><a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a></a>
How can I reach my goal ?
PS: performance is not an issue (at least when it stays reasonable)
all you need is: HTML Agility Pack
check this one: c# html agility pack and plenty of other questions about it here in SO ;-)
this because you better to use a parser with solid understanding of the HTML tree, not just regex or text parsing which may fail depending on the specific markup...

Why isn't MarkdownSharp encoding my HTML?

In my mind, one of the bigger goals of Markdown is to prevent the user from typing potentially malformed HTML directly.
Well that isn't exactly working for me in MarkdownSharp.
This example works properly when you have the extra line break immediately after "abc"...
But when that line break isn't there, I think it should still be HtmlEncoded, but that isn't happening here...
Behind the scenes, the rendered markup is coming from an iframe. And this is the code behind it...
<%
var md = new MarkdownSharp.Markdown();
%>
<%= md.Transform(Request.Form[0]) %>
Surely I must be missing something. Oh, and I am using v1.13 (the latest version as of this writing).
EDIT (this is a test for StackOverflow's implementation)
abc
this shouldn't be red
For those not wanting to use Steve Wortham's customized solution, I have submitted an issue and a proposed fix to the MarkdownSharp guys: http://code.google.com/p/markdownsharp/issues/detail?id=43
If you download my attached Markdown.cs file you will find a new option that you can set. It will stop MarkdownSharp from re-encoding text within the code blocks.
Just don't forget to HTML encode your input BEFORE you pass it into markdown, NOT after.
Another solution is to white-list HTML tags like Stack Overflow does. You would do this AFTER you pass your content to markdown.
See this for more information: http://www.CodeTunnel.com/blog/post/24/mardownsharp-and-encoded-html
Since it became clear that the StackOverflow implementation contains quite a few customizations that could be time consuming to test and figure out, I decided to go another direction.
I created my own simplified markup language that's a subset of Markdown. The open-source project is at http://ultralight.codeplex.com/ and you can see a working example at http://www.bucketsoft.com/ultralight/
The project is a complete ASP.NET MVC solution with a Javascript editor. And unlike MarkdownSharp, safe HTML is guaranteed. The Javascript parser is used both client-side and server-side to guarantee consistent markup (special thanks to the Jurassic Javascript compiler). It's a beautiful thing to only have to maintain one codebase for that parser.
Although the project is still in beta, I'm using it on my own site already and it seems to be working well so far.
Maybe I'm not understanding? If you are starting a new code block in Markdown, in all its varieties, you do need a double linebreak and four-space indentation -- a single newline won't do in any of the renderers I have to hand.
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
yielding:
abc -- Here comes a code block:
<div style="background-color: red"> This is code</div>
From what you are saying it seems that MarkdownSharp does fine with this rule, so with just one newline (but indentation):
abc -- Here comes a code block:
<div style="background-color: red"> This should be code</div>
we get a mess not a code block:
abc -- Here comes a code block:
This should be code
I assume StackOverflow is stripping the <div> tags, because they think comments shouldn't have divisions and suchlike things. (?) (In general they have to do a lot of other processing don't they, e.g. to get syntax highlighting and so on?)
EDIT: I think people are expecting the wrong thing of a Markdown implementation. For example, as I say below, there is no such thing as 'invalid markdown'. It isn't a programming language or anything like one. I have verified that all three markdown implementations I have available from the command line indifferently 'convert' random .js and .c files, or those inserted into otherwise sensible markdown -- and also interpolated zip files and other nonsense -- into valid html that browsers don't mind displaying at all -- chicken scratches though it is. If you want to exclude something, e.g. in a wiki program, you do something further, of course, as most markdown-employing wiki programs do.

Retrieve Html attributes using Regex

I'm in need of a quick way to put a bunch of html attributes in a Dictionary. Like so
<body topmargin=10 leftmargin=0 class="something"> should amount to
attr["topmargin"]="10"
attr["leftmargin"]="0"
attr["class"]="something"
This is to be done server-side and the tag contents are already available. I just need to weed out the tags with no value and take into account different quotation marks or lack of.
I'm guessing regex should be employed. Found some similar questions, but none that really match my need.
Thanks
edit: clarifying server-side
What about HtmlAgilityPack?
I also think that using specialized parsers will be better, but if you want to use regex, try something like:
\<(?<tag>[a-zA-Z]+)( (?<name>\w+)="?(?<value>\w+)"?)*\>
I just tested it, works pretty well

Categories

Resources