Regex matching URL's by sub-folder - c#

I am trying to essentially write an outbound URL matcher so I can replace a stream of html containing URL's to point to my CDN. I cant use the IIS URL Rewrite module as I am using compression. I currently have a regex that matches on a sub folder for a specific file type i.e.
Regex ASSET_PATH = new Regex(#"(?i)assets/([A-Za-z0-9\-_/.]+)\.(jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase );
This works great and allows me to manipulate anything in the string from that point onwards ( i.e. from "assets/" onwards to the right ). What I need to achieve is to manipulate the string to the left of the "assets/" sub-folder, without necessarily knowing the format? Here are some examples :
<img src="./assets/123/pig.jpg" />
<img src="http://mysite.blah/assets/123/pig.jpg" />
<img src="http://www.mysite.blah/assets/123/pig.jpg" />
<img src='assets/123/pig.jpg' />
in css / inline styles :
background-image : URL('assets/123/pig.jpg')
background-image : URL(http://www.mysite.blah/assets/123/pig.jpg)
anyway, I think you get the picture. I essentially want to be able to look to the "left" of the word "assets" until I can find the logical start point of the url and then manipulate it from there to point to my CDN.
I'm not sure this is possible in regex, so any suggestions using a combination of regex / c# /HTML Agility Pack are welcome

Is this what you're after?
(?<BeforeAssets>.*?(?:\/|^))assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
You can try this out here: http://regexstorm.net/tester
Or here: https://regex101.com/r/b8XxcF/1
NB: In the above regex I escaped the forward slash characters. .Net doesn't require this, but doesn't complain; and doing so makes this compatible with other Regex engines; which means it can be tested on Regex101.
When testing with those tools you'll need to specify the MultiLine or SingleLine options to get the example where assets/ has nothing preceding it, since otherwise the ^ character won't match the start of that line. This option may not be required in your code; i.e. if you're only matching one string at a time, rather than a whole block of text.
Update
Apologies for misreading; you're parsing the full HTML page; not just the URIs returned from that page. To do this you could use something like:
["'\(](?<BeforeAssets>[^"'\(\)]*?)assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
(thankfully characters ", ', and ( are illegal in the URL, so should be OK to detect the start of a variable: https://www.rfc-editor.org/rfc/rfc3986#section-2.2.)
This isn't fool-proof; it's better to use an HTML parsing tool, then pull out the URIs from that; but if you are doing everything with regex, hopefully this will help.

Related

Regular expression Google image

I make RSS reader and I need to find path url image (Google RSS) using regex expression.
URL image from RSS is for example:
RSS channel is https://news.google.com/?output=rss.
<img src="//t0.gstatic.com/images?q=tbn:ANd9GcRfMZ3MOzznCthFKCdIan17n9B8vZvEE-tRSQVTcgJa5i1OPfdf90zi4mBuGzPfB7Bj2mwE0TE" alt="" border="1" width="80" height="80" />
btw. I use regex expressions:
Regex regx = new Regex("\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase);
Some advice?
First, you should not parse xml with regex -> use XmlDocument, XmlParser, Readers,...
If you know what you are doing here is the quick and dirty regex solution.
All image Tags in your Feed seems to be in descriptions-Tags and they are of course xml encoded (just keep that in mind the next few steps)
Next you should look for some example img tags
Are you looking for img-tags without src too, or with empty source?
Overall -> define what you are looking for
Design your Regex
because the feed is generated automatically the tags seems to be in the same order every time (we use that fact for shorter regex)
Each img Tag starts with < (but keep point 1 in mind -> xml encoded)
looking for < followed by img (current regex: <img
Next followed by at least one whitespace char. (current regex: lt;img\s+
the src attribute is always the first attribute (in this case) if present so we select src=" (current regex: <img\s+src=")
Next select the url itselt with .* but be carefull the * quantifier is greedy so we have to use Lazy quantification .*? and finally close with "
Final regex: <img\s+src="(.*?)" Make sure that you use brackets for the url for easy group selection.
Last Step: C# Code
//quick & dirty :-)
var url = "https://news.google.com/?output=rss";
var regex = #"<img\s+src="(.*?)"";
var RssContent = new StreamReader(((HttpWebRequest)HttpWebRequest.Create(url)).GetResponse().GetResponseStream()).ReadToEnd();
foreach (Match match in Regex.Matches(RssContent, regex))
{
//print img urls
Debug.WriteLine(match.Groups[1]);
}
PS: If you are trying to write an RSS-reader you should NOT use Regex to parse html at all! try to find a way to transform html into xaml and write your reader in WPF or start with learning more about those problems by studying some open source RSS readers
You can use the below regex patter:
/(.*\/images.*)/

Regex to find WSDL files in HTML

I am writing a discover service that takes a URL and returns the HTML located at that page.
From that page, I need to "scrape" all the WSDL URL's.
So I need something like the following, but I am not sure how to specify the regex to pass into the pattern matching.
string wsdlPattern = //SOME REGEX THAT MATCHES WSDL http:{address}wsdl
Regex wsdlRegex = new Reges(wsdlPattern);
MatchCollection matches = wsdlRegex.Match(html);
Can somebody please help me figure how I can do this?
Try this:
http://[^\s]*?.wsdl
The regular text parts are obvious: it needs to start with http:// and end with .wsdl. [^\s] means "any non-whitespace character", and *? means "as few as possible" (this is necessary in case you have something like http://www.blah.com/a.wsdl<br>http://www.blah.com/b.wsdl. Without the ?, you'd match that whole thing as one string.)
This isn't perfect, but it should get you started.
If you want to play with regex, this is a great resource:
http://www.gskinner.com/RegExr
I used below RE for validting WSDL urls, as you can see I had to check if they end with "?wsdl"
RE : (http|https):\/\/[^\s]*?.\?wsdl
Ignore Case : (?i)(http|https):\/\/[^\s]*?.\?wsdl(?-i)
( Test Case : http://localhost/WebService1.asmx?wSDl )
wsdls can be uploaded using ftp and files as well therefore:
(http|https|ftp|file)://[^\s]*?.(wsdl|WSDL)
Hope this helps!

How to extract string between 2 markers using Regex in .NET?

I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.
I've tried the following with no success:
var match = Regex.Match(output, #"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
un-escape the angle brackets - with the # before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?

Get URL from HTML code using a regular expression

Consider:
<div>Anirudha Web blog</div>
What is the regular expression to get http://anirudhagupta.blogspot.com/
from the following?
<div>Anirudha Web blog</div>
If you suggest something in C# that's good. I also like jQuery to do this.
If you want to use jQuery you can do the following.
$('a').attr('href')
Quick and dirty:
href="(.*?)"
Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).
The simplest way to do this is using the following regular expression.
/href="([^"]+)"/
This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.
UPDATE: A complete Perl program for parsing URLs would look like this:
use 5.010;
while (<>) {
push #matches, m/href="([^"]+)"/gi;
push #matches, m/href='([^']+)'/gi;
push #matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
say for #matches;
}
It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:
curl url | perl urls.pl
The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.
You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.
data="""
<html>
abcd ef ....
blah blah <div>Anirudha Web blog</div>
blah ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""
for item in data.split("</a>"):
if "<a href" in item:
start_of_href = item.index("<a href") # get where <a href=" is
print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.
The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

Why is a left parenthesis being escaped in this Regex?

I'm using an HTML sanitizing whitelist code found here:
http://refactormycode.com/codes/333-sanitize-html
I needed to add the "font" tag as an additional tag to match, so I tried adding this condition after the <img tag check
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Aside from the condition above, my code is almost identical to the code in the page I linked to. When I try to test this in C#, it throws an exception saying "Not enough )'s". I've counted the parenthesis several times and I've run the expression through a few online Javascript-based regex testers and none of them seem to tell me of any problems.
Am I missing something in my Regex that is causing a parenthesis to escape? What do I need to do to fix this?
UPDATE
After a lot of trial and error, I remembered that the # sign is a comment in regexes. The key to fixing this is to escape the # character. In case anyone else comes across the same problem, I've included my fix (just escaping the # sign)
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Your IsMatch Method is using the option RegexOptions.IgnorePatternWhitespace, that allows you to put comments inside the regular expressions, so you have to scape the # chatacter, otherwise it will be interpreted as a comment.
if (!IsMatch(tagname,#"<font(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
I don't see anything obviously wrong with the regex. I would try isolating the problem by removing pieces of the regex until the problem goes away and then focus on the part that causes the issue.
It works fine for me... what version of the .NET framework are you using, and what is the exact exception?
Also - what does you IsMatch method look like? is this just a pass-thru to Regex.IsMatch?
[update] The problem is that the OP's example code didn't show they are using the IgnorePatternWhitespace regex option; with this option it doesn't work; without this option (i.e. as presented) the code is fine.
Download Chris Sells Regex Designer. Its a great free tool for testing .NET regex's.
I'm not sure this regex is going to do what you want because it depends on the order of the attributes matching what you have in the regex. If for example face="Arial" preceeded size="5" then face= wouldn't match.
There are some escaping problems in your regex. You need to escape your " with \ You need to escape your # with \ You need to use \s in Courier New instead of just the space. You need to use the RegexOptions.IgnorePatternWhitespace and RegexOptions.IgnoreCase options.
<font
(\s+size=\"\d{1}\")?
(\s+color=\"((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)\")?
(\s+face=\"(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)\")?
The # characters are what was causing the exception with the somewhat misleading missing ) message.

Categories

Resources