C# extracting certain parts of a string - c#

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!

Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian

?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)

I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Related

Regular expression Google image

I make RSS reader and I need to find path url image (Google RSS) using regex expression.
URL image from RSS is for example:
RSS channel is https://news.google.com/?output=rss.
<img src="//t0.gstatic.com/images?q=tbn:ANd9GcRfMZ3MOzznCthFKCdIan17n9B8vZvEE-tRSQVTcgJa5i1OPfdf90zi4mBuGzPfB7Bj2mwE0TE" alt="" border="1" width="80" height="80" />
btw. I use regex expressions:
Regex regx = new Regex("\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase);
Some advice?
First, you should not parse xml with regex -> use XmlDocument, XmlParser, Readers,...
If you know what you are doing here is the quick and dirty regex solution.
All image Tags in your Feed seems to be in descriptions-Tags and they are of course xml encoded (just keep that in mind the next few steps)
Next you should look for some example img tags
Are you looking for img-tags without src too, or with empty source?
Overall -> define what you are looking for
Design your Regex
because the feed is generated automatically the tags seems to be in the same order every time (we use that fact for shorter regex)
Each img Tag starts with < (but keep point 1 in mind -> xml encoded)
looking for < followed by img (current regex: <img
Next followed by at least one whitespace char. (current regex: lt;img\s+
the src attribute is always the first attribute (in this case) if present so we select src=" (current regex: <img\s+src=")
Next select the url itselt with .* but be carefull the * quantifier is greedy so we have to use Lazy quantification .*? and finally close with "
Final regex: <img\s+src="(.*?)" Make sure that you use brackets for the url for easy group selection.
Last Step: C# Code
//quick & dirty :-)
var url = "https://news.google.com/?output=rss";
var regex = #"<img\s+src="(.*?)"";
var RssContent = new StreamReader(((HttpWebRequest)HttpWebRequest.Create(url)).GetResponse().GetResponseStream()).ReadToEnd();
foreach (Match match in Regex.Matches(RssContent, regex))
{
//print img urls
Debug.WriteLine(match.Groups[1]);
}
PS: If you are trying to write an RSS-reader you should NOT use Regex to parse html at all! try to find a way to transform html into xaml and write your reader in WPF or start with learning more about those problems by studying some open source RSS readers
You can use the below regex patter:
/(.*\/images.*)/

Extracting image link through regex in C#

I have a bunch of links in this format
http://imgur.com/a/bwBpM
http://imgur.com/a/bwBpM[/IMG]
[IMG]http://imgur.com/a/bwBpM
[IMG]http://imgur.com/a/bwBpM[/IMG]
The IMG tags are only supplied in some cases, and I want to extract the link, i.e. http://imgur.com/a/bwBpM in this case. Is there an easy way to do this through regex in C#?
If you're saying that you have the text in the question in some kind of list and they are always either in the format of:
Just the Url
the Url + partial or full tags
then the easiest thing to do is to run:
url = url.Replace("[IMG]", "").Replace("[/IMG]");
if there are no tags then there is no change, but if the tags are there they will be stripped out.
You could use this pattern:
^(?:\[IMG\])?([^[]*)(?:\[/IMG\])?$
You can get the output using:
var match = Regex.Match(input, #"^(?:\[IMG\])?([^[]*)(?:\[/IMG\])?$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // http://imgur.com/a/bwBpM
}

Regex for Removing Comma between <a> tag text C#

I have the following string , i tried many many regex to remove comma between a tag text, but not found any regex for removing comma between a tag text. I want that , whenever text inside a tag has comma ,then will be replace by empty string.
Getty Center, Restaurant at the
i have tried this regex but it is not working, here input is string that contains html.
input = Regex.Replace(input, #"<a(\s+[^>]*)?>[^\w\s]</a(\s+[^>]*)?>", "");
Please help me out. Thank You
You can use the Regex to find and modify the content of the tag like so.
var input = "Getty Center, Restaurant at the";
var regex = new Regex(#"<a[^>]*>(?<content>.*?)</a[^>]*>",
RegexOptions.Singleline);
var match = regex.Match(input);
while (match.Success) {
var group = match.Groups["content"];
input = input.Substring(0, group.Index)
+ group.Value.Replace(",", "")
+ input.Substring(group.Index + group.Length);
match = regex.Match(input, group.Index);
};
The loop is in place to catch multiple tags in the same string. The Regex however is fairly naive. It will mess with tags nested inside the A tag, and will parse incorrectly if a > is in any of the attributes. (Though that would probably be bad HTML anyway.) A proper HTML parser is recommended for this reason.
I would suggest to use a HTML parser. There are plenty available which are open source and are free. One of the best I found is HTMLAgilityPack at HTMLAgilityPack
Some examples at Some Examples
In nutshell, the following code snippet will give you all tag
HtmlDocument myDoc = new HtmlDocument();
myDoc.Load(path);
HtmlNodeCollection imgs = new HtmlNodeCollection(myDoc.DocumentNode.ParentNode);
imgs = myDoc.DocumentNode.SelectNodes("//img");
Hope that helps
If you want to directly use the replace, you will have to match only the comma and not the text before or after the comma. You'd have to use look ahead and look behind to check if the comma is in the tag. Although this is doable, it is not advised to do this.
An alternative is to use matching groups to match the whole text in the tag and group the comma if it exists and replace the match.
<a[^>]+>[\w\s]*(,?)[\w\s]*<\/a>
The first capture group captures comma if present. You can test it here. [http://rubular.com/r/K2jjIaObty][1]
The best option would be to use a html parser to capture contents of the a tag, search for comma and replace.

Strip out content between and including h2 tag

I am trying to strip the content from between the h2 tags in a string using a Regex in C#:
<h2>content needs removing</h2> other content...
I have the following Regex, which according to the Regex buddy software I used to test it, should work, but it doesn't:
myString = Regex.Replace(myString, #"<h[0-9]>.*</h[0-9]>", String.Empty);
I have another Regex that is run after this to remove all other HTML tags, it is called in the same way and works fine. Can anyone help me out with why this isn't working?
Don't use Regular Expressions.
HTML is not a Regular Language, thus it can't be parsed correctly with a Regular Expression.
For example, your Regex would match:
<h2>sample</h1>
which is not valid. When dealing with nested structures, this would lead to unexpected results (.* is greedy and matches everything until the last closing h[0-9] tag in your input HTML string)
You can use XMLDocument (HTML is not XML but that would be sufficient for what you're trying to do) or you can use Html Agility Pack.
try this code :
String sourcestring = "<h2>content needs removing</h2> other content...";
String matchpattern = #"\s?<h[0-9]>[^<]+</h[0-9]>\s?";
String replacementpattern = #"";
MessageBox.Show(Regex.Replace(sourcestring,matchpattern,replacementpattern));
[^<]+ is more safer than .+ because it stops collecting where it sees a <.
This works fine for me:
string myString = "<h2>content needs removing</h2> other content...";
Console.WriteLine(myString);
myString = Regex.Replace(myString, "<h[0-9]>.*</h[0-9]>", string.Empty);
Console.WriteLine(myString);
Displays:
<h2>content needs removing</h2> other content...
other content...
As expected.
If you problem is that your real case has several different heading tags, then you have an issue with the greedy * quantifier. It will create the longest match that it can. For example, if you have:
<h2>content needs removing</h2> other content...<h3>some more headings</h3> and some other stuff
You will match everything from <h2> to </h3> and replace it. To fix this, you need to use a lazy quantifier:
myString = Regex.Replace(myString, "<h[0-9]>.*?</h[0-9]>", string.Empty);
Will leave you with:
other content... and some other stuff
Note however, that this will not fix nested <h> tags. As #fardjad said, using Regex for HTML isn't generally a good idea.

Find and replace - should I use Regex?

I need to create a simple markup fix an I already did everything that I need like bold and italic etc.. But this is a bit harder than what I've done so far and I have no idea how to do this. Basically my input is very simple:
[imgGroup="group1"]
image1.jpg
[/imgGroup]
As you can see I pass a param that is group1 and inside I have image1. I need to convert this into a link that has this image inside and group in rel tag like so:
<a href="image1.jpg" rel="group1" >
<img src="image1.jpg" />
</a>
I think that I will need to use Regex for this problem, however I only know how to find something in between 2 tags, not so much for this problem... I'm using ASP.NET MVC3 with C#.
You could use named groups in RegEx to match, then you can just re-assemble them into the order you want:
var regex = new RegEx(("$1(\d\s)$2([a-z])"); // Set up your regex with named groups
var result = regex.Replace("inputstring", "$2 $1"); // Replace input string with the given text, including anything matched in the named groups $1 and $2
Be warned though, RegEx with things like Urls and HTML parsing can very, very quickly turn into a horror beyond your wildest dreams ;)
Good luck!
Named groups for RegEx in dot net
Here is my suggestion for you:
var regex = new Regex(#"\[imgGroup=" + "group1" + #"\]\s*(?<Content>\S*)\s*\[imgGroup\]");
var newValue = regex.Replace(oldValue, #"<a href=""$1"" rel=""group1"" ><img src=""$1"" /> </a> );
That should do what you've expected.

Categories

Resources