Finding whats between two entered strings using regex? - c#

I am working on a simple facebook messenger client (without the need of a developer account) and so far what i have achieved is getting all my messages - name, preview, time. What i'd like to find is the users href link
so far i have this:
MatchCollection name = Regex.Matches(
htmlText, "<div class=\"_l2\">(.*?)</div>");
MatchCollection preview = Regex.Matches(
htmlText, "<div class=\"_l3 fsm fwn fcg\">(.*?)</div>");
MatchCollection time = Regex.Matches(
htmlText, "<div class=\"_l4\">(.*?)</div>");
which fully works.
but i've tried a few things that i found on this website but nothing seemed to work. The href goes like: <a class="_k_ hoverZoomLink" rel="ignore" href="
and ends with a ". Could someone refer me to an article that actually might help me know how i can get that href. Or even a better way of doing it other than regex but i would really prefer regex:
for (int i = 0; i < name.Count; i++)
{
String resultName = Regex.Replace(name[i].Value, #"<[^>]*>", String.Empty);
String newName = resultName.Substring(0, resultName.Length - 5);
String resultPreview = Regex.Replace(preview[i].Value, #"<[^>]*>", String.Empty);
String s = time[i].Value;
int start = s.IndexOf("data-utime=\"") + 28;
int end = s.IndexOf("</abbr>", start);
String newTime = s.Substring(start, (end - start));
threads.Add(new Thread(newName, resultPreview, newTime, ""));
}
Thanks in advanced.

Use a real html parser like HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var link = doc.DocumentNode.SelectSingleNode("//a[#class='_k_ hoverZoomLink']")
.Attributes["href"].Value;
Instead of XPath, you can use Linq too
var link = doc.DocumentNode.Descendants("a")
.Where(a => a.Attributes["class"] != null)
.First(a => a.Attributes["class"].Value == "_k_ hoverZoomLink")
.Attributes["href"].Value;

Related

Extract Portion of the URL after "Home"

I need to extract the rest of the portion of the URL after "home"
for example the URLs can be
https://www.example.com/site/country/home/products
https://www.example.com/site/country/home/products/consumer
https://www.example.com/site/country/home/products/consumer/kids
The keywords in the url "site", "country" might change.
All I need in the output is:
/products
/products/consumer
/products/consumer/kids
I tried using Regex, but didn't worked in above situation
As suggested by Corion and David in the comments, in this case, the simplest method is probably just to find the index of /home/ and then strip everything up to that point (but not second /):
string home = "/home/";
int homeIndex = url.IndexOf(home);
string relativeUrl = url.Substring(homeIndex + home.Length - 1);
Using a regular expression, you want to match the /home/ substring, and capture the second / and everything following it:
Match match = Regex.Match(url, #"/home(/.*)");
string relativeUrl = "/";
if (match.Success) {
relativeUrl = match.Groups[1].Value;
}
its a so simple c# code i think it may help you
string sub = "https://www.example.com/site/country/home/products";
string temp = "";
string[] ss = sub.Split('/');
for(int i = 0; i < sub.Length; i++)
{
if (ss[i] == "home")
{
i++;
for (int j = i; j < ss.Length; j++)
temp +='/'+ ss[j];
break;
}
}
Console.WriteLine(temp);
You could use the System.Uri class to extract the segments of the URL:
Uri link = new Uri("https://www.example.com/site/country/home/products/consumer/kids");
string[] segs = link.Segments;
int idxOfHome = Array.IndexOf(segs, "home/");
string restOfUrl = string.Join("", segs, idxOfHome+1, segs.Length - (idxOfHome + 1));
Yeilds:
products/consumer/kids
It is easy using Regex. Please use the following Regex and test your scenario. It works fine.
Regex: '(?<=\/home).*\b'
No need to worry about front portion before home. As soon as it finds home, it will take words after home.

How to extract an url from a String in C#

I have this string :
"<figure><img
src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'
href='JavaScript:void(0);' onclick='return takeImg(this)'
tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>"
How can I retrieve this link :
http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg
All string are the same type so somehow I need to get substring between src= and href. But I don't know how to do that. Thanks.
If you parse HTML don't not use string methods but a real HTML parser like HtmlAgilityPack:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is your string
var linksAndImages = doc.DocumentNode.SelectNodes("//a/#href | //img/#src");
var allSrcList = linksAndImages
.Select(node => node.GetAttributeValue("src", "[src not found]"))
.ToList();
You can use regex:
var src = Regex.Match("the string", "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
In general, you should use an HTML/XML parser when parsing a value from HTML code, but with a limited string like this, Regex would be fine.
string url = Regex.Match(htmlString, #"src='(.*?)'").Groups[1].Value;
If your string is always in same format, you can easily do this like so :
string input = "<figure><img src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href='JavaScript:void(0);' onclick='return takeImg(this)' tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
// link is between ' signs starting from the first ' sign so you can do :
input = input.Substring(input.IndexOf("'")).Substring(input.IndexOf("'"));
// now your string looks like : "http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg"
return input;
string str = "<figure><imgsrc = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
int pFrom = str.IndexOf("src = '") + "src = '".Length;
int pTo = str.LastIndexOf("'href");
string url = str.Substring(pFrom, pTo - pFrom);
Source :
Get string between two strings in a string
Q is your string in this case, i look for the index of the attribute you want (src = ') then I remove the first few characters (7 including spaces) and after that you look for when the text ends by looking for '.
With removing the first few characters you could use .IndexOf to look for how many to delete so its not hard coded.
string q =
"<figure><img src = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'" +
"tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
string z = q.Substring(q.IndexOf("src = '"));
z = z.Substring(7);
z = z.Substring(0, z.IndexOf("'"));
MessageBox.Show(z);
This is certainly not the most elegant way (look at the other answers for that :)).

How can i escape "?

if (richTextBox1.Lines[i].StartsWith(#"<a href=""") ||
richTextBox1.Lines[i].EndsWith(#""""))
The StartsWith should be <a href="
The EndsWith should be one single "
But the way it is now i'm getting no results.
Input for example:
Screen-reader users, click here to turn off ggg Instant.
I need to get this part:
/setprefs?suggon=2&prev=https://www.test.com/search?q%3D%2Band%2B%26espv%3D2%26biw%3D960%26bih%3D489%26source%3Dlnms%26tbm%3Disch%26sa%3DX%26ei%3DYrxxVb-hJqac7gba0YOgDQ%26ved%3D0CAYQ_AUoAQ&sig=0_seDQVVTDQQx1hvN3BRktZNFc9Ew%3D
The part between the
I also tried to use htmlagilitypack:
HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("https://www.test.com");
foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!newHtmls.Contains(hrefValue) && hrefValue.Contains("images"))
newHtmls.Add(hrefValue);
}
But this gave me only 1 link.
When i browse and see the page view-source and i make search and filter with the word image or images im getting over 350 results.
I tried also this solution:
var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
But it didnt give me the results i needed.
Forgot to mention that the view-source of the page content i copied it to richTextBox1 window and then i'm reading line by line the text from the richTextBox1 so maybe that's why i'm not getting the results as i need ?
for (int i = 0; i < richTextBox1.Lines.Length; i++)
{
if (richTextBox1.Lines[i].StartsWith("<a href=\"") &&
richTextBox1.Lines[i].EndsWith("\""))
{
listBox1.Items.Add(richTextBox1.Lines[i]);
}
}
Maybe the view-source content as it's in the browser(chrome) is not the same as in the richTextbox1. And maybe i should not read it line by line from the richTextBox1 maybe to read the whole text from the richTextBox1 first ?
Based on your input, EndsWith isn't doing to help (as your input actually ends with </a>. Your next-best option would be to store the location (position) of href=", then look for the next occurrence of a " beginning at your stored location. e.g.
var input = #"Screen-reader users, click here to turn off ggg Instant.";
var needle = #"href=""";
var start = input.IndexOf(needle);
if (start != -1)
{
start += needle.Length;
var end = input.IndexOf(#"""", start);
// final result:
var href = input.Substring(start, end - start).Dump();
}
Better than that would be to use an actual HTML parser (might I recommend HtmlAgilityPack?).

Searching a String using C#

I have the following String "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>"
I require to get the attribute value from the div tag. How can i retrieve this using C#.
Avoid parsing html with regex
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilityPack
You can do it like this with htmlagilityPack.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> itemList = doc.DocumentNode.SelectNodes("//div[#id]")//selects all div having id attribute
.Select(x=>x.Attributes["id"].Value)//select the id attribute value
.ToList<string>();
//itemList will now contain all div's id attribute value
If you're a masochist you can do this old school VB3 style:
string input = #"</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string startString = "div id='";
int startIndex = input.IndexOf(startString);
if (startIndex != -1)
{
startIndex += startString.Length;
int endIndex = input.IndexOf("'", startIndex);
string subString = input.Substring(startIndex, endIndex - startIndex);
}
Strictly solving the question asked, one of a myriad ways of solving it would be to isolate the div element, parse it as an XElement and then pull the attribute's value that way.
string bobo = "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string justDiv = bobo.Substring(bobo.IndexOf("<div"));
XElement xelem = XElement.Parse(justDiv);
var id = xelem.Attribute("id");
var value = id.Value;
There are certainly lots of ways to solve this but this one answers the mail.
A .NET Regex that looks something like this will do the trick
^</script><div id='(?<attrValue>[^']+)'.*$
you can then get hold of the value as
MatchCollection matches = Regex.Matches(input, #"^</script><div id='(?<attrValue>[^']+)'.*$");
if (matches.Count > 0)
{
var attrValue = matches[0].Groups["attrValue"];
}

Particular value from a string using regex in c#

I need to extract the $value from the given piece of string .
string text = "<h2 class="knownclass unknownclass1 unknownclass2" title="Example title>$Value </h2>"
Using the code -:
Match m2 = Regex.Match(text, #"<h2 class=""knownclass(.*)</h2>", RegexOptions.IgnoreCase);
It gets me the full value -: unknownclass1 unknownclass2" title="Example title>$Value .But I just need the $value part.
Please tell me .Thanks in advance.
Assuming the string always follows this format, consider the following code:
var index = text.IndexOf(">");
text.Substring(index + 1, text.IndexOf("<", index));
As had been said multiple time, using a Regex for parsing HTML or XML is bad. Ignoring that, you are capturing too much. Here is an alternative Regex that should work.
#"<h2 class=""knownclass[^""]*"">(.*)</h2>"
If its always the same pattern of your string, you can consider this:
string text = "<h2 class=\"knownclass unknownclass1 unknownclass2\" title=\"Example title>$Value </h2>";
string result = "";
Regex test = new Regex(#"\<.*?\>(.*?)\</h2\>");
MatchCollection matchlist = test.Matches(text);
if (matchlist.Count > 0)
{
for (int i = 0; i < matchlist.Count; i++)
{
result = matchlist[i].Groups[1].ToString();
}
}
But if you are working with XML files or HTML files, I recommend you use XmlTextReader for XML and HtmlAgilityPack for HTML
http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx
http://htmlagilitypack.codeplex.com/
hope it helps!

Categories

Resources