Remove URl from a string in c# - c#

I have a string I have parsed from a RSS feed
thumbnail url='http://photos3.media.pix.ie/11/C5/11C5B77C92204ADBBD0CF5FDF4BA351B-0000314357- 0002211156-00240L-00000000000000000000000000000000.jpg' height='240' width='226'
I need to remove just the URL detail from the string to form the basis of a image on a Windows Phone 7 application.
What would you suggest as the best way of doing this
The code from the phone is here
FeedItems.ItemsSource = from imageFeed in xmlImageEntries.Descendants("item")
select new PixIEPanoramaTest.Data.FeedItem
{
ThumbSource = imageFeed.Element("thumbnail").Value,
};
Feeditems is just a bound list box. The thumbsource variable just needs the URL from the string.
Any thoughts greatly appreciated ,
Rob

You can just access the attribute value for the url attribute:
ThumbSource = imageFeed.Element("thumbnail").Attribute("url").Value,
It would probably worth using some kind of extensions method to return string.Empty if the attribute is missing, though.

There are libraries available to help parse RSS feeds - e.g.
working with rss + c#
on codeplex http://rssr7.codeplex.com/releases/view/43832
If you want to do a quick "hacky" parse of the string yourself, then you could use RegularExpressions - e.g. a Regex something like thumbnail url='(?<imageurl>http.*?)' height='240' width='226' would pull out your url as the imageurl group
Match match = Regex.Match(input, #"thumbnail url='(?<imageurl>http.*?)' height='240' width='226'");
if (match.Success)
{
string url = match.Groups[1].Value;
}

Related

Replace locale name in url using regex only(no string last index of)

I have url scheme setup with following structure
www.foo.com/bar
For choosing locale specific page (bar), user can select different values from dropdown which will append locale code to the url like:
www.foo.com/en/bar
www.foo.com/za/bar
How can I use regex to replace url value before bar to include correct locale? I have following replacement that partially works but keeps on appending locale code multiple times if i keep on selecting diff vale as in:
www.foo.com/za/bar
www.foo.com/za/en/bar
string referer = Request.Headers["Referer"].ToString();
string url = Regex.Replace(referer, Request.Host.ToString(), $"{Request.Host.ToString()}/{lang}");
return Redirect(url);
Managed to get it working using
string url = Regex.Replace(referer, #".*\/", $"{Request.Scheme}://{Request.Host.ToString()}/{lang}/");

search keywords in google through c# window application

i want to work on a scraper program which will search keyword in google. i have problem in starting my scraper program.
my problem is:
let suppose window application(c#) have 2 textboxes and a button control. first textbox have "www.google.com" and the 2nd textbox contain keywork for example:
textbox1: www.google.com
textbox2: "cricket"
i want code to add to the button click event that will search cricket in google. if anyone have a programing idea in c# then plz help me.
best regards
i have googled my problem and found solution to the above problem...
we can use google API for this purpose...when we add reference to google api then we will add the following namespace in our program...........
using Google.API.Search;
write the following code in button click event
var client = new GwebSearchClient("http://www.google.com");
var results = client.Search("google api for .NET", 100);
foreach (var webResult in results)
{
//Console.WriteLine("{0}, {1}, {2}", webResult.Title, webResult.Url, webResult.Content);
listBox1.Items.Add(webResult.ToString ());
}
test my solution and give comments .........thanx everybody
I agree with Paqogomez that you don't appear to have put much work into this but I also understand that it can be hard to get started. Here is some sample code that should get you on the right path.
private void button1_Click(object sender, EventArgs e)
{
string uriString = "http://www.google.com/search";
string keywordString = "Test Keyword";
WebClient webClient = new WebClient();
NameValueCollection nameValueCollection = new NameValueCollection();
nameValueCollection.Add("q", keywordString);
webClient.QueryString.Add(nameValueCollection);
textBox1.Text = webClient.DownloadString(uriString);
}
This code will search for "Test Keyword" on Google and return the results as a string.
The problems with what you are asking is Google is going to return your result as HTML that you will need to parse. I really think you need to do some research on the Google API and what is needed to programmatically request data from Google. Start your search here Google Developers.
Hope this helps get you started on the right path.
You can use the WebClient class and DownloadString method
for searches. Use the regex for matching urls from result string.
For example:
WebClient Web = new WebClient();
string Source=Web.DownloadString("https://www.google.com/search?client=" + textbox2.text);
Regex regex =new Regex(#“ ^http(s)?://([\w-]+.)+[\w-]+(/[\w%&=])?$”);
MatchCollection Collection=regex.Matches(source);
List<string> Urls=new List<string>();
foreach (Match match in Collection)
{
Urls.Add(match.ToString());
}

Pulling data from a webpage, parsing it for specific pieces, and displaying it

I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet Escaño) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!
Take a look at Html Agility Pack
You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.
You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).
You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

How to match URL in c#?

I have found many examples of how to match particular types of URL-s in PHP and other languages. I need to match any URL from my C# application. How to do this? When I talk about URL I talk about links to any sites or to files on sites and subdirectiories and so on.
I have a text like this: "Go to my awsome website http:\www.google.pl\something\blah\?lang=5" or else and I need to get this link from this message. Links can start only with www. too.
If you need to test your regex to find URLs you can try this resource
http://gskinner.com/RegExr/
It will test your regex while you're writing it.
In C# you can use regex for example as below:
Regex r = new Regex(#"(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*");
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
while (m.Success)
{
//do things with your matching text
m = m.NextMatch();
}
Microsoft has a nice page of some regular expressions...this is what they say (works pretty good too)
^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ff650303.aspx#paght000001_commonregularexpressions
I am not sure exactly what you are asking, but a good start would be the Uri class, which will parse the url for you.
Here's one defined for URL's.
^http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
http://msdn.microsoft.com/en-us/library/ms998267.aspx
Regex regx = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
This will return a match collection of all matches found within "yourStringThatHasUrlsInIt":
var pattern = #"((ht|f)tp(s?)\:\/\/|~/|/)?([w]{2}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?";
var regex = new Regex(pattern);
var matches = regex.Matches(yourStringThatHasUrlsInIt);
The return will be a "MatchCollection" which you can read more about here:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchcollection.aspx
//This code return (protocol://)host:port from URL
//Commented URL's with different protocols. Just uncomment to test.
//string url = "http://www.contoso.com:8080/letters/readme.html";
//string url = "ftp://www.contoso.com:8080/letters/readme.html";
//string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string url = "l2tp://1.5.8.6:8080/letters/readme.html";
string host = "";//empty string with host from url
//protocol, (ip/domain), port
host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${proto}://${host}${port}");
//(ip/domain):port without protocol. If HTTPS board loading images from HTTP host.
//host = Regex.Match(url, #"^(?<proto>\w+)://+?(?<host>[A-Za-z0-9\-\.]+)+?(?<port>:\d+)?/", RegexOptions.None, TimeSpan.FromMilliseconds(150)).Result("${host}${port}");
Console.WriteLine("url: "+url+"\nhost: "+host); //display host
see https://rextester.com/PVSO54371
u can also use https://github.com/d-kistanov-parc/DotNetUrlPatternMatching
The library allows you to match a URL to a pattern.
How it works:
an url pattern is split into parts
each non-empty part is matched with a similar one from the URL.
You can specify a Wildcard * or ~
Where * is any character set within the group (scheme, host, port, path, parameter, fragment)
Where ~ any character set within a group segment (host, path)
Only supply parts of the URL you care about. Parts which are left out will match anything. E.g. if you don’t care about the host, then leave it out.

Categories

Resources