I am trying to get the converted text using google translator's api.
public JsonResult getCultureMeaning(string word, string langcode)
{
string url = String.Format("https://translate.google.com/#en/" + langcode+ "/" + word + "");
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string m = "";
foreach (HtmlNode node in doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes)
{
m += node.InnerHtml;
}
return Json(m, JsonRequestBehavior.AllowGet);
}
In this above method I am passing parameters, say if word is Welcome and langcode is hi in this case.
So I would have url https://translate.google.com/#en/hi/welcome and result is आपका स्वागत है
But when I do select result container with its children nodes as- doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes) then it does not find this result container within the result. Hence I don't get this api work in my case.
Edit-
result container from the url-
<span id="result_box" class="short_text" lang="hi"><span class="hps">आपका स्वागत है</span></span>
How should I approach it to get it working. For reference I am using HtmlAgilityPack.
If you inspect page requests, you might notice, that actual translation request done via AJAX, sample query for your translation is: https://translate.google.com/translate_a/single?client=t&sl=en&tl=hi&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qc&dt=rw&dt=rm&dt=ss&dt=t&dt=at&dt=sw&ie=UTF-8&oe=UTF-8&ssel=0&tsel=0&q=welcome
It returns JSON, you might inspect it and get what you looking for(data is pretty big, so i won't post it here)
Agility pack only requests back document elements, It cannot request contents after ajax request is done. Thanks to #Uriil for pointing light on this issue.
However I was able to manage it via traditional way using WebClient
Here is what I did-
public JsonResult getCultureMeaning(string word, string languagePair)
{
string languagePair = "en|" + langua + "";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", word, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
return Json(result, JsonRequestBehavior.AllowGet);
}
It works for every culture pair. Except converting en|en, In this case It would request whole HTML document with result.
Related
I want to count the number of rows in html string returned from API. Any idea to get the rows count without using html agility pack ?
following code will connect to API and return html string for apiContent.
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
var response = client.GetAsync(apiURL).Result;
using (HttpContent content = response.Content)
{
Task<string> result = content.ReadAsStringAsync();
apiContent = result.Result;
}
}
now i need to count the numbers of row (tr) from html string in variable "apiContent" but without using html agility pack.
If the only <TR>'s being returned are what you are interested in, why not just do a LINQ .Count()?
int count = result.Count(f => f == '<tr');
Here is a robust solution without HtmlAgilityPack.
Lets consider this HTML:
var html = "<table><tr><td>cell</td></tr><!--<tr><td>comment</td></tr>--></table>"
Lets load this HTML as a document:
// Create a new context for evaluating webpages with the default configuration
var context = BrowsingContext.New(Configuration.Default);
// Parse the document from the content of a response to a virtual request
var document = await context.OpenAsync(req => req.Content(html));
Query whatever you are looking for in your HTML:
var rows = document.QuerySelectorAll("tr");
Console.WriteLine(rows.Count());
Try it online!
Whenever you want to parse HTML, always rely on a HTML parser. If you dont want to use HAP, AngleSharp is a great alternative. If you dont want to use an existing HTML parser, you are doomed to make your own. It will be easier on a subset of HTML, but mostly not worth the hassle. Help yourself ; use a library.
In my application I need to get to get the URL of the image of a blog post. In order to do this I'm using the HtmlAgilityPack.
This is the code I have so far:
static string GetBlogImageUrl(string postUrl)
{
string imageUrl = string.Empty;
using (WebClient client = new WebClient())
{
string htmlString = client.DownloadString(postUrl);
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlString);
string xPath = "/html/body/div[contains(#class, 'container')]/div[contains(#class, 'content_border')]/div[contains(#class, 'single-post')]/main[contains(#class, 'site-main')]/article/header/div[contains(#class, 'featured_image')]/img";
HtmlNode node = htmlDocument.DocumentNode.SelectSingleNode(xPath);
imageUrl = node.GetAttributeValue("src", string.Empty);
}
return imageUrl;
}
The problem is that this is too slow, when I did some tests I noticed that It takes about three seconds to extract the URL of the image in the given page. Which it's a problem when I'm loading a feed and trying to red several articles.
I tried to use the absolute xpath of the element I want to load, but I didn't noticed any improvement. Is there a faster way to achieve this?
Can you try this code and see if it's faster or not?
string Url = "http://blog.cedrotech.com/5-tendencias-mobile-que-sua-empresa-precisa-acompanhar/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var featureDiv = doc.DocumentNode.Descendants("div").FirstOrDefault(_ => _.Attributes.Contains("class") && _.Attributes["class"].Value.Contains("featured_image"));
var img = featureDiv.ChildNodes.First(_ => _.Name.Equals("img"));
var imgUrl = img.Attributes["src"];
i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.
Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)
I am trying to setup a simple app that consumes the Yahoo Fantasy sports API, and allows queries to be executed through YQL.
class Program
{
static void Main(string[] args)
{
string yql = "select * from fantasysports.games where game_key in ('268')";
//var xml = QueryYahoo(yql);
// Console.Write(xml.InnerText);
string consumerKey = "--my key--";
string consumerSecret = "--my secret--";
var xml = QueryYahoo(yql, consumerKey, consumerSecret);
Console.Write(xml.InnerText);
}
private static XmlDocument QueryYahoo(string yql)
{
string url = "http://query.yahooapis.com/v1/public/yql?format=xml&diagnostics=false&q=" + Uri.EscapeUriString(yql);
var req = System.Net.HttpWebRequest.Create(url);
var xml = new XmlDocument();
using (var res = req.GetResponse().GetResponseStream())
{
xml.Load(res);
}
return xml;
}
private static XmlDocument QueryYahoo(string yql, string consumerKey, string consumerSecret)
{
string url = "http://query.yahooapis.com/v1/yql?format=xml&diagnostics=true&q=" + Uri.EscapeUriString(yql);
url = OAuth.GetUrl(url, consumerKey, consumerSecret);
var req = System.Net.HttpWebRequest.Create(url);
var xml = new XmlDocument();
using (var res = req.GetResponse().GetResponseStream())
{
xml.Load(res);
}
return xml;
}
There is some hidden in here, I have a custom class to make the url ok for the Yahoo API. Here is the structure of the URL that the OAuth.GetUrl() method returns
http://query.yahooapis.com/v1/yql?diagnostics=true&format=xml&oauth_consumer_key=mykey&oauth_nonce=rlfmxniesu&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1332785286&oauth_version=1.0&q=select%20%2A%20from%20fantasysports.games%20where%20game_key%20in%20%28%27268%27%29&oauth_signature=NYKIbhjoirJwB6ADxVq5DOgLW1w%3D
With this, I always seem to get
Authentication Error. The table fantasysports.games requires a higher security level than is provided, you provided APP but at least USER is expected
I am not sure what this means, I am passing my auth information to the api, but it seems I need more permissions. Has anyone have a working example of this. If needed, I can supply code to the GetUrl method, but it is more or less a copy paste from here
http://andy.edinborough.org/Getting-Started-with-Yahoo-and-OAuth
Let me know if you have any questions. Thanks!
I couldn't make it work using the YQL, but I was able to get the players data and draft result etc, by directly using the APIs at https://fantasysports.yahooapis.com/fantasy/v2/
e.g. to get NFL player David Johnson details:
GET /fantasy/v2/players;player_keys=371.p.28474 HTTP/1.1
Host: fantasysports.yahooapis.com
Authorization: Bearer [[Base64 encoded ClientId:Secret]]
Content-Type: application/json
Forgive my ignorance on the subject
I am using
string p="http://" + Textbox2.text;
string r= textBox3.Text;
System.Net.WebClient webclient=new
System.Net.Webclient();
webclient.DownloadFile(p,r);
to download a webpage. Can you please help me with enhancing the code so that it downloads the entire website. Tried using HTML Screen Scraping but it returns me only the href links of the index.html files. How do i proceed ahead
Thanks
Scraping a website is actually a lot of work, with a lot of corner cases.
Invoke wget instead. The manual explains how to use the "recursive retrieval" options.
protected string GetWebString(string url)
{
string appURL = url;
HttpWebRequest wrWebRequest = WebRequest.Create(appURL) as HttpWebRequest;
HttpWebResponse hwrWebResponse = (HttpWebResponse)wrWebRequest.GetResponse();
StreamReader srResponseReader = new StreamReader(hwrWebResponse.GetResponseStream());
string strResponseData = srResponseReader.ReadToEnd();
srResponseReader.Close();
return strResponseData;
}
This puts the webpage into a string from the supplied URL.
You can then use REGEX to parse through the string.
This little piece gets specific links out of craigslist and adds them to an arraylist...Modify to your purpose.
protected ArrayList GetListings(int pages)
{
ArrayList list = new ArrayList();
string page = GetWebString("http://albany.craigslist.org/bik/");
MatchCollection listingMatches = Regex.Matches(page, "(<p>)(?<TITLE>.*)(-)");
foreach (Match m in listingMatches)
{
list.Add("http://albany.craigslist.org" + m.Groups["LINK"].Value.ToString());
}
return list;
}