I am trying to scrape Google search suggestion using c# but am unable to parse the response which looks like json.
The url I am using is
http://clients1.google.com/complete/search?client=youtube&hl=en&gl=us&gs_rn=23&gs_ri=youtube&ds=yt&cp=2&gs_id=d&q=jk
and here is an example of response data
window.google.ac.h(["jk",[["jk news",0],["jkfilms",0],["jk party",0],["jkt48 kokoro no placard",0],["jkt48 river",0],["jk simmons",0,[3]],["jkn",0],["jkt48",0],["jk rowling",0],["jkt48 fortune cookie",0]],{"q":"M9pm0qoSNfax1agFT10pPSqRq54","j":"d","k":1}])
I have tried using json.net and string operations like trim,replace,remove etc without any success
Is there any easy way to get the suggested keywords into an array?
Assuming that it always starts with window.google.ac.h( and ends with ) then you can just do:
var json = input.Replace("window.google.ac.h(", "").TrimEnd(')');
Which produces valid JSON according to http://jsonlint.com/, which you can always put into JSON.NET or similar.
P.S. Scraping this kind of thing may be against Google's ToS, I suggest you read them.
Related
Let's say I use HttpClient(.net or any equivalent framework) to send search request to google to see what are the results of best desktop computer brand
HttpResponseMessage response = await client.GetAsync("https://www.google.com/search?q=best+desktop+brand");
then I get an raw html, let's say there are 10 results, and "https://www.dell.com/" comes the 3th result, but in the raw html, how can I tell it is the 3th result, is it any special string delimiter that separates each result?
You cannot rely on anything about the HTML that is returned. It's meant to be shown to humans in a web browser, not parsed by a script. It might change at any time.
Doing this is also against their TOS, and they may block you if they detect you.
Thankfully, Google provide an API for programatically fetching search results. I suggest you use it.
This is a question that has been asked before, but I've not found the information I'm looking for or maybe I'm just missing the point so please bear with me. I can always adjust my question if I'm asking it the wrong way.
If for example, I have a POST endpoint that use a simply DTO object with 2 properties (i.e. companyRequestDto) and contains a script tag in one of its properties. When I call my endpoint from Postman I use the following:
{
"company": "My Company<script>alert(1);</script>",
"description": "This is a description"
}
When it is received by the action in my endpoint,
public void Post(CompanyRequestDto companyRequestDto)
my DTO object will automatically be set and its properties will be set to:
companyDto.Company = "My Brand<script>alert(1);</script>";
companyDto.Description = "This is a description";
I clearly don't want this information to be stored in our database as is, nor do I want it stored as an escaped string as displayed above.
1) Request: So my first question is how do I throw an error if the DTO posted contains some invalid content such as the tag?
I've looked at Microsoft AntiXss but I don't understand how to handle this as the data provided in the properties of a DTO object is not an html string but just a string, so What I am missing here as I don't understand how this is helping sanitizing or validating the passed data.
When I call
var test = AntiXss.AntiXssEncoder.HtmlEncode(companyRequestDto.Company, true);
It returns an encoded string, but then what??
Is there a way to remove disallowed keywords or just simply throw an error?
2) Response: Assuming 1) was not implemented or didn't work properly and it ended up being stored in our database, am I suppose to return encoded data as a json string, so instead of returning:
"My company"
Am I suppose to return:
"My Company<script>alert(1)</script>"
Is the browser (or whatever app) just supposed to display as below then?:
"My Company<script>alert(1)</script>"
3) Code: Assuming there is a way to sanitize or throw an error, should I use this at the property level using attribute on all the properties of my various DTO objects or is there a way to apply this at the class level using an attribute that will validate and/or sanitize all string properties of a DTO object for example?
I found interesting articles but none really answering my problems or I'm having other problems with some of the answers:
asp.net mvc What is the difference between AntiXss.HtmlEncode and HttpUtility.HtmlEncode?
Stopping XSS when using WebAPI (currently looking into this one but don't see how example is solving problem as property is always failing whether I use the script tag or not)
how to sanitize input data in web api using anti xss attack (also looking at this one but having a problem calling ReadFromStreamAsync from my project at work. Might be down to some of the settings in my web.config but haven't figured out why but it always seems to return an empty string)
Thanks.
UPDATE 1:
I've just finished going through the answer from Stopping XSS when using WebAPI
This is probably the closest one to what I am looking for. Except I don't want to encode the data, as I don't want to store it in my database, so I'll see if I can figure out how to throw an error but I'm not sure what the condition will be. Maybe I should just look for characters such as <, >, ; , etc... as these will not likely be used in any of our fields.
You need to consider where your data will be used when you think about encoding, so that data with in it is only a problem if it's rendered as HTML so if you are going to display data that has been provided by users anywhere, it's probably at the point you are going to display it that you would want to html encode it for display (you want to avoid repeatedly html encoding the same string when saving it for example).
Again, it depends what the response is going to be used for... you probably want to html encode it at the point it's going to be displayed... remember if you are encoding something in the response it may not match whats in data so if the calling code could do something like call your API to search for a company with that name that could cause problems. If the browser does display the html encoded version it might look ugly but it's better than users being compromised by XSS attacks.
It's quite difficult to sanitize text for things like tags if you allow most characters for normal use. It's easier if you can whitelist characters allowed and only allow, say, alphanumeric but that isn't often possible. This can be done using a regex validation attribute on the DTO object. The best approach I think is to encode values for display if you can't stop certain characters. It's really difficult to try to allow all characters but avoid things like as people can start using ascii characters etc.
I have a string in plain text which contains brackets like JSON format as it is created using JavaScriptSerializer().Serialize() method. I need to remove brackets and collon and want to convert it into key = value, key = value format.
Need to convert
{
"account":"rf750",
"type":null,
"amount":"31",
"auth_type":"5",
"balance":"2.95",
"card":"re0724"
}
to
'account=rf750,type=null,amount=31,authe=5,balanc=2.95,card=re0724'
Well, you've got three different things going on here.
The first, and surface issue, is: how do you change the string?
Simple - you do some string substitutions, preferably using Regex. Remove the starting/ending braces, change [a]:"[b]", to [a]=[b], - or however you want the final format to look like.
The second, and slightly deeper issue is: JSON isn't just a simple list of keys=values. You can have nesting. You can have non-string data. Simply saying you want to change the JSON result to key=value,key=value,key=value, etc - is fragile. How do you know the JSON structure will be what you're expecting? JSON Serialization will serialize successfully even if you've got nested structures, non string/int data, etc. And if you want solid code that doesn't easily break, you have to figure out: how do I handle this? Can I handle this?
The third, and final thing is: you're taking a standard data format schema and figuring out how to translate it to a nonstandard data format. 90% of the time someone does that, they deserve to be shot. Seriously, spend some solid time asking yourself whether you can use the JSON as-is, and whether the process wanting key=value,key=value,etc can be changed to use an actual standardized data format.
Here is simple solution which (1) parses json to Dictionary and (2) uses String.Join and Linq Select to provide desired output:
using System.Linq;
using Newtonsoft.Json;
..
var dict = JsonConvert.DeserializeObject<Dictionary<string, string>>(json);
var str = string.Join(',', dict.Select(r => $"{r.Key}={r.Value}"));
str-variable now contains:
account=rf750,type=,amount=31,auth_type=5,balance=2.95,card=re0724
Well thanks everyone for your time and response. your answer led me towards solution and finally i found the following solution which resolved the issue perfectly.
var jObj = (JObject)JsonConvert.DeserializeObject(modelString);
modelString = String.Join("&",jObj.Children().Cast<JProperty>().Select(jp => jp.Name + "="+ HttpUtility.UrlEncode(jp.Value.ToString())));
the above code converts the JSON into a url encoded string and remove the JSON format
I wanna send JSON response to browser that requested based on REST service. I use something like this that includes some quotes in Controller Method:
return Json("blah\"blah", JsonRequestBehavior.AllowGet);
And I expect the result would be blah"blah But is blah\"blah and includes back slash too! I wanna have blah"blah in response without any conversion in client side. I know that need to perform this via C# codes but how to do that?
C# and JSON encode characters similarly. The string blah"blah is encoded in both C# and JSON as "blah\"blah". It's perfectly expected, then, that your raw JSON includes the backslash.
When you decode that string with a proper JSON library, it again becomes the string blah"blah.
I found the answer in two above threads:
ASP.Net MVC: how to create a JsonResult based on raw Json Data
How to output Json string as JsonResult in MVC4?
So that I need to use something like this:
return Content(jsonStringFromSomewhere, "application/json");
With this in mind considering that JSONP is used in case of ajax requesting to external service or URL. But I wanna build specific string due to parse with a special parser and I used JSON rather than JSONP and result was great.
Can you say me how to parse data from this link?
http://www.e1.ru/business/job/resume.detail.php?id=956004
I tryed something like this
var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/table[5]/tbody[1]/tr[1]/td[2]/table[4]/tbody[1]/tr[1]/td[1]/table[1]");
but it is not good variant.
Abbath, I recommend using some 3rd party tools. which can extract data from HTML and then extract your required data. like egrabber, rchilli and many more .
if you are looking for your own solution - then add a index of complete text, and then catch them as XML - study DOM structure and pick out selective values.