screen scraping

screen scraping - c#

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må ..
any idea to solve this?
thanks

Try UTF-8 or Windows-1252 charset.

If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

i just used System.Web.HttpContext.Current.Server.HtmlDecode()
it works ..

I use iso-8859-1 for decoding.
HTH

Its better to use the same encoding that the HttpWebResponse object has,
Below is the code that will work with all langauges and characters .
response = (HttpWebResponse)request.GetResponse();
string Charset = response.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
if (response.StatusCode == HttpStatusCode.OK)
{
response_stream = new StreamReader(response.GetResponseStream(), encoding);
html = response_stream.ReadToEnd();
}

Related

C# Get site source code with letters other than english

I'm trying to get a site's source in C# using
WebClient client = new WebClient();
string content = client.DownloadString(url);
And it gets it just fine.
However, the source code contains Hebrew characters which shows like Gibbrish in content variable.
What do I need to do for it to recognize it?

WebClient client = new WebClient();
client.Encoding = System.Text.UTF8Encoding.UTF8; // added
string content = client.DownloadString(url);
You have to specify the encoding, you are probably requesting ASCII by default and the content could be in UTF8. This is an example where the encoding is set to UTF8. If you are not sure what it is check the source manually first and then specify the encoding accordingly. For more info see Remarks in the documentation.

The problem is the Encoding of your WebClient. MSDN says:
... the method uses the encoding specified in the Encoding property to convert the resource to a String.
Solution: Set a specific Encoding like
client.Encoding = Encoding.UTF8;
and try it again
string content = client.DownloadString(url);
UTF8 should do the trick to encode also the hebrew characters.

convert rtsp stream to http stream

In c# is there possibility that rtsp video stream is used "System.net.httpwebrequest" if not plz tell me another alternative .
// the URL to download the file from
string basepath = #"rtsp://ip.worldonetv.com:1935/live/ ";
// the path to write the file to
// string sFilePathToWriteFileTo = "d:\\Download";
// first, we need to get the exact size (in bytes) of the file we are downloading
Uri url = new Uri(basepath);
System.Net.HttpWebRequest request = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
System.Net.HttpWebResponse response = (System.Net.HttpWebResponse)request.GetResponse();
response.Close();

You can formulate RtspRequests with my library.
You can then base64 encode the RtspRequest and put that as the body to the HttpRequest.
Add the content-length header which would be equal to the length of the base64 encoded rtsp request in the body.
Add the header rtsp/x-tunneled to HttpRequest and then sent it along.
You should get back a HttpResponse with the body containing a base64 encoded RtspResponse.
Base64 decode the Body of the HttpResponse and then use the RtspResponse class in my library to parse it.
The library is # http://net7mma.codeplex.com/
And there is a codeproject article # http://www.codeproject.com/Articles/507218/Managed-Media-Aggregation-using-Rtsp-and-Rtp
If you need anything else let me know!

There's no standard C# library to do this. You can't even do it with the various .NET DirectShow wrappers. I just had a coworker spend a month on this problem and he ended up writing his own C# wrapper on GStreamer. If you're planning to display the video, the easiest option is to embed the VLC ActiveX control.

Reading a web page in a foreign language with StreamReader

I'm trying to fetch a web page that is a mix of English and Korean. The browser can fetch and display the page just fine, but when I try to grab it programmatically I can't get the Korean characters to display properly.
I know that you can specify an Encoding in the StreamReader but I haven't found one that works yet.
This is the code that I'm using to read the response:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(response.CharacterSet));
response.CharacterSet returns UTF8. I've also tried all of the basic encoding options - ASCII, BigEndian, Default, Unicode, UTF32, UTF7, and manually adding Encoding.UTF8.
I've also tried going about it through the CultureInfo:
CultureInfo kr = CultureInfo.GetCultureInfo("ko");
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(kr.TextInfo.ANSICodePage));
using both "ko" and "ko-KR". I get varied results from all these different types, but none of them are correct.
I've also tried the code page directly:
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(949));
response.ContentEncoding returns an empty string. I'm running out of ideas.
Edit: Here is an example of what I'm expecting:
프로젝트:
and here is what I'm getting:
//ASCII == ??????
//BigEndian == à¤¼æ±©æ¹«â¨ç‰¥æ˜½âˆ¯æ©³â½¤ç°æ”
//Default == Ãâ€â€žÃ«Â¡Å“Ã¬Â ÂÃÅ Â¸:
//Unicode == Ãâ€â€žÃ«Â¡Å“Ã¬Â ÂÃÅ Â¸
//UTF32 == ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï
//UTF7 == ÃÂ”Â„Ã«Â¡ÂœÃ¬Â ÂÃÂŠÂ¸
//UTF8 == í”„ë¡œì íŠ¸

FWIW: a stream reader is likely not going to work well.
Prefer using HttpWebRequest Class to do browser requests (or you will start feeling sorry very soon when you get 302 responses or gzipped and/or chunked encoding)
I promoted this to an answer, as it might very well be the problem you're having already. I don't know what the response you are getting looks like, of course

HttpWebRequest an Unicode characters

I am using this code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string result = null;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
StreamReader reader = new StreamReader(resp.GetResponseStream());
result = reader.ReadToEnd();
reader.Close();
}
In result I get text like 003cbr /003e003cbr /003e (I think this should be 2 line breaks instead). I tried with the 2, 3 parameter versions of Streamreader but the string was the same. (the request returns a json string)
Why am I getting those characters, and how can I avoid them?

It's not really clear what that text is, but you're not specifying an encoding at the moment. What content encoding is the server using? StreamReader will default to UTF-8.
It sounds like actually you're getting some sort of oddly-encoded HTML, as U+003C is < and U+003E is >, giving <br /><br /> as the content. That's not JSON...
Two tests:
Use WebClient.DownloadString, which will detect the right encoding to use
See what gets shown using the same URL in a browser
EDIT: Okay, now that I've seen the text, it's actually got:
\u003cbr /\u003e
The \u part is important here - that's part of the JSON which states that the next four characters form ar the hex representation of a UTF-16 code unit.
Any JSON API used to parse that text should perform the unescaping for you.

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!

Take a look at Html Agility Pack

You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.

You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).

You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

screen scraping - c#

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks

Try UTF-8 or Windows-1252 charset.

If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

i just used System.Web.HttpContext.Current.Server.HtmlDecode() it works ..

I use iso-8859-1 for decoding. HTH

Related

C# Get site source code with letters other than english

convert rtsp stream to http stream

Reading a web page in a foreign language with StreamReader

HttpWebRequest an Unicode characters

Need help for parsing HTML in C#

Categories

Resources