Web scraper replacing some characters with question marks

Web scraper replacing some characters with question marks - c#

I make a simple web scraper that scrapes lyrics for me then writes it to a database. everything works but for some reason it's replacing some characters with question marks and when I view this information on a simple php web page I'm seeing a lot of mistakes in the lyrics.
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
I know the error is in c# and my code because I put a breakpoints before it writes to the database and I display it in a rich text box. How would I get it to display these characters correctly?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric

The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.
Check out the excellent article called "What every developer should know about character encoding" for more details.
Update
You could try this, although the StreamReader should default to UTF-8 anyway:
var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);

Check the encoding by searching for charset in the html code.
Your code snipplet misses the actual load process, so it is impossible to tell where it goes wrong.

You can also try using the WebClient:
WebClient client = new WebClient { Encoding = Encoding.UTF8 };
string html = client.DownloadString(url);

Related

UWP - How to get website contents and store them in a string?

I am trying to make an app, that can read a text from a website and store it in a string.
For example my app could open this random generator website, which would generate a random number string and then my program would read it and store it in a string.
Is that even possible?

I didn't get your goal but you may get the whole HTML page and parse it as you wish:
var httpClient = new HttpClient();
var htmlString = await httpClient.GetStringAsync(new Uri("http://google.com"));

You can also use that, and precise the encoding :
string text = null;
using (WebResponse response = WebRequest.Create(url).GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1")))
{
text = reader.ReadToEnd();
reader.Close();
}
response.Close();
}

Get data from a Pastebin raw

I'm trying on form load, to make it count the number of lines in a pastebin raw & return the value to a textbox. Been racking my brains and still cant figure it out.
textBox1.Text = new WebClient().DownloadString("yourlink").

I'm expanding my comment to an answer.
As already mentioned, you need a HttpRequest or WebRequest to get the content of your string.
Maybe new WebClient().DownloadString(url);, but I prefer to use the WebRequest since it's also supported in .NET Core.
What you need to do is, extract the content of the RAW TextArea object from html. I know, people will probably hate me for that, but I used regex for that task. Alternatively you can use a html parser.
The Raw data is contained within a textarea with following attributes:
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
So the regex pattern looks like this:
private static string rgxPatternPasteBinRawContent = #"<textarea id=""paste_code"" class=""paste_code"" name=""paste_code"" onkeydown=""return catchTab\(this,event\)"">(.*)<\/textarea>";
Since the html code is spread over multiple lines, our Regex has to be use with a single line option.
Regex rgx = new Regex(rgxPatternPasteBinRawContent, RegexOptions.Singleline);
Now find the match, that contains the RAW data:
string htmlContent = await GetHtmlContentFromPage("SomePasteBinURL");
//Possibly your new WebClient().DownloadString("SomePasteBinURL");
//await not necesseraly needed here!
Match match = rgx.Match(htmlContent);
string rawContent = "ERROR: No Raw content found!";
if (match.Groups.Count > 0)
{
rawContent = match.Groups[1].Value;
}
int numberOfLines = rawContent.Split('\n').Length + 1;
And you're done.
The WebRequest looks like this for me:
private static async Task<string> GetHtmlContentFromPage(string url)
{
WebRequest request = WebRequest.CreateHttp(url);
WebResponse response = await request.GetResponseAsync();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
readStream = new StreamReader(receiveStream);
string data = readStream.ReadToEnd();
response.Dispose();
readStream.Dispose();
return data;
}

HttpWebRequest wrong encoding determination

I'm trying to read the html page text from site - http://konungstvo.ru/ , which has utf-8 encoding.
var request = _requestCreator.Create(uri);
try
{
using (var response = request.GetResponse())
{
if (response.ContentType.Contains("text/html"))
{
using (var reader = new System.IO.StreamReader(response.GetResponseStream()))
{
string responseText = reader.ReadToEnd();
}
But I'm getting \u001f�\b\01V\u0002X\u0002��X�n\u001b�, and so on, although code works with other sites.

I think you need the character encoding for the Latin/Cyrillic alphabet which could by ISO/IEC 8859-5 or e.g. Windows-1251:
var encoding = Encoding.GetEncoding("iso-8859-5");
using (var reader = new System.IO.StreamReader(response.GetResponseStream(), encoding))
Using this while reading the response stream yields some cyrillic content which unfortunately isn't the correct output, too: https://dotnetfiddle.net/x8jnN8. So, I'm sorry but this isn't a real answer to your problem :/

foreign characters; language translation proxy application

I am working on a C# application that sends an http request to a language translation server that converts Norwegian to English and vice versa. The translation results I'm getting back suggest that certain Norwegian characters like "Ø" are being garbled. Stepping through the code, I found this character being garbled in the URI members of the HttpWebRequest class, so I've explicitly specified the URI (instead of just depending on HTTPWebRequest class to do it for me). But my translation result remains garbled. I suspect that the StreamReader is the culprit now, and have tried UTF8, Unicode, ASCII encoding on a Norwegian machine but to no avail. Please advise. Thank you in advance. The code for the function is pasted below.
Other details:
Example of input content: følger;
Example of translation result: ? lger;
Input and output are in unicode text files.
public string Translate(string requestquery)
{
string translatedresult = "";
Uri uri = new Uri(requestquery, true); //this appears to have solved the garbled URI
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string charSet = response.CharacterSet;
Stream ReceiveStream = response.GetResponseStream();
//I've tried all sorts of encoding here and even changing the machine locale
string contentEncoding = response.ContentEncoding;
StreamReader sr = new StreamReader(ReceiveStream, Encoding.UTF8, false);
translatedresult = sr.ReadToEnd();
sr.Close();
return translatedresult;
}

C# to PHP base64 encode/decode

So I have this c# application that needs to ping my web server thats running linux/php stack.
I am having problems with the c# way of base 64 encoding bytes.
my c# code is like:
byte[] encbuff = System.Text.Encoding.UTF8.GetBytes("the string");
String enc = Convert.ToBase64String(encbuff);
and php side:
$data = $_REQUEST['in'];
$raw = base64_decode($data);
with larger strings 100+ chars it fails.
I think this is due to c# adding '+'s in the encoding but not sure.
any clues

You should probably URL Encode your Base64 string on the C# side before you send it.
And URL Decode it on the php side prior to base64 decoding it.
C# side
byte[] encbuff = System.Text.Encoding.UTF8.GetBytes("the string");
string enc = Convert.ToBase64String(encbuff);
string urlenc = Server.UrlEncode(enc);
and php side:
$data = $_REQUEST['in'];
$decdata = urldecode($data);
$raw = base64_decode($decdata);

Note that + is a valid character in base64 encoding, but when used in URLs it is often translated back to a space. This space may be confusing your PHP base64_decode function.
You have two approaches to solving this problem:
Use %-encoding to encode the + character before it leaves your C# application.
In your PHP application, translate space characters back to + before passing to base64_decode.
The first option is probably your better choice.

This seems to work , replacing + with %2B...
private string HTTPPost(string URL, Dictionary<string, string> FormData)
{
UTF8Encoding UTF8encoding = new UTF8Encoding();
string postData = "";
foreach (KeyValuePair<String, String> entry in FormData)
{
postData += entry.Key + "=" + entry.Value + "&";
}
postData = postData.Remove(postData.Length - 1);
//urlencode replace (+) with (%2B) so it will not be changed to space ( )
postData = postData.Replace("+", "%2B");
byte[] data = UTF8encoding.GetBytes(postData);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URL);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
Stream strm = request.GetRequestStream();
// Send the data.
strm.Write(data, 0, data.Length);
strm.Close();
WebResponse rsp = null;
// Send the data to the webserver
rsp = request.GetResponse();
StreamReader rspStream = new StreamReader(rsp.GetResponseStream());
string response = rspStream.ReadToEnd();
return response;
}

Convert.ToBase64String doesn't seem to add anything extra as far as I can see. For instance:
byte[] bytes = new byte[1000];
Console.WriteLine(Convert.ToBase64String(bytes));
The above code prints out a load of AAAAs with == at the end, which is correct.
My guess is that $data on the PHP side doesn't contain what enc did on the C# side - check them against each other.

in c#
this is a <B>long</b>string. and lets make this a3214 ad0-3214 0czcx 909340 zxci 0324#$##$%%13244513123
turns into
dGhpcyBpcyBhIDxCPmxvbmc8L2I+c3RyaW5nLiBhbmQgbGV0cyBtYWtlIHRoaXMgYTMyMTQgYWQwLTMyMTQgMGN6Y3ggOTA5MzQwIHp4Y2kgMDMyNCMkQCMkJSUxMzI0NDUxMzEyMw==
for me. and i think that + is breaking it all.

The PHP side should be:
$data = $_REQUEST['in'];
// $decdata = urldecode($data);
$raw = base64_decode($decdata);
The $_REQUEST should already be URLdecoded.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Web scraper replacing some characters with question marks - c#

Check the encoding by searching for charset in the html code. Your code snipplet misses the actual load process, so it is impossible to tell where it goes wrong.

You can also try using the WebClient: WebClient client = new WebClient { Encoding = Encoding.UTF8 }; string html = client.DownloadString(url);

Related

UWP - How to get website contents and store them in a string?

Get data from a Pastebin raw

HttpWebRequest wrong encoding determination

foreign characters; language translation proxy application

C# to PHP base64 encode/decode

Categories

Resources