download a web page and save as UTF-8 text file

download a web page and save as UTF-8 text file - c#

I download a web page as follows. I want to save it as UTF-8 text. But how?
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
Encoding enc = Encoding.GetEncoding(resp.CharacterSet);
Encoding utf8 = Encoding.UTF8;
using (StreamWriter w = new StreamWriter(new FileStream(pathname, FileMode.Create), utf8))
{
using (StreamReader r = new StreamReader(resp.GetResponseStream()))
{
// This works, but it's bad because you read the whole response into memory:
string s = r.ReadToEnd();
w.Write(s);
// This doesn't work :(
char[] buffer = new char[1024];
int n;
while (!r.EndOfStream)
{
n = r.ReadBlock(buffer, 0, 1024);
w.Write(utf8.GetChars(Encoding.Convert(enc, utf8, enc.GetBytes(buffer))));
}
// This means that r.ReadToEnd() is doing the transcoding to UTF-8 differently.
// But how?!
}
}
return resp.StatusCode;
}
Don't read this paragraph. It's just here to make the warning message about having too much code go away.

You could simply use the WebClient Class. It supports encoding and easier use:
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
webClient.DownloadFile(url, "file.txt");

Related

Download CSV file with WebClient in C# but the size of file is less than when download with browser

I have a link that returns a CSV file. When I open it in a browser (Chrome, Firefox,...) the size of file that's downloaded is 86 KB, but when I want to download it with the code below, the size is just 25 KB and when I open the downloaded file it doesn't have correct data (means no columns and can't read data)
You can try it in browser and code
http://tsetmc.com/tsev2/data/Export-txt.aspx?t=i&a=1&b=0&i=43283802997035462
string url = "http://tsetmc.com/tsev2/data/Export-txt.aspx?t=i&a=1&b=0&i=43283802997035462";
WebClient wc = new WebClient();
wc.DownloadFile(url, "111.csv");

webClient is returning you zip file instead of plain text /csv file
I changed wc output file extension to zip and it is working...
zip will contain file that you specified in argument
screenshot from RestClient

As Akshay Sandhu pointed out the downloaded file is compressed with the gzip encoding and that is why it appears as corrupted when trying to open it as a csv.
To download the file and automatically decode it please refer to these two SO answers.
First download the file using the HttpWebRequest class instead of the WebClient class as done here:
How to Download the File using HttpWebRequest and HttpWebResponse class(Cookies,Credentials,etc.)
Then make sure the file is automatically decompressed. Check this out
Automatically decompress gzip response via WebClient.DownloadData
Here is the working code:
string url = "http://tsetmc.com/tsev2/data/Export-txt.aspx?t=i&a=1&b=0&i=43283802997035462";
string path = "111.csv";
using (FileStream fileStream = new FileStream(path, System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.Write))
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Get;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
const int BUFFER_SIZE = 16 * 1024;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
var buffer = new byte[BUFFER_SIZE];
int bytesRead;
do
{
bytesRead = responseStream.Read(buffer, 0, BUFFER_SIZE);
fileStream.Write(buffer, 0, bytesRead);
} while (bytesRead > 0);
}
}
}

You need to decompress the gzip, before you read it to file.
var url = new Uri("http://tsetmc.com/tsev2/data/Export-txt.aspx?t=i&a=1&b=0&i=43283802997035462");
var path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
var fileName = "111.csv";
using (WebClient wc = new WebClient())
using (Stream s = File.Create(Path.Combine(path, fileName)))
using (GZipStream gs = new GZipStream(wc.OpenRead(url), CompressionMode.Decompress))
{
//Saves to C:\Users\[YourUser]\Desktop\111.csv
gs.CopyTo(s);
}

c# files downloaded with httpwebrequest and cookies get corrupted

I am trying to make a program which is able to download files with URI(URL) using httpwebrequest and cookies(for credential information to keep login status).
I can download files with following code but files get corrupted after being downloaded.
when I download xlsx file(on the web page) into text file at local drive, I see some part of numbers and words from an original file in a corrupted file, therefore I assume I have reached to the right file.
however, when I download xlsx file(on the web page) in xlsx file at local drive, it seems like it fails to open saying
excel cannot open the file 'filename.xlsx' because the file format or
file extension is not valid. Verify that the file has not been
corrupted and that the file extension matches the format of the file.
Is there any way I can keep fully original file content after I download?
I attach a part of result content as well.
private void btsDownload_Click(object sender, EventArgs e)
{
try
{
string filepath1 = #"PathAndNameofFile.txt";
string sTmpCookieString = GetGlobalCookies(webBrowser1.Url.AbsoluteUri);
HttpWebRequest fstRequest = (HttpWebRequest)WebRequest.Create(sLinkDwPage);
fstRequest.Method = "GET";
fstRequest.CookieContainer = new System.Net.CookieContainer();
fstRequest.CookieContainer.SetCookies(webBrowser1.Document.Url, sTmpCookieString);
HttpWebResponse fstResponse = (HttpWebResponse)fstRequest.GetResponse();
StreamReader sr = new StreamReader(fstResponse.GetResponseStream());
string sPageData = sr.ReadToEnd();
sr.Close();
string sViewState = ExtractInputHidden(sPageData, "__VIEWSTATE");
string sEventValidation = this.ExtractInputHidden(sPageData, "__EVENTVALIDATION");
string sUrl = ssItemLinkDwPage;
HttpWebRequest hwrRequest = (HttpWebRequest)WebRequest.Create(sUrl);
hwrRequest.Method = "POST";
string sPostData = "__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=" + sViewState + "&__EVENTVALIDATION=" + sEventValidation + "&Name=test" + "&Button1=Button";
ASCIIEncoding encoding = new ASCIIEncoding();
byte[] bByteArray = encoding.GetBytes(sPostData);
hwrRequest.ContentType = "application/x-www-form-urlencoded";
Uri convertedURI = new Uri(ssDwPage);
hwrRequest.CookieContainer = new System.Net.CookieContainer();
hwrRequest.CookieContainer.SetCookies(convertedURI, sTmpCookieString);
hwrRequest.ContentLength = bByteArray.Length;
Stream sDataStream = hwrRequest.GetRequestStream();
sDataStream.Write(bByteArray, 0, bByteArray.Length);
sDataStream.Close();
using (WebResponse response = hwrRequest.GetResponse())
{
using (sDataStream = response.GetResponseStream())
{
StreamReader reader = new StreamReader(sDataStream);
{
string sResponseFromServer = reader.ReadToEnd();
FileStream fs = File.Open(filepath1, FileMode.OpenOrCreate, FileAccess.Write);
Byte[] info = encoding.GetBytes(sResponseFromServer);
fs.Write(info, 0, info.Length);
fs.Close();
reader.Close();
sDataStream.Close();
response.Close();
}
}
}
}
catch
{
MessageBox.Show("Error");
}
}

StreamReader is for dealing with text data. Using it corrupts your binary data(excel file).
Write sDataStream directly to file. For ex.
sDataStream.CopyTo(fs)
PS: I prepared a test case (using similar logic) to show how your code doesn't work
var binaryData = new byte[] { 128,255 };
var sr = new StreamReader(new MemoryStream(binaryData));
var str3 = sr.ReadToEnd();
var newData = new ASCIIEncoding().GetBytes(str3); //<-- 63,63
Just compare binaryData with newData

WP8 Gzip compression

WP8 does not support Gzip compression, but there is 3rd party libraries that will allow you do so, i have tried many but i am not able to make it work. this is my latest try:
var handler = new HttpClientHandler();
if (handler.SupportsAutomaticDecompression)
{
handler.AutomaticDecompression = DecompressionMethods.GZip |
DecompressionMethods.Deflate;
}
Uri myUri = new Uri("http://www.web.com");
HttpClient client = new HttpClient(handler);
client.BaseAddress = myUri;
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
client.DefaultRequestHeaders.Add("ubq-compression", "gzip");
HttpRequestMessage req = new HttpRequestMessage(HttpMethod.Post, myUri);
req.Content = new StringContent(finalURL, Encoding.UTF8);
HttpResponseMessage rm = client.SendAsync(req).Result;
string rst = await rm.Content.ReadAsStringAsync();
the API return to me an array of bytes but the the first 300 are not Gziped but everything else is
i need to unzip everything that comes after the 300 bytes.
i am using the http://www.nuget.org/packages/Microsoft.Net.Http
// i am splitting the array
byte[] hJ = res.Take(300).ToArray();
byte[] bJ = res.Skip(300).ToArray();
bj is what need to be decompressed.
i am trying this
MemoryStream stream = new MemoryStream();
stream.Write(bj, 0, bj.Length);
using (var inStream = new MemoryStream(bj))
{
var bigStreamsss = new GZipStream(inStream, CompressionMode.Decompress, true);
using (var bigStreamOut = new MemoryStream())
{
bigStreamsss.CopyTo(bigStreamOut);
output = Encoding.UTF8.GetString(bigStreamOut.ToArray(), 0, bigStreamOut.ToArray().Length);
}
}
but it is always failing on this line "
var bigStreamsss = new GZipStream(inStream, CompressionMode.Decompress, true);
"
Any help would be much appreciated

If you are using the compression header there's nothing you need to do. The server compresses and the client decompresses automatically and you don't have to worry about anything. However it sounds like you're using some proprietary content compression standard where you only compress some of it. If that's the case don't mess with any compression settings on the http client, and instead use a 3rd party uncompress library. Just seek 300 bytes on the response stream, then pass that stream to the library. You should be able to use my inflater from my gzip library, that you can find here: https://github.com/dotMorten/SharpGIS.GZipWebClient/blob/master/src/SharpGIS.GZipWebClient/GZipDeflateStream.cs
It's extremely light-weight (it's just this one file). First call
myResultStream.Seek(300, SeekOrigin.Begin);
If the stream isn't seekable, you will need to read the first 300 bytes first though.
Then use my class to decompress the rest:
Stream gzipStream = new SharpGIS.GZipInflateStream(myResultStream);
You can now read the gzipStream as if it was an uncompressed stream.
However I really don't understand why you don't use standard http compression and let the server compress everything including the first 300 bytes. It's much easier and better for the all the kittens out there.

You can use http://www.componentace.com/zlib_.NET.htm library (available through nuget):
part of request handler:
try {
var request = (HttpWebRequest)callbackResult.AsyncState;
using (var response = request.EndGetResponse(callbackResult))
using (var stream = response.GetResponseStream()) {
bool zip = false;
if (response.Headers.AllKeys.Contains("Content-Encoding") &&
response.Headers[HttpRequestHeader.ContentEncoding].ToLower() == "gzip") zip = true;
using (var reader = zip ?
#if NETFX_CORE
new StreamReader(new GZipStream(stream, CompressionMode.Decompress))
#else
new StreamReader(new zlib.ZOutputStream(stream), false)
#endif
: new StreamReader(stream)) {
var str = reader.ReadToEnd();
result = new ObjectResult(str);
}
}
}
catch (WebException e) {...}

Can't download complete image file from skydrive using REST API

I'm working on a quick wrapper for the skydrive API in C#, but running into issues with downloading a file. For the first part of the file, everything comes through fine, but then there start to be differences in the file and shortly thereafter everything becomes null. I'm fairly sure that it's just me not reading the stream correctly.
This is the code I'm using to download the file:
public const string ApiVersion = "v5.0";
public const string BaseUrl = "https://apis.live.net/" + ApiVersion + "/";
public SkyDriveFile DownloadFile(SkyDriveFile file)
{
string uri = BaseUrl + file.ID + "/content";
byte[] contents = GetResponse(uri);
file.Contents = contents;
return file;
}
public byte[] GetResponse(string url)
{
checkToken();
Uri requestUri = new Uri(url + "?access_token=" + HttpUtility.UrlEncode(token.AccessToken));
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUri);
request.Method = WebRequestMethods.Http.Get;
WebResponse response = request.GetResponse();
Stream responseStream = response.GetResponseStream();
byte[] contents = new byte[response.ContentLength];
responseStream.Read(contents, 0, (int)response.ContentLength);
return contents;
}
This is the image file I'm trying to download
And this is the image I am getting
These two images lead me to believe that I'm not waiting for the response to finish coming through, because the content-length is the same as the size of the image I'm expecting, but I'm not sure how to make my code wait for the entire response to come through or even really if that's the approach I need to take.
Here's my test code in case it's helpful
[TestMethod]
public void CanUploadAndDownloadFile()
{
var api = GetApi();
SkyDriveFolder folder = api.CreateFolder(null, "TestFolder", "Test Folder");
SkyDriveFile file = api.UploadFile(folder, TestImageFile, "TestImage.png");
file = api.DownloadFile(file);
api.DeleteFolder(folder);
byte[] contents = new byte[new FileInfo(TestImageFile).Length];
using (FileStream fstream = new FileStream(TestImageFile, FileMode.Open))
{
fstream.Read(contents, 0, contents.Length);
}
using (FileStream fstream = new FileStream(TestImageFile + "2", FileMode.CreateNew))
{
fstream.Write(file.Contents, 0, file.Contents.Length);
}
Assert.AreEqual(contents.Length, file.Contents.Length);
bool sameData = true;
for (int i = 0; i < contents.Length && sameData; i++)
{
sameData = contents[i] == file.Contents[i];
}
Assert.IsTrue(sameData);
}
It fails at Assert.IsTrue(sameData);

This is because you don't check the return value of responseStream.Read(contents, 0, (int)response.ContentLength);. Read doesn't ensure that it will read response.ContentLength bytes. Instead it returns the number of bytes read. You can use a loop or stream.CopyTo there.
Something like this:
WebResponse response = request.GetResponse();
MemoryStream m = new MemoryStream();
response.GetResponseStream().CopyTo(m);
byte[] contents = m.ToArray();

As LB already said, you need to continue to call Read() until you have read the entire stream.
Although Stream.CopyTo will copy the entire stream it does not ensure that read the number of bytes expected. The following method will solve this and raise an IOException if it does not read the length specified...
public static void Copy(Stream input, Stream output, long length)
{
byte[] bytes = new byte[65536];
long bytesRead = 0;
int len = 0;
while (0 != (len = input.Read(bytes, 0, Math.Min(bytes.Length, (int)Math.Min(int.MaxValue, length - bytesRead)))))
{
output.Write(bytes, 0, len);
bytesRead = bytesRead + len;
}
output.Flush();
if (bytesRead != length)
throw new IOException();
}

How to GET data from an URL and save it into a file in binary in C#.NET without the encoding mess?

In C#.NET, I want to fetch data from an URL and save it to a file in binary.
Using HttpWebRequest/Streamreader to read into a string and saving using StreamWriter works fine with ASCII, but non-ASCII characters get mangled because the Systems thinks it has to worry about Encodings, encode to Unicode or from or whatever.
What is the easiest way to GET data from an URL and saving it to a file, binary, as-is?
// This code works, but for ASCII only
String url = "url...";
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create(url);
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
// we will read data via the response stream
Stream ReceiveStream = response.GetResponseStream();
StreamReader readStream = new StreamReader( ReceiveStream );
string contents = readStream.ReadToEnd();
string filename = #"...";
// create a writer and open the file
TextWriter tw = new StreamWriter(filename);
tw.Write(contents.Substring(5));
tw.Close();

Minimalist answer:
using (WebClient client = new WebClient()) {
client.DownloadFile(url, filePath);
}
Or in PowerShell (suggested in an anonymous edit):
[System.Net.WebClient]::WebClient
$client = New-Object System.Net.WebClient
$client.DownloadFile($URL, $Filename)

Just don't use any StreamReader or TextWriter. Save into a file with a raw FileStream.
String url = ...;
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
// execute the request
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
// we will read data via the response stream
Stream ReceiveStream = response.GetResponseStream();
string filename = ...;
byte[] buffer = new byte[1024];
FileStream outFile = new FileStream(filename, FileMode.Create);
int bytesRead;
while((bytesRead = ReceiveStream.Read(buffer, 0, buffer.Length)) != 0)
outFile.Write(buffer, 0, bytesRead);
// Or using statement instead
outFile.Close()

This is what I use:
sUrl = "http://your.com/xml.file.xml";
rssReader = new XmlTextReader(sUrl.ToString());
rssDoc = new XmlDocument();
WebRequest wrGETURL;
wrGETURL = WebRequest.Create(sUrl);
Stream objStream;
objStream = wrGETURL.GetResponse().GetResponseStream();
StreamReader objReader = new StreamReader(objStream, Encoding.UTF8);
WebResponse wr = wrGETURL.GetResponse();
Stream receiveStream = wr.GetResponseStream();
StreamReader reader = new StreamReader(receiveStream, Encoding.UTF8);
string content = reader.ReadToEnd();
XmlDocument content2 = new XmlDocument();
content2.LoadXml(content);
content2.Save("direct.xml");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

download a web page and save as UTF-8 text file - c#

You could simply use the WebClient Class. It supports encoding and easier use: WebClient webClient = new WebClient(); webClient.Encoding = System.Text.Encoding.UTF8; webClient.DownloadFile(url, "file.txt");

Related

Download CSV file with WebClient in C# but the size of file is less than when download with browser

c# files downloaded with httpwebrequest and cookies get corrupted

WP8 Gzip compression

Can't download complete image file from skydrive using REST API

How to GET data from an URL and save it into a file in binary in C#.NET without the encoding mess?

Categories

Resources