This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I detect the encoding/codepage of a text file
I have a ASP.NET MVC application. In my view I upload a text file and process it with a controller method with this signature
[HttpPost]
public ActionResult FromCSV(HttpPostedFileBase file, string platform)
I get a stream from the uploaded file as file.InputStream and read it using a standard StreamReader
using (var sr = new StreamReader(file.InputStream))
{
...
}
The problem is, that this only works for UTF text files. When I have a text file in Windows-1250, the characters get messed up. I can work with Windows-1250 encoded text files when I explicitly specify the encoding
using (var sr = new StreamReader(file.InputStream, Encoding.GetEncoding(1250)))
{
...
}
My problem is, that I need to support both UTF and Windows-1250 encoded files so I need a way to detect the encoding of the submitted file.
Trying to decode a file encoded in Windows-1250 as UTF-8 is extremely likely to cause an exception (or if not, the file is only using ASCII subset so it doesn't matter what encoding is used to decode) with exception fallback, so you could do something like this:
Encoding[] encodings = new Encoding[]{
Encoding.GetEncoding("UTF-8", new EncoderExceptionFallback(), new DecoderExceptionFallback()),
Encoding.GetEncoding(1250, new EncoderExceptionFallback(), new DecoderExceptionFallback())
};
String result = null;
foreach( Encoding enc in encodings ) {
try {
result = enc.GetString( fileAsByteArray );
break;
}
catch( DecoderFallbackException e ) {
}
}
Related
This question already has an answer here:
OPEN XML to Export 2007/2010
(1 answer)
Closed 3 years ago.
To give you a bit of context I'm trying to download a bunch of attachments in one bulk operation. These attachments are normally downloaded via a website a file at a time and the MVC controller code which retrieves the attachment looks something like this:
var attachment = _attachmentsRepository.GetAttachment(websiteId, id);
if (attachment.FileStream == null || !attachment.FileStream.CanRead)
{
return new HttpResponseMessage(HttpStatusCode.BadRequest);
}
var content = new StreamContent(attachment.FileStream);
content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment") { FileName = $"{id}.{attachment.Extension}" };
return new HttpResponseMessage(HttpStatusCode.OK) { Content = content };
What I'm trying to do is write a console application for the bulk operation which will save each file directly to disk and sofar what I've got for a single file save is:
var = attachment attachmentsRepository.GetAttachment(websiteId, resource.Id);
attachment.FileStream.Position = 0;
var reader = new StreamReader(attachment.FileStream);
var content = reader.ReadToEnd();
File.WriteAllText(someFilePath, content);
I've side-stepped any http specific framework classes since I just need to download directly to file via code instead of the browser. This code successfully generates files but when I open them Excel indicates that they're corrupted which I suspect is an encoding issue. I'm currently playing around with the encoding at the moment but sofar not much luck so any help is appreciated.
Don't use StreamReader for processing binary data. The StreamReader/StreamWriter classes are for reading and writing human-readable text in a Stream and as such they attempt to perform text encoding/decoding which can mangle binary data (I feel the classes should be renamed to StreamTextReader/StreamTextWriter to reflect their inheritance hierarchy).
To read raw binary data, use methods on the Stream class directly (e.g. Read, Write, and CopyTo).
Try this:
var attachment = ...
using( FileStream fs = new FileStream( someFilePath, FileMode.CreateNew, FileAccess.Write, FileShare.None, bufferSize: 8 * 1024, useAsync: true ) )
{
await attachment.FileStream.CopyToAsync( fs );
}
The using() block will ensure fs is flushed and closed correctly.
I am currently developing a Windows Phone 8 application in which one I have to download a CSV file from a web-service and convert data to a C# business object (I do not use a library for this part).
Download the file and convert data to a C# business object is not an issue using RestSharp.Portable, StreamReader class and MemoryStream class.
The issue I face to is about the bad encoding of the string fields.
With the library RestSharp.Portable, I retrieve the csv file content as a byte array and then convert data to string with the following code (where response is a byte array) :
using (var streamReader = new StreamReader(new MemoryStream(response)))
{
while (streamReader.Peek() >= 0)
{
var csvLine = streamReader.ReadLine();
}
}
but instead of "Jérome", my csvLine variable contains J�rome. I tried several things to obtain Jérome but without success like :
using (var streamReader = new StreamReader(new MemoryStream(response), true))
or
using (var streamReader = new StreamReader(new MemoryStream(response), Encoding.UTF8))
When I open the CSV file with a simple notepad software like notepad++ I obtain Jérome only when the file is encoding in ANSI. But if I try the following code in C# :
using (var streamReader = new StreamReader(new MemoryStream(response), Encoding.GetEncoding("ANSI")))
I have the following exception :
'ANSI' is not a supported encoding name.
Can someone help me to decode correctly my CSV file ?
Thank you in advance for your help or advices !
You need to pick one of these.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
If you don't know, you can try to guess it. Guessing isn't a perfect solution, per the answer here.
You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results.
From the link of Lawtonfogle I tried to use
using (var streamReader = new StreamReader(new MemoryStream(response), Encoding.GetEncoding("Windows-1252")))
But I had the following error :
'Windows-1252' is not a supported encoding name.
Searching why on the internet, I finally found following thread with the following answer that works for me.
So here the working solution in my case :
using (var streamReader = new StreamReader(new MemoryStream(response), Encoding.GetEncoding("ISO-8859-1")))
{
while (streamReader.Peek() >= 0)
{
var csvLine = streamReader.ReadLine();
}
}
I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.
Right now, I read in the file assuming Shift-JIS as such:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
String line = sr.ReadToEnd();
// Detection must be done AFTER you read from the file. Silly rabbit.
fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
codingFromBOM = sr.CurrentEncoding;
}
and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:
using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
String line = sr.ReadToEnd();
}
I am trying to avoid this re-read step and reinterperet the data in memory if possible.
Or is magic already happening and I am needlessly worrying about double I/O access?
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
text = Encoding.GetEncoding(932).GetString(buf);
}
I downloaded a webpage and it contains paragraph having this type of quotations marks
“I simple extracted this line from html page”
but when I write then to file then this “ character is not properly shown.
WebClient wc = new WebClient();
Stream strm = wc.OpenRead("http://images.thenews.com.pk/21-08-2013/ethenews/t-24895.htm");
StreamReader sr = new StreamReader(strm);
StreamWriter sw = new StreamWriter("D://testsharp.txt");
String line;
Console.WriteLine(sr.CurrentEncoding);
while ((line = sr.ReadLine()) != null) {
sw.WriteLine(line);
}
sw.Close();
strm.Close();
If all you want to do is to write the file to disk, then: use the Stream API directly, or (even easier) just use:
wc.DownloadFile("http://images.thenews.com.pk/21-08-2013/ethenews/t-24895.htm",
#"D:\testsharp.txt");
If you don't treat it as binary, then you need to worry about encodings - and it isn't enough just to look at sr.CurrentEncoding, because we can't be sure that it detected it correctly. It could be that the encoding was reported in the HTTP headers, which would be nice. It could also be that the encoding is specified in a BOM at the start of the payload. However, in the case of HTML the encoding could also be specified inside the HTML. In all three cases, treating the file as binary will improve things (for the BOM and inside-the-html cases, it will fix it entirely).
I have a problem that output string must be utf8-formatted, I am writing currently ansi string to zip file without problems like this:
StreamReader tr = new StreamReader( "myutf8-file.xml");
String myfilecontent = tr.ReadToEnd();
ZipFile zip = new ZipFile());
zip.AddFileFromString("my.xml", "", myfilecontent );
How to force string (my.xml file content) to be UTF8.
Don't use the deprecated AddFileFromString method. Use AddEntry(string, string, string, Encoding) instead:
zip.AddEntry("my.xml", "", myfilecontent, Encoding.UTF8);
If you're actually reading a UTF-8 text file to start with though, why not just open the stream and pass that into AddEntry? There's no need to decode from UTF-8 and then re-encode...