I downloaded a webpage and it contains paragraph having this type of quotations marks
“I simple extracted this line from html page”
but when I write then to file then this “ character is not properly shown.
WebClient wc = new WebClient();
Stream strm = wc.OpenRead("http://images.thenews.com.pk/21-08-2013/ethenews/t-24895.htm");
StreamReader sr = new StreamReader(strm);
StreamWriter sw = new StreamWriter("D://testsharp.txt");
String line;
Console.WriteLine(sr.CurrentEncoding);
while ((line = sr.ReadLine()) != null) {
sw.WriteLine(line);
}
sw.Close();
strm.Close();
If all you want to do is to write the file to disk, then: use the Stream API directly, or (even easier) just use:
wc.DownloadFile("http://images.thenews.com.pk/21-08-2013/ethenews/t-24895.htm",
#"D:\testsharp.txt");
If you don't treat it as binary, then you need to worry about encodings - and it isn't enough just to look at sr.CurrentEncoding, because we can't be sure that it detected it correctly. It could be that the encoding was reported in the HTTP headers, which would be nice. It could also be that the encoding is specified in a BOM at the start of the payload. However, in the case of HTML the encoding could also be specified inside the HTML. In all three cases, treating the file as binary will improve things (for the BOM and inside-the-html cases, it will fix it entirely).
Related
When downloading an XML response from a REST API, I cannot get .NET to download the full XML document on many requests. In each case, I'm missing the last several characters of the XML file which means I can't parse it. The requests work fine in a browser.
I have tried WebResponse.GetResponseStream() using a StreamReader. Within the StreamReader I have tried Read(...) with a buffer, ReadLine(), and ReadToEnd() to build a string for the response. Wondering if there was a bug in my code, I also tried WebClient.DownloadString(url) with the same result and XmlDocument.Load(url) which just throws an exception (unexpected end of file while parsing ____).
I know for a fact that this API has had some encoding issues in the past, so I've tried specifying multiple different encodings (e.g., UTF-8, iso-8859-1) for the StreamReader as well as letting .NET detect the encoding. Changing the encoding seems to result in a different number of characters that get left off the end.
Is there any way I can detect the proper encoding myself? How does a browser do it? Is there somewhere in any browser to see the actual encoding the response is using (not what the HTTP headers say it's returning)? Any other methods of getting a string response from a web site with an unknown encoding?
StreamReader sample code
StringBuilder sb = new StringBuilder();
using (resp = (HttpWebResponse)req.GetResponse())
{
using (Stream stream = resp.GetResponseStream())
{
using (StreamReader sr = new StreamReader(stream))
{
int charsRead = 1;
char[] buffer = new char[4096];
while (charsRead > 0)
{
charsRead = sr.Read(buffer, 0, buffer.Length);
sb.Append(buffer, 0, charsRead);
}
}
}
}
WebClient sample code
WebClient wc = new WebClient();
string text = wc.DownloadString(url);
XmlDocument sample code
XmlDocument doc = new XmlDocument();
doc.Load(url)
I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.
Right now, I read in the file assuming Shift-JIS as such:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
String line = sr.ReadToEnd();
// Detection must be done AFTER you read from the file. Silly rabbit.
fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
codingFromBOM = sr.CurrentEncoding;
}
and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:
using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
String line = sr.ReadToEnd();
}
I am trying to avoid this re-read step and reinterperet the data in memory if possible.
Or is magic already happening and I am needlessly worrying about double I/O access?
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
text = Encoding.GetEncoding(932).GetString(buf);
}
I have an ASP.net project and want to return a CSV file when an AJAX post is sent (yes it works. See Handle file download from AJAX post).
The special thing is, that I want to create the result in a MemoryStream to return it as FileResult.
But my problem is now, that German umlauts (ä, ö, ü) get corrupted. So here is my code:
public ActionResult Download(FormCollection form) {
string[] v = new string[16];
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream,
System.Text.Encoding.GetEncoding("Windows-1252"));
SqlCommand cmd = dbconn.CreateCommand();
//create SQL command
while (rs.Read()) {
v = new string[16];
v[0] = rs.GetString("IstAktiv");
v[1] = rs.GetString("Haus");
//cache all the values
...
//write cached values
for (int i = 0; i < v.Length; i++) {
if (i > 0) writer.Write(";");
writer.Write(v[i]);
writer.Flush();
}
writer.Write("\r\n");
writer.Flush();
} //end while rs.Read()
FileContentResult ret = new FileContentResult(stream.ToArray(), "text/csv");
ret.FileDownloadName = "Kontakte.csv";
writer.Close();
return ret;
} //end method
So when I open the resulting file in Excel the umlauts are converted into something strange. For example the upper case letter "Ä" is changed to "�".
So is there any possibility to solve this issue?
Best regards
To have Excel read CSV files correctly, it expects the CSV file to be in the UTF-8 encoding (with BOM).
So, without a doubt, your StreamWriter would have to be set this way:
StreamWriter writer = new StreamWriter(stream,
System.Text.Encoding.GetEncoding("UTF-8"));
However, if that doesn't work for you, then it's very likely that it's because the characters are being corrupted before you even get a chance to write them to the stream. You may be facing an encoding conversion problem as you are reading the data from the database.
v = new string[16];
v[0] = rs.GetString("IstAktiv");
v[1] = rs.GetString("Haus");
To validate that, place a breakpoint as you read the values into the 'v' array, and check that the characters still look ok at this step. If they are corrupted, then you know that the problem is between the code and the database, and the writing to the CSV is not the problem.
EDIT: Here is an isolated test case you can use to prove that UTF-8 is the correct encoding to write CSVs. Perhaps you can try that first:
Encoding enc = Encoding.GetEncoding("UTF-8");
using (StreamWriter writer = new StreamWriter(#"d:\test\test.csv", false, enc))
{
writer.Write(#"""hello ä, ö, ü world""");
}
I can't read those special characters
I tried like this
1st way #
string xmlFile = File.ReadAllText(fileName);
2nd way #
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
StreamReader r = new StreamReader(fs);
string s = r.ReadToEnd();
But both statements don't understand those special characters.
How should I read?
UPDATE ###
I also try all encoding with
string xmlFile = File.ReadAllText(fileName, Encoding. );
but still don't understand those special characters.
There is no such thing as "special character". What those likely are is extended ascii characters from the latin1 set (iso-8859-1).
You can read those by supplying encoding explicitly to the stream reader (otherwise it will assume UTF8)
using (StreamReader r = new StreamReader(fileName, Encoding.GetEncoding("iso-8859-1")))
r.ReadToEnd();
StreamReader sr = new StreamReader(stream, Encoding.UTF8)
This worked for me :
var json = System.IO.File.ReadAllText(#"././response/response.json" , System.Text.Encoding.GetEncoding("iso-8859-1"));
You have to tell the StreamReader that you are reading Unicode like so
StreamReader sr = new StreamReader(stream, Encoding.Unicode);
If your file is of some other encoding, specify it as the second parameter
I had to "find" the encoding of the file first
//try to "find" the encoding, if not found, use UTF8
var enc = GetEncoding(filePath)??Encoding.UTF8;
var text = File.ReadAllText(filePath, enc );
(please refer to this answer to get the GetEncoding function)
If you can modify the file in question, you can save it with encoding.
I had a json file that I had created (normally) in VS, and I was having the same problem. Rather than specify the encoding when reading the file (I was using System.IO.File.ReadAllText which defaults to UTF8), I resaved the file (File->Save As) and on the Save button, I clicked the arrow and chose "Save with Encoding", then chose "Unicode (UTF-8 with signature) - Codepage 65001".
Problem solved, no need to specify the encoding when reading the file.
I'm having a problem writing Norwegian characters into an XML file using C#. I have a string variable containing some Norwegian text (with letters like æøå).
I'm writing the XML using an XmlTextWriter, writing the contents to a MemoryStream like this:
MemoryStream stream = new MemoryStream();
XmlTextWriter xmlTextWriter = new XmlTextWriter(stream, Encoding.GetEncoding("ISO-8859-1"));
xmlTextWriter.Formatting = Formatting.Indented;
xmlTextWriter.WriteStartDocument(); //Start doc
Then I add my Norwegian text like this:
xmlTextWriter.WriteCData(myNorwegianText);
Then I write the file to disk like this:
FileStream myFile = new FileStream(myPath, FileMode.Create);
StreamWriter sw = new StreamWriter(myFile);
stream.Position = 0;
StreamReader sr = new StreamReader(stream);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
myFile.Flush();
myFile.Close();
Now the problem is that in the file on this, all the Norwegian characters look funny.
I'm probably doing the above in some stupid way. Any suggestions on how to fix it?
Why are you writing the XML first to a MemoryStream and then writing that to the actual file stream? That's pretty inefficient. If you write directly to the FileStream it should work.
If you still want to do the double write, for whatever reason, do one of two things. Either
Make sure that the StreamReader and StreamWriter objects you use all use the same encoding as the one you used with the XmlWriter (not just the StreamWriter, like someone else suggested), or
Don't use StreamReader/StreamWriter. Instead just copy the stream at the byte level using a simple byte[] and Stream.Read/Write. This is going to be, btw, a lot more efficient anyway.
Both your StreamWriter and your StreamReader are using UTF-8, because you're not specifying the encoding. That's why things are getting corrupted.
As tomasr said, using a FileStream to start with would be simpler - but also MemoryStream has the handy "WriteTo" method which lets you copy it to a FileStream very easily.
I hope you've got a using statement in your real code, by the way - you don't want to leave your file handle open if something goes wrong while you're writing to it.
Jon
You need to set the encoding everytime you write a string or read binary data as a string.
Encoding encoding = Encoding.GetEncoding("ISO-8859-1");
FileStream myFile = new FileStream(myPath, FileMode.Create);
StreamWriter sw = new StreamWriter(myFile, encoding);
stream.Position = 0;
StreamReader sr = new StreamReader(stream, encoding);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
myFile.Flush();
myFile.Close();
As mentioned in above answers, the biggest issue here is the Encoding, which is being defaulted due to being unspecified.
When you do not specify an Encoding for this kind of conversion, the default of UTF-8 is used - which may or may not match your scenario. You are also converting the data needlessly by pushing it into a MemoryStream and then out into a FileStream.
If your original data is not UTF-8, what will happen here is that the first transition into the MemoryStream will attempt to decode using default Encoding of UTF-8 - and corrupt your data as a result. When you then write out to the FileStream, which is also using UTF-8 as encoding by default, you simply persist that corruption into the file.
In order to fix the issue, you likely need to specify Encoding into your Stream objects.
You can actually skip the MemoryStream process entirely, also - which will be faster and more efficient. Your updated code might look something more like:
FileStream fs = new FileStream(myPath, FileMode.Create);
XmlTextWriter xmlTextWriter =
new XmlTextWriter(fs, Encoding.GetEncoding("ISO-8859-1"));
xmlTextWriter.Formatting = Formatting.Indented;
xmlTextWriter.WriteStartDocument(); //Start doc
xmlTextWriter.WriteCData(myNorwegianText);
StreamWriter sw = new StreamWriter(fs);
fs.Position = 0;
StreamReader sr = new StreamReader(fs);
string content = sr.ReadToEnd();
sw.Write(content);
sw.Flush();
fs.Flush();
fs.Close();
Which encoding do you use for displaying the result file? If it is not in ISO-8859-1, it will not display correctly.
Is there a reason to use this specific encoding, instead of for example UTF8?
After investigating, this is that worked best for me:
var doc = new XDocument(new XDeclaration("1.0", "ISO-8859-1", ""));
using (XmlWriter writer = doc.CreateWriter()){
writer.WriteStartDocument();
writer.WriteStartElement("Root");
writer.WriteElementString("Foo", "value");
writer.WriteEndElement();
writer.WriteEndDocument();
}
doc.Save("dte.xml");