Encoding problems with dBase III .dbf files on different machines - c#

I'm using C# and .NET 3.5, trying to import some data from old dbf files using ODBC with Microsoft dBase Driver.
The dbf's are in dBase III format and using ibm850 encoding for strings.
Now, when I run my program on my machine, all string data read from OdbcDataReader comes out converted to UTF-16 or UTF-8 or something, idk and I save it as UTF-8 and everything is ok, but when I try to use this program on an XP box, some characters aren't converted correctly to UTF-8. 'Õ' for example. There may be some others too. Characters like 'Ä', 'Ö' and 'Ü' are ok. This is the problem.
Maybe the ODBC or the driver uses some machine culture info or something to mess everything up.
Is it possible to read strings from the database as binary? Maybe some functions like CONVERT or CAST? Or where could I find some references for SQL functions and syntax which works for this dBase driver or other drivers? I searched around and couldn't find anything. I feel so blind when using ODBC and SQL.
Right now I'm using a temporary hack that replaces all σ's with Õ's.
Thanks!
Example code:
System.Data.Odbc.OdbcConnection oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = #"Driver={Microsoft dBase Driver (*.dbf)};DriverID=277;Dbq=" + dbPath + ";";
oConn.Open();
System.Data.Odbc.OdbcCommand oCmd = oConn.CreateCommand();
oCmd.CommandText = #"SELECT name FROM " + dbPath + "TABLE.DBF";
System.Data.Odbc.OdbcDataReader reader = oCmd.ExecuteReader();
reader.Read();
byte[] buf = Encoding.UTF8.GetBytes(reader.GetString(0));
BinaryWriter writer = new BinaryWriter(File.Open(#"C:\DBF\Test.txt", FileMode.Create));
writer.Write(buf);
Result:
E5 in dbf (Õ in 850)
Test.txt on pc1: C3 95 (Õ in UTF-8)
Test.txt on pc2: CF 83 (σ in UTF-8)

If you are still having a problem with these files, I may be able to help you.
What is in the "codepage byte" aka "language driver id" (LDID) at offset 29 (decimal) in the file?
I have a Python-based DBF reader which can read just about any field data type and just about any codepage -- it has a long list compiled from various sources of mappings from codepage byte to codepage number. Options are (1) believe the LDID, deliver Unicode (2) ignore the LDID, deliver undecoded bytes (3) override the LDID, decode with a specific codepage into Unicode. The Unicode can of course be then encoded into UTF-8.
The DBF reader also does a whole lot of reasonableness cross-checks which may help investigating why VFP thinks the file is corrupt.
How do you know that it's using IBM850? Another piece of Python code that I have is a prototype encoding detector, which unlike detectors like 'chardet' which are derived from Mozilla code is not web-centric and can happily recognise most old DOS codepages -- this may help.
A observation: the Greek letter lowercase sigma (σ) is 0xE5 in codepage 437, which was succeded by codepage 850 -- "pc2" seems a little outdated ...
If you think I can be of any help, feel free to e-mail me at insert_punctuation("sjmachin", "lexicon", "net")

Try this code.
var oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = "Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=" + dbPath;
oConn.Open();
var oCmd = oConn.CreateCommand();
oCmd.CommandText = #"SELECT name FROM " + dbPath + "TABLE.DBF";
var reader = oCmd.ExecuteReader();
reader.Read();
byte[] A = Encoding.GetEncoding(Encoding.Default.CodePage).GetBytes(reader.GetString(0));
string p = Encoding.Unicode.GetString((Encoding.Convert(Encoding.GetEncoding(850), Encoding.Unicode, A)));

When you read dbf file you should understand that you should take into account 3 types of encoding:
1.Encoding in which database provider reads the file. It depends on
provider and current operation system. This encoding shall be used for bytes array receiving. For example on my PC:
when I use connection string "Data Source={0};
Provider=Microsoft.JET.OLEDB.4.0;Extended Properties=DBase IV;User
ID=;Password=;", strings are read using 866 code page (Russian
MS-DOS)
when I use connection string "Data Source={0};
Provider=vfpoledb.1;Exclusive=No;Collating Sequence=Machine", strings
are read using Encoding.Default (1251 code page)
2.Encoding in which strings are written to dbf file. It can be received from 29 byte of dbf file, but in fact there is no matter what how dbf file encoding is marked, you should just know what encoding was used. This encoding shall be used as source encoding during string conversion
3.Encoding to which string shall be converted. This is UTF-8 usually.
So string conversion should look like this:
byte[] bytes = Encoding.GetEncoding(codePage1).GetBytes(reader.GetString(0));
string result = Encoding.UTF8.GetString((Encoding.Convert(Encoding.GetEncoding(codePage2), Encoding.UTF8, bytes)));

Have you tried using the Visual Foxpro driver "VFPOleDb" driver instead???

Related

OpenXml SDK excel accented French Chars (éèçà) [duplicate]

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

Reading Excel files with OleDb or ExcelDataReader return special characters encoded wronlgy

I need to read an EXCEL binary file (xls) in C#.
Basically its working, but i have problems with wrongly encoded characters (german umlauts for example).
Is there some way to specify an encoding? Does Excel files have something like an encoding at all?
The string in the Excel File is
Lydia Hömmerl
When reading with Jet or EDR i get:
Lydia HŠmmerl
I have tried OleDb and the Excel Data Reader project.
Here is the code i use to open and read the file:
var connectionString =
string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0};
Extended Properties=\"Excel 8.0;HDR=no\";", filePath);
var adapter = new OleDbDataAdapter(
"SELECT * FROM [Webshop orders$]", connectionString);
var ds = new DataSet();
adapter.Fill(ds, "Orders");
var data = ds.Tables["Orders"].AsEnumerable();
foreach (var row in data)
{
var str = Convert.ToString(row[0]);
Unicode with C#
There are still many people who don't understand the difference between binary and text, or know what a character encoding is, etc. It is for these people that this page has been written. It mentions a few advanced topics, but only to make the reader aware of their existence, rather than to give much guidance on the topic.

Generating dBase II DBF File in C# 3.5

I'm generating dbf file to get imported to legacy systems that only accepts dBase II OR III. My aplication is .Net 3.5. I initially started working with this component VFPOLEDB.1 but it only generate dbf files in dBase V format which isn't backwards compatibily.
Anyone knows a component or driver to generate de dbf file in dBase II or III
Thanks
Try issuing a call to execute a script that opens the file, then does
COPY TO {some file} type FOX2X
that should get you the output...
There was another post of a similar all being done via C# through the VFPOleDB and I'll try to find it... Yup, and with credit to #DaveB here's a snippet of his post in Create .DBF in C# code that is readable from Excel (VFP or not)
string connectionString = #"Provider=VFPOLEDB.1;Data Source=C:\YourDirectory\";
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand scriptCommand = connection.CreateCommand())
{
connection.Open();
string vfpScript = #"USE TestDBF
COPY TO OldDBaseFormatFile TYPE Fox2x
USE";
scriptCommand.CommandType = CommandType.StoredProcedure;
scriptCommand.CommandText = "ExecScript";
scriptCommand.Parameters.Add("myScript", OleDbType.Char).Value = vfpScript;
scriptCommand.ExecuteNonQuery();
}
}
The original post was for someone to be able to open the file in Excel format.
I remember trying to do this very thing several years ago and failing. My solution was to take an existing dBase II file, empty all data, and keep that empty file as a template for when I needed to create a new database.
ESRI's Shapefile format uses dBase III for storing attribute data. There's a decent implementation in the SharpMap project which you should be able to use independently (careful of the license, though: it's LGPL).
http://code.google.com/p/sharpmapv2/source/browse/trunk/SharpMap.Data.Providers/ShapeFileProvider/DbaseFile.cs

c#, Excel + csv: how to get the correct encoding?

I've been trying this for quite a while now, but can't figure it out. I'm trying to export data to Excel via a *.csv file. It works great so far, but I have some encoding problems when opening the files in Excel.
(original string on the left, EXCEL result on the right):
Messwert(µm / m) ==> Messwert(µm / m)
Dümme Mässöng ==> Dümme Mässöng
Notepad++ tells me that the file is encoded "ANSI as UTF8"(WTF?)
So here are different ways I tried to get a valid result:
obvious implementation:
tWriter.Write(";Messwert(µm /m)");
more sophisticated one (tried probably a dozen or more encoding combinations:)
tWriter.Write(Encoding.Default.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
tWriter.Write(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(";Messwert(µm /m)")));
and so on
Whole source code for the method creating the data:
MemoryStream tStream = new MemoryStream();
StreamWriter tWriter = new StreamWriter(tStream);
tWriter.Write("\uFEFF");
tWriter.WriteLine(string.Format("{0}", aMeasurement.Name));
tWriter.WriteLine(aMeasurement.Comment);
tWriter.WriteLine();
tWriter.WriteLine("Zeit in Minuten;Messwert(µm / m)");
TimeSpan tSpan;
foreach (IMeasuringPoint tPoint in aMeasurement)
{
tSpan = new TimeSpan(tPoint.Time - aMeasurement[0].Time);
tWriter.WriteLine(string.Format("{0};{1};", (int)tSpan.TotalMinutes, getMPString(tPoint)));
}
tWriter.Flush();
return tStream;
Generated CSV file:
Dümme Mössäng
Testmessung die erste
Zeit in Minuten;Messwert(µm / m)
0;-703;
0;-381;
1;1039;
1;1045;
2;1457;
2;1045;
This worked perfect for me:
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
CSV encoding issues (Microsoft Excel)
try the following:
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
It writes the proper UTF8 preamble to the file before writing the UTF8 encoded CSV.
This solution is written up as a fix for a Java application however you should be able to do something similar in C#. You may also want to look at the documentation on the StreamWriter class, in the remarks it refers to the Byte Order Mark (BOM).
"ANSI as UTF8"(WTF?)
NotePad++ is probably correct. The encoding is UTF8 (i.e., correct Unicode header), but only contains ANSI data (i.e., é is not encoded in correct UTF8 way, which would mean two bytes).
Or: it is the other way around. It is ANSI (no file header BOM), but the encoding of the individual characters is, or looks like, UTF8. This would explain the ü and other characters expanding in more than one other character. You can fix this by forcing the file to be read as Unicode.
If it's possible to post (part of) your CSV, we may be able to help fixing it at the source.
Edit
Now that we've seen your code: can you remove the StreamWriter and replace it with a TextWriter? Also, remove the hand-encoding of the BOM, it is not necessary. When you create a TextWriter, you can specify the encoding (don't use ASCII, try UTF8).
Trevor Germain's helped me to save in the correct encoded format
using (var sw = File.Create(Path.Combine(txtPath.Text, "UTF8.csv")))
{
var preamble = Encoding.UTF8.GetPreamble();
sw.Write(preamble, 0, preamble.Length);
var data = Encoding.UTF8.GetBytes("懘荧,\"Hello\",text");
sw.Write(data, 0, data.Length);
}
I'd suggest you open up the text file in a hex editor, and see what it really is. The BOM for UTF-16 is 0xFEFF, which the writing code is apparently writing to the stream - but the rest of the writing doesn't specify an encoding to use - it would use the default encoding of the StreamWriter, which is UTF-8. There appears to be a mix up of encodings.
When you pop open the file in hex view, if you see lots of 0x00 between the characters, you're working with UTF-16, which is Encoding.Unicode in C#. If there are no 0x00 between chars, the encoding is probably UTF-8.
If the latter case, just fix up the BOM to be EF BB BF rather than FE FF, and read normally with UTF-8 encoding.
For my scenario using StreamWriter I found explicitly passing UTF8 encoding to the StreamWriter enabled excel to read the file using the correct encoding.
See this answer for more details:
https://stackoverflow.com/a/22306937/999048

How do I load text files into database that are greater than the 64 kb buffersize limit?

I'm trying to load text files (.aspx, .cs, html, etc) into a sql server 2008 database. I'm able to load all files that are less than 64 kb so far. I have two questions; How do I get around the 64 kb limit, and is the method I'm using the best way to do this?
Thanks for the help.
Database:
file_length int,
file_path varchar(250),
file_string varchar(MAX)
private static void Load_Files()
{
string source = HttpContext.Current.Server.MapPath("~/website/");
DirectoryInfo di = new DirectoryInfo(source);
FileInfo[] files = di.GetFiles();
foreach (FileInfo f in files)
{
string sourceFile = f.FullName;
FileStream fs_reader = new FileStream(sourceFile, FileMode.Open, FileAccess.Read);
StreamReader reader = new StreamReader(fs_reader);
string content = reader.ReadToEnd();
Int32 file_length = content.Length;
string CS = ConfigurationManager.ConnectionStrings["MCP_CS"].ConnectionString;
SqlConnection SQL_Conn_01 = new SqlConnection(CS);
string SQL_01 = "INSERT INTO Page_File_Store (file_length, file_path, file_string) VALUES (#file_length, #file_path, #file_string)";
SqlCommand SQL_File_Load = new SqlCommand(SQL_01, SQL_Conn_01);
SQL_File_Load.Parameters.Add(new SqlParameter("#file_length", file_length));
SQL_File_Load.Parameters.Add(new SqlParameter("#file_path", sourceFile));
//SQL_File_Load.Parameters.Add(new SqlParameter("#file_string", content));
SqlParameter contentParameter = new SqlParameter("#file_string", SqlDbType.VarChar, -1);
contentParameter.Value = content;
SQL_File_Load.Parameters.Add(contentParameter);
SQL_Conn_01.Open();
SQL_File_Load.ExecuteNonQuery();
SQL_Conn_01.Close();
reader.Close();
}
}
}
Please Note: this is a copy of a question I asked earlier and lost control of when I cleared my cookies. How do I load text files greater than the 64 kb buffersize limit?
There is no 64kb limit in SQL Server. The limits for SQL strings are either at 8000 bytes for in-row data types (char, varchar, nchar, nvarchar, binary and varbinary) or 2 GB for LOB types (varchar(max), nvarchar(max) and varbinary(max)). The 64 kb limitation you see must come from something else, most likely from IIS upload or ASP or CLR processing limitations.
But you're not going to be able to process arbitrary length files like this. .Net will not load a large stream into a string using Stream.ReadToEnd() because the memory allocation won't succeed. You are going to load the file in chunks and append each chunk into the database, using the LOB specific UPDATE table SET column.WRITE... syntax.
P.S. Some responses recommend use of the old LOB types like text, ntext and image. Don't use those types with SQL Server 2008, they are deprecated and discouraged.
Use a TEXT column instead of a VARCHAR/CHAR column. If you need something even bigger than TEXT, or will be loading in binary files, look into BINARY/VARBINARY/IMAGE etc.
MSDN provides documentation on all of the available data types. For text files you'll probably want to use the TEXT type, for binary files use one of the Binary String types.
In addition to phoebus's response, if your working buffer is too small, or even smaller than the 64k, you can read in the first segment, update the text field with that, read another buffer and update text with text + new buffer and repeat until all data loaded.
Databases are not made to store big files in it. Store the files on the harddisk instead and store the filenames into the database.
If you still want to store them into the database anyway you can use a compression library like #ziplib to decrease file sizes (source code compresses very well) and use binary column types like phoebus proposes.

Categories

Resources