i need to convert special html entities to its decimal values using visual C#. First i need to load .html file and need to replace all special character values to decimal values.
EX: ‰ ---> "‰"
® ---> "®"
Å ---> "Å"
so what is the optimized way to replace all characters with decimal values. i have list of more than 1000 characters and entities.
You should use WebUtility.HtmlEncode Method (String)
Assuming you can comfortably fit your HTML file in a StringBuilder, you could take a couple of different approaches. First, I'm assuming you have all of your character replacements stored in a dictionary:
var replacements = new Dictionary<char,string> {
{ '®', "‰" },
// ...etc
}
First, read your file into a StringBuilder:
var html = new StringBuilder( File.ReadAllText( filename ) );
The first approach is that you could use StringBuilder.Replace(string,string):
foreach( var c in replacements.Keys ) {
html.Replace( c.ToString(), replacements[c] );
}
The second approach would be to go through every character in the file and see if it needs replacing (note that we start backwards from the end of the file; if we went forwards, we'd constantly be having to modify our index value since we're adding length to the file):
for( int i=html.Length-1; i>0; i-- ) {
var c = html[i];
if( replacements.ContainsKey( c ) ) {
html.Remove( i, 1 );
html.Insert( i, replacements[c] );
}
}
It's hard to say which would be more efficient without either having details about the implementation of StringBuilder.Replace(string,string) or doing some profiling, but I'll leave that up to you.
If it's not feasible to load your entire HTML file into a StringBuilder, you could use a variation of the second technique with a StreamReader reading the file one byte at a time.
I've seen lots of samples in parsing CSV File. but this one is kind of annoying file...
so how do you parse this kind of CSV
"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"
The best answer in most cases is probably #Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.
The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.
So I rewrote for this case as a string extension. I think this is close.
Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.
The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As #Vash points out, you're better off following some standard and coding a little more OFfensively.
But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to #Vash's comment!
Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.
namespace YourFavoriteNamespace
{
using System;
using System.Collections.Generic;
using System.Text;
public static class Extensions
{
public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"',
bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
{
Queue<string> qReturn = new Queue<string>();
StringBuilder stringBuilder = new StringBuilder();
bool bInEscapeVal = false;
for (int i = 0; i < valToSplit.Length; i++)
{
if (!bInEscapeVal)
{
// Escape values must come immediately after a split.
// abc,"b,ca",cab has an escaped comma.
// abc,b"ca,c"ab does not.
if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
{
bInEscapeVal = true; // not capturing escapeChar as part of value; easy enough to change if need be.
}
else if (splittingChar == valToSplit[i])
{
qReturn.Enqueue(stringBuilder.ToString());
stringBuilder = new StringBuilder();
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
else
{
// Can't use switch b/c we're comparing to a variable, I believe.
if (escapeChar == valToSplit[i])
{
// Repeated escape always reduces to one escape char in this logic.
// So if you wanted "I'm ""double quote"" crazy!" to come out with
// the double double quotes, you're toast.
if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
{
i++;
stringBuilder.Append(escapeChar);
}
else if (!strictEscapeToSplitEvaluation)
{
bInEscapeVal = false;
}
// *** STINKY CONDITION ***
// Kinda defense, since only `", ` really makes sense.
else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
{
i = i+2;
stringBuilder.Append("\", ");
}
// *** EO STINKY CONDITION ***
else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
{
bInEscapeVal = false;
}
else
{
stringBuilder.Append(escapeChar);
}
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
}
// NOTE: The `captureEndingNull` flag is not tested.
// Catch null final entry? "abc,cab,bca," could be four entries, with the last an empty string.
if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
{
qReturn.Enqueue(stringBuilder.ToString());
}
return qReturn;
}
}
}
Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)
[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.
If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.
Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.
Enjoy. ;^)
I very strongly recommend using TextFieldParser. Hand-coded parsers that use String.Split or regular expressions almost invariably mishandle things like quoted fields that have embedded quotes or embedded separators.
I would be surprised, though, if it handled your particular example. As others have said, that line is, at best, ambiguous.
Split based on
",
I would use MyString.IndexOf("\","
And then substring the parts. Other then that im sure someone written a csv parser out there that can handle this :)
I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...
from this:
"Annoying","CSV File","poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"
to this:
"Annoying¤CSV File¤poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"
then split it:
ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
ArrayA[1]: CSV File
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
ArrayA[3]: yeah!"
after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again
from this
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
to this
ArrayA[2]: poop#mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby
then split it again and would turn to this
ArrayB[0]: poop#mypants.com
ArrayB[1]: 1999,01-20-2001
ArrayB[2]: oh,boy
ArrayB[3]: 01-20-2001
ArrayB[4]: yeah baby
and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC
It's tedious but there's no other way to do it...
There is one another open source library, Cinchoo ETL, handle quoted string fine. Here is sample code.
string csv = #"""1"",1/2/2010,""The sample(""adasdad"") asdada"",""I was pooping in the door ""Stinky"", so I'll be damn"",""AK""";
using (var r = ChoCSVReader.LoadText(csv)
.QuoteAllFields()
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
Output:
[Count: 5]
Key: Column1 [Type: Int64]
Value: 1
Key: Column2 [Type: DateTime]
Value: 1/2/2010 12:00:00 AM
Key: Column3 [Type: String]
Value: The sample(adasdad) asdada
Key: Column4 [Type: String]
Value: I was pooping in the door Stinky, so I'll be damn
Key: Column5 [Type: String]
Value: AK
You could split the string by ",". It is recomended that the csv file could each cell value should be enclosed in quotes like "1","2","3".....
I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.
"1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"
Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.
You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.
It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.
As no (decent) .csv parser can parse non-csv-data correctly, the task isn't to parse the data, but to fix the file(s) (and then to parse the correct data).
To fix the data you need a list of bad rows (to be sent to the person responsible for the garbage for manual editing). To get such a list, you can
use Access with a correct import specification to import the file. You'll get a list of import failures.
write a script/program that opens the file via the OLEDB text driver.
Sample file:
"Id","Remark","DateDue"
1,"This is good",20110413
2,"This is ""good""",20110414
3,"This is ""good"","bad",and "ugly",,20110415
4,"This is ""good""" again,20110415
Sample SQL/Result:
SELECT * FROM [badcsv01.csv]
Id Remark DateDue
1 This is good 4/13/2011
2 This is "good" 4/14/2011
3 This is "good", NULL
4 This is "good" again 4/15/2011
SELECT * FROM [badcsv01.csv] WHERE DateDue Is Null
Id Remark DateDue
3 This is "good", NULL
First you will do it for the columns names:
DataTable pbResults = new DataTable();
OracleDataAdapter oda = new OracleDataAdapter(cmd);
oda.Fill(pbResults);
StringBuilder sb1 = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
IEnumerable<string> columnNames = pbResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
sb1.Append(string.Join("\"" + "," + "\"", columnNames));
sb2.Append("\"");
sb2.Append(sb1);
sb2.AppendLine("\"");
Second you will do it for each row:
foreach (DataRow row in pbResults.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => field.ToString());
sb2.Append("\"");
sb2.Append(string.Join("\"" + "," + "\"", fields));
sb2.AppendLine("\"");
}
I'm having a problem getting streams for embedded resources. Most online samples show paths that can be directly translated by changing the slash of a path to a dot for the source (MyFolder/MyFile.ext becomes MyNamespace.MyFolder.MyFile.ext). However when a folder has a dot in the name and when special characters are used, manually getting the resource name does not work. I'm trying to find a function that can convert a path to a resource name as Visual Studio renames them when compiling..
These names from the solution ...
Content/jQuery.UI-1.8.2/jQuery.UI.css
Scripts/jQuery-1.5.2/jQuery.js
Scripts/jQuery.jPlayer-2.0.0/jQuery.jPlayer.js
Scripts/jQuery.UI-1.8.2/jQuery.UI.js
... are changed into these names in the resources ...
Content.jQuery.UI_1._8._2.jQuery.UI.css
Scripts.jQuery_1._5._2.jQuery.js
Scripts.jQuery.jPlayer_2._0._0.jQuery.jPlayer.js
Scripts.jQuery.UI_1._8._12.jQuery.UI.js
Slashes are translated to dots. However, when a dot is used in a folder name, the first dot is apparently considered an extension and the rest of the dots are changed to be prefixed with an underscore. This logic does not apply on the jQuery.js file, though, maybe because the 'extension' is a single number? Here's a function able to translate the issues I've had so far, but doesn't work on the jQuery.js path.
protected String _GetResourceName( String[] zSegments )
{
String zResource = String.Empty;
for ( int i = 0; i < zSegments.Length; i++ )
{
if ( i != ( zSegments.Length - 1 ))
{
int iPos = zSegments[i].IndexOf( '.' );
if ( iPos != -1 )
{
zSegments[i] = zSegments[i].Substring( 0, iPos + 1 )
+ zSegments[i].Substring( iPos + 1 ).Replace( ".", "._" );
}
}
zResource += zSegments[i].Replace( '/', '.' ).Replace( '-', '_' );
}
return String.Concat( _zAssemblyName, zResource );
}
Is there a function that can change the names for me? What is it? Or where can I find all the rules so I can write my own function? Thanks for any assistance you may be able to provide.
This is kinda a very late answer... But since this was the first hit on google, I'll post what I've found!
You can simply force compiler to name the embedded resource as you want it; Which will kinda solves this problem from the beginning... You've just got to edit your csproj file, which you normally do if you want wildcards in it! here is what I did:
<EmbeddedResource Include="$(SolutionDir)\somefolder\**">
<Link>somefolder\%(RecursiveDir)%(Filename)%(Extension)</Link>
<LogicalName>somefolder:\%(RecursiveDir)%(Filename)%(Extension)</LogicalName>
</EmbeddedResource>
In this case, I'm telling Visual studio, that I want all the files in "some folder" to be imported as embedded resources. Also I want them to be shown under "some folder", in VS solution explorer (this is link tag). And finally, when compiling them, I want them to be named exactly with same name and address they had on my disk, with only "somefolder:\" prefix. The last part is doing the magic.
This is what I came up with to solve the issue. I'm still open for better methods, as this is a bit of a hack (but seems to be accurate with the current specifications). The function expects a segment from an Uri to process (LocalPath when dealing with web requests). Example call is below..
protected String _GetResourceName( String[] zSegments )
{
// Initialize the resource string to return.
String zResource = String.Empty;
// Initialize the variables for the dot- and find position.
int iDotPos, iFindPos;
// Loop through the segments of the provided Uri.
for ( int i = 0; i < zSegments.Length; i++ )
{
// Find the first occurrence of the dot character.
iDotPos = zSegments[i].IndexOf( '.' );
// Check if this segment is a folder segment.
if ( i < zSegments.Length - 1 )
{
// A dash in a folder segment will cause each following dot occurrence to be appended with an underscore.
if (( iFindPos = zSegments[i].IndexOf( '-' )) != -1 && iDotPos != -1 )
{
zSegments[i] = zSegments[i].Substring( 0, iFindPos + 1 ) + zSegments[i].Substring( iFindPos + 1 ).Replace( ".", "._" );
}
// A dash is replaced with an underscore when no underscores are in the name or a dot occurrence is before it.
//if (( iFindPos = zSegments[i].IndexOf( '_' )) == -1 || ( iDotPos >= 0 && iDotPos < iFindPos ))
{
zSegments[i] = zSegments[i].Replace( '-', '_' );
}
}
// Each slash is replaced by a dot.
zResource += zSegments[i].Replace( '/', '.' );
}
// Return the assembly name with the resource name.
return String.Concat( _zAssemblyName, zResource );
}
Example call..
var testResourceName = _GetResourceName( new String[] {
"/",
"Scripts/",
"jQuery.UI-1.8.12/",
"jQuery-_.UI.js"
});
Roel,
Hmmm... This is a hack, but I guess it should work. Just define an empty "Marker" class in each directory which contains resources, then get the FullName of it's type, remove the class name from end and wala: there's your decoded-path.
string path = (new MarkerClass()).GetType().FullName.Replace(".MarkerClass", "");
I'm sure there's a "better" way to do it... with a LOT more lines of code; and this one has the advantage that Microsoft maintains it when they change stuff ;-)
Cheers. Keith.
A late answer here as well, I googled before I attempted this on my own and I eventually had to.
Here's the solution I came up with:
public string ProcessFolderDash(string path)
{
int dotCount = path.Split('/').Length - 1; // Gets the count of slashes
int dotCountLoop = 1; // Placeholder
string[] absolutepath = path.Split('/');
for (int i = 0; i < absolutepath.Length; i++)
{
if (dotCountLoop <= dotCount) // check to see if its a file
{
absolutepath[i] = absolutepath[i].Replace("-", "_");
}
dotCountLoop++;
}
return String.Join("/", absolutepath);
}
I am somehow unable to determine whether a string is newline or not. The string which I use is read from a file written by Ultraedit using DOS Terminators CR/LF. I assume this would equate to "\r\n" or Environment.NewLine in C#. However , when I perform a comparison like this it always seem to return false :
if(str==Environment.NewLine)
Anyone with a clue on what's going on here?
How are the lines read? If you're using StreamReader.ReadLine (or something similar), the new line character will not appear in the resulting string - it will be String.Empty or (i.e. "").
Are you sure that the whole string only contains a NewLine and nothing more or less? Have you already tried str.Contains(Environment.NewLine)?
The most obvious troubleshooting step would be to check what the value of str actually is. Just view it in the debugger or print it out.
Newline is "\r\n", not "/r/n". Maybe there's more than just the newline.... what is the string value in Debug Mode?
You could use the new .NET 4.0 Method:
String.IsNullOrWhiteSpace
This is a very valid question.
Here is the answer. I have invented a kludge that takes care of it.
static bool StringIsNewLine(string s)
{
return (!string.IsNullOrEmpty(s)) &&
(!string.IsNullOrWhiteSpace(s)) &&
(((s.Length == 1) && (s[0] == 8203)) ||
((s.Length == 2) && (s[0] == 8203) && (s[1] == 8203)));
}
Use it like so:
foreach (var line in linesOfMyFile)
{
if (StringIsNewLine(line)
{
// Ignore reading new lines
continue;
}
// Do the stuff only for non-empty lines
...
}