Convert html entity to decimal value using visual C#

Convert html entity to decimal value using visual C# - c#

i need to convert special html entities to its decimal values using visual C#. First i need to load .html file and need to replace all special character values to decimal values.
EX: ‰ ---> "&#137"
® ---> "&#174"
Å ---> "&#197"
so what is the optimized way to replace all characters with decimal values. i have list of more than 1000 characters and entities.

You should use WebUtility.HtmlEncode Method (String)

Assuming you can comfortably fit your HTML file in a StringBuilder, you could take a couple of different approaches. First, I'm assuming you have all of your character replacements stored in a dictionary:
var replacements = new Dictionary<char,string> {
{ '®', "&#137" },
// ...etc
}
First, read your file into a StringBuilder:
var html = new StringBuilder( File.ReadAllText( filename ) );
The first approach is that you could use StringBuilder.Replace(string,string):
foreach( var c in replacements.Keys ) {
html.Replace( c.ToString(), replacements[c] );
}
The second approach would be to go through every character in the file and see if it needs replacing (note that we start backwards from the end of the file; if we went forwards, we'd constantly be having to modify our index value since we're adding length to the file):
for( int i=html.Length-1; i>0; i-- ) {
var c = html[i];
if( replacements.ContainsKey( c ) ) {
html.Remove( i, 1 );
html.Insert( i, replacements[c] );
}
}
It's hard to say which would be more efficient without either having details about the implementation of StringBuilder.Replace(string,string) or doing some profiling, but I'll leave that up to you.
If it's not feasible to load your entire HTML file into a StringBuilder, you could use a variation of the second technique with a StreamReader reading the file one byte at a time.

Related

Which is the best way to store data from small static tables so I can easily access the data in C# Windows Forms projects

I'm working in a Windows Form project using C# that will be used to make calculations.
Basically the objective of this project is to replace a big Excel spreadsheet I've been using. As more people will use it, I need to do it FoolProof.
An example of what I have to do:
A comboBox should show me the column names (So I should retrieve the column names into a IEnumerable list).
Then I should insert a value in a TextBox that corresponds to a value in the first column.
The program will use thise and the column name into this table and save the corresponding value in memory
Tables I'm going to use are mostly from books so I might never (or not often) change these values. That's why I dismissed using a SQL database. I thought it was like buying a Ferrari for going to the mall.
Thanks so much in advance

The simplest storage for such data is a text file:
6;80;125
8;107;166
...
To store data write them line by line delimited by ';'.
To read data, read the file content line by line (or ReadAllText) then split each line by ';'.

I agree with your point that an SQL database is probably an overkill for your use case. Here’s a relatively easy workaround.
In Excel, press File, Save as, Text (tab delimited).
Move that file into your C# project, add to the project, then right click in visual studio, Properties, set Build Action = Embedded Resource.
Then write some C# to parse data from the tab-delimited file. Here’s an example.
sealed class Entry
{
public readonly double diameter, al, adn;
Entry( string line )
{
string[] fields = line.Split( '\t' );
if( fields.Length != 3 )
throw new ArgumentException();
diameter = double.Parse( fields[ 0 ], CultureInfo.InvariantCulture );
al = double.Parse( fields[ 1 ], CultureInfo.InvariantCulture );
adn = double.Parse( fields[ 2 ], CultureInfo.InvariantCulture );
}
public static Entry[] loadTabSeparated( Stream stream )
{
using var reader = new StreamReader( stream );
// Ignore the first row with column names
reader.ReadLine();
// Parse the rest of the rows in the table
List<Entry> list = new List<Entry>();
while( true )
{
string? line = reader.ReadLine();
if( null == line )
break;
list.Add( new Entry( line ) );
}
return list.ToArray();
}
}
Usage example:
static Entry[] loadEntries()
{
const string resourceName = "DefaultNamespace.someData.tsv";
var ass = Assembly.GetExecutingAssembly();
using var stm = ass.GetManifestResourceStream( resourceName )
?? throw new ApplicationException( "Embedded resource is missing" );
return Entry.loadTabSeparated( stm );
}
If the text file is at least 100kb, and you want to save size of the binary, compress that resource file into GZip. The C# code only needs a single line of code to decompress on the fly, see GZipStream class.
Also, since that data is immutable, you probably should not load it on every use, instead load it once, and keep the array in memory for as long as your app is running. For instance, you could cache in a static readonly field of some class.

How to find one of many possible substrings in a larger string?

I have a simple problem, but I could not find a simple solution yet.
I have a string containing for example this
UNB+123UNH+234BGM+345DTM+456
The actual string is lots larger, but you get the idea
now I have a set of values I need to find in this string
for example UNH and BGM and DTM and so on
So I need to search in the large string, and find the position of the first set of values.
something like this (not existing but to explain the idea)
string[] chars = {"UNH", "BGM", "DTM" };
int pos = test.IndexOfAny(chars);
in this case pos would be 8 because from all 3 substrings, UNH is the first occurrence in the variable test
What I actually trying to accomplish is splitting the large string into a list of strings, but the delimiter can be one of many values ("BGM", "UNH", "DTM")
So the result would be
UNB+123
UNH+234
BGM+345
DTM+456
I can off course build a loop that does IndexOf for each of the substrings, and then remember the smallest value, but that seems so inefficient. I am hoping for a better way to do this
EDIT
the substrings to search for are always 3 letters, but the text in between can be anything at all with any length
EDIT
It are always 3 alfanumeric characters, and then anything can be there, also lots of + signs

You will find more problems with EDI than just splitting into corresponding fields, what about conditions or multiple values or lists?. I recommend you to take a look at EDI.net
EDIT:
EDIFact is a format pretty complex to just use regex, as I mentioned before, you will have conditions for each format/field/process, you will need to catch the whole field in order to really parse it, means as example DTM can have one specific datetime format and in another EDI can have a DateTime format totally different.
However, this is the structure of a DTM field:
DTM DATE/TIME/PERIOD
Function: To specify date, and/or time, or period.
010 C507 DATE/TIME/PERIOD M 1
2005 Date or time or period function code
qualifier M an..3
2380 Date or time or period text C an..35
2379 Date or time or period format code C an..3
So you will have always something like 'DTM+d3:d35:d3' to search for.
Really, it doesn't worth the struggle, use EDI.net, create your own POCO classes and work from there.
Friendly reminder that EDIFact changes every 6 months on Europe.

If the separators can be any one of UNB, UNH, BGM, or DTM, the following Regex could work:
foreach (Match match in Regex.Matches(input, #"(UNB|UNH|BGM|DTM).+?(?=(UNB|UNH|BGM|DTM)|$)"))
{
Console.WriteLine(match.Value);
}
Explanation:
(UNB|UNH|BGM|DTM) matches either of the separators
.+? matches any string with at least one character (but as short as possible)
(?=(UNB|UNH|BGM|DTM)|$) matches if either a separator follows or if the string ends there - the match is however not included in the value.

It sounds like the other answer recognises the format - you should definitely consider a library specifically for parsing this format!
If you're intent on parsing it yourself, you could simply find the index of your identifiers in the string, determine the first 2 by position, and use those positions to Substring the original input
var input = "UNB+123UNH+234BGM+345DTM+456";
var chars = new[]{"UNH", "BGM", "DTM" };
var indexes = chars.Select(c => new{Length=c.Length,Position= input.IndexOf(c)}) // Get position and length of each input
.Where(x => x.Position>-1) // where there is actually a match
.OrderBy(x =>x.Position) // put them in order of the position in the input
.Take(2) // only interested in first 2
.ToArray(); // make it an array
if(indexes.Length < 2)
throw new Exception("Did not find 2");
var result = input.Substring(indexes[0].Position + indexes[0].Length, indexes[1].Position - indexes[0].Position - indexes[0].Length);
Live example: https://dotnetfiddle.net/tDiQLG

There is already a lot of answers here, but I took the time to write mine so might as well post it even if it's not as elegant.
The code assumes all tags are accounted for in the chars array.
string str = "UNB+123UNH+234BGM+345DTM+456";
string[] chars = { "UNH", "BGM", "DTM" };
var locations = chars.Select(o => str.IndexOf(o)).Where(i => i > -1).OrderBy(o => o);
var resultList = new List<string>();
for(int i = 0;i < locations.Count();i++)
{
var nextIndex = locations.ElementAtOrDefault(i + 1);
nextIndex = nextIndex > 0 ? nextIndex : str.Length;
nextIndex = nextIndex - locations.ElementAt(i);
resultList.Add(str.Substring(locations.ElementAt(i), nextIndex));
}

This is a fairly efficient O(n) solution using a HashSet
It's extremely simple, low allocations, more efficient than regex, and doesn't need a library
Given
private static HashSet<string> _set;
public static IEnumerable<string> Split(string input)
{
var last = 0;
for (int i = 0; i < input.Length-3; i++)
{
if (!_set.Contains(input.Substring(i, 3))) continue;
yield return input.Substring(last, i - last);
last = i;
}
yield return input.Substring(last);
}
Usage
_set = new HashSet<string>(new []{ "UNH", "BGM", "DTM" });
var results = Split("UNB+123UNH+234BGM+345DTM+456");
foreach (var item in results)
Console.WriteLine(item);
Output
UNB+123
UNH+234
BGM+345
DTM+456
Full Demo Here
Note : You could get this faster with a simple sorted tree, but would require more effort

How to avoid false separators in csv / XML

I've been trying to understand how XML and CSV parsing work, without actually writing any code yet. I might have to parse a .csv file in the ongoing project and I'd like to be ready. (I'll have to convert them to .ofx files)
I'm also aware there's probably a thousand XLM and csv parsers out there, so I'm more curious than I am worried. I intend on using the XMLReader that I believe microsoft provides.
Let's say I have the following .csv file
02/02/2016 ; myfirstname ; mylastname ; somefield ; 321654 ; commentary ; blabla
Sometimes a field will be missing. Which means, for the sake of the example, that the lastname isn't mandatory, and somefield could be right after the first name.
My questions are :
How do I avoid the confusion between somefield and lastname?
I could count the total number of fields, but in my situation two are optional, if there is only one missing, I can't be sure which one it is.
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
Again, I could count the remaining fields and find out where I am, but that excludes the optional fields problem.
My questions also apply to XML, what can I do if the user starts writing XML in his form ? Wether I decide to export the form as .csv or .xml, there can be trouble.
Right now I'm on the assumption that the c# Xml reader/parser are awesome enough to deal with it ; and if they are, I'm really curious on the how.

Assuming the CSV/XML data has been exported properly, none of this will be a problem. Missing fields will be handled by repeated separators:
02/02/2016;myfirstname;;somefield
Semi-colons within a field will normally be handled by quoting:
02/02/2016;"myfirst;name";
Quotes are escaped within a string:
02/02/2016;"my""first""name";
With XML it's even less of an issue since the tags or attributes will all have names.
If your CSV data is NOT well-formed, then you have a much bigger problem, as it may be impossible to distinguish missing fields and non-quoted separators.

How do I avoid false "tags"? String values should be quoted if the (can) contain separator characters. If you create the CSV file, quote and unquote all string values.
How do I avoid the confusion between somefield and lastname? No general solution for this, all case must be handled one by one. Can a general algorithm decide wheather first name or last name is missing? No.
If you know what field(s) can be omitted, you can write an "intelligent" handling.
Use XML and all of your problem will be solved.

Fisrt
How do I avoid the confusion between somefield and lastname?
There is no way to do this without change the logic of file. For example: when "mylastname" is empty You may have a "" value, empty string or like this ;;
How do I avoid false "tags"? I mean, if the user first comment includes a ;, how can I be sure it's a part of his comment and not the start of the following tag?
It is simple you have to file like this:
; - separor of columns
"" - delimetr of columns
value;value;"value;;;;value";value
To split this only for separtor ; without the separator in "" this code do this is tested and compiled
public static string[] SplitWithDelimeter(this string line, char separator, char checkSeparator, bool eraseCheckSeparator)
{
var separatorsIndexes = new List<int>();
var open = false;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == checkSeparator)
{
open = !open;
}
if (!open && line[i] == separator )
{
separatorsIndexes.Add(i);
}
}
separatorsIndexes.Add(line.Length);
var result = new string[separatorsIndexes.Count];
var first = 0;
for (var j = 0; j < separatorsIndexes.Count; j++)
{
var tempLine = line.Substring(first, separatorsIndexes[j] - first);
result[j] = eraseCheckSeparator ? tempLine.Replace(checkSeparator, ' ').Trim() : tempLine;
first = separatorsIndexes[j] + 1;
}
return result;
}
Return would be:
value
value
"value;;;;value"
value

Find a delimiter of csv or text files in c#

I want to find a delimiter being used to separate the columns in csv or text files.
I am using TextFieldParser class to read those files.
Below is my code,
String path = #"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
if (path.Contains(".txt"))
{
parser.SetDelimiters("|");
}
else
{
parser.SetDelimiters(",");
}
parser.HasFieldsEnclosedInQuotes = true;
bool firstLine = true;
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (firstLine)
{
foreach (var val in fields)
{
dt.Columns.Add(val);
}
firstLine = false;
continue;
}
dt.Rows.Add(fields);
}
}
lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
dgvTextFieldParser1.DataSource = dt;
Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.
How can I do that?

Mathematically correct but totally useless answer: It's not possible.
Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.
For instance, can we assume that the delimiter is one or any of the elements in the set below?
List<char> delimiters = new List<char>{' ', ';', '|'};
Or can we assume that the delimiter is such that it produces elements of equal length?
Should we try to find a delimiter that's a single character or can a word be one?
Etc.
Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.
How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?
List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
counts[c] = textArray.Count(t => t == c);
I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.
You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.

Very simple guessing approach using LINQ:
static class CsvSeperatorDetector
{
private static readonly char[] SeparatorChars = {';', '|', '\t', ','};
public static char DetectSeparator(string csvFilePath)
{
string[] lines = File.ReadAllLines(csvFilePath);
return DetectSeparator(lines);
}
public static char DetectSeparator(string[] lines)
{
var q = SeparatorChars.Select(sep => new
{Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
.OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
.ThenBy(res => res.Found.Count())
.First();
return q.Separator;
}
}
What this does is it reads the file line by line (note that CSV files may include line breaks), then checks for each potential separator how often it occurs in each line.
Then we check which separator occurs on the most lines, and of those which occur on the same number of lines, we take the one with the most even distribution (e.g. 5 occurences on every line are ranked higher than one that occurs once in one line and 10 times in another line).
Of course you might have to tweak this for your own purposes, add error handling, fallback logic and so forth. I'm sure it's not perfect, but it's good enough for me.

You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.

In python we can do this easily by using csv sniffer. It will cater for text files and also if you just need to read some bytes from the file.

TextReader.Read() not returning correct integer?

So my method should in theory work, I'm just not getting my expected result back.
I have a function that creates a new TextReader class, reads in a character (int) from my text file and adds it too a list.
The textfile data looks like the following (48 x 30):
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111100000000001111111111000000111111111
111111110000000000000000000000000000000011111111
100000000000000000000000000000000000000001111111
000000000000001111111111111111111111000001111111
100000001111111111111111111112211221000001111111
100000111111122112211221122111111111000001111111
111111111221111111111111111112211110000011111111
111112211111111111111111111111111100000111221111
122111111111111122111100000000000000001111111111
111111111111111111100000000000000000011111111111
111111111111111111000000000000000001112211111111
111111111111221110000001111110000111111111111111
111111111111111100000111112211111122111111111111
111111112211110000001122111111221111111111111111
111122111111000000011111111111111111112211221111
111111110000000011111111112211111111111111111111
111111000000001111221111111111221122111100000011
111111000000011111111111000001111111110000000001
111111100000112211111100000000000000000000000001
111111110000111111100000000000000000000000000011
111111111000011100000000000000000000000011111111
111111111100000000000000111111111110001111111111
111111111110000000000011111111111111111111111111
111111111111100000111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
My method is as follows:
private void LoadReferenceMap(string FileName)
{
FileName = Path.Combine(Environment.CurrentDirectory, FileName);
List<int> ArrayMapValues = new List<int>();
if (File.Exists(FileName))
{
// Create a new stream to write to the file
using (TextReader reader = File.OpenText(FileName))
{
for (int i = 0; i < 48; i++)
{
for (int j = 0; j < 30; j++)
{
int x = reader.Read();
if (x == -1)
break;
ArrayMapValues.Add(x);
}
}
}
level.SetFieldMap(ArrayMapValues);
}
}
Returns:
As you can see once it reaches the end of the first line Read() returns 13 and then 10 before moving on to the next row?

A different approach that removes both problems with conversion of chars to integers and the skipping of Environment.NewLine characters
private void LoadReferenceMap(string FileName)
{
List<int> ArrayMapValues = new List<int>();
if (File.Exists(FileName))
{
foreach(string line in File.ReadLines(FileName))
{
var lineMap = line.ToCharArray()
.Select(x => Convert.ToInt32(x.ToString()));
ArrayMapValues.AddRange(lineMap);
}
level.SetFieldMap(ArrayMapValues);
}
}
The file is small, so it seems to be convenient to read a line as a string (this removes the Environment.NewLine), process the line converting it to a char array and applying the conversion to integer for each char. Finally the List of integers of a single line could be added to your List of integers for all the file.
I have not inserted any check on the length of a single line (48 chars) and the total number of lines (30) because you say that every file has this format. However, adding a small check on the total lines loaded and their lengths, should be pretty simple.

This is because you need to convert the symbol you've got to the char, like this:
(char)sr.Read();
After that you can parse it as int with different approach, for example:
int.Parse(((char)sr.Read()).ToString());
More information on MSDN.
As you can see once it reaches the end of the first line Read() returns 13 and then 10 before moving on to the next row?
The line break in the .NET looks like this: \r\n, and not the \n (Check the Environment.NewLine property.

The actual text file has line breaks in it. This means that once you have read the first 48 characters the next thing in the file is a line break. In this case it is a standard windows new line which is a Carriage Return (character 13) followed by a Line Feed (character 10).
You need to deal with these line breaks in your code somehow. My preferred way of doing this would be the method outlined by Steve above (using File.ReadAllLines). You could alternatively though just at the end of each of your sets of 48 character reads check for the 13/10 character combo. One thing of note though is that some systems just use Line Feed without the carriage return to indicate new lines. Depending on the source of these files you may need to code something to deal with possible different line breaks. Using ReadAllLines will let something else deal with this issue though as would using reader.ReadLine()
If you are also unsure why it is returning 49 instead of 1 then you need to understand about character encoding. The file is stored as bytes which are interpreted by the reading program. In this case you are reading out the values of the characters as integers (which is how .NET stores them internally). You need to convert this to a character. In this case you can just cast to char (ie (char)x). This will then return a char which is '1'. If you want this as an integer you would then need to use Integer.Parse to parse from text into an integer.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.