Count characters in HTML [duplicate]

Count characters in HTML [duplicate] - c#

I'm trying to figure out a way to count the number of characters in a string, truncate the string, then returns it. However, I need this function to NOT count HTML tags. The problem is that if it counts HTML tags, then if the truncate point is in the middle of a tag, then the page will appear broken.
This is what I have so far...
public string Truncate(string input, int characterLimit, string currID) {
string output = input;
// Check if the string is longer than the allowed amount
// otherwise do nothing
if (output.Length > characterLimit && characterLimit > 0) {
// cut the string down to the maximum number of characters
output = output.Substring(0, characterLimit);
// Check if the character right after the truncate point was a space
// if not, we are in the middle of a word and need to remove the rest of it
if (input.Substring(output.Length, 1) != " ") {
int LastSpace = output.LastIndexOf(" ");
// if we found a space then, cut back to that space
if (LastSpace != -1)
{
output = output.Substring(0, LastSpace);
}
}
// end any anchors
if (output.Contains("<a href")) {
output += "</a>";
}
// Finally, add the "..." and end the paragraph
output += "<br /><br />...<a href='Announcements.aspx?ID=" + currID + "'>see more</a></p>";
}
return output;
}
But I'm not happy with this. Is there a better way to do this? If you could provide a new solution to this, or perhaps suggestions on what to add to what I have so far, that would be great.
Disclaimer: I've never worked with C#, so I'm not familiar with the concepts related to the language... I'm doing this because I have to, not by choice.
Thanks,
Hristo

Use the right tool for the problem.
HTML is not a simple format to parse. I would advise that you use a proven, existing parser rather than rolling your own. If you know that you will only ever parse XHTML - then you could use an XML parser instead.
These are the only reliable ways to perform operations on HTML that will preserve the semantic representation.
Don't try to use regular expressions. HTML is not a regular language and you can only cause yourself grief and misery going in that direction.

Related

Complex string compare logic

I need help with some complex (for me anyway as I not too experienced) string comparison logic. Basically, I want to validate a string to make sure it matches a format rule. I am using C#, targeting .NET 4.5.2.
I am trying to work with an API which gives me the expected format of the string this way:
1:420+4:9#### (must have “420” starting in position 1 AND have a “9” in position 4 AND have numeric digits in positions 5-8
2:Z+14:&&+20:10,11,12 (must have a “Z” in position 2 AND and alpha letters in positions 14, 15 AND have either “10”, “11”, or “12” starting in position 20
Legend:
":" = position/valuelist separator
"," = value separator
"+" = test separator
"#" = numeric digit-only wildcard
"&" = alpha letter-only wildcard
Given this, my first thought is to do a series of substrings and splits of the input string and then do compare on each section? Or, I could do a for loop and iterate through each character one by one until I hit the end of the length of the input string.
Let's assume in this case that the input string is something like "420987435744585". Using rule number one, I should get a pass on this since the first three are 420, position 4 is a 9 and the next 5-8 are numeric.
So far, I have created a method that returns a bool if I pass/fail validation. The input string is passed in. I then started to split on + or - to get all of the and or not sections and then split on comma to get the groups of rules. But this is where I am stuck. It seems like it should be easy and maybe it is but I just can't seem to wrap my head around it and I am thinking I am going to end up with a ton of arrays, foreach loops, if statements, etc... Just to validate and return true/false if the input string matches my format.
Can somebody please assist and give some guidance?
Thank you!!!!

The best way to handle these conditions would be using Regular Expressions (Regex). At first, you may find it a bit complicated, but it's worth to put time on learning it to handle all types of string patterns in a simple non-verbose way.
You can start with these tutorials :
http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial
http://www.tutorialspoint.com/csharp/csharp_regular_expressions.htm
And use this one as a reference :
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

I think the best way is a custom function, it will be faster than RegEx, and it would be a lot of manual work to convert that format to RegEx.
I've made a start at the validation function, and it's testing ok for the samples you provided.
Here is the code:
static bool CheckFormat(string formatString, string value)
{
string[] tests = formatString.Split('+');
foreach(string test in tests)
{
string[] testElement = test.Split(':');
int startPos = int.Parse(testElement[0]);
string patterns = testElement[1];
string[] patternElements = patterns.Split(',');
foreach(string patternElement in patternElements)
{
//value string not long enough, so fail.
if(startPos + patternElement.Length > value.Length)
return false;
for (int i = 0; i < patternElement.Length; i++)
{
switch(patternElement[i])
{
case '#':
if (!Char.IsNumber(value[i]))
return false;
break;
case '&':
if (!Char.IsLetter(value[i]))
return false;
break;
default:
if(patternElement[i] != value[i])
return false;
break;
}
}
}
}
return true;
}
The dotnet fiddle is here if you want to play with it: https://dotnetfiddle.net/52olLQ.
Good luck.

Code an elegant way to strip strings

I am using C# and in one of the places i got list of all peoples names with their email id's in the format
name(email)\n
i just came with this sub string stuff just off my head. I am looking for more elegant, fast ( in the terms of access time, operations it performs), easy to remember line of code to do this.
string pattern = "jackal(jackal#gmail.com)";
string email = pattern.SubString(pattern.indexOf("("),pattern.LastIndexOf(")") - pattern.indexOf("("));
//extra
string email = pattern.Split('(',')')[1];
I think doing the above would do sequential access to each character until it finds the index of the character. Works ok now since name is short, but would struggle when having a large name ( hope people don't have one)

A dirty hack would be to let microsoft do it for you.
try
{
new MailAddress(input);
//valid
}
catch (Exception ex)
{
// invalid
}
I hope they would do a better job than a custom reg-ex.
Maintaining a custom reg-ex that takes care of everything might involve some effort.
Refer: MailAddress
Your format is actually very close to some supported formats.
Text within () are treated as comments, but if you replace ( with < and ) with > and get a supported format.

The second parameter in Substring() is the length of the string to take, not the ending index.
Your code should read:
string pattern = "jackal(jackal#gmail.com)";
int start = pattern.IndexOf("(") + 1;
int end = pattern.LastIndexOf(")");
string email = pattern.Substring(start, end - start);
Alternatively, have a look at Regular Expression to find a string included between two characters while EXCLUDING the delimiters

Parsing CSV File enclosed with quotes in C#

I've seen lots of samples in parsing CSV File. but this one is kind of annoying file...
so how do you parse this kind of CSV
"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"

The best answer in most cases is probably #Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.
The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.
So I rewrote for this case as a string extension. I think this is close.
Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.
The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As #Vash points out, you're better off following some standard and coding a little more OFfensively.
But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to #Vash's comment!
Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.
namespace YourFavoriteNamespace
{
using System;
using System.Collections.Generic;
using System.Text;
public static class Extensions
{
public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"',
bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
{
Queue<string> qReturn = new Queue<string>();
StringBuilder stringBuilder = new StringBuilder();
bool bInEscapeVal = false;
for (int i = 0; i < valToSplit.Length; i++)
{
if (!bInEscapeVal)
{
// Escape values must come immediately after a split.
// abc,"b,ca",cab has an escaped comma.
// abc,b"ca,c"ab does not.
if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
{
bInEscapeVal = true; // not capturing escapeChar as part of value; easy enough to change if need be.
}
else if (splittingChar == valToSplit[i])
{
qReturn.Enqueue(stringBuilder.ToString());
stringBuilder = new StringBuilder();
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
else
{
// Can't use switch b/c we're comparing to a variable, I believe.
if (escapeChar == valToSplit[i])
{
// Repeated escape always reduces to one escape char in this logic.
// So if you wanted "I'm ""double quote"" crazy!" to come out with
// the double double quotes, you're toast.
if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
{
i++;
stringBuilder.Append(escapeChar);
}
else if (!strictEscapeToSplitEvaluation)
{
bInEscapeVal = false;
}
// *** STINKY CONDITION ***
// Kinda defense, since only `", ` really makes sense.
else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
{
i = i+2;
stringBuilder.Append("\", ");
}
// *** EO STINKY CONDITION ***
else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
{
bInEscapeVal = false;
}
else
{
stringBuilder.Append(escapeChar);
}
}
else
{
stringBuilder.Append(valToSplit[i]);
}
}
}
// NOTE: The `captureEndingNull` flag is not tested.
// Catch null final entry? "abc,cab,bca," could be four entries, with the last an empty string.
if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
{
qReturn.Enqueue(stringBuilder.ToString());
}
return qReturn;
}
}
}
Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)
[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.
If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.
Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.
Enjoy. ;^)

I very strongly recommend using TextFieldParser. Hand-coded parsers that use String.Split or regular expressions almost invariably mishandle things like quoted fields that have embedded quotes or embedded separators.
I would be surprised, though, if it handled your particular example. As others have said, that line is, at best, ambiguous.

Split based on
",
I would use MyString.IndexOf("\","
And then substring the parts. Other then that im sure someone written a csv parser out there that can handle this :)

I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...
from this:
"Annoying","CSV File","poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"
to this:
"Annoying¤CSV File¤poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"
then split it:
ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
ArrayA[1]: CSV File
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
ArrayA[3]: yeah!"
after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again
from this
ArrayA[2]: poop#mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
to this
ArrayA[2]: poop#mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby
then split it again and would turn to this
ArrayB[0]: poop#mypants.com
ArrayB[1]: 1999,01-20-2001
ArrayB[2]: oh,boy
ArrayB[3]: 01-20-2001
ArrayB[4]: yeah baby
and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC
It's tedious but there's no other way to do it...

There is one another open source library, Cinchoo ETL, handle quoted string fine. Here is sample code.
string csv = #"""1"",1/2/2010,""The sample(""adasdad"") asdada"",""I was pooping in the door ""Stinky"", so I'll be damn"",""AK""";
using (var r = ChoCSVReader.LoadText(csv)
.QuoteAllFields()
)
{
foreach (var rec in r)
Console.WriteLine(rec.Dump());
}
Output:
[Count: 5]
Key: Column1 [Type: Int64]
Value: 1
Key: Column2 [Type: DateTime]
Value: 1/2/2010 12:00:00 AM
Key: Column3 [Type: String]
Value: The sample(adasdad) asdada
Key: Column4 [Type: String]
Value: I was pooping in the door Stinky, so I'll be damn
Key: Column5 [Type: String]
Value: AK

You could split the string by ",". It is recomended that the csv file could each cell value should be enclosed in quotes like "1","2","3".....

I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.
"1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"
Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.

You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.
It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.

As no (decent) .csv parser can parse non-csv-data correctly, the task isn't to parse the data, but to fix the file(s) (and then to parse the correct data).
To fix the data you need a list of bad rows (to be sent to the person responsible for the garbage for manual editing). To get such a list, you can
use Access with a correct import specification to import the file. You'll get a list of import failures.
write a script/program that opens the file via the OLEDB text driver.
Sample file:
"Id","Remark","DateDue"
1,"This is good",20110413
2,"This is ""good""",20110414
3,"This is ""good"","bad",and "ugly",,20110415
4,"This is ""good""" again,20110415
Sample SQL/Result:
SELECT * FROM [badcsv01.csv]
Id Remark DateDue
1 This is good 4/13/2011
2 This is "good" 4/14/2011
3 This is "good", NULL
4 This is "good" again 4/15/2011
SELECT * FROM [badcsv01.csv] WHERE DateDue Is Null
Id Remark DateDue
3 This is "good", NULL

First you will do it for the columns names:
DataTable pbResults = new DataTable();
OracleDataAdapter oda = new OracleDataAdapter(cmd);
oda.Fill(pbResults);
StringBuilder sb1 = new StringBuilder();
StringBuilder sb2 = new StringBuilder();
IEnumerable<string> columnNames = pbResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
sb1.Append(string.Join("\"" + "," + "\"", columnNames));
sb2.Append("\"");
sb2.Append(sb1);
sb2.AppendLine("\"");
Second you will do it for each row:
foreach (DataRow row in pbResults.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => field.ToString());
sb2.Append("\"");
sb2.Append(string.Join("\"" + "," + "\"", fields));
sb2.AppendLine("\"");
}

Replace & with & in C#

Ok I feel really stupid asking this. I see plenty of other questions that resemble my question, but none seem to be able to answer it.
I am creating an xml file for a program that is very picky about syntax. Sadly I am making the XML file from scratch. Meaning, I am placing each line in individually (lots of file.WriteLine(String)).
I know this is ugly, but its the only way I can get the logic to work out.
ANYWAY. I have a few strings that are coming through with '&' in them.
if (value.Contains("&"))
{
value.Replace("&", "&");
}
Does not seem to work. The value.Contains() seems to see it, but the replace does not work. I am using C# .Net 2.0 sp2. VS 2005.
Please help me out here.. Its been a long week..

If you really want to go that route, you have to assign the result of Replace (the method returns a new string because strings are immutable) back to the variable:
value = value.Replace("&", "&");
I would suggest rethinking the way you're writing your XML though. If you switch to using the XmlTextWriter, it will handle all of the encoding for you (not only the ampersand, but all of the other characters that need encoded as well):
using(var writer = new XmlTextWriter(#"C:\MyXmlFile.xml", null))
{
writer.WriteStartElement("someString");
writer.WriteText("This is < a > string & everything will get encoded");
writer.WriteEndElement();
}
Should produce:
<someString>This is < a > string &
everything will get encoded</someString>

You should really use something like Linq to XML (XDocument etc.) to solve it. I'm 100% sure you can do it without all your WriteLine´s ;) Show us your logic?
Otherwise you could use this which will be bullet proof (as opposed to .Replace("&")):
var value = "hej&hej<some>";
value = new System.Xml.Linq.XText(value).ToString(); //hej&hej<some>
This will also take care of < which you also HAVE TO escape :)
Update: I have looked at the code for XText.ToString() and internally it creates a XmlWriter + StringWriter and uses XNode.WriteTo. This may be overkill for a given application so if many strings should be converted, XText.WriteTo would be better. An alternative which should be fast and reliant is System.Web.HttpUtility.HtmlEncode.
Update 2: I found this System.Security.SecurityElement.Escape(xml) which may be the fastest and ensures max compatibility (supported since .Net 1.0 and does not require the System.Web reference).

you can also use HttpUtility.HtmlEncode class under System.Web namespace instead of doing the replacement yourself.
here you go: http://msdn.microsoft.com/en-us/library/73z22y6h.aspx

You can use Regex for replace char "&" only in node values:
input data example (string)
<select>
<option id="11">Gigamaster&Minimaster</option>
<option id="12">Black & White</option>
<option id="13">Other</option>
</select>
Replace with Regex
Regex rgx = new Regex(">(?<prefix>.*)&(?<sufix>.*)<");
data = rgx.Replace(data, ">${prefix}&${sufix}<");
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(data);
result data
<select>
<option id="11">Gigamaster&MiniMaster</option>
<option id="12">Black & White</option>
<option id="13">Other</option>
</select>

I'm Obviously very late to this, but the right answer is:
System.Text.RegularExpressions.Regex.Replace(input, "&(?!amp;)", "&");
Hope this helps somebody!

You can try:
value = value.Replace("&", "&");

Strings are immutable. You need to write:
value = value.Replace("&", "&");
Note that if you do this and your string contains "&", it's going to get changed to "&amp;".

I've created the following function to encode & and ' without messing up with already encoded & or &apos; or "
public static string encodeSelectXMLCharacters(string xmlString)
{
string returnValue = Regex.Replace(xmlString, "&(?!quot;|apos;|amp;|lt;|gt;#x?.*?;)|'",
delegate(Match m)
{
string encodedValue;
switch (m.Value)
{
case "&":
encodedValue = "&";
break;
case "'":
encodedValue = "&apos;";
break;
default:
encodedValue = m.Value;
break;
}
return encodedValue;
});
return returnValue;
}

not sure if this is useful to anyone... I was fighting this for a while... here is a glorious regex you can use to fix all your links, javascript, content. I had to deal with a ton of legacy content that nobody wanted to correct.
Add this to your Render override in your master page, control or recode to run a string through it. Please don't flame me for putting this in the wrong place:
// remove the & from href="blaw?a=b&b=c" and replace with &
//in urls - this corrects any unencoded & not just those in URL's
// this match will also ignore any matches it finds within <script> blocks AND
// it will also ignore the matches where the link includes a javascript command like
// blaw
html = Regex.Replace(html, "&(?!(?<=(?<outerquote>[\"'])javascript:(?>(?!\\k<outerquote>|[>]).)*)\\k<outerquote>?)(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\\d+);)(?!(?>(?:(?!<script|\\/script>).)*)\\/script>)", "&", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Its a broad stroke for a rendered page but this can be adapted to many uses without blowing up your page.

What about
Value = Server.HtmlEncode(Value);

I am quite sure it will work if you "embrace" your value with CDATA, so the result is something like
<ampersandData><![CDATA[value with ampersands like …]]></ampersandData>
Hope it helps.
Michael

Very late here, but I want to share my solution which handles the cases where you have both & (incorrect xml) and & (valid xml) in the document in addition to other xml character entities.
This solution is only meant for cases where you cannot control generation of the xml, usually because it comes from some external source. If you control the xml generation please use XmlTextWriter as suggested by #Justin Niessner
It is also quite fast and handles all the different xml character entities/references
Predefined character entities:
& quot;
& amp;
& apos;
& lt;
& gt;
Numeric character entities/references:
& #nnnn;
& #xhhhh;
PS! The space after & should not be included in the entities/references, I just added it here to avoid it being encoded in the page rendering
Code
public static string CleanXml(string text)
{
int length = text.Length;
StringBuilder stringBuilder = new StringBuilder(length);
for (int i = 0; i < length; ++i)
{
if (text[i] == '&')
{
var remaining = Math.Abs(length - i + 1);
var subStrLength = Math.Min(remaining, 12);
var subStr = text.Substring(i, subStrLength);
var firstIndexOfSemiColon = subStr.IndexOf(';');
if (firstIndexOfSemiColon > -1)
subStr = subStr.Substring(0, firstIndexOfSemiColon + 1);
var matches = Regex.Matches(subStr, "&(?!quot;|apos;|amp;|lt;|gt;|#x?.*?;)|'");
if (matches.Count > 0)
stringBuilder.Append("&");
else
stringBuilder.Append("&");
}
else if (XmlConvert.IsXmlChar(text[i]))
{
stringBuilder.Append(text[i]);
}
else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
{
stringBuilder.Append(text[i]);
stringBuilder.Append(text[i + 1]);
++i;
}
}
return stringBuilder.ToString();
}

Need to pick up line terminators with StreamReader.ReadLine()

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).
The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().
The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.
I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.

I would recommend that you change your architecture to work more like a parser in a compiler.
You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.
In your case the tokens would be:
Column data
Comma
End of Line
You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.
This has the advantages of:
Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change
Here's a sample of what the Lexer would look like:
Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.
enum TokenType
{
ColumnData,
Comma,
LineTerminator
}
class Token
{
public TokenType Type { get; private set;}
public string Data { get; private set;}
public Token(TokenType type)
{
Type = type;
}
public Token(TokenType type, string data)
{
Type = type;
Data = data;
}
}
private IEnumerable<Token> GetTokens(TextReader s)
{
var builder = new StringBuilder();
while (s.Peek() >= 0)
{
var c = (char)s.Read();
switch (c)
{
case ',':
{
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.Comma);
break;
}
case '\r':
{
var next = s.Peek();
if (next == '\n')
{
s.Read();
}
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.LineTerminator);
break;
}
default:
builder.Append(c);
break;
}
}
s.Read();
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
}
private string ExtractText(StringBuilder b)
{
var ret = b.ToString();
b.Remove(0, b.Length);
return ret;
}
Your "parser" code would then look like this:
public void ConvertXLS(TextReader s)
{
var columnData = new List<string>();
bool lastWasColumnData = false;
bool seenAnyData = false;
foreach (var token in GetTokens(s))
{
switch (token.Type)
{
case TokenType.ColumnData:
{
seenAnyData = true;
if (lastWasColumnData)
{
//TODO: do some error reporting
}
else
{
lastWasColumnData = true;
columnData.Add(token.Data);
}
break;
}
case TokenType.Comma:
{
if (!lastWasColumnData)
{
columnData.Add(null);
}
lastWasColumnData = false;
break;
}
case TokenType.LineTerminator:
{
if (seenAnyData)
{
OutputLine(lastWasColumnData);
}
seenAnyData = false;
lastWasColumnData = false;
columnData.Clear();
}
}
}
if (seenAnyData)
{
OutputLine(columnData);
}
}

You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.
I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.
It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.
x.Replace("\r\n", "\u0000") // Or some other unused character
.Replace("\n", "\\x0A") // Or whatever escaping you need
.Replace("\u0000", "\r\n") // Replace the real line breaks
I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.

Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).
The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.
Here's an example of the file after I convert to CSV...
Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.
I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.

I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.
If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:
string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
if (lineCount == 0)
{
lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
columnCount = lineData.length;
++lineCount;
continue;
}
string thisLine = lastLine + currentLine;
lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
if (lineData.Length < columnCount)
{
lastLine += currentLine;
continue;
}
else
{
lastLine = null;
}
......

Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!
Below is a explanation about what I needed:
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.
https://stackoverflow.com/a/12640862/1582188

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Count characters in HTML [duplicate] - c#

Related

Complex string compare logic

Code an elegant way to strip strings

Parsing CSV File enclosed with quotes in C#

Replace & with & in C#

Need to pick up line terminators with StreamReader.ReadLine()

Categories

Resources