Converting multifasta parser from Python to C#

Converting multifasta parser from Python to C# - c#

I am trying to convert a multi fasta parser from Python to C#. For the input
>header1
ACTG
GCTA
>header2
GATTACA
it would return the dictionary {'header2': 'GATTACA', 'header1': 'ACTGGCTA'}
The original Python code looks like:
def fastaParser(handle):
""" Adapted from https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L39 """
fastaDict = {}
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "":
return # Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0] != ">":
raise ValueError("Records in Fasta files should start with '>' character")
title = line[1:].rstrip()
lines = []
line = handle.readline()
while True:
if not line:
break
if line[0] == ">":
break
lines.append(line.rstrip())
line = handle.readline()
#Remove trailing whitespace, and any internal spaces
sequence = "".join(lines).replace(" ", "").replace("\r", "")
fastaDict[title] = sequence
if not line:
return fastaDict
if __name__ == '__main__':
with open('fasta.txt') as f:
print fastaParser(f)
What I have as C# code is (my code expects a string instead of an open filehandle):
public Dictionary<int, string> parseFasta(string multiFasta)
{
Dictionary<int, string> fastaDict = new Dictionary<int, string>();
using (System.IO.StringReader multiFastaReader = new System.IO.StringReader(multiFasta))
{
// Skip any text before the first record (e.g. blank lines, comments)
while (true)
{
string line = multiFastaReader.ReadLine();
if (line == "")
{
return fastaDict; // Premature end of file, or just empty?
}
if (line[0] == '>')
{
break;
}
}
while (true)
{
if (line[0] != '>') // <- Here I get the error: "the name 'line' does not exist in the current context
{
throw new Exception("Records in Fasta files should start with '>' character");
}
string title= line[1:].TrimEnd();
List<string> lines = new List<string>();
line = multiFastaReader.ReadLine();
while (true)
{
if (!line)
{
break;
}
if (line[0] == '>')
{
break;
}
lines.Add(line.TrimEnd());
line = multiFastaReader.ReadLine();
}
// Remove trailing whitespace, and any internal spaces
string sequence = String.Join("", lines).Replace(" ", "").Replace("\r", "");
fastaDict.Add(title, sequence);
if (!line)
{
return fastaDict;
}
}
}
}
The error that 'm getting is that Visual Studio says that the variables called line after the second while (true) don't exist in the current context.

I finally got it to work with this code:
public Dictionary<string, string> parseFasta(string multiFasta)
{
Dictionary<string, string> fastaDict = new Dictionary<string, string>();
using (System.IO.StringReader multiFastaReader = new System.IO.StringReader(multiFasta))
{
// Skip any text before the first record (e.g. blank lines, comments)
string line = multiFastaReader.ReadLine();
while (true)
{
if (line == "")
{
return fastaDict; // Premature end of file, or just empty?
}
if (line[0] == '>')
{
break;
}
}
while (true)
{
if (line[0] != '>')
{
throw new Exception("Records in Fasta files should start with '>' character");
}
string title= line.Substring(1, line.Length-1).TrimEnd();
List<string> lines = new List<string>();
line = multiFastaReader.ReadLine();
while (true)
{
if (line == "")
{
break;
}
if (line == null)
{
break;
}
if (line[0] == '>')
{
break;
}
lines.Add(line.TrimEnd());
line = multiFastaReader.ReadLine();
}
// Remove trailing whitespace, and any internal spaces
string sequence = String.Join("", lines).Replace(" ", "").Replace("\r", "");
fastaDict.Add(title, sequence);
if (line == null)
{
return fastaDict;
}
}
}
}

Related

Split a string if delimiter is between single quotes [duplicate]

This question already has answers here:
How to split csv whose columns may contain comma
(9 answers)
Closed 4 years ago.
I have the following comma-separated string that I need to split. The problem is that some of the content is within quotes and contains commas that shouldn't be used in the split.
String:
111,222,"33,44,55",666,"77,88","99"
I want the output:
111
222
33,44,55
666
77,88
99
I have tried this:
(?:,?)((?<=")[^"]+(?=")|[^",]+)
But it reads the comma between "77,88","99" as a hit and I get the following output:
111
222
33,44,55
666
77,88
,
99

Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!
You can do so with some simple regex
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
This will do the following:
(?:^|,) = Match expression "Beginning of line or string ,"
(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:
stuff in quotes
stuff between commas
This should give you the output you are looking for.
Example code in C#
static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
public static string[] SplitCSV(string input)
{
List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}
private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}
Warning As per #MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.
You should use:
(?:^|,)(\"(?:[^\"])*\"|[^,]*)
instead

Fast and easy:
public static string[] SplitCsv(string line)
{
List<string> result = new List<string>();
StringBuilder currentStr = new StringBuilder("");
bool inQuotes = false;
for (int i = 0; i < line.Length; i++) // For each character
{
if (line[i] == '\"') // Quotes are closing or opening
inQuotes = !inQuotes;
else if (line[i] == ',') // Comma
{
if (!inQuotes) // If not in quotes, end of current string, add it to result
{
result.Add(currentStr.ToString());
currentStr.Clear();
}
else
currentStr.Append(line[i]); // If in quotes, just add it
}
else // Add any other character to current string
currentStr.Append(line[i]);
}
result.Add(currentStr.ToString());
return result.ToArray(); // Return array of all strings
}
With this string as input :
111,222,"33,44,55",666,"77,88","99"
It will return :
111
222
33,44,55
666
77,88
99

i really like jimplode's answer, but I think a version with yield return is a little bit more useful, so here it is:
public IEnumerable<string> SplitCSV(string input)
{
Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
yield return match.Value.TrimStart(',');
}
}
Maybe it's even more useful to have it like an extension method:
public static class StringHelper
{
public static IEnumerable<string> SplitCSV(this string input)
{
Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
yield return match.Value.TrimStart(',');
}
}
}

This regular expression works without the need to loop through values and TrimStart(','), like in the accepted answer:
((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))
Here is the implementation in C#:
string values = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
MatchCollection matches = new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))").Matches(values);
foreach (var match in matches)
{
Console.WriteLine(match);
}
Outputs
111
222
33,44,55
666
77,88
99

None of these answers work when the string has a comma inside quotes, as in "value, 1", or escaped double-quotes, as in "value ""1""", which are valid CSV that should be parsed as value, 1 and value "1", respectively.
This will also work with the tab-delimited format if you pass in a tab instead of a comma as your delimiter.
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
if (character.val == delimiter) //We hit a delimiter character...
{
if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
{
Console.WriteLine(currentString);
yield return currentString.ToString();
currentString.Clear();
}
else
{
currentString.Append(character.val);
}
} else {
if (character.val != ' ')
{
if(character.val == '"') //If we've hit a quote character...
{
if(character.val == '\"' && inQuotes) //Does it appear to be a closing quote?
{
if (row[character.index + 1] == character.val) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
{
quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
}
else if (quoteIsEscaped)
{
quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
currentString.Append(character.val);
}
else
{
inQuotes = false;
}
}
else
{
if (!inQuotes)
{
inQuotes = true;
}
else
{
currentString.Append(character.val); //...It's a quote inside a quote.
}
}
}
else
{
currentString.Append(character.val);
}
}
else
{
if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
{
currentString.Append(character.val);
}
}
}
}
}

With minor updates to the function provided by "Chad Hedgcock".
Updates are on:
Line 26: character.val == '\"' - This can never be true due to the check made on Line 24. i.e. character.val == '"'
Line 28: if (row[character.index + 1] == character.val) added !quoteIsEscaped to escape 3 consecutive quotes.
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
if (character.val == delimiter) //We hit a delimiter character...
{
if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
{
//Console.WriteLine(currentString);
yield return currentString.ToString();
currentString.Clear();
}
else
{
currentString.Append(character.val);
}
} else {
if (character.val != ' ')
{
if(character.val == '"') //If we've hit a quote character...
{
if(character.val == '"' && inQuotes) //Does it appear to be a closing quote?
{
if (row[character.index + 1] == character.val && !quoteIsEscaped) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
{
quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
}
else if (quoteIsEscaped)
{
quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
currentString.Append(character.val);
}
else
{
inQuotes = false;
}
}
else
{
if (!inQuotes)
{
inQuotes = true;
}
else
{
currentString.Append(character.val); //...It's a quote inside a quote.
}
}
}
else
{
currentString.Append(character.val);
}
}
else
{
if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
{
currentString.Append(character.val);
}
}
}
}
}

For Jay's answer, if you use a 2nd boolean then you can have nested double-quotes inside single-quotes and vice-versa.
private string[] splitString(string stringToSplit)
{
char[] characters = stringToSplit.ToCharArray();
List<string> returnValueList = new List<string>();
string tempString = "";
bool blockUntilEndQuote = false;
bool blockUntilEndQuote2 = false;
int characterCount = 0;
foreach (char character in characters)
{
characterCount = characterCount + 1;
if (character == '"' && !blockUntilEndQuote2)
{
if (blockUntilEndQuote == false)
{
blockUntilEndQuote = true;
}
else if (blockUntilEndQuote == true)
{
blockUntilEndQuote = false;
}
}
if (character == '\'' && !blockUntilEndQuote)
{
if (blockUntilEndQuote2 == false)
{
blockUntilEndQuote2 = true;
}
else if (blockUntilEndQuote2 == true)
{
blockUntilEndQuote2 = false;
}
}
if (character != ',')
{
tempString = tempString + character;
}
else if (character == ',' && (blockUntilEndQuote == true || blockUntilEndQuote2 == true))
{
tempString = tempString + character;
}
else
{
returnValueList.Add(tempString);
tempString = "";
}
if (characterCount == characters.Length)
{
returnValueList.Add(tempString);
tempString = "";
}
}
string[] returnValue = returnValueList.ToArray();
return returnValue;
}

The original version
Currently I use the following regex:
public static Regex regexCSVSplit = new Regex(#"(?x:(
(?<FULL>
(^|[,;\t\r\n])\s*
( (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>) |
(?<QUODAT> (?<DAT> [^""',;\s\r\n]* )) )
(?=\s*([,;\t\r\n]|$))
) |
(?<FULL>
(^|[\s\t\r\n])
( (?<QUODAT> (?<QUO>[""'])(?<DAT> [^""',;\s\t\r\n]* )\k<QUO>) |
(?<QUODAT> (?<DAT> [^""',;\s\t\r\n]* )) )
(?=[,;\s\t\r\n]|$)
)
))", RegexOptions.Compiled);
This solution can handle pretty chaotic cases too like below:
This is how to feed the result into an array:
var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
Select(x => x.Groups["DAT"].Value).ToArray();
See this example in action HERE
Note: The regular expression contains two set of <FULL> block and each of them contains two <QUODAT> block separated by "or" (|). Depending on your task you may only need one of them.
Note: That this regular expression gives us one string array, and works on single line with or without <carrier return> and/or <line feed>.
Simplified version
The following regular expression will already cover many complex cases:
public static Regex regexCSVSplit = new Regex(#"(?x:(
(?<FULL>
(^|[,;\t\r\n])\s*
(?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>)
(?=\s*([,;\t\r\n]|$))
)
))", RegexOptions.Compiled);
See this example in action: HERE
It can process complex, easy and empty items too:
This is how to feed the result into an array:
var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
Select(x => x.Groups["DAT"].Value).ToArray();
The main rule here is that every item may contain anything but the <quotation mark><separators><comma> sequence AND each item shall being and end with the same <quotation mark>.
<quotation mark>: <">, <'>
<comma>: <,>, <;>, <tab>, <carrier return>, <line feed>
Edit notes: I added some more explanation to make it easier to understand and replaces the text "CO" with "QUO".

Try this:
string s = #"111,222,""33,44,55"",666,""77,88"",""99""";
List<string> result = new List<string>();
var splitted = s.Split('"').ToList<string>();
splitted.RemoveAll(x => x == ",");
foreach (var it in splitted)
{
if (it.StartsWith(",") || it.EndsWith(","))
{
var tmp = it.TrimEnd(',').TrimStart(',');
result.AddRange(tmp.Split(','));
}
else
{
if(!string.IsNullOrEmpty(it)) result.Add(it);
}
}
//Results:
foreach (var it in result)
{
Console.WriteLine(it);
}

I know I'm a bit late to this, but for searches, here is how I did what you are asking about in C sharp
private string[] splitString(string stringToSplit)
{
char[] characters = stringToSplit.ToCharArray();
List<string> returnValueList = new List<string>();
string tempString = "";
bool blockUntilEndQuote = false;
int characterCount = 0;
foreach (char character in characters)
{
characterCount = characterCount + 1;
if (character == '"')
{
if (blockUntilEndQuote == false)
{
blockUntilEndQuote = true;
}
else if (blockUntilEndQuote == true)
{
blockUntilEndQuote = false;
}
}
if (character != ',')
{
tempString = tempString + character;
}
else if (character == ',' && blockUntilEndQuote == true)
{
tempString = tempString + character;
}
else
{
returnValueList.Add(tempString);
tempString = "";
}
if (characterCount == characters.Length)
{
returnValueList.Add(tempString);
tempString = "";
}
}
string[] returnValue = returnValueList.ToArray();
return returnValue;
}

Don't reinvent a CSV parser, try FileHelpers.

I needed something a little more robust, so I took from here and created this... This solution is a little less elegant and a little more verbose, but in my testing (with a 1,000,000 row sample), I found this to be 2 to 3 times faster. Plus it handles non-escaped, embedded quotes. I used string delimiter and qualifiers instead of chars because of the requirements of my solution. I found it more difficult than I expected to find a good, generic CSV parser so I hope this parsing algorithm can help someone.
public static string[] SplitRow(string record, string delimiter, string qualifier, bool trimData)
{
// In-Line for example, but I implemented as string extender in production code
Func <string, int, int> IndexOfNextNonWhiteSpaceChar = delegate (string source, int startIndex)
{
if (startIndex >= 0)
{
if (source != null)
{
for (int i = startIndex; i < source.Length; i++)
{
if (!char.IsWhiteSpace(source[i]))
{
return i;
}
}
}
}
return -1;
};
var results = new List<string>();
var result = new StringBuilder();
var inQualifier = false;
var inField = false;
// We add new columns at the delimiter, so append one for the parser.
var row = $"{record}{delimiter}";
for (var idx = 0; idx < row.Length; idx++)
{
// A delimiter character...
if (row[idx]== delimiter[0])
{
// Are we inside qualifier? If not, we've hit the end of a column value.
if (!inQualifier)
{
results.Add(trimData ? result.ToString().Trim() : result.ToString());
result.Clear();
inField = false;
}
else
{
result.Append(row[idx]);
}
}
// NOT a delimiter character...
else
{
// ...Not a space character
if (row[idx] != ' ')
{
// A qualifier character...
if (row[idx] == qualifier[0])
{
// Qualifier is closing qualifier...
if (inQualifier && row[IndexOfNextNonWhiteSpaceChar(row, idx + 1)] == delimiter[0])
{
inQualifier = false;
continue;
}
else
{
// ...Qualifier is opening qualifier
if (!inQualifier)
{
inQualifier = true;
}
// ...It's a qualifier inside a qualifier.
else
{
inField = true;
result.Append(row[idx]);
}
}
}
// Not a qualifier character...
else
{
result.Append(row[idx]);
inField = true;
}
}
// ...A space character
else
{
if (inQualifier || inField)
{
result.Append(row[idx]);
}
}
}
}
return results.ToArray<string>();
}
Some test code:
//var input = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
var input =
"111, 222, \"99\",\"33,44,55\" , \"666 \"mark of a man\"\", \" spaces \"77,88\" \"";
Console.WriteLine("Split with trim");
Console.WriteLine("---------------");
var result = SplitRow(input, ",", "\"", true);
foreach (var r in result)
{
Console.WriteLine(r);
}
Console.WriteLine("");
// Split 2
Console.WriteLine("Split with no trim");
Console.WriteLine("------------------");
var result2 = SplitRow(input, ",", "\"", false);
foreach (var r in result2)
{
Console.WriteLine(r);
}
Console.WriteLine("");
// Time Trial 1
Console.WriteLine("Experimental Process (1,000,000) iterations");
Console.WriteLine("-------------------------------------------");
watch = Stopwatch.StartNew();
for (var i = 0; i < 1000000; i++)
{
var x1 = SplitRow(input, ",", "\"", false);
}
watch.Stop();
elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"Total Process Time: {string.Format("{0:0.###}", elapsedMs / 1000.0)} Seconds");
Console.WriteLine("");
Results
Split with trim
---------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"
Split with no trim
------------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"
Original Process (1,000,000) iterations
-------------------------------
Total Process Time: 7.538 Seconds
Experimental Process (1,000,000) iterations
--------------------------------------------
Total Process Time: 3.363 Seconds

I once had to do something similar and in the end I got stuck with Regular Expressions. The inability for Regex to have state makes it pretty tricky - I just ended up writing a simple little parser.
If you're doing CSV parsing you should just stick to using a CSV parser - don't reinvent the wheel.

Here is my fastest implementation based upon string raw pointer manipulation:
string[] FastSplit(string sText, char? cSeparator = null, char? cQuotes = null)
{
string[] oTokens;
if (null == cSeparator)
{
cSeparator = DEFAULT_PARSEFIELDS_SEPARATOR;
}
if (null == cQuotes)
{
cQuotes = DEFAULT_PARSEFIELDS_QUOTE;
}
unsafe
{
fixed (char* lpText = sText)
{
#region Fast array estimatation
char* lpCurrent = lpText;
int nEstimatedSize = 0;
while (0 != *lpCurrent)
{
if (cSeparator == *lpCurrent)
{
nEstimatedSize++;
}
lpCurrent++;
}
nEstimatedSize++; // Add EOL char(s)
string[] oEstimatedTokens = new string[nEstimatedSize];
#endregion
#region Parsing
char[] oBuffer = new char[sText.Length];
int nIndex = 0;
int nTokens = 0;
lpCurrent = lpText;
while (0 != *lpCurrent)
{
if (cQuotes == *lpCurrent)
{
// Quotes parsing
lpCurrent++; // Skip quote
nIndex = 0; // Reset buffer
while (
(0 != *lpCurrent)
&& (cQuotes != *lpCurrent)
)
{
oBuffer[nIndex] = *lpCurrent; // Store char
lpCurrent++; // Move source cursor
nIndex++; // Move target cursor
}
}
else if (cSeparator == *lpCurrent)
{
// Separator char parsing
oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex); // Store token
nIndex = 0; // Skip separator and Reset buffer
}
else
{
// Content parsing
oBuffer[nIndex] = *lpCurrent; // Store char
nIndex++; // Move target cursor
}
lpCurrent++; // Move source cursor
}
// Recover pending buffer
if (nIndex > 0)
{
// Store token
oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex);
}
// Build final tokens list
if (nTokens == nEstimatedSize)
{
oTokens = oEstimatedTokens;
}
else
{
oTokens = new string[nTokens];
Array.Copy(oEstimatedTokens, 0, oTokens, 0, nTokens);
}
#endregion
}
}
// Epilogue
return oTokens;
}

Try this
private string[] GetCommaSeperatedWords(string sep, string line)
{
List<string> list = new List<string>();
StringBuilder word = new StringBuilder();
int doubleQuoteCount = 0;
for (int i = 0; i < line.Length; i++)
{
string chr = line[i].ToString();
if (chr == "\"")
{
if (doubleQuoteCount == 0)
doubleQuoteCount++;
else
doubleQuoteCount--;
continue;
}
if (chr == sep && doubleQuoteCount == 0)
{
list.Add(word.ToString());
word = new StringBuilder();
continue;
}
word.Append(chr);
}
list.Add(word.ToString());
return list.ToArray();
}

This is Chad's answer rewritten with state based logic. His answered failed for me when it came across """BRAD""" as a field. That should return "BRAD" but it just ate up all the remaining fields. When I tried to debug it I just ended up rewriting it as state based logic:
enum SplitState { s_begin, s_infield, s_inquotefield, s_foundquoteinfield };
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
SplitState state = SplitState.s_begin;
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new { val, index }))
{
//Console.WriteLine("character = " + character.val + " state = " + state);
switch (state)
{
case SplitState.s_begin:
if (character.val == delimiter)
{
/* empty field */
yield return currentString.ToString();
currentString.Clear();
} else if (character.val == '"')
{
state = SplitState.s_inquotefield;
} else
{
currentString.Append(character.val);
state = SplitState.s_infield;
}
break;
case SplitState.s_infield:
if (character.val == delimiter)
{
/* field with data */
yield return currentString.ToString();
state = SplitState.s_begin;
currentString.Clear();
} else
{
currentString.Append(character.val);
}
break;
case SplitState.s_inquotefield:
if (character.val == '"')
{
// could be end of field, or escaped quote.
state = SplitState.s_foundquoteinfield;
} else
{
currentString.Append(character.val);
}
break;
case SplitState.s_foundquoteinfield:
if (character.val == '"')
{
// found escaped quote.
currentString.Append(character.val);
state = SplitState.s_inquotefield;
}
else if (character.val == delimiter)
{
// must have been last quote so we must find delimiter
yield return currentString.ToString();
state = SplitState.s_begin;
currentString.Clear();
}
else
{
throw new Exception("Quoted field not terminated.");
}
break;
default:
throw new Exception("unknown state:" + state);
}
}
//Console.WriteLine("currentstring = " + currentString.ToString());
}
This is a lot more lines of code than the other solutions, but it is easy to modify to add edge cases.

Superpower: match a string with tokenizer only if it begins a line

When tokenizing in superpower, how to match a string only if it is the first thing in a line (note: this is a different question than this one) ?
For example, assume I have a language with only the following 4 characters (' ', ':', 'X', 'Y'), each of which is a token. There is also a 'Header' token to capture cases of the following regex pattern /^[XY]+:/ (any number of Xs and Ys followed by a colon, only if they start the line).
Here is a quick class for testing (the 4th test-case fails):
using System;
using Superpower;
using Superpower.Parsers;
using Superpower.Tokenizers;
public enum Tokens { Space, Colon, Header, X, Y }
public class XYTokenizer
{
static void Main(string[] args)
{
Test("X", Tokens.X);
Test("XY", Tokens.X, Tokens.Y);
Test("X Y:", Tokens.X, Tokens.Space, Tokens.Y, Tokens.Colon);
Test("X: X", Tokens.Header, Tokens.Space, Tokens.X);
}
public static readonly Tokenizer<Tokens> tokenizer = new TokenizerBuilder<Tokens>()
.Match(Character.EqualTo('X'), Tokens.X)
.Match(Character.EqualTo('Y'), Tokens.Y)
.Match(Character.EqualTo(':'), Tokens.Colon)
.Match(Character.EqualTo(' '), Tokens.Space)
.Build();
static void Test(string input, params Tokens[] expected)
{
var tokens = tokenizer.Tokenize(input);
var i = 0;
foreach (var t in tokens)
{
if (t.Kind != expected[i])
{
Console.WriteLine("tokens[" + i + "] was Tokens." + t.Kind
+ " not Tokens." + expected[i] + " for '" + input + "'");
return;
}
i++;
}
Console.WriteLine("OK");
}
}

I came up with a custom Tokenizer based on the example found here. I added comments throughout the code so you can follow what's happening.
public class MyTokenizer : Tokenizer<Tokens>
{
protected override IEnumerable<Result<Tokens>> Tokenize(TextSpan input)
{
Result<char> next = input.ConsumeChar();
bool checkForHeader = true;
while (next.HasValue)
{
// need to check for a header when starting a new line
if (checkForHeader)
{
var headerStartLocation = next.Location;
var tokenQueue = new List<Result<Tokens>>();
while (next.HasValue && (next.Value == 'X' || next.Value == 'Y'))
{
tokenQueue.Add(Result.Value(next.Value == 'X' ? Tokens.X : Tokens.Y, next.Location, next.Remainder));
next = next.Remainder.ConsumeChar();
}
// only if we had at least one X or one Y
if (tokenQueue.Any())
{
if (next.HasValue && next.Value == ':')
{
// this is a header token; we have to return a Result of the start location
// along with the remainder at this location
yield return Result.Value(Tokens.Header, headerStartLocation, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else
{
// this isn't a header; we have to return all the tokens we parsed up to this point
foreach (Result<Tokens> tokenResult in tokenQueue)
{
yield return tokenResult;
}
}
}
if (!next.HasValue)
yield break;
}
checkForHeader = false;
if (next.Value == '\r')
{
// skip over the carriage return
next = next.Remainder.ConsumeChar();
continue;
}
if (next.Value == '\n')
{
// line break; check for a header token here
next = next.Remainder.ConsumeChar();
checkForHeader = true;
continue;
}
if (next.Value == 'A')
{
var abcStart = next.Location;
next = next.Remainder.ConsumeChar();
if (next.HasValue && next.Value == 'B')
{
next = next.Remainder.ConsumeChar();
if (next.HasValue && next.Value == 'C')
{
yield return Result.Value(Tokens.ABC, abcStart, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else
{
yield return Result.Empty<Tokens>(next.Location, $"unrecognized `AB{next.Value}`");
}
}
else
{
yield return Result.Empty<Tokens>(next.Location, $"unrecognized `A{next.Value}`");
}
}
else if (next.Value == 'X')
{
yield return Result.Value(Tokens.X, next.Location, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else if (next.Value == 'Y')
{
yield return Result.Value(Tokens.Y, next.Location, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else if (next.Value == ':')
{
yield return Result.Value(Tokens.Colon, next.Location, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else if (next.Value == ' ')
{
yield return Result.Value(Tokens.Space, next.Location, next.Remainder);
next = next.Remainder.ConsumeChar();
}
else
{
yield return Result.Empty<Tokens>(next.Location, $"unrecognized `{next.Value}`");
next = next.Remainder.ConsumeChar(); // Skip the character anyway
}
}
}
}
And you can call it like this:
var tokens = new MyTokenizer().Tokenize(input);

Getting a loop to build a string correctly

I'm trying to pull the tournament age from this webpage:
http://www.reddishvulcans.com/uk_tournament_database.asp
I'm trying to create a string based on the valid ages for entry for each table.
For example, if the "Carling Cup" is enterable by 7 year olds, then a string would be generated like "U7", or if it's enterable by 7, 8, and 9 year olds, the resulting string will be "U7, U8, U9".
I've made a start, however my logic will break if the ages go like this "Under 7s, Gap here where no under 8s, Under 9s".
Here is my code:
public static List<Record> getRecords()
{
string url = "http://www.reddishvulcans.com/uk_tournament_database.asp";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
var root = doc.DocumentNode;
var ages = root.SelectNodes("//div[#class='infobox']/table/tr[5]/td/img");
List<String> tournamentAges = new List<String>();
String ageGroups = "";
List<String> ageString = new List<String>();
for (Int32 i = 0; i < ages.Count(); i++)
{
if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u6_Yes.gif")
{
if (!ageString.Contains(" U6 ")) {
ageString.Add(" U6 ");
continue;
}
}
else if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u6_.gif")
{
continue;
}
if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u7_Yes.gif")
{
if (!ageString.Contains(" U7 "))
{
ageString.Add(" U7 ");
continue;
}
}
else if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u7_.gif")
{
continue;
}
if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u8_Yes.gif")
{
if (!ageString.Contains(" U8 "))
{
ageString.Add(" U8 ");
continue;
}
}
else if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u8_.gif")
{
continue;
}
// Checks until u16.gif
foreach (String a in ageString)
{
if (a != "")
{
ageGroups += a;
}
}
ageString.Clear();
if (ageGroups != "")
{
tournamentAges.Add(ageGroups);
}
ageGroups = "";
}
}
}
To be clear, I'm having trouble with the loop logic.
The flow currently goes like this:
Loop through current list of images
If > u6_Yes.gif
Concatenate u6 to ageString
else
Continue
However it will continue back to the start and get stuck in an infinite loop, How can I make it gracefully handle when u6_.gif is gone, ignore it and go to the next?

why don't you just simplify your loop like this?
if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u6_Yes.gif")
{
if (!ageString.Contains(" U6 "))
{
ageString.Add(" U6 ");
continue;
}
}
if (ages[i].GetAttributeValue("src", "nope") == "images/2016/u7_Yes.gif")
{
if (!ageString.Contains(" U7 "))
{
ageString.Add(" U7 ");
continue;
}
}
}
(...)
just remove all those else if blocks...
Also, you should consider extraction src attribute from ages array. You can use Linq and make your loop a lot simplier. Something like this:
List<String> ageString = new List<String>();
List<string> imageSources = ages.Where(x => x.GetAttributeValue("src", "nope").StartsWith("images/2016/u") && x.GetAttributeValue("src", "nope").EndsWith("_Yes.gif")).ToList();
foreach (var src in imageSources)
{
ageString.Add(" " + src.Substring(11, 2).ToUpper() + " ");
}

Regex behaving strangely .net

I have some code which reads every row of a CSV file and if the value doesn't match the correct value, it will add it to the error list which is returned to the users screen. The problem I am having is with the regex itself.
protected void ReadData(string filePath, bool upload)
{
StringBuilder sb = new StringBuilder();
#region upload
if (upload == true) // CSV file upload chosen
{
using (CsvReader csv = new CsvReader(new StreamReader(filePath), true)) // Cache CSV file to memory
{
int fieldCount = csv.FieldCount; // Total number of fields per row
string[] headers = csv.GetFieldHeaders(); // Correct CSV headers stored in array
SortedList<int, string> errorList = new SortedList<int, string>(); // This list will contain error values
bool errorFlag = false;
int errorCount = 0;
// Check if headers are correct first before reading data
if (headers[0] != "first name" || headers[1] != "last name" || headers[2] != "job title" || headers[3] != "email address" || headers[4] != "telephone number" || headers[5] != "company" || headers[6] != "research manager" || headers[7] != "user card number")
{
sb.Append("Headers are incorrect");
}
else
{
while (csv.ReadNextRecord())
try
{
//Check csv obj data for valid values
for (int i = 0; i < fieldCount; i++)
{
if (i == 0 || i == 1) // FirstName and LastName
{
if (Regex.IsMatch(csv[i].ToString(), "[a-zA-Z]", RegexOptions.IgnoreCase)) //REGEX letters only min of 5 char max of 20
{
errorList.Add(errorCount, csv[i]);
errorCount += 1;
errorFlag = true;
string text = csv[i].ToString();
}
}
else if (i == 5) // Company name
{
string text = csv[i];
text.Replace("&", "and");
}
}
if (errorFlag == true)
{
sb.Append("<b>" + "Number of Error: " + errorCount + "</b>");
sb.Append("<ul>");
foreach (KeyValuePair<int, string> key in errorList)
{
sb.Append("<li>" + key.Value + "</li>");
}
}
else // All validation checks equaled to false. Create User
{
ORCLdap.CreateUserAccount(rootLDAPPath, svcUsername, svcPassword, csv[0], csv[1], csv[2], csv[3], csv[4], csv[5], csv[7]);
sb.Append("<b>New user data uploaded successfully</b>");
}
}// end of try
catch (Exception ex)
{
sb.Append(ex.ToString());
}
finally
{
lblMessage.Text = sb.ToString();
sb.Remove(0, sb.Length);
}
}
}
#endregion
The lblMessage.text contains this html:
Number of Error: 4
David1212
smith
Nick444
Gowdy333
When it should be 3 errors because smith doesnt contain a number.
Does anyone have suggestions for this?

You also have a logic error:
if (Regex.IsMatch(csv[i].ToString(), "[a-zA-Z]", RegexOptions.IgnoreCase)) //REGEX letters only min of 5 char max of 20
should be
if (!Regex.IsMatch(csv[i].ToString(), "^[a-zA-Z]+$", RegexOptions.IgnoreCase)) //REGEX letters only min of 5 char max of 20
because it is only an error if the name has other characters than [a-zA-Z] in it, right?
(and if you use RegexOptions.IgnoreCase you don't need [a-zA-Z], [a-z] would do)

You need to add word boundaries to your regex, or starting '^' and end '$'
i.e.
^[a-zA-Z]+$
http://regexr.com?3298g
Your current regex is incorrect, and will match any string which contains a-z or A-Z , any letter,at any position.
http://regexr.com?3298j

c# search string in txt file

I want to find a string in a txt file if string compares, it should go on reading lines till another string which I'm using as parameter.
Example:
CustomerEN //search for this string
...
some text which has details about the customer
id "123456"
username "rootuser"
...
CustomerCh //get text till this string
I need the details to work with them otherwise.
I'm using linq to search for "CustomerEN" like this:
File.ReadLines(pathToTextFile).Any(line => line.Contains("CustomerEN"))
But now I'm stuck with reading lines (data) till "CustomerCh" to extract details.

If your pair of lines will only appear once in your file, you could use
File.ReadLines(pathToTextFile)
.SkipWhile(line => !line.Contains("CustomerEN"))
.Skip(1) // optional
.TakeWhile(line => !line.Contains("CustomerCh"));
If you could have multiple occurrences in one file, you're probably better off using a regular foreach loop - reading lines, keeping track of whether you're currently inside or outside a customer etc:
List<List<string>> groups = new List<List<string>>();
List<string> current = null;
foreach (var line in File.ReadAllLines(pathToFile))
{
if (line.Contains("CustomerEN") && current == null)
current = new List<string>();
else if (line.Contains("CustomerCh") && current != null)
{
groups.Add(current);
current = null;
}
if (current != null)
current.Add(line);
}

You have to use while since foreach does not know about index. Below is an example code.
int counter = 0;
string line;
Console.Write("Input your search text: ");
var text = Console.ReadLine();
System.IO.StreamReader file =
new System.IO.StreamReader("SampleInput1.txt");
while ((line = file.ReadLine()) != null)
{
if (line.Contains(text))
{
break;
}
counter++;
}
Console.WriteLine("Line number: {0}", counter);
file.Close();
Console.ReadLine();

With LINQ, you could use the SkipWhile / TakeWhile methods, like this:
var importantLines =
File.ReadLines(pathToTextFile)
.SkipWhile(line => !line.Contains("CustomerEN"))
.TakeWhile(line => !line.Contains("CustomerCh"));

If you whant only one first string, you can use simple for-loop.
var lines = File.ReadAllLines(pathToTextFile);
var firstFound = false;
for(int index = 0; index < lines.Count; index++)
{
if(!firstFound && lines[index].Contains("CustomerEN"))
{
firstFound = true;
}
if(firstFound && lines[index].Contains("CustomerCh"))
{
//do, what you want, and exit the loop
// return lines[index];
}
}

I worked a little bit the method that Rawling posted here to find more than one line in the same file until the end. This is what worked for me:
foreach (var line in File.ReadLines(pathToFile))
{
if (line.Contains("CustomerEN") && current == null)
{
current = new List<string>();
current.Add(line);
}
else if (line.Contains("CustomerEN") && current != null)
{
current.Add(line);
}
}
string s = String.Join(",", current);
MessageBox.Show(s);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Converting multifasta parser from Python to C# - c#

Related

Split a string if delimiter is between single quotes [duplicate]

Superpower: match a string with tokenizer only if it begins a line

Getting a loop to build a string correctly

Regex behaving strangely .net

c# search string in txt file

Categories

Resources