Splitting CamelCase - c#

This is all asp.net c#.
I have an enum
public enum ControlSelectionType
{
NotApplicable = 1,
SingleSelectRadioButtons = 2,
SingleSelectDropDownList = 3,
MultiSelectCheckBox = 4,
MultiSelectListBox = 5
}
The numerical value of this is stored in my database. I display this value in a datagrid.
<asp:boundcolumn datafield="ControlSelectionTypeId" headertext="Control Type"></asp:boundcolumn>
The ID means nothing to a user so I have changed the boundcolumn to a template column with the following.
<asp:TemplateColumn>
<ItemTemplate>
<%# Enum.Parse(typeof(ControlSelectionType), DataBinder.Eval(Container.DataItem, "ControlSelectionTypeId").ToString()).ToString()%>
</ItemTemplate>
</asp:TemplateColumn>
This is a lot better... However, it would be great if there was a simple function I can put around the Enum to split it by Camel case so that the words wrap nicely in the datagrid.
Note: I am fully aware that there are better ways of doing all this. This screen is purely used internally and I just want a quick hack in place to display it a little better.

I used:
public static string SplitCamelCase(string input)
{
return System.Text.RegularExpressions.Regex.Replace(input, "([A-Z])", " $1", System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
}
Taken from http://weblogs.asp.net/jgalloway/archive/2005/09/27/426087.aspx
vb.net:
Public Shared Function SplitCamelCase(ByVal input As String) As String
Return System.Text.RegularExpressions.Regex.Replace(input, "([A-Z])", " $1", System.Text.RegularExpressions.RegexOptions.Compiled).Trim()
End Function
Here is a dotnet Fiddle for online execution of the c# code.

Indeed a regex/replace is the way to go as described in the other answer, however this might also be of use to you if you wanted to go a different direction
using System.ComponentModel;
using System.Reflection;
...
public static string GetDescription(System.Enum value)
{
FieldInfo fi = value.GetType().GetField(value.ToString());
DescriptionAttribute[] attributes = (DescriptionAttribute[])fi.GetCustomAttributes(typeof(DescriptionAttribute), false);
if (attributes.Length > 0)
return attributes[0].Description;
else
return value.ToString();
}
this will allow you define your Enums as
public enum ControlSelectionType
{
[Description("Not Applicable")]
NotApplicable = 1,
[Description("Single Select Radio Buttons")]
SingleSelectRadioButtons = 2,
[Description("Completely Different Display Text")]
SingleSelectDropDownList = 3,
}
Taken from
http://www.codeguru.com/forum/archive/index.php/t-412868.html

This regex (^[a-z]+|[A-Z]+(?![a-z])|[A-Z][a-z]+) can be used to extract all words from the camelCase or PascalCase name. It also works with abbreviations anywhere inside the name.
MyHTTPServer will contain exactly 3 matches: My, HTTP, Server
myNewXMLFile will contain 4 matches: my, New, XML, File
You could then join them into a single string using string.Join.
string name = "myNewUIControl";
string[] words = Regex.Matches(name, "(^[a-z]+|[A-Z]+(?![a-z])|[A-Z][a-z]+)")
.OfType<Match>()
.Select(m => m.Value)
.ToArray();
string result = string.Join(" ", words);
As #DanielB noted in the comments, that regex won't work for numbers (and with underscores), so here is an improved version that supports any identifier with words, acronyms, numbers, underscores (slightly modified #JoeJohnston's version), see online demo (fiddle):
([A-Z]+(?![a-z])|[A-Z][a-z]+|[0-9]+|[a-z]+)
Extreme example: __snake_case12_camelCase_TLA1ABC → snake, case, 12, camel, Case, TLA, 1, ABC

Tillito's answer does not handle strings already containing spaces well, or Acronyms. This fixes it:
public static string SplitCamelCase(string input)
{
return Regex.Replace(input, "(?<=[a-z])([A-Z])", " $1", RegexOptions.Compiled);
}

If C# 3.0 is an option you can use the following one-liner to do the job:
Regex.Matches(YOUR_ENUM_VALUE_NAME, "[A-Z][a-z]+").OfType<Match>().Select(match => match.Value).Aggregate((acc, b) => acc + " " + b).TrimStart(' ');

Here's an extension method that handles numbers and multiple uppercase characters sanely, and also allows for upper-casing specific acronyms in the final string:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.Text.RegularExpressions;
using System.Web.Configuration;
namespace System
{
/// <summary>
/// Extension methods for the string data type
/// </summary>
public static class ConventionBasedFormattingExtensions
{
/// <summary>
/// Turn CamelCaseText into Camel Case Text.
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
/// <remarks>Use AppSettings["SplitCamelCase_AllCapsWords"] to specify a comma-delimited list of words that should be ALL CAPS after split</remarks>
/// <example>
/// wordWordIDWord1WordWORDWord32Word2
/// Word Word ID Word 1 Word WORD Word 32 Word 2
///
/// wordWordIDWord1WordWORDWord32WordID2ID
/// Word Word ID Word 1 Word WORD Word 32 Word ID 2 ID
///
/// WordWordIDWord1WordWORDWord32Word2Aa
/// Word Word ID Word 1 Word WORD Word 32 Word 2 Aa
///
/// wordWordIDWord1WordWORDWord32Word2A
/// Word Word ID Word 1 Word WORD Word 32 Word 2 A
/// </example>
public static string SplitCamelCase(this string input)
{
if (input == null) return null;
if (string.IsNullOrWhiteSpace(input)) return "";
var separated = input;
separated = SplitCamelCaseRegex.Replace(separated, #" $1").Trim();
//Set ALL CAPS words
if (_SplitCamelCase_AllCapsWords.Any())
foreach (var word in _SplitCamelCase_AllCapsWords)
separated = SplitCamelCase_AllCapsWords_Regexes[word].Replace(separated, word.ToUpper());
//Capitalize first letter
var firstChar = separated.First(); //NullOrWhiteSpace handled earlier
if (char.IsLower(firstChar))
separated = char.ToUpper(firstChar) + separated.Substring(1);
return separated;
}
private static readonly Regex SplitCamelCaseRegex = new Regex(#"
(
(?<=[a-z])[A-Z0-9] (?# lower-to-other boundaries )
|
(?<=[0-9])[a-zA-Z] (?# number-to-other boundaries )
|
(?<=[A-Z])[0-9] (?# cap-to-number boundaries; handles a specific issue with the next condition )
|
(?<=[A-Z])[A-Z](?=[a-z]) (?# handles longer strings of caps like ID or CMS by splitting off the last capital )
)"
, RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace
);
private static readonly string[] _SplitCamelCase_AllCapsWords =
(WebConfigurationManager.AppSettings["SplitCamelCase_AllCapsWords"] ?? "")
.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
.Select(a => a.ToLowerInvariant().Trim())
.ToArray()
;
private static Dictionary<string, Regex> _SplitCamelCase_AllCapsWords_Regexes;
private static Dictionary<string, Regex> SplitCamelCase_AllCapsWords_Regexes
{
get
{
if (_SplitCamelCase_AllCapsWords_Regexes == null)
{
_SplitCamelCase_AllCapsWords_Regexes = new Dictionary<string,Regex>();
foreach(var word in _SplitCamelCase_AllCapsWords)
_SplitCamelCase_AllCapsWords_Regexes.Add(word, new Regex(#"\b" + word + #"\b", RegexOptions.Compiled | RegexOptions.IgnoreCase));
}
return _SplitCamelCase_AllCapsWords_Regexes;
}
}
}
}

You can use C# extension methods
public static string SpacesFromCamel(this string value)
{
if (value.Length > 0)
{
var result = new List<char>();
char[] array = value.ToCharArray();
foreach (var item in array)
{
if (char.IsUpper(item) && result.Count > 0)
{
result.Add(' ');
}
result.Add(item);
}
return new string(result.ToArray());
}
return value;
}
Then you can use it like
var result = "TestString".SpacesFromCamel();
Result will be
Test String

Using LINQ:
var chars = ControlSelectionType.NotApplicable.ToString().SelectMany((x, i) => i > 0 && char.IsUpper(x) ? new char[] { ' ', x } : new char[] { x });
Console.WriteLine(new string(chars.ToArray()));

I also have an enum which I had to separate. In my case this method solved the problem-
string SeparateCamelCase(string str)
{
for (int i = 1; i < str.Length; i++)
{
if (char.IsUpper(str[i]))
{
str = str.Insert(i, " ");
i++;
}
}
return str;
}

public enum ControlSelectionType
{
NotApplicable = 1,
SingleSelectRadioButtons = 2,
SingleSelectDropDownList = 3,
MultiSelectCheckBox = 4,
MultiSelectListBox = 5
}
public class NameValue
{
public string Name { get; set; }
public object Value { get; set; }
}
public static List<NameValue> EnumToList<T>(bool camelcase)
{
var array = (T[])(Enum.GetValues(typeof(T)).Cast<T>());
var array2 = Enum.GetNames(typeof(T)).ToArray<string>();
List<NameValue> lst = null;
for (int i = 0; i < array.Length; i++)
{
if (lst == null)
lst = new List<NameValue>();
string name = "";
if (camelcase)
{
name = array2[i].CamelCaseFriendly();
}
else
name = array2[i];
T value = array[i];
lst.Add(new NameValue { Name = name, Value = value });
}
return lst;
}
public static string CamelCaseFriendly(this string pascalCaseString)
{
Regex r = new Regex("(?<=[a-z])(?<x>[A-Z])|(?<=.)(?<x>[A-Z])(?=[a-z])");
return r.Replace(pascalCaseString, " ${x}");
}
//In your form
protected void Button1_Click1(object sender, EventArgs e)
{
DropDownList1.DataSource = GeneralClass.EnumToList<ControlSelectionType >(true); ;
DropDownList1.DataTextField = "Name";
DropDownList1.DataValueField = "Value";
DropDownList1.DataBind();
}

The solution from Eoin Campbell works good except if you have a Web Service.
You would need to do the Following as the Description Attribute is not serializable.
[DataContract]
public enum ControlSelectionType
{
[EnumMember(Value = "Not Applicable")]
NotApplicable = 1,
[EnumMember(Value = "Single Select Radio Buttons")]
SingleSelectRadioButtons = 2,
[EnumMember(Value = "Completely Different Display Text")]
SingleSelectDropDownList = 3,
}
public static string GetDescriptionFromEnumValue(Enum value)
{
EnumMemberAttribute attribute = value.GetType()
.GetField(value.ToString())
.GetCustomAttributes(typeof(EnumMemberAttribute), false)
.SingleOrDefault() as EnumMemberAttribute;
return attribute == null ? value.ToString() : attribute.Value;
}

And if you don't fancy using regex - try this:
public static string SeperateByCamelCase(this string text, char splitChar = ' ') {
var output = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
var c = text[i];
//if not the first and the char is upper
if (i > 0 && char.IsUpper(c)) {
var wasLastLower = char.IsLower(text[i - 1]);
if (i + 1 < text.Length) //is there a next
{
var isNextUpper = char.IsUpper(text[i + 1]);
if (!isNextUpper) //if next is not upper (start of a word).
{
output.Append(splitChar);
}
else if (wasLastLower) //last was lower but i'm upper and my next is an upper (start of an achromin). 'abcdHTTP' 'abcd HTTP'
{
output.Append(splitChar);
}
}
else
{
//last letter - if its upper and the last letter was lower 'abcd' to 'abcd A'
if (wasLastLower)
{
output.Append(splitChar);
}
}
}
output.Append(c);
}
return output.ToString();
}
Passes these tests, it doesn't like numbers but i didn't need it to.
[TestMethod()]
public void ToCamelCaseTest()
{
var testData = new string[] { "AAACamel", "AAA", "SplitThisByCamel", "AnA", "doesnothing", "a", "A", "aasdasdAAA" };
var expectedData = new string[] { "AAA Camel", "AAA", "Split This By Camel", "An A", "doesnothing", "a", "A", "aasdasd AAA" };
for (int i = 0; i < testData.Length; i++)
{
var actual = testData[i].SeperateByCamelCase();
var expected = expectedData[i];
Assert.AreEqual(actual, expected);
}
}

#JustSayNoToRegex
Takes a C# identifier, with uderscores and numbers, and converts it to space-separated string.
public static class StringExtensions
{
public static string SplitOnCase(this string identifier)
{
if (identifier == null || identifier.Length == 0) return string.Empty;
var sb = new StringBuilder();
if (identifier.Length == 1) sb.Append(char.ToUpperInvariant(identifier[0]));
else if (identifier.Length == 2) sb.Append(char.ToUpperInvariant(identifier[0])).Append(identifier[1]);
else {
if (identifier[0] != '_') sb.Append(char.ToUpperInvariant(identifier[0]));
for (int i = 1; i < identifier.Length; i++) {
var current = identifier[i];
var previous = identifier[i - 1];
if (current == '_' && previous == '_') continue;
else if (current == '_') {
sb.Append(' ');
}
else if (char.IsLetter(current) && previous == '_') {
sb.Append(char.ToUpperInvariant(current));
}
else if (char.IsDigit(current) && char.IsLetter(previous)) {
sb.Append(' ').Append(current);
}
else if (char.IsLetter(current) && char.IsDigit(previous)) {
sb.Append(' ').Append(char.ToUpperInvariant(current));
}
else if (char.IsUpper(current) && char.IsLower(previous)
&& (i < identifier.Length - 1 && char.IsUpper(identifier[i + 1]) || i == identifier.Length - 1)) {
sb.Append(' ').Append(current);
}
else if (char.IsUpper(current) && i < identifier.Length - 1 && char.IsLower(identifier[i + 1])) {
sb.Append(' ').Append(current);
}
else {
sb.Append(current);
}
}
}
return sb.ToString();
}
}
Tests:
[TestFixture]
static class HelpersTests
{
[Test]
public static void Basic()
{
Assert.AreEqual("Foo", "foo".SplitOnCase());
Assert.AreEqual("Foo", "_foo".SplitOnCase());
Assert.AreEqual("Foo", "__foo".SplitOnCase());
Assert.AreEqual("Foo", "___foo".SplitOnCase());
Assert.AreEqual("Foo 2", "foo2".SplitOnCase());
Assert.AreEqual("Foo 23", "foo23".SplitOnCase());
Assert.AreEqual("Foo 23 A", "foo23A".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23Ab".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23_ab".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23___ab".SplitOnCase());
Assert.AreEqual("Foo 23", "foo__23".SplitOnCase());
Assert.AreEqual("Foo Bar", "Foo_bar".SplitOnCase());
Assert.AreEqual("Foo Bar", "Foo____bar".SplitOnCase());
Assert.AreEqual("AAA", "AAA".SplitOnCase());
Assert.AreEqual("Foo A Aa", "fooAAa".SplitOnCase());
Assert.AreEqual("Foo AAA", "fooAAA".SplitOnCase());
Assert.AreEqual("Foo Bar", "FooBar".SplitOnCase());
Assert.AreEqual("Mn M", "MnM".SplitOnCase());
Assert.AreEqual("AS", "aS".SplitOnCase());
Assert.AreEqual("As", "as".SplitOnCase());
Assert.AreEqual("A", "a".SplitOnCase());
Assert.AreEqual("_", "_".SplitOnCase());
}
}

Simple version similar to some of the above, but with logic to not auto-insert the separator (which is by default, a space, but can be any char) if there's already one at the current position.
Uses a StringBuilder rather than 'mutating' strings.
public static string SeparateCamelCase(this string value, char separator = ' ') {
var sb = new StringBuilder();
var lastChar = separator;
foreach (var currentChar in value) {
if (char.IsUpper(currentChar) && lastChar != separator)
sb.Append(separator);
sb.Append(currentChar);
lastChar = currentChar;
}
return sb.ToString();
}
Example:
Input : 'ThisIsATest'
Output : 'This Is A Test'
Input : 'This IsATest'
Output : 'This Is A Test' (Note: Still only one space between 'This' and 'Is')
Input : 'ThisIsATest' (with separator '_')
Output : 'This_Is_A_Test'

Try this:
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
Console
.WriteLine(
SeparateByCamelCase("TestString") == "Test String" // True
);
}
public static string SeparateByCamelCase(string str)
{
return String.Join(" ", SplitByCamelCase(str));
}
public static IEnumerable<string> SplitByCamelCase(string str)
{
if (str.Length == 0)
return new List<string>();
return
new List<string>
{
Head(str)
}
.Concat(
SplitByCamelCase(
Tail(str)
)
);
}
public static string Head(string str)
{
return new String(
str
.Take(1)
.Concat(
str
.Skip(1)
.TakeWhile(IsLower)
)
.ToArray()
);
}
public static string Tail(string str)
{
return new String(
str
.Skip(
Head(str).Length
)
.ToArray()
);
}
public static bool IsLower(char ch)
{
return ch >= 'a' && ch <= 'z';
}
}
See sample online

Related

How to convert camel case to snake case with two capitals next to each other

I am trying to convert camel case to snake case.
Like this:
"LiveKarma" -> "live_karma"
"youGO" -> "you_g_o"
I cannot seem to get the second example working like that. It always outputs as 'you_go' . How can I get it to output 'you_g_o'
My code:
(Regex.Replace(line, "(?<=[a-z0-9])[A-Z]", "_$0", RegexOptions.Compiled)).ToLowerInvariant()
Here is an extension method that transforms the text into a snake case:
using System.Text;
public static string ToSnakeCase(this string text)
{
if(text == null) {
throw new ArgumentNullException(nameof(text));
}
if(text.Length < 2) {
return text;
}
var sb = new StringBuilder();
sb.Append(char.ToLowerInvariant(text[0]));
for(int i = 1; i < text.Length; ++i) {
char c = text[i];
if(char.IsUpper(c)) {
sb.Append('_');
sb.Append(char.ToLowerInvariant(c));
} else {
sb.Append(c);
}
}
return sb.ToString();
}
Put it into a static class somewhere (named for example StringExtensions) and use it like this:
string text = "LiveKarma";
string snakeCaseText = text.ToSnakeCase();
// snakeCaseText => "live_karma"
Since the option that converts abbreviations as separate words is not suitable for many, I found a complete solution in the EF Core codebase.
Here are a couple of examples of how the code works:
TestSC -> test_sc
testSC -> test_sc
TestSnakeCase -> test_snake_case
testSnakeCase -> test_snake_case
TestSnakeCase123 -> test_snake_case123
_testSnakeCase123 -> _test_snake_case123
test_SC -> test_sc
I rewrote it a bit so you can copy it as a ready-to-use string extension:
using System;
using System.Globalization;
using System.Text;
namespace Extensions
{
public static class StringExtensions
{
public static string ToSnakeCase(this string text)
{
if (string.IsNullOrEmpty(text))
{
return text;
}
var builder = new StringBuilder(text.Length + Math.Min(2, text.Length / 5));
var previousCategory = default(UnicodeCategory?);
for (var currentIndex = 0; currentIndex < text.Length; currentIndex++)
{
var currentChar = text[currentIndex];
if (currentChar == '_')
{
builder.Append('_');
previousCategory = null;
continue;
}
var currentCategory = char.GetUnicodeCategory(currentChar);
switch (currentCategory)
{
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.TitlecaseLetter:
if (previousCategory == UnicodeCategory.SpaceSeparator ||
previousCategory == UnicodeCategory.LowercaseLetter ||
previousCategory != UnicodeCategory.DecimalDigitNumber &&
previousCategory != null &&
currentIndex > 0 &&
currentIndex + 1 < text.Length &&
char.IsLower(text[currentIndex + 1]))
{
builder.Append('_');
}
currentChar = char.ToLower(currentChar, CultureInfo.InvariantCulture);
break;
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
if (previousCategory == UnicodeCategory.SpaceSeparator)
{
builder.Append('_');
}
break;
default:
if (previousCategory != null)
{
previousCategory = UnicodeCategory.SpaceSeparator;
}
continue;
}
builder.Append(currentChar);
previousCategory = currentCategory;
}
return builder.ToString();
}
}
}
You can find the original code here:
https://github.com/efcore/EFCore.NamingConventions/blob/main/EFCore.NamingConventions/Internal/SnakeCaseNameRewriter.cs
UPD 27.04.2022:
Also, you can use Newtonsoft library if you're looking for a ready to use third party solution. The output of the code is the same as the code above.
// using Newtonsoft.Json.Serialization;
var snakeCaseStrategy = new SnakeCaseNamingStrategy();
var snakeCaseResult = snakeCaseStrategy.GetPropertyName(text, false);
Simple Linq based solution... no idea if its faster or not. basically ignores consecutive uppercases
public static string ToUnderscoreCase(this string str)
=> string.Concat((str ?? string.Empty).Select((x, i) => i > 0 && i < str.Length - 1 && char.IsUpper(x) && !char.IsUpper(str[i-1]) ? $"_{x}" : x.ToString())).ToLower();
using Newtonsoft package
public static string? ToCamelCase(this string? str) => str is null
? null
: new DefaultContractResolver() { NamingStrategy = new CamelCaseNamingStrategy() }.GetResolvedPropertyName(str);
public static string? ToSnakeCase(this string? str) => str is null
? null
: new DefaultContractResolver() { NamingStrategy = new SnakeCaseNamingStrategy() }.GetResolvedPropertyName(str);
RegEx Solution
A quick internet search turned up this site which has an answer using RegEx, which I had to modify to grab the Value portion in order for it to work on my machine (but it has the RegEx you're looking for). I also modified it to handle null input, rather than throwing an exception:
public static string ToSnakeCase2(string str)
{
var pattern =
new Regex(#"[A-Z]{2,}(?=[A-Z][a-z]+[0-9]*|\b)|[A-Z]?[a-z]+[0-9]*|[A-Z]|[0-9]+");
return str == null
? null
: string
.Join("_", pattern.Matches(str).Cast<Match>().Select(m => m.Value))
.ToLower();
}
Non-RegEx Solution
For a non-regex solution, we can do the following:
Reduce all whitespace to a single space by
using string.Split to split with an empty array as the first parameter to split on all whitespace
joining those parts back together with the '_' character
Prefix all upper-case characters with '_' and lower-case them
Split and re-join the resulting string on the _ character to remove any instances of multiple concurrent underscores ("__") and to remove any leading or trailing instances of the character.
For example:
public static string ToSnakeCase(string str)
{
return str == null
? null
: string.Join("_", string.Concat(string.Join("_", str.Split(new char[] {},
StringSplitOptions.RemoveEmptyEntries))
.Select(c => char.IsUpper(c)
? $"_{c}".ToLower()
: $"{c}"))
.Split(new[] {'_'}, StringSplitOptions.RemoveEmptyEntries));
}
pseudo code below. In essence check if each char is upper case, then if it is add a _, then add the char to lower case
var newString = s.subString(0,1).ToLower();
foreach (char c in s.SubString(1,s.length-1))
{
if (char.IsUpper(c))
{
newString = newString + "_";
}
newString = newString + c.ToLower();
}
if you're into micro-optimaizations and want to prevent unneccessary conversions wherever possible, this one might also work:
public static string ToSnakeCase(this string text)
{
static IEnumerable<char> Convert(CharEnumerator e)
{
if (!e.MoveNext()) yield break;
yield return char.ToLower(e.Current);
while (e.MoveNext())
{
if (char.IsUpper(e.Current))
{
yield return '_';
yield return char.ToLower(e.Current);
}
else
{
yield return e.Current;
}
}
}
return new string(Convert(text.GetEnumerator()).ToArray());
}
There is a well maintained EF Core community project that implements a number of naming convention rewriters called EFCore.NamingConventions. The rewriters don't have any internal dependencies, so if you don't want to bring in an EF Core related package you can just copy the rewriter code out.
Here is the snake case rewriter: https://github.com/efcore/EFCore.NamingConventions/blob/main/EFCore.NamingConventions/Internal/SnakeCaseNameRewriter.cs
May has well toss this one out. Very simple and worked for me.
public static string ToSnakeCase(this string text)
{
text = Regex.Replace(text, "(.)([A-Z][a-z]+)", "$1_$2");
text = Regex.Replace(text, "([a-z0-9])([A-Z])", "$1_$2");
return text.ToLower();
}
Testing it with some samples (borrowed from #GeekInside's answer):
var samples = new List<string>() { "TestSC", "testSC", "TestSnakeCase", "testSnakeCase", "TestSnakeCase123", "_testSnakeCase123", "test_SC" };
var results = new List<string>() { "test_sc", "test_sc", "test_snake_case", "test_snake_case", "test_snake_case123", "_test_snake_case123", "test_sc" };
for (int i = 0; i < samples.Count; i++)
{
Console.WriteLine("Test success: " + (val.ToSnakeCase() == results[i] ? "true" : "false"));
}
Produced the following output:
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true
Test success: true

C# - ignore Split character if it's inside parentheses

I'm writing the following method in C# to parse a CSV file and write values to a SQL Server database.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
public List<Entity> ParseEntityExclusionFile(List<string> entries, string urlFile)
{
entries.RemoveAt(0);
List<Entity> entities = new List<Entity>();
foreach (string line in entries)
{
Entity exclusionEntity = new Entity();
string[] lineParts = line.Split(',').Select(p => p.Trim('\"')).ToArray();
exclusionEntity.Id = 1000;
exclusionEntity.Names = lineParts[3];
exclusionEntity.Identifier = $"{lineParts[25]}" + $" | " + $"Classification: " + ${lineParts[0]}";
entities.Add(exclusionEntity);
}
return entities;
}
The data in some columns of the csv file are comma-separated values inside a set of parentheses, meant to represent one value for that column. So for any values like that, I need to capture it as one value to go into one field in the database. How would I adjust/add-to the line of code string[] lineParts = line.Split(',').Select(p => p.Trim('\"')).ToArray(); to instruct the application that if it encounters a column with open parenthesis, capture all the data after the open parenthesis, including the commas, until the close parenthesis, all as one value?
EDIT: it seems the Select(p => p.Trim('\"')).ToArray(); part of the above line of code is confusing some folks - don't worry about that part - I just need to know how I would go about adding 'exception' code to create a condition where Split(',') is ignored if the commas happen to be in between a set of parentheses. One field in the csv file looks like this (1,2,3,4) - currently the code parses it as four fields, whereas I need that parsed as one field like 1,2,3,4 OR (1,2,3,4) it actually doesn't matter whether the resulting fields contain the parentheses or not.
EDIT 2: I appreciate the suggestions of using a .NET CSV library - however, everything is working perfectly in this project outside of this one field in the csv file containing a set of parentheses with comma-separated values inside - I feel as though it's a bit overkill to install and configure an entire library, including having to set up new models and properties, just for this one column of data.
Try this code:
public static class Ex
{
private static string Peek(this string source, int peek) => (source == null || peek < 0) ? null : source.Substring(0, source.Length < peek ? source.Length : peek);
private static (string, string) Pop(this string source, int pop) => (source == null || pop < 0) ? (null, source) : (source.Substring(0, source.Length < pop ? source.Length : pop), source.Length < pop ? String.Empty : source.Substring(pop));
public static string[] ParseCsvLine(this string line)
{
return ParseCsvLineImpl(line).ToArray();
IEnumerable<string> ParseCsvLineImpl(string l)
{
string remainder = line;
string field;
while (remainder.Peek(1) != "")
{
(field, remainder) = ParseField(remainder);
yield return field;
}
}
}
private const string DQ = "\"";
private static (string field, string remainder) ParseField(string line)
{
if (line.Peek(1) == DQ)
{
var (_, split) = line.Pop(1);
return ParseFieldQuoted(split);
}
else
{
var field = "";
var (head, tail) = line.Pop(1);
while (head != "," && head != "")
{
field += head;
(head, tail) = tail.Pop(1);
}
return (field, tail);
}
}
private static (string field, string remainder) ParseFieldQuoted(string line)
{
var field = "";
var head = "";
var tail = line;
while (tail.Peek(1) != "" && (tail.Peek(1) != DQ || tail.Peek(2) == DQ + DQ))
{
if (tail.Peek(2) == DQ + DQ)
{
(head, tail) = tail.Pop(2);
field += DQ;
}
else
{
(head, tail) = tail.Pop(1);
field += head;
}
}
if (tail.Peek(2) == DQ + ",")
{
(head, tail) = tail.Pop(2);
}
else if (tail.Peek(1) == DQ)
{
(head, tail) = tail.Pop(1);
}
return (field, tail);
}
}
It handles double-quotes, and double-double-quotes.
You can then do this:
string line = "45,\"23\"\",34\",66"; // 45,"23"",34",66
string[] fields = line.ParseCsvLine();
That produces:
45
23",34
66
Here's an updated version of my code that deals with ( and ) as delimiters. It deals with nested delimiters and treats them as part of the field string.
You would need to remove the " as you see fit - I'm not entirely sure why you are doing this.
Also, this is no longer CSV. The parenthesis are not a normal part of CSV. I've changed the name of the method to ParseLine as a result.
public static class Ex
{
private static string Peek(this string source, int peek) => (source == null || peek < 0) ? null : source.Substring(0, source.Length < peek ? source.Length : peek);
private static (string, string) Pop(this string source, int pop) => (source == null || pop < 0) ? (null, source) : (source.Substring(0, source.Length < pop ? source.Length : pop), source.Length < pop ? String.Empty : source.Substring(pop));
public static string[] ParseLine(this string line)
{
return ParseLineImpl(line).ToArray();
IEnumerable<string> ParseLineImpl(string l)
{
string remainder = line;
string field;
while (remainder.Peek(1) != "")
{
(field, remainder) = ParseField(remainder);
yield return field;
}
}
}
private const string GroupOpen = "(";
private const string GroupClose = ")";
private static (string field, string remainder) ParseField(string line)
{
if (line.Peek(1) == GroupOpen)
{
var (_, split) = line.Pop(1);
return ParseFieldQuoted(split);
}
else
{
var field = "";
var (head, tail) = line.Pop(1);
while (head != "," && head != "")
{
field += head;
(head, tail) = tail.Pop(1);
}
return (field, tail);
}
}
private static (string field, string remainder) ParseFieldQuoted(string line) => ParseFieldQuoted(line, false);
private static (string field, string remainder) ParseFieldQuoted(string line, bool isNested)
{
var field = "";
var head = "";
var tail = line;
while (tail.Peek(1) != "" && tail.Peek(1) != GroupClose)
{
if (tail.Peek(1) == GroupOpen)
{
(head, tail) = tail.Pop(1);
(head, tail) = ParseFieldQuoted(tail, true);
field += GroupOpen + head + GroupClose;
}
else
{
(head, tail) = tail.Pop(1);
field += head;
}
}
if (tail.Peek(2) == GroupClose + ",")
{
(head, tail) = tail.Pop(isNested ? 1 : 2);
}
else if (tail.Peek(1) == GroupClose)
{
(head, tail) = tail.Pop(1);
}
return (field, tail);
}
}
It's used like this:
string line = "45,(23(Fo(,,(,)),(\"Bar\")o),34),66"; // 45,(23(Fo(,,(,)),("Bar")o),34),66
string[] fields = line.ParseLine();
Console.WriteLine(fields.All(f => line.Contains(f))); // True == maybe code is right, False == code is WRONG
And it gives me:
45
23(Fo(,,(,)),("Bar")o),34
66
First, calling line.Trim('\"') will not strip "any existing double-quotes"; it will only remove all leading and trailing instances of the '\"' char.
var line = "\"\"example \"goes here\"";
var trimmed = line.Trim('\"');
Console.WriteLine(trimmed); //output: example "goes here
Here's how you strip all of the '\"' char:
var line = "\"\"example \"goes here\"";
var trimmed = string.Join(string.Empty, line.Split('"'));
Console.WriteLine(trimmed); //output: example goes here
Notice you can also nix the escape because the " is inside of single quotes.
I'm making an assumption that what your string inputs look like this:
"OneValue,TwoValue,(OneB,TwoB),FiveValue"
or if you have quotes (I'm also assuming you won't actually have quotes inside, but we'll solve for that anyway:
"\"OneValue,TwoValue,(OneB,TwoB),FiveValue\"\""
And I'm expecting your final string[] lineparts variable to have the values in this hard declaration after processing:
var lineparts = new string[] { "OneValue", "TwoValue", "OneB, TwoB", "FiveValue" };
The first solution I can think of is to first split by '(', then iterate over the collection, conditionally splitting by ')' or ',', depending on which side of the opening parenthesis the current element is on. Pretty sure this is linear, so that's neat:
const string l = ",(";
const string r = "),";
const char c = ',';
const char a = '"';
var line = "\"One,Two,(OneB,TwoB),Five\"";
line = string.Join(string.Empty, line.Split(a)); //Strip "
var splitL = line.Split(l); //,(
var partsList = new List<string>();
foreach (var value in splitL)
{
if (value.Contains(r))//),
{
//inside of parentheses, so we keep the values before the ),
var splitR = value.Split(r);//),
//I don't like literal indexes, but we know we have at least one element because we have a value.
partsList.Add(splitR[0]);
//Everything else is after the closing parenthesis for this group, and before the parenthesis after that
//so we'll parse it all into different values.
//The literal index is safe here because split always returns two values if any value is found.
partsList.AddRange(splitR[1].Split(c));//,
}
else
{
//before the parentheses, so these are all different values
partsList.AddRange(value.Split(c));//,
}
}
var lineparts = partsList.ToArray();//{ "One", "Two", "OneB, TwoB", "Five" };
Here's a better example of a tighter integration with the code in your question, not considering the specific intended values of your Entity properties or the need to trim for quotations:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
public List<Entity> ParseEntityExclusionFile(List<string> entries, string urlFile)
{
entries.RemoveAt(0);
const char l = '(';
const char r = ')';
const char c = ',';
const char a = '"';
List<Entity> entities = new List<Entity>();
foreach (string line in entries)
{
var splitL = line.Split(l); //(
var partsList = new List<string>();
foreach (var value in splitL)
{
if (value.Contains(r))//)
{
var splitR = value.Split(r);//)
partsList.Add(splitR[0]);
if (!line.EndsWith(r))
{
partsList.AddRange(splitR[1].Remove(0, 1).Split(c));//,
}
}
else
{
if (!line.StartsWith(l))
{
partsList.AddRange(value.Remove(value.Length - 1).Split(c));//,
}
}
}
var lineParts = partsList.ToArray();//{ "One", "Two", "OneB, TwoB", "Five" };
entities.Add(new Entity
{
Id = 1000,
Names = lineParts[3],
Identifier = $"{lineParts[25]} | Classification: {lineParts[0]}";
});
}
return entities;
}
This solution may get hairy if your groups contain other groups, i.e...
"OneValue,TwoValue,(OneB,(TwoB, ThreeB)),SixValue"

How to split string with date in c#

i have string with date , i want to split it with date and string
For example :
I have this type of strings data
9/23/2013/marking abandoned based on notes below/DB
12/8/2012/I think the thid is string/SG
and i want to make it like as
9/23/2013 marking abandoned based on notes below/DB
12/8/2013 I think the thid is string/SG
so, i don't know how to split these strings and store in different columns of table.
pls help me.
string[] vals = { "9/23/2013/marking abandoned based on notes below/DB",
"12/8/2012/I think the thid is string/SG" };
var regex = #"(\d{1,2}/\d{1,2}/\d{4})/(.*)";
var matches = vals.Select(val => Regex.Match(vals, regex));
foreach (var match in matches)
{
Console.WriteLine ("{0} {1}", match.Groups[1], match.Groups[2]);
}
prints:
9/23/2013 marking abandoned based on notes below/DB
12/8/2012 I think the thid is string/SG
(\d{1,2}/\d{1,2}/\d{4})/(.*) breaks down to
(\d{1,2}/\d{1,2}/\d{4}):
\d{1,2} - matches any one or two digit number
/ - matches to one / symbol
\d{4} - matches to four digit number
(...) - denotes first group
(.*) - matches everything else and creates second group
Another way to do it with LINQ:
var inputs = new[]{
"9/23/2013/marking abandoned based on notes below/DB",
"12/8/2012/I think the thid is string/SG"
};
foreach (var item in inputs)
{
int counter = 0;
var r = item.Split('/')
.Aggregate("", (a, b) =>
a + ((counter++ == 3) ? "\t" : ((counter == 1) ? "" : "/")) + b);
Console.WriteLine(r);
}
Or you may use the IndexOf and Substring methods:
foreach (var item in inputs)
{
var lastPos =
item.IndexOf('/',
1 + item.IndexOf('/',
1 + item.IndexOf('/')));
if (lastPos != -1)
{
var r = String.Join("\t",
item.Substring(0, lastPos),
item.Substring(lastPos + 1, item.Length - lastPos - 1));
Console.WriteLine(r);
}
}
Perhaps with pure string methods, the third slash separates the date and the text:
string line = "9/23/2013/marking abandoned based on notes below/DB";
int slashIndex = line.IndexOf('/');
if(slashIndex >= 0)
{
int slashCount = 1;
while(slashCount < 3 && slashIndex >= 0)
{
slashIndex = line.IndexOf('/', slashIndex + 1);
if(slashIndex >= 0) slashCount++;
}
if(slashCount == 3)
{
Console.WriteLine("Date:{0} Text: {1}"
, line.Substring(0, slashIndex)
, line.Substring(slashIndex +1));
}
}
For what it's worth, here is a extension method to "break" a string in half on nth occurence of astring:
public static class StringExtensions
{
public static string[] BreakOnNthIndexOf(this string input, string value, int breakOn, StringComparison comparison)
{
if (breakOn <= 0)
throw new ArgumentException("breakOn must be greater than 0", "breakOn");
if (value == null) value = " "; // fallback on white-space
int slashIndex = input.IndexOf(value, comparison);
if (slashIndex >= 0)
{
int slashCount = 1;
while (slashCount < breakOn && slashIndex >= 0)
{
slashIndex = input.IndexOf(value, slashIndex + value.Length, comparison);
if (slashIndex >= 0) slashCount++;
}
if (slashCount == breakOn)
{
return new[] {
input.Substring(0, slashIndex),
input.Substring(slashIndex + value.Length)
};
}
}
return new[]{ input };
}
}
Use it in this way:
string line1 = "9/23/2013/marking abandoned based on notes below/DB";
string line2 = "12/8/2012/I think the thid is string/SG";
string[] res1 = line1.BreakOnNthIndexOf("/", 3, StringComparison.OrdinalIgnoreCase);
string[] res2 = line2.BreakOnNthIndexOf("/", 3, StringComparison.OrdinalIgnoreCase);

C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas

In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?
using System ;
using System.Text.RegularExpressions ;
class Example
{
public static void Main ( )
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
Regex regex = new Regex ( "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)" ) ;
Match match = regex.Match ( myString ) ;
int j = 0 ;
while ( match.Success )
{
Console.WriteLine ( j++ + " \t" + match ) ;
match = match.NextMatch() ;
}
}
}
Output (in part) appears as follows:
0 cat
1 dog
2 "0 = OFF
3 1 = ON"
4 lion
5 tiger
6 'R = red
7 G = green
8 B = blue'
9 bear
However, desired output is:
0 cat
1 dog
2 0 = OFF, 1 = ON
3 lion
4 tiger
5 R = red, G = green, B = blue
6 bear
Try with this Regex:
"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*
Regex regexObj = new Regex(#"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
Match matchResults = regexObj.Match(input);
while (matchResults.Success)
{
Console.WriteLine(matchResults.Value);
matchResults = matchResults.NextMatch();
}
Ouputs:
cat
dog
"0 = OFF, 1 = ON"
lion
tiger
'R = red, G = green, B = blue'
bear
Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.
Why not heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.
it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.
Ah, RegEx. Now you have two problems. ;)
I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.
This works, for example:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
class Program
{
static void Main(string[] args)
{
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear";
Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
CsvParser parser = new CsvParser(myString);
Int32 lineNumber = 0;
foreach (string s in parser)
{
Console.WriteLine(lineNumber + ": " + s);
}
Console.ReadKey();
}
}
internal enum TokenType
{
Comma,
Quote,
Value
}
internal class Token
{
public Token(TokenType type, string value)
{
Value = value;
Type = type;
}
public String Value { get; private set; }
public TokenType Type { get; private set; }
}
internal class StreamTokenizer : IEnumerable<Token>
{
private TextReader _reader;
public StreamTokenizer(TextReader reader)
{
_reader = reader;
}
public IEnumerator<Token> GetEnumerator()
{
String line;
StringBuilder value = new StringBuilder();
while ((line = _reader.ReadLine()) != null)
{
foreach (Char c in line)
{
switch (c)
{
case '\'':
case '"':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Quote, c.ToString());
break;
case ',':
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
value.Length = 0;
}
yield return new Token(TokenType.Comma, c.ToString());
break;
default:
value.Append(c);
break;
}
}
// Thanks, dpan
if (value.Length > 0)
{
yield return new Token(TokenType.Value, value.ToString());
}
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
internal class CsvParser : IEnumerable<String>
{
private StreamTokenizer _tokenizer;
public CsvParser(Stream data)
{
_tokenizer = new StreamTokenizer(new StreamReader(data));
}
public CsvParser(String data)
{
_tokenizer = new StreamTokenizer(new StringReader(data));
}
public IEnumerator<string> GetEnumerator()
{
Boolean inQuote = false;
StringBuilder result = new StringBuilder();
foreach (Token token in _tokenizer)
{
switch (token.Type)
{
case TokenType.Comma:
if (inQuote)
{
result.Append(token.Value);
}
else
{
yield return result.ToString();
result.Length = 0;
}
break;
case TokenType.Quote:
// Toggle quote state
inQuote = !inQuote;
break;
case TokenType.Value:
result.Append(token.Value);
break;
default:
throw new InvalidOperationException("Unknown token type: " + token.Type);
}
}
if (result.Length > 0)
{
yield return result.ToString();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.
CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.
As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net
Function:
private List<string> ParseDelimitedString (string arguments, char delim = ',')
{
bool inQuotes = false;
bool inNonQuotes = false; //used to trim leading WhiteSpace
List<string> strings = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in arguments)
{
if (c == '\'' || c == '"')
{
if (!inQuotes)
inQuotes = true;
else
inQuotes = false;
}else if (c == delim)
{
if (!inQuotes)
{
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)
{
if (!inNonQuotes) inNonQuotes = true;
sb.Append(c);
}
}
strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
return strings;
}
Usage
string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear, text";
List<string> strings = ParseDelimitedString(myString);
foreach( string s in strings )
Console.WriteLine( s );
Output:
cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text
I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.
And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.
So I've updated the code to the following and thought I'd share...
static public List<string> ParseDelimitedString(string value, char delimiter)
{
bool inQuotes = false;
bool inNonQuotes = false;
bool secondQuote = false;
char curQuote = '\0';
List<string> results = new List<string>();
StringBuilder sb = new StringBuilder();
foreach (char c in value)
{
if (inNonQuotes)
{
// then quotes are just characters
if (c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inNonQuotes = false;
}
else
{
sb.Append(c);
}
}
else if (inQuotes)
{
// then quotes need to be double escaped
if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
{
if (secondQuote)
{
secondQuote = false;
sb.Append(c);
}
else
secondQuote = true;
}
else if (secondQuote && c == delimiter)
{
results.Add(sb.ToString());
sb.Remove(0, sb.Length);
inQuotes = false;
}
else if (!secondQuote)
{
sb.Append(c);
}
else
{
// bad,as,"user entered something like"this,poorly escaped,value
// just ignore until second delimiter found
}
}
else
{
// not yet parsing a field
if (c == '\'' || c == '"')
{
curQuote = c;
inQuotes = true;
inNonQuotes = false;
secondQuote = false;
}
else if (c == delimiter)
{
// blank field
inQuotes = false;
inNonQuotes = false;
results.Add(string.Empty);
}
else
{
inQuotes = false;
inNonQuotes = true;
sb.Append(c);
}
}
}
if (inQuotes || inNonQuotes)
results.Add(sb.ToString());
return results;
}
since this question: Regex to to parse csv with nested quotes
reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)
here is a simple parser implementation in Python as well
def csv_to_array(string):
stack = []
match = []
matches = []
for c in string:
# do we have a quote or double quote?
if c == "\"":
# is it a closing match?
if len(stack) > 0 and stack[-1] == c:
stack.pop()
else:
stack.append(c)
elif (c == "," and len(stack) == 0) or (c == "\n"):
matches.append("".join(match))
match = []
else:
match.append(c)
return matches

Does C# have built-in support for parsing page-number strings?

Does C# have built-in support for parsing strings of page numbers? By page numbers, I mean the format you might enter into a print dialog that's a mixture of comma and dash-delimited.
Something like this:
1,3,5-10,12
What would be really nice is a solution that gave me back some kind of list of all page numbers represented by the string. In the above example, getting a list back like this would be nice:
1,3,5,6,7,8,9,10,12
I just want to avoid rolling my own if there's an easy way to do it.
Should be simple:
foreach( string s in "1,3,5-10,12".Split(',') )
{
// try and get the number
int num;
if( int.TryParse( s, out num ) )
{
yield return num;
continue; // skip the rest
}
// otherwise we might have a range
// split on the range delimiter
string[] subs = s.Split('-');
int start, end;
// now see if we can parse a start and end
if( subs.Length > 1 &&
int.TryParse(subs[0], out start) &&
int.TryParse(subs[1], out end) &&
end >= start )
{
// create a range between the two values
int rangeLength = end - start + 1;
foreach(int i in Enumerable.Range(start, rangeLength))
{
yield return i;
}
}
}
Edit: thanks for the fix ;-)
It doesn't have a built-in way to do this, but it would be trivial to do using String.Split.
Simply split on ',' then you have a series of strings that represent either page numbers or ranges. Iterate over that series and do a String.Split of '-'. If there isn't a result, it's a plain page number, so stick it in your list of pages. If there is a result, take the left and right of the '-' as the bounds and use a simple for loop to add each page number to your final list over that range.
Can't take but 5 minutes to do, then maybe another 10 to add in some sanity checks to throw errors when the user tries to input invalid data (like "1-2-3" or something.)
Keith's approach seems nice. I put together a more naive approach using lists. This has error checking so hopefully should pick up most problems:-
public List<int> parsePageNumbers(string input) {
if (string.IsNullOrEmpty(input))
throw new InvalidOperationException("Input string is empty.");
var pageNos = input.Split(',');
var ret = new List<int>();
foreach(string pageString in pageNos) {
if (pageString.Contains("-")) {
parsePageRange(ret, pageString);
} else {
ret.Add(parsePageNumber(pageString));
}
}
ret.Sort();
return ret.Distinct().ToList();
}
private int parsePageNumber(string pageString) {
int ret;
if (!int.TryParse(pageString, out ret)) {
throw new InvalidOperationException(
string.Format("Page number '{0}' is not valid.", pageString));
}
return ret;
}
private void parsePageRange(List<int> pageNumbers, string pageNo) {
var pageRange = pageNo.Split('-');
if (pageRange.Length != 2)
throw new InvalidOperationException(
string.Format("Page range '{0}' is not valid.", pageNo));
int startPage = parsePageNumber(pageRange[0]),
endPage = parsePageNumber(pageRange[1]);
if (startPage > endPage) {
throw new InvalidOperationException(
string.Format("Page number {0} is greater than page number {1}" +
" in page range '{2}'", startPage, endPage, pageNo));
}
pageNumbers.AddRange(Enumerable.Range(startPage, endPage - startPage + 1));
}
Below is the code I just put together to do this.. You can enter in the format like.. 1-2,5abcd,6,7,20-15,,,,,,
easy to add-on for other formats
private int[] ParseRange(string ranges)
{
string[] groups = ranges.Split(',');
return groups.SelectMany(t => GetRangeNumbers(t)).ToArray();
}
private int[] GetRangeNumbers(string range)
{
//string justNumbers = new String(text.Where(Char.IsDigit).ToArray());
int[] RangeNums = range
.Split('-')
.Select(t => new String(t.Where(Char.IsDigit).ToArray())) // Digits Only
.Where(t => !string.IsNullOrWhiteSpace(t)) // Only if has a value
.Select(t => int.Parse(t)) // digit to int
.ToArray();
return RangeNums.Length.Equals(2) ? Enumerable.Range(RangeNums.Min(), (RangeNums.Max() + 1) - RangeNums.Min()).ToArray() : RangeNums;
}
Here's something I cooked up for something similar.
It handles the following types of ranges:
1 single number
1-5 range
-5 range from (firstpage) up to 5
5- range from 5 up to (lastpage)
.. can use .. instead of -
;, can use both semicolon, comma, and space, as separators
It does not check for duplicate values, so the set 1,5,-10 will produce the sequence 1, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
public class RangeParser
{
public static IEnumerable<Int32> Parse(String s, Int32 firstPage, Int32 lastPage)
{
String[] parts = s.Split(' ', ';', ',');
Regex reRange = new Regex(#"^\s*((?<from>\d+)|(?<from>\d+)(?<sep>(-|\.\.))(?<to>\d+)|(?<sep>(-|\.\.))(?<to>\d+)|(?<from>\d+)(?<sep>(-|\.\.)))\s*$");
foreach (String part in parts)
{
Match maRange = reRange.Match(part);
if (maRange.Success)
{
Group gFrom = maRange.Groups["from"];
Group gTo = maRange.Groups["to"];
Group gSep = maRange.Groups["sep"];
if (gSep.Success)
{
Int32 from = firstPage;
Int32 to = lastPage;
if (gFrom.Success)
from = Int32.Parse(gFrom.Value);
if (gTo.Success)
to = Int32.Parse(gTo.Value);
for (Int32 page = from; page <= to; page++)
yield return page;
}
else
yield return Int32.Parse(gFrom.Value);
}
}
}
}
You can't be sure till you have test cases. In my case i would prefer to be white space delimited instead of comma delimited. It make the parsing a little more complex.
[Fact]
public void ShouldBeAbleToParseRanges()
{
RangeParser.Parse( "1" ).Should().BeEquivalentTo( 1 );
RangeParser.Parse( "-1..2" ).Should().BeEquivalentTo( -1,0,1,2 );
RangeParser.Parse( "-1..2 " ).Should().BeEquivalentTo( -1,0,1,2 );
RangeParser.Parse( "-1..2 5" ).Should().BeEquivalentTo( -1,0,1,2,5 );
RangeParser.Parse( " -1 .. 2 5" ).Should().BeEquivalentTo( -1,0,1,2,5 );
}
Note that Keith's answer ( or a small variation) will fail the last test where there is whitespace between the range token. This requires a tokenizer and a proper parser with lookahead.
namespace Utils
{
public class RangeParser
{
public class RangeToken
{
public string Name;
public string Value;
}
public static IEnumerable<RangeToken> Tokenize(string v)
{
var pattern =
#"(?<number>-?[1-9]+[0-9]*)|" +
#"(?<range>\.\.)";
var regex = new Regex( pattern );
var matches = regex.Matches( v );
foreach (Match match in matches)
{
var numberGroup = match.Groups["number"];
if (numberGroup.Success)
{
yield return new RangeToken {Name = "number", Value = numberGroup.Value};
continue;
}
var rangeGroup = match.Groups["range"];
if (rangeGroup.Success)
{
yield return new RangeToken {Name = "range", Value = rangeGroup.Value};
}
}
}
public enum State { Start, Unknown, InRange}
public static IEnumerable<int> Parse(string v)
{
var tokens = Tokenize( v );
var state = State.Start;
var number = 0;
foreach (var token in tokens)
{
switch (token.Name)
{
case "number":
var nextNumber = int.Parse( token.Value );
switch (state)
{
case State.Start:
number = nextNumber;
state = State.Unknown;
break;
case State.Unknown:
yield return number;
number = nextNumber;
break;
case State.InRange:
int rangeLength = nextNumber - number+ 1;
foreach (int i in Enumerable.Range( number, rangeLength ))
{
yield return i;
}
state = State.Start;
break;
default:
throw new ArgumentOutOfRangeException();
}
break;
case "range":
switch (state)
{
case State.Start:
throw new ArgumentOutOfRangeException();
break;
case State.Unknown:
state = State.InRange;
break;
case State.InRange:
throw new ArgumentOutOfRangeException();
break;
default:
throw new ArgumentOutOfRangeException();
}
break;
default:
throw new ArgumentOutOfRangeException( nameof( token ) );
}
}
switch (state)
{
case State.Start:
break;
case State.Unknown:
yield return number;
break;
case State.InRange:
break;
default:
throw new ArgumentOutOfRangeException();
}
}
}
}
One line approach with Split and Linq
string input = "1,3,5-10,12";
IEnumerable<int> result = input.Split(',').SelectMany(x => x.Contains('-') ? Enumerable.Range(int.Parse(x.Split('-')[0]), int.Parse(x.Split('-')[1]) - int.Parse(x.Split('-')[0]) + 1) : new int[] { int.Parse(x) });
Here's a slightly modified version of lassevk's code that handles the string.Split operation inside of the Regex match. It's written as an extension method and you can easily handle the duplicates problem using the Disinct() extension from LINQ.
/// <summary>
/// Parses a string representing a range of values into a sequence of integers.
/// </summary>
/// <param name="s">String to parse</param>
/// <param name="minValue">Minimum value for open range specifier</param>
/// <param name="maxValue">Maximum value for open range specifier</param>
/// <returns>An enumerable sequence of integers</returns>
/// <remarks>
/// The range is specified as a string in the following forms or combination thereof:
/// 5 single value
/// 1,2,3,4,5 sequence of values
/// 1-5 closed range
/// -5 open range (converted to a sequence from minValue to 5)
/// 1- open range (converted to a sequence from 1 to maxValue)
///
/// The value delimiter can be either ',' or ';' and the range separator can be
/// either '-' or ':'. Whitespace is permitted at any point in the input.
///
/// Any elements of the sequence that contain non-digit, non-whitespace, or non-separator
/// characters or that are empty are ignored and not returned in the output sequence.
/// </remarks>
public static IEnumerable<int> ParseRange2(this string s, int minValue, int maxValue) {
const string pattern = #"(?:^|(?<=[,;])) # match must begin with start of string or delim, where delim is , or ;
\s*( # leading whitespace
(?<from>\d*)\s*(?:-|:)\s*(?<to>\d+) # capture 'from <sep> to' or '<sep> to', where <sep> is - or :
| # or
(?<from>\d+)\s*(?:-|:)\s*(?<to>\d*) # capture 'from <sep> to' or 'from <sep>', where <sep> is - or :
| # or
(?<num>\d+) # capture lone number
)\s* # trailing whitespace
(?:(?=[,;\b])|$) # match must end with end of string or delim, where delim is , or ;";
Regex regx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match m in regx.Matches(s)) {
Group gpNum = m.Groups["num"];
if (gpNum.Success) {
yield return int.Parse(gpNum.Value);
} else {
Group gpFrom = m.Groups["from"];
Group gpTo = m.Groups["to"];
if (gpFrom.Success || gpTo.Success) {
int from = (gpFrom.Success && gpFrom.Value.Length > 0 ? int.Parse(gpFrom.Value) : minValue);
int to = (gpTo.Success && gpTo.Value.Length > 0 ? int.Parse(gpTo.Value) : maxValue);
for (int i = from; i <= to; i++) {
yield return i;
}
}
}
}
}
The answer I came up with:
static IEnumerable<string> ParseRange(string str)
{
var numbers = str.Split(',');
foreach (var n in numbers)
{
if (!n.Contains("-"))
yield return n;
else
{
string startStr = String.Join("", n.TakeWhile(c => c != '-'));
int startInt = Int32.Parse(startStr);
string endStr = String.Join("", n.Reverse().TakeWhile(c => c != '-').Reverse());
int endInt = Int32.Parse(endStr);
var range = Enumerable.Range(startInt, endInt - startInt + 1)
.Select(num => num.ToString());
foreach (var s in range)
yield return s;
}
}
}
Regex is not efficient as following code. String methods are more efficient than Regex and should be used when possible.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string[] inputs = {
"001-005/015",
"009/015"
};
foreach (string input in inputs)
{
List<int> numbers = new List<int>();
string[] strNums = input.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string strNum in strNums)
{
if (strNum.Contains("-"))
{
int startNum = int.Parse(strNum.Substring(0, strNum.IndexOf("-")));
int endNum = int.Parse(strNum.Substring(strNum.IndexOf("-") + 1));
for (int i = startNum; i <= endNum; i++)
{
numbers.Add(i);
}
}
else
numbers.Add(int.Parse(strNum));
}
Console.WriteLine(string.Join(",", numbers.Select(x => x.ToString())));
}
Console.ReadLine();
}
}
}
My solution:
return list of integers
reversed/typo/duplicate possible: 1,-3,5-,7-10,12-9 => 1,3,5,7,8,9,10,12,11,10,9 (used when you want to extract, repeat pages)
option to set total of pages: 1,-3,5-,7-10,12-9 (Nmax=9) => 1,3,5,7,8,9,9
autocomplete: 1,-3,5-,8 (Nmax=9) => 1,3,5,6,7,8,9,8
public static List<int> pageRangeToList(string pageRg, int Nmax = 0)
{
List<int> ls = new List<int>();
int lb,ub,i;
foreach (string ss in pageRg.Split(','))
{
if(int.TryParse(ss,out lb)){
ls.Add(Math.Abs(lb));
} else {
var subls = ss.Split('-').ToList();
lb = (int.TryParse(subls[0],out i)) ? i : 0;
ub = (int.TryParse(subls[1],out i)) ? i : Nmax;
ub = ub > 0 ? ub : lb; // if ub=0, take 1 value of lb
for(i=0;i<=Math.Abs(ub-lb);i++)
ls.Add(lb<ub? i+lb : lb-i);
}
}
Nmax = Nmax > 0 ? Nmax : ls.Max(); // real Nmax
return ls.Where(s => s>0 && s<=Nmax).ToList();
}

Categories

Resources