How to determine if a File Matches a File Mask? - c#

I need to decide whether file name fits to file mask. The file mask could contain * or ? characters. Is there any simple solution for this?
bool bFits = Fits("myfile.txt", "my*.txt");
private bool Fits(string sFileName, string sFileMask)
{
??? anything simple here ???
}

I appreciate finding Joel's answer--saved me some time as well ! I did, however, have to make a few changes to make the method do what most users would expect:
I removed the 'this' keyword preceding the first argument. It does nothing here (though it could be useful if the method is intended to be an extension method, in which case it needs to be public and contained within a static class and itself be a static method).
I made the regular expression case-independent to match standard Windows wildcard behavior (so e.g. "c*.*" and "C*.*" both return the same result).
I added starting and ending anchors to the regular expression, again to match standard Windows wildcard behavior (so e.g. "stuff.txt" would be matched by "stuff*" or "s*" or "s*.*" but not by just "s").
private bool FitsMask(string fileName, string fileMask)
{
Regex mask = new Regex(
'^' +
fileMask
.Replace(".", "[.]")
.Replace("*", ".*")
.Replace("?", ".")
+ '$',
RegexOptions.IgnoreCase);
return mask.IsMatch(fileName);
}
2009.11.04 Update: Match one of several masks
For even more flexibility, here is a plug-compatible method built on top of the original. This version lets you pass multiple masks (hence the plural on the second parameter name fileMasks) separated by lines, commas, vertical bars, or spaces. I wanted it so that I could let the user put as many choices as desired in a ListBox and then select all files matching any of them. Note that some controls (like a ListBox) use CR-LF for line breaks while others (e.g. RichTextBox) use just LF--that is why both "\r\n" and "\n" show up in the Split list.
private bool FitsOneOfMultipleMasks(string fileName, string fileMasks)
{
return fileMasks
.Split(new string[] {"\r\n", "\n", ",", "|", " "},
StringSplitOptions.RemoveEmptyEntries)
.Any(fileMask => FitsMask(fileName, fileMask));
}
2009.11.17 Update: Handle fileMask inputs more gracefully
The earlier version of FitsMask (which I have left in for comparison) does a fair job but since we are treating it as a regular expression it will throw an exception if it is not a valid regular expression when it comes in. The solution is that we actually want any regex metacharacters in the input fileMask to be considered literals, not metacharacters. But we still need to treat period, asterisk, and question mark specially. So this improved version of FitsMask safely moves these three characters out of the way, transforms all remaining metacharacters into literals, then puts the three interesting characters back, in their "regex'ed" form.
One other minor improvement is to allow for case-independence, per standard Windows behavior.
private bool FitsMask(string fileName, string fileMask)
{
string pattern =
'^' +
Regex.Escape(fileMask.Replace(".", "__DOT__")
.Replace("*", "__STAR__")
.Replace("?", "__QM__"))
.Replace("__DOT__", "[.]")
.Replace("__STAR__", ".*")
.Replace("__QM__", ".")
+ '$';
return new Regex(pattern, RegexOptions.IgnoreCase).IsMatch(fileName);
}
2010.09.30 Update: Somewhere along the way, passion ensued...
I have been remiss in not updating this earlier but these references will likely be of interest to readers who have made it to this point:
I embedded the FitsMask method as the heart of a WinForms user control aptly called a FileMask--see the API here.
I then wrote an article featuring the FileMask control published on Simple-Talk.com, entitled Using LINQ Lambda Expressions to Design Customizable Generic Components. (While the method itself does not use LINQ, the FileMask user control does, hence the title of the article.)

Try this:
private bool FitsMask(string sFileName, string sFileMask)
{
Regex mask = new Regex(sFileMask.Replace(".", "[.]").Replace("*", ".*").Replace("?", "."));
return mask.IsMatch(sFileName);
}

Many people don't know that, but .NET includes an internal class, called "PatternMatcher" (under the "System.IO" namespace).
This static class contains only 1 method:
public static bool StrictMatchPattern(string expression, string name)
This method is used by .net whenever it needs to compare files with wildcard (FileSystemWatcher, GetFiles(), etc)
Using reflector, I exposed the code here.
Didn't really go through it to understand how it works, but it works great,
So this is the code for anyone who doesn't want to work with the inefficient RegEx way:
public static class PatternMatcher
{
// Fields
private const char ANSI_DOS_QM = '<';
private const char ANSI_DOS_STAR = '>';
private const char DOS_DOT = '"';
private const int MATCHES_ARRAY_SIZE = 16;
// Methods
public static bool StrictMatchPattern(string expression, string name)
{
expression = expression.ToLowerInvariant();
name = name.ToLowerInvariant();
int num9;
char ch = '\0';
char ch2 = '\0';
int[] sourceArray = new int[16];
int[] numArray2 = new int[16];
bool flag = false;
if (((name == null) || (name.Length == 0)) || ((expression == null) || (expression.Length == 0)))
{
return false;
}
if (expression.Equals("*") || expression.Equals("*.*"))
{
return true;
}
if ((expression[0] == '*') && (expression.IndexOf('*', 1) == -1))
{
int length = expression.Length - 1;
if ((name.Length >= length) && (string.Compare(expression, 1, name, name.Length - length, length, StringComparison.OrdinalIgnoreCase) == 0))
{
return true;
}
}
sourceArray[0] = 0;
int num7 = 1;
int num = 0;
int num8 = expression.Length * 2;
while (!flag)
{
int num3;
if (num < name.Length)
{
ch = name[num];
num3 = 1;
num++;
}
else
{
flag = true;
if (sourceArray[num7 - 1] == num8)
{
break;
}
}
int index = 0;
int num5 = 0;
int num6 = 0;
while (index < num7)
{
int num2 = (sourceArray[index++] + 1) / 2;
num3 = 0;
Label_00F2:
if (num2 != expression.Length)
{
num2 += num3;
num9 = num2 * 2;
if (num2 == expression.Length)
{
numArray2[num5++] = num8;
}
else
{
ch2 = expression[num2];
num3 = 1;
if (num5 >= 14)
{
int num11 = numArray2.Length * 2;
int[] destinationArray = new int[num11];
Array.Copy(numArray2, destinationArray, numArray2.Length);
numArray2 = destinationArray;
destinationArray = new int[num11];
Array.Copy(sourceArray, destinationArray, sourceArray.Length);
sourceArray = destinationArray;
}
if (ch2 == '*')
{
numArray2[num5++] = num9;
numArray2[num5++] = num9 + 1;
goto Label_00F2;
}
if (ch2 == '>')
{
bool flag2 = false;
if (!flag && (ch == '.'))
{
int num13 = name.Length;
for (int i = num; i < num13; i++)
{
char ch3 = name[i];
num3 = 1;
if (ch3 == '.')
{
flag2 = true;
break;
}
}
}
if ((flag || (ch != '.')) || flag2)
{
numArray2[num5++] = num9;
numArray2[num5++] = num9 + 1;
}
else
{
numArray2[num5++] = num9 + 1;
}
goto Label_00F2;
}
num9 += num3 * 2;
switch (ch2)
{
case '<':
if (flag || (ch == '.'))
{
goto Label_00F2;
}
numArray2[num5++] = num9;
goto Label_028D;
case '"':
if (flag)
{
goto Label_00F2;
}
if (ch == '.')
{
numArray2[num5++] = num9;
goto Label_028D;
}
break;
}
if (!flag)
{
if (ch2 == '?')
{
numArray2[num5++] = num9;
}
else if (ch2 == ch)
{
numArray2[num5++] = num9;
}
}
}
}
Label_028D:
if ((index < num7) && (num6 < num5))
{
while (num6 < num5)
{
int num14 = sourceArray.Length;
while ((index < num14) && (sourceArray[index] < numArray2[num6]))
{
index++;
}
num6++;
}
}
}
if (num5 == 0)
{
return false;
}
int[] numArray4 = sourceArray;
sourceArray = numArray2;
numArray2 = numArray4;
num7 = num5;
}
num9 = sourceArray[num7 - 1];
return (num9 == num8);
}
}

None of these answers quite seem to do the trick, and msorens's is needlessly complex. This one should work just fine:
public static Boolean MatchesMask(string fileName, string fileMask)
{
String convertedMask = "^" + Regex.Escape(fileMask).Replace("\\*", ".*").Replace("\\?", ".") + "$";
Regex regexMask = new Regex(convertedMask, RegexOptions.IgnoreCase);
return regexMask.IsMatch(fileName);
}
This makes sure possible regex chars in the mask are escaped, replaces the \* and \?, and surrounds it all by ^ and $ to mark the boundaries.
Of course, in most situations, it's far more useful to simply make this into a FileMaskToRegex tool function which returns the Regex object, so you just got it once and can then make a loop in which you check all strings from your files list on it.
public static Regex FileMaskToRegex(string fileMask)
{
String convertedMask = "^" + Regex.Escape(fileMask).Replace("\\*", ".*").Replace("\\?", ".") + "$";
return new Regex(convertedMask, RegexOptions.IgnoreCase);
}

Use WildCardPattern class from System.Management.Automation available as NuGet package or in Windows PowerShell SDK.
WildcardPattern pattern = new WildcardPattern("my*.txt");
bool fits = pattern.IsMatch("myfile.txt");

From Windows 7 using P/Invoke (without 260 char count limit):
// UNICODE_STRING for Rtl... method
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public struct UNICODE_STRING
{
public ushort Length;
public ushort MaximumLength;
[MarshalAs(UnmanagedType.LPWStr)]
string Buffer;
public UNICODE_STRING(string buffer)
{
if (buffer == null)
Length = MaximumLength = 0;
else
Length = MaximumLength = unchecked((ushort)(buffer.Length * 2));
Buffer = buffer;
}
}
// RtlIsNameInExpression method from NtDll.dll system library
public static class NtDll
{
[DllImport("NtDll.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
[return: MarshalAs(UnmanagedType.U1)]
public extern static bool RtlIsNameInExpression(
ref UNICODE_STRING Expression,
ref UNICODE_STRING Name,
[MarshalAs(UnmanagedType.U1)]
bool IgnoreCase,
IntPtr Zero
);
}
public bool MatchMask(string mask, string fileName)
{
// Expression must be uppercase for IgnoreCase == true (see MSDN for RtlIsNameInExpression)
UNICODE_STRING expr = new UNICODE_STRING(mask.ToUpper());
UNICODE_STRING name = new UNICODE_STRING(fileName);
if (NtDll.RtlIsNameInExpression(ref expr, ref name, true, IntPtr.Zero))
{
// MATCHES !!!
}
}

Fastest version of the previously proposed function:
public static bool FitsMasks(string filePath, params string[] fileMasks)
// or
public static Regex FileMasksToRegex(params string[] fileMasks)
{
if (!_maskRegexes.ContainsKey(fileMasks))
{
StringBuilder sb = new StringBuilder("^");
bool first = true;
foreach (string fileMask in fileMasks)
{
if(first) first =false; else sb.Append("|");
sb.Append('(');
foreach (char c in fileMask)
{
switch (c)
{
case '*': sb.Append(#".*"); break;
case '?': sb.Append(#"."); break;
default:
sb.Append(Regex.Escape(c.ToString()));
break;
}
}
sb.Append(')');
}
sb.Append("$");
_maskRegexes[fileMasks] = new Regex(sb.ToString(), RegexOptions.IgnoreCase);
}
return _maskRegexes[fileMasks].IsMatch(filePath);
// or
return _maskRegexes[fileMasks];
}
static readonly Dictionary<string[], Regex> _maskRegexes = new Dictionary<string[], Regex>(/*unordered string[] comparer*/);
Notes:
Re-using Regex objects.
Using StringBuilder to optimize Regex creation (multiple .Replace() calls are slow).
Multiple masks, combined with OR.
Another version returning the Regex.

If PowerShell is available, it has direct support for wildcard type matching (as well as Regex).
WildcardPattern pat = new WildcardPattern("a*.b*");
if (pat.IsMatch(filename)) { ... }

I didn't want to copy the source code and like #frankhommers I came up with a reflection based solution.
Notice the code comment about the use of wildcards in the name argument I found in the reference source.
public static class PatternMatcher
{
static MethodInfo strictMatchPatternMethod;
static PatternMatcher()
{
var typeName = "System.IO.PatternMatcher";
var methodName = "StrictMatchPattern";
var assembly = typeof(Uri).Assembly;
var type = assembly.GetType(typeName, true);
strictMatchPatternMethod = type.GetMethod(methodName, BindingFlags.Static | BindingFlags.Public) ?? throw new MissingMethodException($"{typeName}.{methodName} not found");
}
/// <summary>
/// Tells whether a given name matches the expression given with a strict (i.e. UNIX like) semantics.
/// </summary>
/// <param name="expression">Supplies the input expression to check against</param>
/// <param name="name">Supplies the input name to check for.</param>
/// <returns></returns>
public static bool StrictMatchPattern(string expression, string name)
{
// https://referencesource.microsoft.com/#system/services/io/system/io/PatternMatcher.cs
// If this class is ever exposed for generic use,
// we need to make sure that name doesn't contain wildcards. Currently
// the only component that calls this method is FileSystemWatcher and
// it will never pass a name that contains a wildcard.
if (name.Contains('*')) throw new FormatException("Wildcard not allowed");
return (bool)strictMatchPatternMethod.Invoke(null, new object[] { expression, name });
}
}

For .net Core the way microsoft does.
private bool MatchPattern(ReadOnlySpan<char> relativePath)
{
ReadOnlySpan<char> name = IO.Path.GetFileName(relativePath);
if (name.Length == 0)
return false;
if (Filters.Count == 0)
return true;
foreach (string filter in Filters)
{
if (FileSystemName.MatchesSimpleExpression(filter, name, ignoreCase: !PathInternal.IsCaseSensitive))
return true;
}
return false;
}
The way microsoft itself seemed to do for .NET 4.6 is documented in github:
private bool MatchPattern(string relativePath) {
string name = System.IO.Path.GetFileName(relativePath);
if (name != null)
return PatternMatcher.StrictMatchPattern(filter.ToUpper(CultureInfo.InvariantCulture), name.ToUpper(CultureInfo.InvariantCulture));
else
return false;
}

My version, which supports ** wild card:
static Regex FileMask2Regex(string mask)
{
var sb = new StringBuilder(mask);
// hide wildcards
sb.Replace("**", "affefa0d52e84c2db78f5510117471aa-StarStar");
sb.Replace("*", "affefa0d52e84c2db78f5510117471aa-Star");
sb.Replace("?", "affefa0d52e84c2db78f5510117471aa-Question");
sb.Replace("/", "affefa0d52e84c2db78f5510117471aa-Slash");
sb.Replace("\\", "affefa0d52e84c2db78f5510117471aa-Slash");
sb = new StringBuilder(Regex.Escape(sb.ToString()));
// unhide wildcards
sb.Replace("affefa0d52e84c2db78f5510117471aa-StarStar", #".*");
sb.Replace("affefa0d52e84c2db78f5510117471aa-Star", #"[^/\\]*");
sb.Replace("affefa0d52e84c2db78f5510117471aa-Question", #"[^/\\]");
sb.Replace("affefa0d52e84c2db78f5510117471aa-Slash", #"[/\\]");
sb.Append("$");
// allowed to have prefix
sb.Insert(0, #"^(?:.*?[/\\])?");
return new Regex(sb.ToString(), RegexOptions.IgnoreCase);
}

How about using reflection to get access to the function in the .NET framework?
Like this:
public class PatternMatcher
{
public delegate bool StrictMatchPatternDelegate(string expression, string name);
public StrictMatchPatternDelegate StrictMatchPattern;
public PatternMatcher()
{
Type patternMatcherType = typeof(FileSystemWatcher).Assembly.GetType("System.IO.PatternMatcher");
MethodInfo patternMatchMethod = patternMatcherType.GetMethod("StrictMatchPattern", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.Public);
StrictMatchPattern = (expression, name) => (bool)patternMatchMethod.Invoke(null, new object[] { expression, name });
}
}
void Main()
{
PatternMatcher patternMatcher = new PatternMatcher();
Console.WriteLine(patternMatcher.StrictMatchPattern("*.txt", "test.txt")); //displays true
Console.WriteLine(patternMatcher.StrictMatchPattern("*.doc", "test.txt")); //displays false
}

Related

Get Difference Between Two Strings in Terms of Remove and Insert Actions

So I have a text box and on the text changed event I have the old text and the new text, and want to get the difference between them. In this case, I want to be able to recreate the new text with the old text using one remove function and one insert function. That is possible because there are a few possibilities of the change that was in the text box:
Text was only removed (one character or more using selection) - ABCD -> AD
Text was only added (one character or more using paste) - ABCD -> ABXXCD
Text was removed and added (by selecting text and entering text in the same action) - ABCD -> AXD
So I want to have these functions:
Sequence GetRemovedCharacters(string oldText, string newText)
{
}
Sequence GetAddedCharacters(string oldText, string newText)
{
}
My Sequence class:
public class Sequence
{
private int start;
private int end;
public Sequence(int start, int end)
{
StartIndex = start; EndIndex = end;
}
public int StartIndex { get { return start; } set { start = value; Length = end - start + 1; } }
public int EndIndex { get { return end; } set { end = value; Length = end - start + 1; } }
public int Length { get; private set; }
public override string ToString()
{
return "(" + StartIndex + ", " + EndIndex + ")";
}
public static bool operator ==(Sequence a, Sequence b)
{
if(IsNull(a) && IsNull(b))
return true;
else if(IsNull(a) || IsNull(b))
return false;
else
return a.StartIndex == b.StartIndex && a.EndIndex == b.EndIndex;
}
public override bool Equals(object obj)
{
return base.Equals(obj);
}
public static bool operator !=(Sequence a, Sequence b)
{
if(IsNull(a) && IsNull(b))
return false;
else if(IsNull(a) || IsNull(b))
return true;
else
return a.StartIndex != b.StartIndex && a.EndIndex != b.EndIndex;
}
public override int GetHashCode()
{
return base.GetHashCode();
}
static bool IsNull(Sequence sequence)
{
try
{
return sequence.Equals(null);
}
catch(NullReferenceException)
{
return true;
}
}
}
Extra Explanation: I want to know which characters were removed and which characters were added to the text in order to get the new text so I can recreate this. Let's say I have ABCD -> AXD. 'B' and 'C' would be the characters that were removed and 'X' would be the character that was added. So the output from the GetRemovedCharacters function would be (1, 2) and the output from the GetAddedCharacters function would be (1, 1). The output from the GetRemovedCharacters function refers to indexes in the old text and the output from the GetAddedCharacters function refers to indexes in the old text after removing the removed characters.
EDIT: I've thought of a few directions:
This code I created* which returns the sequence that was affected - if characters were removed it returns the sequence of the characters that were removed in the old text; if characters were added it returns the sequence of the characters that were added in the new text. It does not return the right value (which I myself not sure what I want it to be) when removing and adding text.
Maybe the SelectionStart property in the text box could help - the position of the caret after the text was changed.
*
private static Sequence GetChangeSequence(string oldText, string newText)
{
if(newText.Length > oldText.Length)
{
for(int i = 0; i < newText.Length; i++)
if(i == oldText.Length || newText[i] != oldText[i])
return new Sequence(i, i + (newText.Length - oldText.Length) - 1);
return null;
}
else if(newText.Length < oldText.Length)
{
for(int i = 0; i < oldText.Length; i++)
if(i == newText.Length || oldText[i] != newText[i])
return new Sequence(i, i + (oldText.Length - newText.Length) - 1);
return null;
}
else
return null;
}
Thanks.
A simple string comparison wont do the job since you are asking for a algorithm which supports added and removed chars at the same time and is hence not easy to achive in a few lines of code. Id suggest to use a library instead of writing your own comparison algorithm.
Have a look at this project for example.
I quickly threw this together to give you an idea of what I did to solve your question. It doesn't use your classes but it does find an index so it's customizable for you.
There are also obvious limitations to this as it is just bare bones.
This method will spot out changes made to the original string by comparing it to the changed string
// Find the changes made to a string
string StringDiff (string originalString, string changedString)
{
string diffString = "";
// Iterate over the original string
for (int i = 0; i < originalString.Length; i++)
{
// Get the character to search with
char diffChar = originalString[i];
// If found char in the changed string
if (FindInString(diffChar, changedString, out int index))
{
// Remove from the changed string at the index as we don't want to match to this char again
changedString = changedString.Remove(index, 1);
}
// If not found then this is a difference
else
{
// Add to diff string
diffString += diffChar;
}
}
return diffString;
}
This method will return true at the first matching occurrence (an obvious limitation but this is more to give you an idea)
// Find char at first occurence in string
bool FindInString (char c, string search, out int index)
{
index = -1;
// Iterate over search string
for (int i = 0; i < search.Length; i++)
{
// If found then return true with index
if (c == search[i])
{
index = i;
return true;
}
}
return false;
}
This is a simple helper method to show you an example
void SplitStrings(string oldStr, string newStr)
{
Console.WriteLine($"Old : {oldStr}, New: {newStr}");
Console.WriteLine("Removed - " + StringDiff(oldStr, newStr));
Console.WriteLine("Added - " + StringDiff(newStr, oldStr));
}
I've done it.
static void Main(string[] args)
{
while(true)
{
Console.WriteLine("Enter the Old Text");
string oldText = Console.ReadLine();
Console.WriteLine("Enter the New Text");
string newText = Console.ReadLine();
Console.WriteLine("Enter the Caret Position");
int caretPos = int.Parse(Console.ReadLine());
Sequence removed = GetRemovedCharacters(oldText, newText, caretPos);
if(removed != null)
oldText = oldText.Remove(removed.StartIndex, removed.Length);
Sequence added = GetAddedCharacters(oldText, newText, caretPos);
if(added != null)
oldText = oldText.Insert(added.StartIndex, newText.Substring(added.StartIndex, added.Length));
Console.WriteLine("Worked: " + (oldText == newText).ToString());
Console.ReadKey();
Console.Clear();
}
}
static Sequence GetRemovedCharacters(string oldText, string newText, int caretPosition)
{
int startIndex = GetStartIndex(oldText, newText);
if(startIndex != -1)
{
Sequence sequence = new Sequence(startIndex, caretPosition + (oldText.Length - newText.Length) - 1);
if(SequenceValid(sequence))
return sequence;
}
return null;
}
static Sequence GetAddedCharacters(string oldText, string newText, int caretPosition)
{
int startIndex = GetStartIndex(oldText, newText);
if(startIndex != -1)
{
Sequence sequence = new Sequence(GetStartIndex(oldText, newText), caretPosition - 1);
if(SequenceValid(sequence))
return sequence;
}
return null;
}
static int GetStartIndex(string oldText, string newText)
{
for(int i = 0; i < Math.Max(oldText.Length, newText.Length); i++)
if(i >= oldText.Length || i >= newText.Length || oldText[i] != newText[i])
return i;
return -1;
}
static bool SequenceValid(Sequence sequence)
{
return sequence.StartIndex >= 0 && sequence.EndIndex >= 0 && sequence.EndIndex >= sequence.StartIndex;
}

Sort alphanumeric string containing decimal in C#, issue with decimal points [duplicate]

I have a string[] in which every elements ends with some numeric value.
string[] partNumbers = new string[]
{
"ABC10", "ABC1","ABC2", "ABC11","ABC10", "AB1", "AB2", "Ab11"
};
I am trying to sort the above array as follows using LINQ but I am not getting the expected result.
var result = partNumbers.OrderBy(x => x);
Actual Result:
AB1
Ab11
AB2
ABC1
ABC10
ABC10
ABC11
ABC2
Expected Result
AB1
AB2
AB11
..
That is because the default ordering for string is standard alpha numeric dictionary (lexicographic) ordering, and ABC11 will come before ABC2 because ordering always proceeds from left to right.
To get what you want, you need to pad the numeric portion in your order by clause, something like:
var result = partNumbers.OrderBy(x => PadNumbers(x));
where PadNumbers could be defined as:
public static string PadNumbers(string input)
{
return Regex.Replace(input, "[0-9]+", match => match.Value.PadLeft(10, '0'));
}
This pads zeros for any number (or numbers) that appear in the input string so that OrderBy sees:
ABC0000000010
ABC0000000001
...
AB0000000011
The padding only happens on the key used for comparison. The original strings (without padding) are preserved in the result.
Note that this approach assumes a maximum number of digits for numbers in the input.
If you want to sort a list of objects by a specific property using LINQ and a custom comparer like the one by Dave Koelle you would do something like this:
...
items = items.OrderBy(x => x.property, new AlphanumComparator()).ToList();
...
You also have to alter Dave's class to inherit from System.Collections.Generic.IComparer<object> instead of the basic IComparer so the class signature becomes:
...
public class AlphanumComparator : System.Collections.Generic.IComparer<object>
{
...
Personally, I prefer the implementation by James McCormack because it implements IDisposable, though my benchmarking shows that it is slightly slower.
You can use PInvoke to get fast and good result:
class AlphanumericComparer : IComparer<string>
{
[DllImport("shlwapi.dll", CharSet = CharSet.Unicode)]
static extern int StrCmpLogicalW(string s1, string s2);
public int Compare(string x, string y) => StrCmpLogicalW(x, y);
}
You can use it like AlphanumComparatorFast from the answer above.
You can PInvoke to StrCmpLogicalW (the windows function) to do this. See here: Natural Sort Order in C#
public class AlphanumComparatorFast : IComparer
{
List<string> GetList(string s1)
{
List<string> SB1 = new List<string>();
string st1, st2, st3;
st1 = "";
bool flag = char.IsDigit(s1[0]);
foreach (char c in s1)
{
if (flag != char.IsDigit(c) || c=='\'')
{
if(st1!="")
SB1.Add(st1);
st1 = "";
flag = char.IsDigit(c);
}
if (char.IsDigit(c))
{
st1 += c;
}
if (char.IsLetter(c))
{
st1 += c;
}
}
SB1.Add(st1);
return SB1;
}
public int Compare(object x, object y)
{
string s1 = x as string;
if (s1 == null)
{
return 0;
}
string s2 = y as string;
if (s2 == null)
{
return 0;
}
if (s1 == s2)
{
return 0;
}
int len1 = s1.Length;
int len2 = s2.Length;
int marker1 = 0;
int marker2 = 0;
// Walk through two the strings with two markers.
List<string> str1 = GetList(s1);
List<string> str2 = GetList(s2);
while (str1.Count != str2.Count)
{
if (str1.Count < str2.Count)
{
str1.Add("");
}
else
{
str2.Add("");
}
}
int x1 = 0; int res = 0; int x2 = 0; string y2 = "";
bool status = false;
string y1 = ""; bool s1Status = false; bool s2Status = false;
//s1status ==false then string ele int;
//s2status ==false then string ele int;
int result = 0;
for (int i = 0; i < str1.Count && i < str2.Count; i++)
{
status = int.TryParse(str1[i].ToString(), out res);
if (res == 0)
{
y1 = str1[i].ToString();
s1Status = false;
}
else
{
x1 = Convert.ToInt32(str1[i].ToString());
s1Status = true;
}
status = int.TryParse(str2[i].ToString(), out res);
if (res == 0)
{
y2 = str2[i].ToString();
s2Status = false;
}
else
{
x2 = Convert.ToInt32(str2[i].ToString());
s2Status = true;
}
//checking --the data comparision
if(!s2Status && !s1Status ) //both are strings
{
result = str1[i].CompareTo(str2[i]);
}
else if (s2Status && s1Status) //both are intergers
{
if (x1 == x2)
{
if (str1[i].ToString().Length < str2[i].ToString().Length)
{
result = 1;
}
else if (str1[i].ToString().Length > str2[i].ToString().Length)
result = -1;
else
result = 0;
}
else
{
int st1ZeroCount=str1[i].ToString().Trim().Length- str1[i].ToString().TrimStart(new char[]{'0'}).Length;
int st2ZeroCount = str2[i].ToString().Trim().Length - str2[i].ToString().TrimStart(new char[] { '0' }).Length;
if (st1ZeroCount > st2ZeroCount)
result = -1;
else if (st1ZeroCount < st2ZeroCount)
result = 1;
else
result = x1.CompareTo(x2);
}
}
else
{
result = str1[i].CompareTo(str2[i]);
}
if (result == 0)
{
continue;
}
else
break;
}
return result;
}
}
USAGE of this Class:
List<string> marks = new List<string>();
marks.Add("M'00Z1");
marks.Add("M'0A27");
marks.Add("M'00Z0");
marks.Add("0000A27");
marks.Add("100Z0");
string[] Markings = marks.ToArray();
Array.Sort(Markings, new AlphanumComparatorFast());
For those who likes a generic approach, adjust the AlphanumComparator to Dave Koelle : AlphanumComparator slightly.
Step one (I rename the class to non-abbreviated and taking a TCompareType generic type argument):
public class AlphanumericComparator<TCompareType> : IComparer<TCompareType>
The next adjustments is to import the following namespace:
using System.Collections.Generic;
And we change the signature of the Compare method from object to TCompareType:
public int Compare(TCompareType x, TCompareType y)
{ .... no further modifications
Now we can specify the right type for the AlphanumericComparator.
(It should actually be called AlphanumericComparer I think), when we use it.
Example usage from my code:
if (result.SearchResults.Any()) {
result.SearchResults = result.SearchResults.OrderBy(item => item.Code, new AlphanumericComparator<string>()).ToList();
}
Now you have an alphanumeric comparator (comparer) that accepts generic arguments and can be used on different types.
And here is an extension method for using the comparator:
/// <summary>
/// Returns an ordered collection by key selector (property expression) using alpha numeric comparer
/// </summary>
/// <typeparam name="T">The item type in the ienumerable</typeparam>
/// <typeparam name="TKey">The type of the key selector (property to order by)</typeparam>
/// <param name="coll">The source ienumerable</param>
/// <param name="keySelector">The key selector, use a member expression in lambda expression</param>
/// <returns></returns>
public static IEnumerable<T> OrderByMember<T, TKey>(this IEnumerable<T> coll, Func<T, TKey> keySelector)
{
var result = coll.OrderBy(keySelector, new AlphanumericComparer<TKey>());
return result;
}
Well looks like its doing a Lexicographical Ordering irrespective to small or capital chars.
You can try using some custom expression in that lambda to do that.
There's no natural way to do this in .NET, but have a look at this blog post on natural sorting
You could put this into an extension method and use that instead of OrderBy
Looks like Dave Koelle's code link is dead. I got the last version from WebArchive.
/*
* The Alphanum Algorithm is an improved sorting algorithm for strings
* containing numbers. Instead of sorting numbers in ASCII order like
* a standard sort, this algorithm sorts numbers in numeric order.
*
* The Alphanum Algorithm is discussed at http://www.DaveKoelle.com
*
* Based on the Java implementation of Dave Koelle's Alphanum algorithm.
* Contributed by Jonathan Ruckwood <jonathan.ruckwood#gmail.com>
*
* Adapted by Dominik Hurnaus <dominik.hurnaus#gmail.com> to
* - correctly sort words where one word starts with another word
* - have slightly better performance
*
* Released under the MIT License - https://opensource.org/licenses/MIT
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included
* in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
* USE OR OTHER DEALINGS IN THE SOFTWARE.
*
*/
using System;
using System.Collections;
using System.Text;
/*
* Please compare against the latest Java version at http://www.DaveKoelle.com
* to see the most recent modifications
*/
namespace AlphanumComparator
{
public class AlphanumComparator : IComparer
{
private enum ChunkType {Alphanumeric, Numeric};
private bool InChunk(char ch, char otherCh)
{
ChunkType type = ChunkType.Alphanumeric;
if (char.IsDigit(otherCh))
{
type = ChunkType.Numeric;
}
if ((type == ChunkType.Alphanumeric && char.IsDigit(ch))
|| (type == ChunkType.Numeric && !char.IsDigit(ch)))
{
return false;
}
return true;
}
public int Compare(object x, object y)
{
String s1 = x as string;
String s2 = y as string;
if (s1 == null || s2 == null)
{
return 0;
}
int thisMarker = 0, thisNumericChunk = 0;
int thatMarker = 0, thatNumericChunk = 0;
while ((thisMarker < s1.Length) || (thatMarker < s2.Length))
{
if (thisMarker >= s1.Length)
{
return -1;
}
else if (thatMarker >= s2.Length)
{
return 1;
}
char thisCh = s1[thisMarker];
char thatCh = s2[thatMarker];
StringBuilder thisChunk = new StringBuilder();
StringBuilder thatChunk = new StringBuilder();
while ((thisMarker < s1.Length) && (thisChunk.Length==0 ||InChunk(thisCh, thisChunk[0])))
{
thisChunk.Append(thisCh);
thisMarker++;
if (thisMarker < s1.Length)
{
thisCh = s1[thisMarker];
}
}
while ((thatMarker < s2.Length) && (thatChunk.Length==0 ||InChunk(thatCh, thatChunk[0])))
{
thatChunk.Append(thatCh);
thatMarker++;
if (thatMarker < s2.Length)
{
thatCh = s2[thatMarker];
}
}
int result = 0;
// If both chunks contain numeric characters, sort them numerically
if (char.IsDigit(thisChunk[0]) && char.IsDigit(thatChunk[0]))
{
thisNumericChunk = Convert.ToInt32(thisChunk.ToString());
thatNumericChunk = Convert.ToInt32(thatChunk.ToString());
if (thisNumericChunk < thatNumericChunk)
{
result = -1;
}
if (thisNumericChunk > thatNumericChunk)
{
result = 1;
}
}
else
{
result = thisChunk.ToString().CompareTo(thatChunk.ToString());
}
if (result != 0)
{
return result;
}
}
return 0;
}
}
}
Since the number of characters at the beginning is variable, a regular expression would help:
var re = new Regex(#"\d+$"); // finds the consecutive digits at the end of the string
var result = partNumbers.OrderBy(x => int.Parse(re.Match(x).Value));
If there were a fixed number of prefix characters, then you could use the Substring method to extract starting from the relevant characters:
// parses the string as a number starting from the 5th character
var result = partNumbers.OrderBy(x => int.Parse(x.Substring(4)));
If the numbers might contain a decimal separator or thousands separator, then the regular expression needs to allow those characters as well:
var re = new Regex(#"[\d,]*\.?\d+$");
var result = partNumbers.OrderBy(x => double.Parse(x.Substring(4)));
If the string returned by the regular expression or Substring might be unparseable by int.Parse / double.Parse, then use the relevant TryParse variant:
var re = new Regex(#"\d+$"); // finds the consecutive digits at the end of the string
var result = partNumbers.OrderBy(x => {
int? parsed = null;
if (int.TryParse(re.Match(x).Value, out var temp)) {
parsed = temp;
}
return parsed;
});
Just extending #Nathan's answer here.
var maxStringLength = partNumbers.Max(x => x).Count();
var result = partNumbers.OrderBy(x => PadNumbers(x, maxStringLength));
Then pass the param to the PadNumbers function will be dynamic.
public static string PadNumbers(string input, int maxStringLength)
{
return Regex.Replace(input, "[0-9]+", match => match.Value.PadLeft(maxStringLength, '0'));
}
I don´t know how to do that in LINQ, but maybe you like this way to:
Array.Sort(partNumbers, new AlphanumComparatorFast());
// Display the results
foreach (string h in partNumbers )
{
Console.WriteLine(h);
}

Determine if string has all unique characters

I'm working through an algorithm problem set which poses the following question:
"Determine if a string has all unique characters. Assume you can only use arrays".
I have a working solution, but I would like to see if there is anything better optimized in terms of time complexity. I do not want to use LINQ. Appreciate any help you can provide!
static void Main(string[] args)
{
FindDupes("crocodile");
}
static string FindDupes(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine("String is either empty or too long");
}
char[] str = new char[text.Length];
char[] output = new char[text.Length];
int strLength = 0;
int outputLength = 0;
foreach (char value in text)
{
bool dupe = false;
for (int i = 0; i < strLength; i++)
{
if (value == str[i])
{
dupe = true;
break;
}
}
if (!dupe)
{
str[strLength] = value;
strLength++;
output[outputLength] = value;
outputLength++;
}
}
return new string(output, 0, outputLength);
}
If time complexity is all you care about you could map the characters to int values, then have an array of bool values which remember if you've seen a particular character value previously.
Something like ... [not tested]
bool[] array = new bool[256]; // or larger for Unicode
foreach (char value in text)
if (array[(int)value])
return false;
else
array[(int)value] = true;
return true;
try this,
string RemoveDuplicateChars(string key)
{
string table = string.Empty;
string result = string.Empty;
foreach (char value in key)
{
if (table.IndexOf(value) == -1)
{
table += value;
result += value;
}
}
return result;
}
usage
Console.WriteLine(RemoveDuplicateChars("hello"));
Console.WriteLine(RemoveDuplicateChars("helo"));
Console.WriteLine(RemoveDuplicateChars("Crocodile"));
output
helo
helo
Crocdile
public boolean ifUnique(String toCheck){
String str="";
for(int i=0;i<toCheck.length();i++)
{
if(str.contains(""+toCheck.charAt(i)))
return false;
str+=toCheck.charAt(i);
}
return true;
}
EDIT:
You may also consider to omit the boundary case where toCheck is an empty string.
The following code works:
static void Main(string[] args)
{
isUniqueChart("text");
Console.ReadKey();
}
static Boolean isUniqueChart(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine(" The text is empty or too larg");
return false;
}
Boolean[] char_set = new Boolean[256];
for (int i = 0; i < text.Length; i++)
{
int val = text[i];//already found this char in the string
if (char_set[val])
{
Console.WriteLine(" The text is not unique");
return false;
}
char_set[val] = true;
}
Console.WriteLine(" The text is unique");
return true;
}
If the string has only lower case letters (a-z) or only upper case letters (A-Z) you can use a very optimized O(1) solution.Also O(1) space.
c++ code :
bool checkUnique(string s){
if(s.size() >26)
return false;
int unique=0;
for (int i = 0; i < s.size(); ++i) {
int j= s[i]-'a';
if(unique & (1<<j)>0)
return false;
unique=unique|(1<<j);
}
return true;
}
Remove Duplicates in entire Unicode Range
Not all characters can be represented by a single C# char. If you need to take into account combining characters and extended unicode characters, you need to:
parse the characters using StringInfo
normalize the characters
find duplicates amongst the normalized strings
Code to remove duplicate characters:
We keep track of the entropy, storing the normalized characters (each character is a string, because many characters require more than 1 C# char). In case a character (normalized) is not yet stored in the entropy, we append the character (in specified form) to the output.
public static class StringExtension
{
public static string RemoveDuplicateChars(this string text)
{
var output = new StringBuilder();
var entropy = new HashSet<string>();
var iterator = StringInfo.GetTextElementEnumerator(text);
while (iterator.MoveNext())
{
var character = iterator.GetTextElement();
if (entropy.Add(character.Normalize()))
{
output.Append(character);
}
}
return output.ToString();
}
}
Unit Test:
Let's test a string that contains variations on the letter A, including the Angstrom sign Å. The Angstrom sign has unicode codepoint u212B, but can also be constructed as the letter A with the diacritic u030A. Both represent the same character.
// ÅÅAaA
var input = "\u212BA\u030AAaA";
// ÅAa
var output = input.RemoveDuplicateChars();
Further extensions could allow for a selector function that determines how to normalize characters. For instance the selector (x) => x.ToUpperInvariant().Normalize() would allow for case-insensitive duplicate removal.
public static bool CheckUnique(string str)
{
int accumulator = 0;
foreach (int asciiCode in str)
{
int shiftedBit = 1 << (asciiCode - ' ');
if ((accumulator & shiftedBit) > 0)
return false;
accumulator |= shiftedBit;
}
return true;
}

Alphanumeric sorting using LINQ

I have a string[] in which every elements ends with some numeric value.
string[] partNumbers = new string[]
{
"ABC10", "ABC1","ABC2", "ABC11","ABC10", "AB1", "AB2", "Ab11"
};
I am trying to sort the above array as follows using LINQ but I am not getting the expected result.
var result = partNumbers.OrderBy(x => x);
Actual Result:
AB1
Ab11
AB2
ABC1
ABC10
ABC10
ABC11
ABC2
Expected Result
AB1
AB2
AB11
..
That is because the default ordering for string is standard alpha numeric dictionary (lexicographic) ordering, and ABC11 will come before ABC2 because ordering always proceeds from left to right.
To get what you want, you need to pad the numeric portion in your order by clause, something like:
var result = partNumbers.OrderBy(x => PadNumbers(x));
where PadNumbers could be defined as:
public static string PadNumbers(string input)
{
return Regex.Replace(input, "[0-9]+", match => match.Value.PadLeft(10, '0'));
}
This pads zeros for any number (or numbers) that appear in the input string so that OrderBy sees:
ABC0000000010
ABC0000000001
...
AB0000000011
The padding only happens on the key used for comparison. The original strings (without padding) are preserved in the result.
Note that this approach assumes a maximum number of digits for numbers in the input.
If you want to sort a list of objects by a specific property using LINQ and a custom comparer like the one by Dave Koelle you would do something like this:
...
items = items.OrderBy(x => x.property, new AlphanumComparator()).ToList();
...
You also have to alter Dave's class to inherit from System.Collections.Generic.IComparer<object> instead of the basic IComparer so the class signature becomes:
...
public class AlphanumComparator : System.Collections.Generic.IComparer<object>
{
...
Personally, I prefer the implementation by James McCormack because it implements IDisposable, though my benchmarking shows that it is slightly slower.
You can use PInvoke to get fast and good result:
class AlphanumericComparer : IComparer<string>
{
[DllImport("shlwapi.dll", CharSet = CharSet.Unicode)]
static extern int StrCmpLogicalW(string s1, string s2);
public int Compare(string x, string y) => StrCmpLogicalW(x, y);
}
You can use it like AlphanumComparatorFast from the answer above.
You can PInvoke to StrCmpLogicalW (the windows function) to do this. See here: Natural Sort Order in C#
public class AlphanumComparatorFast : IComparer
{
List<string> GetList(string s1)
{
List<string> SB1 = new List<string>();
string st1, st2, st3;
st1 = "";
bool flag = char.IsDigit(s1[0]);
foreach (char c in s1)
{
if (flag != char.IsDigit(c) || c=='\'')
{
if(st1!="")
SB1.Add(st1);
st1 = "";
flag = char.IsDigit(c);
}
if (char.IsDigit(c))
{
st1 += c;
}
if (char.IsLetter(c))
{
st1 += c;
}
}
SB1.Add(st1);
return SB1;
}
public int Compare(object x, object y)
{
string s1 = x as string;
if (s1 == null)
{
return 0;
}
string s2 = y as string;
if (s2 == null)
{
return 0;
}
if (s1 == s2)
{
return 0;
}
int len1 = s1.Length;
int len2 = s2.Length;
int marker1 = 0;
int marker2 = 0;
// Walk through two the strings with two markers.
List<string> str1 = GetList(s1);
List<string> str2 = GetList(s2);
while (str1.Count != str2.Count)
{
if (str1.Count < str2.Count)
{
str1.Add("");
}
else
{
str2.Add("");
}
}
int x1 = 0; int res = 0; int x2 = 0; string y2 = "";
bool status = false;
string y1 = ""; bool s1Status = false; bool s2Status = false;
//s1status ==false then string ele int;
//s2status ==false then string ele int;
int result = 0;
for (int i = 0; i < str1.Count && i < str2.Count; i++)
{
status = int.TryParse(str1[i].ToString(), out res);
if (res == 0)
{
y1 = str1[i].ToString();
s1Status = false;
}
else
{
x1 = Convert.ToInt32(str1[i].ToString());
s1Status = true;
}
status = int.TryParse(str2[i].ToString(), out res);
if (res == 0)
{
y2 = str2[i].ToString();
s2Status = false;
}
else
{
x2 = Convert.ToInt32(str2[i].ToString());
s2Status = true;
}
//checking --the data comparision
if(!s2Status && !s1Status ) //both are strings
{
result = str1[i].CompareTo(str2[i]);
}
else if (s2Status && s1Status) //both are intergers
{
if (x1 == x2)
{
if (str1[i].ToString().Length < str2[i].ToString().Length)
{
result = 1;
}
else if (str1[i].ToString().Length > str2[i].ToString().Length)
result = -1;
else
result = 0;
}
else
{
int st1ZeroCount=str1[i].ToString().Trim().Length- str1[i].ToString().TrimStart(new char[]{'0'}).Length;
int st2ZeroCount = str2[i].ToString().Trim().Length - str2[i].ToString().TrimStart(new char[] { '0' }).Length;
if (st1ZeroCount > st2ZeroCount)
result = -1;
else if (st1ZeroCount < st2ZeroCount)
result = 1;
else
result = x1.CompareTo(x2);
}
}
else
{
result = str1[i].CompareTo(str2[i]);
}
if (result == 0)
{
continue;
}
else
break;
}
return result;
}
}
USAGE of this Class:
List<string> marks = new List<string>();
marks.Add("M'00Z1");
marks.Add("M'0A27");
marks.Add("M'00Z0");
marks.Add("0000A27");
marks.Add("100Z0");
string[] Markings = marks.ToArray();
Array.Sort(Markings, new AlphanumComparatorFast());
For those who likes a generic approach, adjust the AlphanumComparator to Dave Koelle : AlphanumComparator slightly.
Step one (I rename the class to non-abbreviated and taking a TCompareType generic type argument):
public class AlphanumericComparator<TCompareType> : IComparer<TCompareType>
The next adjustments is to import the following namespace:
using System.Collections.Generic;
And we change the signature of the Compare method from object to TCompareType:
public int Compare(TCompareType x, TCompareType y)
{ .... no further modifications
Now we can specify the right type for the AlphanumericComparator.
(It should actually be called AlphanumericComparer I think), when we use it.
Example usage from my code:
if (result.SearchResults.Any()) {
result.SearchResults = result.SearchResults.OrderBy(item => item.Code, new AlphanumericComparator<string>()).ToList();
}
Now you have an alphanumeric comparator (comparer) that accepts generic arguments and can be used on different types.
And here is an extension method for using the comparator:
/// <summary>
/// Returns an ordered collection by key selector (property expression) using alpha numeric comparer
/// </summary>
/// <typeparam name="T">The item type in the ienumerable</typeparam>
/// <typeparam name="TKey">The type of the key selector (property to order by)</typeparam>
/// <param name="coll">The source ienumerable</param>
/// <param name="keySelector">The key selector, use a member expression in lambda expression</param>
/// <returns></returns>
public static IEnumerable<T> OrderByMember<T, TKey>(this IEnumerable<T> coll, Func<T, TKey> keySelector)
{
var result = coll.OrderBy(keySelector, new AlphanumericComparer<TKey>());
return result;
}
Well looks like its doing a Lexicographical Ordering irrespective to small or capital chars.
You can try using some custom expression in that lambda to do that.
There's no natural way to do this in .NET, but have a look at this blog post on natural sorting
You could put this into an extension method and use that instead of OrderBy
Looks like Dave Koelle's code link is dead. I got the last version from WebArchive.
/*
* The Alphanum Algorithm is an improved sorting algorithm for strings
* containing numbers. Instead of sorting numbers in ASCII order like
* a standard sort, this algorithm sorts numbers in numeric order.
*
* The Alphanum Algorithm is discussed at http://www.DaveKoelle.com
*
* Based on the Java implementation of Dave Koelle's Alphanum algorithm.
* Contributed by Jonathan Ruckwood <jonathan.ruckwood#gmail.com>
*
* Adapted by Dominik Hurnaus <dominik.hurnaus#gmail.com> to
* - correctly sort words where one word starts with another word
* - have slightly better performance
*
* Released under the MIT License - https://opensource.org/licenses/MIT
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included
* in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
* DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
* OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
* USE OR OTHER DEALINGS IN THE SOFTWARE.
*
*/
using System;
using System.Collections;
using System.Text;
/*
* Please compare against the latest Java version at http://www.DaveKoelle.com
* to see the most recent modifications
*/
namespace AlphanumComparator
{
public class AlphanumComparator : IComparer
{
private enum ChunkType {Alphanumeric, Numeric};
private bool InChunk(char ch, char otherCh)
{
ChunkType type = ChunkType.Alphanumeric;
if (char.IsDigit(otherCh))
{
type = ChunkType.Numeric;
}
if ((type == ChunkType.Alphanumeric && char.IsDigit(ch))
|| (type == ChunkType.Numeric && !char.IsDigit(ch)))
{
return false;
}
return true;
}
public int Compare(object x, object y)
{
String s1 = x as string;
String s2 = y as string;
if (s1 == null || s2 == null)
{
return 0;
}
int thisMarker = 0, thisNumericChunk = 0;
int thatMarker = 0, thatNumericChunk = 0;
while ((thisMarker < s1.Length) || (thatMarker < s2.Length))
{
if (thisMarker >= s1.Length)
{
return -1;
}
else if (thatMarker >= s2.Length)
{
return 1;
}
char thisCh = s1[thisMarker];
char thatCh = s2[thatMarker];
StringBuilder thisChunk = new StringBuilder();
StringBuilder thatChunk = new StringBuilder();
while ((thisMarker < s1.Length) && (thisChunk.Length==0 ||InChunk(thisCh, thisChunk[0])))
{
thisChunk.Append(thisCh);
thisMarker++;
if (thisMarker < s1.Length)
{
thisCh = s1[thisMarker];
}
}
while ((thatMarker < s2.Length) && (thatChunk.Length==0 ||InChunk(thatCh, thatChunk[0])))
{
thatChunk.Append(thatCh);
thatMarker++;
if (thatMarker < s2.Length)
{
thatCh = s2[thatMarker];
}
}
int result = 0;
// If both chunks contain numeric characters, sort them numerically
if (char.IsDigit(thisChunk[0]) && char.IsDigit(thatChunk[0]))
{
thisNumericChunk = Convert.ToInt32(thisChunk.ToString());
thatNumericChunk = Convert.ToInt32(thatChunk.ToString());
if (thisNumericChunk < thatNumericChunk)
{
result = -1;
}
if (thisNumericChunk > thatNumericChunk)
{
result = 1;
}
}
else
{
result = thisChunk.ToString().CompareTo(thatChunk.ToString());
}
if (result != 0)
{
return result;
}
}
return 0;
}
}
}
Since the number of characters at the beginning is variable, a regular expression would help:
var re = new Regex(#"\d+$"); // finds the consecutive digits at the end of the string
var result = partNumbers.OrderBy(x => int.Parse(re.Match(x).Value));
If there were a fixed number of prefix characters, then you could use the Substring method to extract starting from the relevant characters:
// parses the string as a number starting from the 5th character
var result = partNumbers.OrderBy(x => int.Parse(x.Substring(4)));
If the numbers might contain a decimal separator or thousands separator, then the regular expression needs to allow those characters as well:
var re = new Regex(#"[\d,]*\.?\d+$");
var result = partNumbers.OrderBy(x => double.Parse(x.Substring(4)));
If the string returned by the regular expression or Substring might be unparseable by int.Parse / double.Parse, then use the relevant TryParse variant:
var re = new Regex(#"\d+$"); // finds the consecutive digits at the end of the string
var result = partNumbers.OrderBy(x => {
int? parsed = null;
if (int.TryParse(re.Match(x).Value, out var temp)) {
parsed = temp;
}
return parsed;
});
Just extending #Nathan's answer here.
var maxStringLength = partNumbers.Max(x => x).Count();
var result = partNumbers.OrderBy(x => PadNumbers(x, maxStringLength));
Then pass the param to the PadNumbers function will be dynamic.
public static string PadNumbers(string input, int maxStringLength)
{
return Regex.Replace(input, "[0-9]+", match => match.Value.PadLeft(maxStringLength, '0'));
}
I don´t know how to do that in LINQ, but maybe you like this way to:
Array.Sort(partNumbers, new AlphanumComparatorFast());
// Display the results
foreach (string h in partNumbers )
{
Console.WriteLine(h);
}

Algorithm for expanding a character set?

Are there any ready-made functions for expanding a C# regex-style character set?
For example, expand("a-z1") would return a string containing all the characters a to z, followed by the number 1.
Here's what I've got so far:
public static string ExpandCharacterSet(string set)
{
var sb = new StringBuilder();
int start = 0;
while (start < set.Length - 1)
{
int dash = set.IndexOf('-', start + 1);
if (dash <= 0 || dash >= set.Length - 1)
break;
sb.Append(set.Substring(start, dash - start - 1));
char a = set[dash - 1];
char z = set[dash + 1];
for (var i = a; i <= z; ++i)
sb.Append(i);
start = dash + 2;
}
sb.Append(set.Substring(start));
return sb.ToString();
}
Is there anything I'm overlooking?
PS: Let's ignore negative character sets for now.
Thought my example was quite clear... let's try that again. This is what I want:
ExpandCharacterSet("a-fA-F0-9") == "abcdefABCDEF0123456789"
It took a bit of work to get this but here's what I was able to muster. Of course this is not going to be portable since I'm messing with internals. But it works well enough for simple test cases. It will accept any regex character class but will not work for negated classes. The range of values is way too broad without any restrictions. I don't know if it will be correct for all cases and it doesn't handle repetition at all but it's a start. At least you won't have to roll out your own parser. As of .NET Framework 4.0:
public static class RegexHelper
{
public static string ExpandCharClass(string charClass)
{
var regexParser = new RegexParser(CultureInfo.CurrentCulture);
regexParser.SetPattern(charClass);
var regexCharClass = regexParser.ScanCharClass(false);
int count = regexCharClass.RangeCount();
List<string> ranges = new List<string>();
// range 0 can be skipped
for (int i = 1; i < count; i++)
{
var range = regexCharClass.GetRangeAt(i);
ranges.Add(ExpandRange(range));
}
return String.Concat(ranges);
}
static string ExpandRange(SingleRange range)
{
char first = range._first;
char last = range._last;
return String.Concat(Enumerable.Range(first, last - first + 1).Select(i => (char)i));
}
internal class RegexParser
{
static readonly Type RegexParserType;
static readonly ConstructorInfo RegexParser_Ctor;
static readonly MethodInfo RegexParser_SetPattern;
static readonly MethodInfo RegexParser_ScanCharClass;
static RegexParser()
{
RegexParserType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexParser");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
RegexParser_Ctor = RegexParserType.GetConstructor(flags, null, new[] { typeof(CultureInfo) }, null);
RegexParser_SetPattern = RegexParserType.GetMethod("SetPattern", flags, null, new[] { typeof(String) }, null);
RegexParser_ScanCharClass = RegexParserType.GetMethod("ScanCharClass", flags, null, new[] { typeof(Boolean) }, null);
}
private readonly object instance;
internal RegexParser(CultureInfo culture)
{
instance = RegexParser_Ctor.Invoke(new object[] { culture });
}
internal void SetPattern(string pattern)
{
RegexParser_SetPattern.Invoke(instance, new object[] { pattern });
}
internal RegexCharClass ScanCharClass(bool caseInsensitive)
{
return new RegexCharClass(RegexParser_ScanCharClass.Invoke(instance, new object[] { caseInsensitive }));
}
}
internal class RegexCharClass
{
static readonly Type RegexCharClassType;
static readonly MethodInfo RegexCharClass_RangeCount;
static readonly MethodInfo RegexCharClass_GetRangeAt;
static RegexCharClass()
{
RegexCharClassType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexCharClass");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
RegexCharClass_RangeCount = RegexCharClassType.GetMethod("RangeCount", flags, null, new Type[] { }, null);
RegexCharClass_GetRangeAt = RegexCharClassType.GetMethod("GetRangeAt", flags, null, new[] { typeof(Int32) }, null);
}
private readonly object instance;
internal RegexCharClass(object regexCharClass)
{
if (regexCharClass == null)
throw new ArgumentNullException("regexCharClass");
if (regexCharClass.GetType() != RegexCharClassType)
throw new ArgumentException("not an instance of a RegexCharClass object", "regexCharClass");
instance = regexCharClass;
}
internal int RangeCount()
{
return (int)RegexCharClass_RangeCount.Invoke(instance, new object[] { });
}
internal SingleRange GetRangeAt(int i)
{
return new SingleRange(RegexCharClass_GetRangeAt.Invoke(instance, new object[] { i }));
}
}
internal struct SingleRange
{
static readonly Type RegexCharClassSingleRangeType;
static readonly FieldInfo SingleRange_first;
static readonly FieldInfo SingleRange_last;
static SingleRange()
{
RegexCharClassSingleRangeType = Assembly.GetAssembly(typeof(Regex)).GetType("System.Text.RegularExpressions.RegexCharClass+SingleRange");
var flags = BindingFlags.NonPublic | BindingFlags.Instance;
SingleRange_first = RegexCharClassSingleRangeType.GetField("_first", flags);
SingleRange_last = RegexCharClassSingleRangeType.GetField("_last", flags);
}
internal char _first;
internal char _last;
internal SingleRange(object singleRange)
{
if (singleRange == null)
throw new ArgumentNullException("singleRange");
if (singleRange.GetType() != RegexCharClassSingleRangeType)
throw new ArgumentException("not an instance of a SingleRange object", "singleRange");
_first = (char)SingleRange_first.GetValue(singleRange);
_last = (char)SingleRange_last.GetValue(singleRange);
}
}
}
// usage:
RegexHelper.ExpandCharClass(#"[\-a-zA-F1 5-9]");
// "-abcdefghijklmnopqrstuvwxyzABCDEF1 56789"
Seems like a pretty unusual requirement, but since there are only about 96 characters that you can match (unless you include high chars), you might as well just test your regular expression against all of them, and output the matches:
public static string expando(string input_re) {
// add more chars in s as needed, such as ,.?/|=+_-éñ etc.
string s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
string output = "";
Regex exp = new Regex(input_re);
for (int i = 0; i < s.Length; i++) {
if (exp.IsMatch(s.Substring(i, 1))) {
output += s[i];
}
}
return output;
}
By using an actual regex to determine your character class, you can expand whatever regular expression you want, [^A-B]|[0123a-cg-h], for example.
Something like this?
var input = "a-fA-F0-9!";
var matches = Regex.Matches(input,#".-.|.");
var list = new StringBuilder();
foreach (Match m in matches)
{
var value = m.Value;
if (value.Length == 1)
list.Append(value);
else
{
if (value[2] < value[0]) throw new ArgumentException("invalid format"); // or switch, if you want.
for (char c = value[0]; c <= value[2]; c++)
list.Append(c);
}
}
Console.WriteLine(list);
Output:
abcdefABCDEF0123456789!
The moral, of course, is to solve your regex problems with more regex!
Here's a version with support for escape characters. It all depends how robust you want it to be... for example, I don't do anything special here to handle surrogates, so that probably won't work. Also, if you're trying to match the performance of a current regex engine exactly you'll need to know exactly what all the parameters are, which would be a fairly big job.
void Main()
{
//these are all equivalent:
var input = #"\x41-\0x46\u41";
var input2 = #"\65-\70\65";
var input3 = "A-FA";
// match hex as \0x123 or \x123 or \u123, or decimal \412, or the escapes \n\t\r, or any character
var charRegex = #"(\\(0?x|u)[0-9a-fA-F]+|\\[0-9]+|\\[ntr]|.)";
var matches = Regex.Matches(input, charRegex + "-" + charRegex + "|" + charRegex);
var list = new StringBuilder();
foreach (Match m in matches)
{
var dashIndex = m.Value.IndexOf('-', 1); //don't look at 0 (in case it's a dash)
if (dashIndex > 0) // this means we have two items: a range
{
var charLeft = Decode(m.Value.Substring(0,dashIndex));
var charRight = Decode(m.Value.Substring(dashIndex+1));
if (charRight < charLeft) throw new ArgumentException("invalid format (left bigger than right)"); // or switch, if you want.
for (char c = charLeft; c <= charRight; c++)
list.Append(c);
}
else // just one item
{
list.Append(Decode(m.Value));
}
}
Console.WriteLine(list);
}
char Decode(string s)
{
if (s.Length == 1)
return s[0];
// here, s[0] == '\', because of the regex
if (s.Length == 2)
switch (s[1])
{
// incomplete; add more as wished
case 'n': return '\n';
case 't': return '\t';
case 'r': return '\r';
default: break;
}
if (s[1] == 'u' || s[1] == 'x')
return (char)Convert.ToUInt16(s.Substring(2), 16);
else if (s.Length > 2 && s[1] == '0' && s[2] == 'x')
return (char)Convert.ToUInt16(s.Substring(3), 16);
else
return (char)Convert.ToUInt16(s.Substring(1)); // will fail from here if invalid escape (e.g. \g)
}
private static readonly IEnumerable<char> CharacterSet = Enumerable.Range(0, char.MaxValue + 1).Select(Convert.ToChar).Where(c => !char.IsControl(c));
public static string ExpandCharacterSet(string set)
{
var sb = new StringBuilder();
int start = 0;
bool invertSet = false;
if (set.Length == 0)
return "";
if (set[0] == '[' && set[set.Length - 1] == ']')
set = set.Substring(1, set.Length - 2);
if (set[0] == '^')
{
invertSet = true;
set = set.Substring(1);
}
while (start < set.Length - 1)
{
int dash = set.IndexOf('-', start + 1);
if (dash <= 0 || dash >= set.Length - 1)
break;
sb.Append(set.Substring(start, dash - start - 1));
char a = set[dash - 1];
char z = set[dash + 1];
for (var i = a; i <= z; ++i)
sb.Append(i);
start = dash + 2;
}
sb.Append(set.Substring(start));
if (!invertSet) return sb.ToString();
var A = new HashSet<char>(CharacterSet);
var B = new HashSet<char>(sb.ToString());
A.ExceptWith(B);
return new string(A.ToArray());
}

Categories

Resources