C# Regex sorting letter and number strings - c#

I have a list that needs ordering say:
R1-1
R1-11
R2-2
R1-2
this needs to be ordered:
R1-1
R1-2
R1-11
R2-2
Currently I am using the C# Regex.Replace method and adding a 0 before the occurance of single numbers at the end of a string with something like:
Regex.Replace(inString,#"([1-9]$)", #"0$2")
I'm sure there is a nicer way to do this which I just can't figure out.
Does anyone have a nice way of sorting letter and number strings with regex?
I have used Greg's method below to complete this and just thought I should add the code I am using for completeness:
public static List<Rack> GetRacks(Guid aisleGUID)
{
log.Debug("Getting Racks with aisleId " + aisleGUID);
List<Rack> result = dataContext.Racks.Where(
r => r.aisleGUID == aisleGUID).ToList();
return result.OrderBy(r => r.rackName, new NaturalStringComparer()).ToList();
}

I think what you're after is natural sort order, like Windows Explorer does? If so then I wrote a blog entry a while back showing how you can achieve this in a few lines of C#.
Note: I just checked and using the NaturalStringComparer in the linked entry does return the order you are looking for with the example strings.

You can write your own comparator and use regular expressions to compare the number between "R" and "-" first, followed by the number after "-", if the first numbers are equal.
Sketch:
public int Compare(string x, string y)
{
int releaseX = ...;
int releaseY = ...;
int revisionX = ...;
int revisionY = ...;
if (releaseX == releaseY)
{
return revisionX - revisionY;
}
else
{
return releaseX - releaseY;
}
}

Related

String array: search for text, return line number of first occurrence

In my string array, I want to look up some text and return the the line number of the first occurrence as an int.
This is working;
public static int LookUpLineNumber(String[] item, string TextToLookUp)
{
int m;
for (m = 0; m < item.Count(); m++)
{
if (item[m].Contains(TextToLookUp))
{
break;
}
}
return m++;
}
However, I am wondering if there is any way to optimize it for efficiency and length?
Speed comparison:
(average time on 10.000 runs with an string array of size 10.000)
Using my code:
1,259ms
Using Habib's code: Array.FindIndex<string>(item, r => r.Contains(TextToLookUp));
0,906ms
Your current solution looks OK. You can have return m; instead of return m++.
If you want to shorten your code you can use Array.FindIndex<T> like:
public static int LookUpLineNumber(String[] item, string TextToLookUp)
{
return Array.FindIndex<string>(item, r => r.Contains(TextToLookUp));
}
Not really sure if it would give you any performance gain.
If you need to do this multiple times, a suffix tree built out of the array would be the fastest approach:
http://en.wikipedia.org/wiki/Suffix_tree
However, if you are not re-using the array, then I think the approach you have is likely fastest, short of using a regex to do the contains, which may be faster if the regex is pre-compiled.
You can also do the following :-
Array.FindIndex(item,i=>i.Contains(TextToLookUp));
The above would work even if it is not sorted.
The above can be further optimized by using IndexOf operations instead of Contains and passing StringComparison.OrdinalIgnoreCase. Then you will have to compare it against 0.

Finding duplicates in List<string>

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates?
For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?
Thanks
Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.
You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.
This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.
EDIT: Answer updated to reflect refined question after it was edited
void Main()
{
List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};
IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();
var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);
// do something with grouped duplicates...
}
// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {
private static Regex digitFilenameRegex = new Regex(#"\d+", RegexOptions.Compiled);
public bool Equals(string left, string right) {
Match leftDigitsMatch = digitFilenameRegex.Match(left);
Match rightDigitsMatch = digitFilenameRegex.Match(right);
long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;
return leftValue == rightValue;
}
public int GetHashCode(string value) {
return base.GetHashCode();
}
}
I understand you are looking for duplicates in order to remove them?
One way to go about it could be the following:
Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)
Make this class explicitly or implictly convertible to string (or override ToString() for that matter).
Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.
Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.
I have a simple solution for everyone to find a duplicate string word and cahracter
For word
public class Test {
public static void main(String[] args) {
findDuplicateWords("i am am a a learner learner learner");
}
private static void findDuplicateWords(String string) {
HashMap<String,Integer> hm=new HashMap<>();
String[] s=string.split(" ");
for(String tempString:s){
if(hm.get(tempString)!=null){
hm.put(tempString, hm.get(tempString)+1);
}
else{
hm.put(tempString,1);
}
}
System.out.println(hm);
}
}
for character use for loop, get array length and use charAt()
Maybe somthing like this:
List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };
theList.GroupBy(txt => txt)
.Where(grouping => grouping.Count() > 1)
.ToList()
.ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these values {2}",
groupItem.Key,
groupItem.Count(),
string.Join(" ", groupItem.ToArray())));

Generate a unique string based on a pair of strings

I've two strings StringA, StringB. I want to generate a unique string to denote this pair.
i.e.
f(x, y) should be unique for every x, y and f(x, y) = f(y, x) where x, y are strings.
Any ideas?
Compute a message digest of both strings and XOR the values
MD5(x) ^ MD5(Y)
The message digest gives you unique value for each string and the XOR makes it possible for f(x, y) to be equal to f(y, x).
EDIT: As #Phil H observed, you have to treat the case in which you receive two equal strings as input, which would generate 0 after the XOR. You could return something like an MD5(x+y) if x and y are the same, and MD5(x) ^ MD5(y) for the rest of values.
Just create a new class and override Equals & GetHashCode:
class StringTuple
{
public string StringA { get; set; }
public string StringB { get; set; }
public override bool Equals(object obj)
{
var stringTuple = obj as StringTuple;
if (stringTuple == null)
return false;
return (StringA.Equals(stringTuple.StringA) && StringB.Equals(stringTuple.StringB)) ||
(StringA.Equals(stringTuple.StringB) && StringB.Equals(stringTuple.StringA));
}
public override int GetHashCode()
{
// Order of operands is irrelevant when using *
return StringA.GetHashCode() * StringB.GetHashCode();
}
}
Just find a unique way of ordering them and concatenate with a separator.
def uniqueStr(strA,strB,sep):
if strA <= strB:
return strA+sep+strB
else:
return strB+sep+strA
For arbitrarily long lists of strings, either sort the list or generate a set, then concatenate with a separator:
def uniqueStr(sep,strList):
return sep.join(Set(strList));
Preferably, if the strings are long or the separator choice is a problem, use the hashes and hash the result:
def uniqueStr(sep,strList):
return hash(''.join([hash(str) for str in Set(strList)]))
I think the following should yield unique strings:
String f = Replace(StringA<StringB?StringA:StringB,"#","##") + "}#{" + Replace(StringA<StringB?StringB:StringA,"#","##")
(That is, there's only one place in the string where a single "#" sign can appear, and we don't have to worry about a run of "#"s at the end of StringA being confused with a run of "#"s at the start of StringB.
You can use x.GetHashCode(). That not ensures that this will be unique, but quite. See more information in this question.
For example:
public int GetUniqueValue(string x, string y)
{
unchecked {
var result = x.GetHashCode() * x.GetHashCode();
return result;
}
}
Well take into consideration the first letter of each string before combining them? So if it is alphabetically ordered f(x, y) = f(y, x) will be true.
if(x > y)
c = x + y;
else
c = y + x;
What about StringC = StringA + StringB;.
That is guaranteed to be unique for any combination of StringA or StringB. Or did you have some other considerations for the string also?
You can for example combine the strings and take the MD5 hash of it. Then you will get a string that is probably "unique enough" for your needs, but you cannot reverse the hash back into the strings again, but you can take the same strings and be sure that the generated hash will be the same the next time.
EDIT
I saw your edit now, but I feel it's only a matter of sorting the strings first in that case. So something like
StringC = StringA.CompareTo(StringB) < 0 ? StringA + StringB : StringB + StringA;
You could just sort them and concatenate them, along with, lets, say the lenght of the first word.
That way f("one","two") = "onetwo3", f("two","one") = "onetwo3", and no other combination would produce that unique string as , e,g, "onet", "wo" would yield "onetwo4"
However, this will be a abysmal solution for reasonably long strings.
You could also do some sort of hash code calculcation, like this
first.GetHashCode() ^ second.GetHashCode()
that would be reasonably unique, however, you can't guarantee uniqueness.
It would be nice if the OP provided a little more context, because this does not sound like a sound solution to any problem.
public static String getUniqString(String x,String y){
return (x.compareTo(y)<0)?(x+y):(y+x);
}

C# Array contains partial

How to find whether a string array contains some part of string?
I have array like this
String[] stringArray = new [] { "abc#gmail.com", "cde#yahoo.com", "#gmail.com" };
string str = "coure06#gmail.com"
if (stringArray.Any(x => x.Contains(str)))
{
//this if condition is never true
}
i want to run this if block when str contains a string thats completely or part of any of array's Item.
Assuming you've got LINQ available:
bool partialMatch = stringArray.Any(x => str.Contains(x));
Even without LINQ it's easy:
bool partialMatch = Array.Exists(stringArray, x => str.Contains(x));
or using C# 2:
bool partialMatch = Array.Exists(stringArray,
delegate(string x) { return str.Contains(x)); });
If you're using C# 1 then you probably have to do it the hard way :)
If you're looking for if a particular string in your array contains just "#gmail.com" instead of "abc#gmail.com" you have a couple of options.
On the input side, there are a variety of questions here on SO which will point you in the direction you need to go to validate that your input is a valid email address.
If you can only check on the back end, I'd do something like:
emailStr = "#gmail.com";
if(str.Contains(emailStr) && str.length == emailStr.length)
{
//your processing here
}
You can also use Regex matching, but I'm not nearly familiar enough with that to tell you what pattern you'd need.
If you're looking for just anything containing "#gmail.com", Jon's answer is your best bets.

How to split this string

I have some strings, entered by users, that may look like this:
++7
7++
1++7
1+7
1++7+10++15+20+30++
Those are to mean:
Anything up to and including 7
Anything from 7 and up
1 and 7 and anything inbetween
1 and 7 only
1 to 7, 10 to 15, 20 and 30 and above
I need to parse those strings into actual ranges. That is I need to create a list of objects of type Range which have a start and an end. For single items I just set the start and end to the same, and for those that are above or below, I set start or end to null. For example for the first one I would get one range which had start set to null and end set to 7.
I currently have a kind of messy method using a regular expression to do this splitting and parsing and I want to simplify it. My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.
Looking at those strings it should be really easy to parse them, I just can't come up with a smart way to do it. It just have to be an easier (cleaner, easier to read) way. Probably involving some easy concept I just haven't heard about before :P
The regular expression looks like this:
private readonly Regex Pattern = new Regex(#" ( [+]{2,} )?
([^+]+)
(?:
(?: [+]{2,} [^+]* )*
[+]{2,} ([^+]+)
)?
( [+]{2,} )? ", RegexOptions.IgnorePatternWhitespace);
That is then used like this:
public IEnumerable<Range<T>> Parse(string subject, TryParseDelegate<string, T> itemParser)
{
if (string.IsNullOrEmpty(subject))
yield break;
for (var item = RangeStringConstants.Items.Match(subject); item.Success; item = item.NextMatch())
{
var startIsOpen = item.Groups[1].Success;
var endIsOpen = item.Groups[4].Success;
var startItem = item.Groups[2].Value;
var endItem = item.Groups[3].Value;
if (endItem == string.Empty)
endItem = startItem;
T start, end;
if (!itemParser(startItem, out start) || !itemParser(endItem, out end))
continue;
yield return Range.Create(startIsOpen ? default(T) : start,
endIsOpen ? default(T) : end);
}
}
It works, but I don't think it is particularly readable or maintainable. For example changing the '+' and '++' into ',' and '-' would not be that trivial to do.
My problem is that I need to split on + first, and then on ++. But if I split on + first, then the ++ instances are ruined and I end up with a mess.
You could split on this regex first:
(?<!\+)\+(?!\+)
That way, only the 'single' +'s are being split on, leaving you to parse the ++'s. Note that I am assuming that there cannot be three successive +'s.
The regex above simple says: "split on the '+' only if there's no '+' ahead or behind it".
Edit:
After reading that there can be more than 2 successive +'s, I recommend writing a small grammar and letting a parser-generator create a lexer+parser for your little language. ANTLR can generate C# source code as well.
Edit 2:
But before implementing any solution (parser or regex) you'd first have to define what is and what isn't valid input. If you're going to let more than two successive +'s be valid, ie. 1+++++5, which is [1++, +, ++5], I'd write a little grammar. See this tutorial how that works: http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required
And if you're going to reject input of more than 2 successive +'s, you can use either Lasse's or my (first) regex-suggestion.
Here's some code that uses regular expressions.
Note that the issue raised by Bart in the comments to your question, ie. "How do you handle 1+++5", is not handled at all.
To fix that, unless your code is already out in the wild and not subject to change of behaviour, I would suggest you change your syntax to the following:
use .. to denote ranges
allow both + and - for numbers, for positive and negative numbers
use comma and/or semicolon to separate distinct numbers or ranges
allow whitespace
Look at the difference between the two following strings:
1++7+10++15+20+30++
1..7, 10..15, 20, 30..
The second string is much easier to parse, and much easier to read.
It would also remove all ambiguity:
1+++5 = 1++ + 5 = 1.., 5
1+++5 = 1 + ++5 = 1, ..5
There's no way to parse wrong the second syntax.
Anyway, here's my code. Basically it works by adding four regex patterns for the four types of patterns:
num
num++
++num
num++num
For "num", it will handle negative numbers with a leading minus sign, and one or more digits. It does not, for obvious reasons, handle the plus sign as part of the number.
I've interpreted "and up" to mean "up to Int32.MaxValue" and same for down to Int32.MinValue.
public class Range
{
public readonly Int32 From;
public readonly Int32 To;
public Range(Int32 from, Int32 to)
{
From = from;
To = to;
}
public override string ToString()
{
if (From == To)
return From.ToString();
else if (From == Int32.MinValue)
return String.Format("++{0}", To);
else if (To == Int32.MaxValue)
return String.Format("{0}++", From);
else
return String.Format("{0}++{1}", From, To);
}
}
public static class RangeSplitter
{
public static Range[] Split(String s)
{
if (s == null)
throw new ArgumentNullException("s");
String[] parts = new Regex(#"(?<!\+)\+(?!\+)").Split(s);
List<Range> result = new List<Range>();
var patterns = new Dictionary<Regex, Action<Int32[]>>();
patterns.Add(new Regex(#"^(-?\d+)$"),
values => result.Add(new Range(values[0], values[0])));
patterns.Add(new Regex(#"^(-?\d+)\+\+$"),
values => result.Add(new Range(values[0], Int32.MaxValue)));
patterns.Add(new Regex(#"^\+\+(-?\d+)$"),
values => result.Add(new Range(Int32.MinValue, values[0])));
patterns.Add(new Regex(#"^(-?\d+)\+\+(-?\d+)$"),
values => result.Add(new Range(values[0], values[1])));
foreach (String part in parts)
{
foreach (var kvp in patterns)
{
Match ma = kvp.Key.Match(part);
if (ma.Success)
{
Int32[] values = ma.Groups
.OfType<Group>()
.Skip(1) // group 0 is the entire match
.Select(g => Int32.Parse(g.Value))
.ToArray();
kvp.Value(values);
}
}
}
return result.ToArray();
}
}
Unit-tests:
[TestFixture]
public class RangeSplitterTests
{
[Test]
public void Split_NullString_ThrowsArgumentNullException()
{
Assert.Throws<ArgumentNullException>(() =>
{
var result = RangeSplitter.Split(null);
});
}
[Test]
public void Split_EmptyString_ReturnsEmptyArray()
{
Range[] result = RangeSplitter.Split(String.Empty);
Assert.That(result.Length, Is.EqualTo(0));
}
[TestCase(01, "++7", Int32.MinValue, 7)]
[TestCase(02, "7", 7, 7)]
[TestCase(03, "7++", 7, Int32.MaxValue)]
[TestCase(04, "1++7", 1, 7)]
public void Split_SinglePatterns_ProducesExpectedRangeBounds(
Int32 testIndex, String input, Int32 expectedLower,
Int32 expectedUpper)
{
Range[] result = RangeSplitter.Split(input);
Assert.That(result.Length, Is.EqualTo(1));
Assert.That(result[0].From, Is.EqualTo(expectedLower));
Assert.That(result[0].To, Is.EqualTo(expectedUpper));
}
[TestCase(01, "++7")]
[TestCase(02, "7++")]
[TestCase(03, "1++7")]
[TestCase(04, "1+7")]
[TestCase(05, "1++7+10++15+20+30++")]
public void Split_ExamplesFromQuestion_ProducesCorrectResults(
Int32 testIndex, String input)
{
Range[] ranges = RangeSplitter.Split(input);
String rangesAsString = String.Join("+",
ranges.Select(r => r.ToString()).ToArray());
Assert.That(rangesAsString, Is.EqualTo(input));
}
[TestCase(01, 10, 10, "10")]
[TestCase(02, 1, 10, "1++10")]
[TestCase(03, Int32.MinValue, 10, "++10")]
[TestCase(04, 10, Int32.MaxValue, "10++")]
public void RangeToString_Patterns_ProducesCorrectResults(
Int32 testIndex, Int32 lower, Int32 upper, String expected)
{
Range range = new Range(lower, upper);
Assert.That(range.ToString(), Is.EqualTo(expected));
}
}

Categories

Resources