Extract version number

Extract version number - c#

I am trying to just get the version number from an HML link.
Take this for example
firefox-10.0.2.bundle
I have got it to take everything after the - with
string versionNum = name.Split('-').Last();
versionNum = Regex.Replace(versionNum, "[^0-9.]", "");
which gives you an output of
10.0.2
However, if the link is like this
firefox-10.0.2.source.tar.bz2
the output will look like
10.0.2...2
How can I make it so that it just chops everything off after the third .? Or can I make it so that when first letter is detected it cuts that and everything that follows?

You could solve this with a single regex match.
Here is an example:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Regex regex = new Regex(#"\d+.\d+.\d+");
Match match = regex.Match("firefox-10.0.2.source.tar.bz2");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}

after you split "firefox-10.0.2.source.tar.bz2" to "10.0.2.source.tar.bz2"
string a = "10.0.2.source.tar.bz2";
string[] list = a.Split(new char [] { '.' });
string output = ""
foreach(var item in list)
{
if item is integer then // try to write this part
output += item + ".";
}
after that remove the last character from output.

Although late, I feel that this answer would be much more apt:
Regex r = new Regex(#"[\d\.]+(?![a-zA-Z\-])");
Match m = r.Match(name);
Console.WriteLine(m.Value);
Improvements -
Though #Samuel's answer works, what happens if the build is 10.2.2.3? His regex would give 10.2.2 - a partial answer, and therefore, wrong.
With the regex I have posted, the match would be complete.
Explanation -
[\d\.]+ matches all the combination of numbers and dots such as 10.2.2.34.56.78 and even just 10 if the build is 10.bundle
(?![a-zA-Z\-]) is a negative look-ahead which ensures that the match is not followed by any letter or dash.
Being robust is absolutely vital to any code, so my posted answer should work pretty well under any circumstances (because the link could be anything).

Here's a version which can handle 1-4 numbers (not just digits) in the input string, and returns a version number:
public static Version ExtractVersionNumber(string input)
{
Match match = Regex.Match(input, #"(\d+\.){0,3}\d+");
if (match.Success)
{
return new Version(match.Value);
}
return null;
}
void Main()
{
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5source.tar.bz2"));
Console.WriteLine(ExtractVersionNumber("firefox-10.0.2.5.6source.tar.bz2"));
}
Explanation
There are essentially 2 parts:
(\d+\.){0,3} -match a number (uninterrupted sequence of 1 or more digits) immediately followed by a dot. Match this 0 to 3 times.
\d+ - match a number (sequence of 1 or more digits).
These work as follows:
When there's only 1 number (or even if there's only 1 number followed by a dot), the first part will match nothing, the second part will match the number
when there are 2 numbers separated by a dot, the first part matches the first number and the dot, the second part matches the second number.
for 3 numbers separated by dots, the first part gets the first 2 numbers & dots, the last the third number
for 4 or more numbers separated by dots, the first part gets the first 3 numbers and dots, the second gets the fourth number. Any subsequent numbers and dots are ignored.
ps. If you wanted to ensure that you only got the number after the hyphen (e.g. to avoid getting 4.0.1 given the string firefox4.0.1-10.0.2.source.tar.bz2") you could add a negative look behind to say "the character immediately before the version number must be a hyphen": (?<=-)(\d+\.){0,3}\d+.

Related

Regex to text before set of numbers

I have text like this
Inc12345_Month
Ted12345_Month
J8T12345_Month
What I need to do is extract the 12345 and also remove everything before it. This will be done in C#
.+?(?=\d_Monthly) was working in a regex tester online but when I put it in my code it only returned 5_Month.
Edit: the 12345 could be a variable length so I cannot [0-9] multiple times.
Edit2: Code this was just to try and remove everything before the 12345
string text = /* the above text pulled in from a file */;
Regex reg = new Regex(#".+?(?=\d+_Monthly)");
text = reg.Replace(string, "");

You can use this function to strip it:
private static Regex getNumberAndBeyondRegex = new Regex(^.{2}\D+(\d.*)$", RegexOptions.Compiled);
public static string GetNumberAndBeyond(string input)
{
var match = getNumberAndBeyondRegex.Match(input);
if (!match.Success) throw new ArgumentException("String isn't in the correct format.", "input");
return match.Groups[1].Value;
}
The regex at work is ^.{2}\D+(\d.*)$
It works by grabbing anything that's a number, after at least one character that isn't a number. It'll not only match _Month but also other endings.
The regex exists out of a few parts:
^ matches the beginning of the string
.{2} matches any two characters, to prevent a digit from matching if it's the first or 2nd character, you can increase this number to be equal to the minimum prefix length - 1
\D+ matches at least one character that isn't a number
( starts capturing a group
\d.* matches at least one number and any values beyond that
) closes the capturing group
$ matches the end of the string
There are a lot of different regex flavors, many of them have slight differences in terms of escaping, capturing, replacing and quite surely some others.
For testing .NET regexes online I use the free version of the tool RegexHero, it has an popup every now and then, but it makes up for that time by showing you live results, capture groups, and instant replacing. Next to having quite a lot of features.
If you want to match anywhere within the string, you can use the regex \d+_Month, it is very similiar to your original regex. In code:
new Regex("\d+_Month").Match(input).Value
Edit:
Based on the format you supplied in the comment I've created a regex and function to parse the entire file name:
private static Regex parseFileNameRegex = new Regex(#"^.*\D(\d+)_Month_([a-zA-Z]+)\.(\w+)$", RegexOptions.Compiled);
public static bool TryParseFileName(string fileName, out int id, out string month, out string fileExtension)
{
id = 0; month = null; fileExtension = null;
if (fileName == null) return false;
var match = parseFileNameRegex.Match(fileName);
if (!match.Success) return false;
if (!int.TryParse(match.Groups[1].Value, out id) || id < 1) return false; // Convert the ID into a number
month = match.Groups[2].Value;
fileExtension = match.Groups[3].Value;
return true;
}
In the parse function it requires the ID to be at least 1, 0 isn't accepted (and negative numbers won't match the regex), if you don't want this restriction, simply remove || id < 1 from the function.
Using the function would look like:
int id; string month, fileExtension;
if (!TryParseFileName("CompanyName_ClientName12345_Month_Nov.pdf", out id, out month, out fileExtension))
throw new FormatException("File name is incorrectly formatted."); // Do whatever you want when you get an invalid filename
// Use id, month and fileExtension here :)
The regex ^.*\D(\d+)_Month_([a-zA-Z]+)\.(\w+)$ works like:
^ matches the beginning of the string
.*\D matches at least one non-numeric character
(\d+) captures at least 1 number, this is the ID
_Month_ is the literal text in between
([a-zA-Z]+) matches and captures at least 1 letter, this is the month
\. matches a . character
(\w+) matches and captures any alphanumeric (letters and numbers), this is the file extension
$ matches the end of the string

Using :
Regex reg = new Regex(#"\D+(?=(\d+)_Monthly)");
is more explicit, the result is in Groups[1].

Part by part:
.+?
Match anything, maybe. This doesn't make any sense to me. It would be equivalent to ".*", which may or may not be what you meant.
(?=
start a group
\d
Match exactly 1 decimal, which explains what you are seeing, and the rest of the number is matched by .+? which is outside the group
_Monthly
match the literal text
)
end group
I think what you want is:
.*(?=\d+_Monthly)

I guess you are missing the + sign after \d
.+?(?=\d+_Monthly)
This should ask for one or more digits.

If you don't need anything before the number, this should work:
(\d+_Month)
I use Derek Slager's regex tester when I'm working with C# regex.
Better dotnet regular expression tester

Replace all characters and first 0's (zeroes)

I am trying to replace all characters inside a Regular Expression expect the number, but the number should not start with 0
How can I achieve this using Regular Expression?
I have tried multiple things like #"^([1-9]+)(0+)(\d*)"and "(?<=[1-9])0+", but those does not work
Some examples of the text could be hej:\\\\0.0.0.22, hej:22, hej:\\\\?022 and hej:\\\\?22, and the result should in all places be 22

Rather than replace, try and match against [1-9][0-9]*$ on your string. Grab the matched text.
Note that as .NET regexes match Unicode number characters if you use \d, here the regex restricts what is matched to a simple character class instead.
(note: regex assumes matches at end of line only)

According to one of your comments hej:\\\\0.011.0.022 should yield 110022. First select the relevant string part from the first non zero digit up to the last number not being zero:
([1-9].*[1-9]\d*)|[1-9]
[1-9] is the first non zero digit
.* are any number of any characters
[1-9]\d* are numbers, starting at the first non-zero digit
|[1-9] includes cases consisting of only one single non zero digit
Then remove all non digits (\D)
Match match = Regex.Match(input, #"([1-9].*[1-9]\d*)|[1-9]");
if (match.Success) {
result = Regex.Replace(match.Value, "\D", "");
} else {
result = "";
}

Use following
[1-9][0-9]*$
You don't need to do any recursion, just match that.

Here is something that you can try The87Boy you can play around with or add to the pattern as you like.
string strTargetString = #"hej:\\\\*?0222\";
string pattern = "[\\\\hej:0.?*]";
string replacement = " ";
Regex regEx = new Regex(pattern);
string newRegStr = Regex.Replace(regEx.Replace(strTargetString, replacement), #"\s+", " ");
Result from the about Example = 22

How to do this Regex in C#?

I've been trying to do this for quite some time but for some reason never got it right.
There will be texts like these:
12325 NHGKF
34523 KGJ
29302 MMKSEIE
49504EFDF
The rule is there will be EXACTLY 5 digit number (no more or less) after that a 1 SPACE (or no space at all) and some text after as shown above. I would like to have a MATCH using a regex pattern and extract THE NUMBER and SPACE and THE TEXT.
Is this possible? Thank you very much!

Since from your wording you seem to need to be able to get each component part of the input text on a successful match, then here's one that'll give you named groups number, space and text so you can get them easily if the regex matches:
(?<number>\d{5})(?<space>\s?)(?<text>\w+)
On the returned Match, if Success==true then you can do:
string number = match.Groups["number"].Value;
string text = match.Groups["text"].Value;
bool hadSpace = match.Groups["space"] != null;

The expression is relatively simple:
^([0-9]{5}) ?([A-Z]+)$
That is, 5 digits, an optional space, and one or more upper-case letter. The anchors at both ends ensure that the entire input is matched.
The parentheses around the digits pattern and the letters pattern designate capturing groups one and two. Access them to get the number and the word.

string test = "12345 SOMETEXT";
string[] result = Regex.Split(test, #"(\d{5})\s*(\w+)");

You could use the Split method:
public class Program
{
static void Main()
{
var values = new[]
{
"12325 NHGKF",
"34523 KGJ",
"29302 MMKSEIE",
"49504EFDF"
};
foreach (var value in values)
{
var tokens = Regex.Split(value, #"(\d{5})\s*(\w+)");
Console.WriteLine("key: {0}, value: {1}", tokens[1], tokens[2]);
}
}
}

named groups splitting regardless of position of match

Having a hard time explaining what I mean, so here is what I want to do
I want any sentence to be parsed along the pattern of
text #something a few words [someothertext]
for this, the matching sentence would be
Jeremy is trying #20 times to [understand this]
And I would name 4 groups, as text, time, who, subtitle
However, I could also write
#20 Jeremy is trying [understand this] times to
and still get the tokens
#20
Jeremy is trying
times to
understand this
corresponding to the right groups
As long as the delimited tokens can separate the 2 text only tokens, I'm fine.
Is this even possible? I've tried a few regex's and failed miserably (am still experimenting but finding myself spending way too much time learning it)
Note: The order of the tokens can be random. If this isn't possible with regex then I guess I can live with a fixed order.
edit: fixed a typo. clarified further what I wanted.

You can alternate on the different types of text. Using named groups means that one group would have a Success value equal to true for each match.
This pattern should do what you need:
#"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*"
(?<Number>#\d+\b) - matches # followed by one or more digits, up to a word boundary
(?<Subtitle>\[.+?]) - non-greedy matching of text between square brackets
\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s* - trims spaces at either end of the string, and the named capture group uses an approach that matches a single character at a time provided that the negative look-ahead fails to match if it detects text that would match the other 2 text patterns of interest (numbers and subtitles).
Example usage:
var inputs = new[]
{
"Jeremy is trying #20 times to [understand this]",
"#20 Jeremy is trying [understand this] times to"
};
string pattern = #"(?<Number>#\d+\b)|(?<Subtitle>\[.+?])|\s*(?<Text>(?:.(?!#\d+\b|\[.*?]))+)\s*";
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
foreach (Match m in Regex.Matches(input, pattern))
{
// skip first group, which is the entire matched text
var group = m.Groups.Cast<Group>().Skip(1).First(g => g.Success);
Console.WriteLine(group.Value);
}
Console.WriteLine();
}
Alternately, this example demonstrates how to pair the named groups to the matches:
var re = new Regex(pattern);
foreach (var input in inputs)
{
Console.WriteLine("Input: " + input);
var query = from Match m in re.Matches(input)
from g in re.GetGroupNames().Skip(1)
where m.Groups[g].Success
select new
{
GroupName = g,
Value = m.Groups[g].Value
};
foreach (var item in query)
{
Console.WriteLine("{0}: {1}", item.GroupName, item.Value);
}
Console.WriteLine();
}

So if I understand this correctly, you're looking for four phrases:
1) 1+ words of normal text
2) 1 word of text prefixed by a #
3) 1+ words of normal text
4) 1+ words of text wrapped by [ ]
My (admittedly slow and regex-less) suggestion would be to find the indexes of the #, [, and ] characters, then use several calls to string.Substring().
This would be acceptable for relatively small strings and a relatively small number of iterations, although with much larger strings this would be extremely slow.

Is there a way to get a string up until a year value?

Basically I have some filenames where there is a year in the middle. I am only interested in getting any letter or number up until the year value, but only letters and numbers, not commas, dots, underscores, etc. Is it possible? Maybe with Regex?
For instance:
"A-Good-Life-2010-For-Archive"
"Any.Chararacter_Can+Come.Before!2011-RedundantInfo"
"WhatyouseeIsWhatUget.2012-Not"
"400-Gestures.In1.2000-Communication"
where I want:
"AGoodLife"
"AnyChararacterCanComeBefore"
"WhatyouseeIsWhatUget"
"400GesturesIn1"
By numbers I mean any number that doesn't look like a year, i.e. 1 digit, 2 digits, 3 digits, 5 digits, and so on. I only want to recognize 4 digit numbers as years.

You'll have to do this in two parts -- first to remove the symbols you don't want, and second to grab everything up to the year (or vice versa).
To do grab everything up to the year, you can use:
Match match = Regex.Match(movieTitle,#"(.*)(?<!\d)(?:19|20)[0-9]{2}(?!\d)");
// if match.Success, result is in match.Groups[1].value
I've made the year regex so it only matches things in the 1900s or 2000s, to make sure you don't match four-digit numbers as year if they're not a year (e.g. "Ali-Baba-And-the-1234-Thieves.2011").
However, if your movie title involves a year, then this won't really work ("2001:-Space-Odyssey(1968)").
To then replace all the non-characters, you can replace "[^a-zA-Z0-9]" with "". (I've allowed digits because a movie might have legitimate numbers in the title).
UPDATED from comments below:
if you search from the end to find the year you might do better. ie find the latest occuring year-candidate as the year. Hence, I've changed a .*? to .* in the regex so that the title is as greedy as possible and only uses the last year-candidate as the year.
Added a (?!\d) to the end of the year regex and a (?<!\d) to the start so that it doesn't match "My-title-1" instead of "My-title-120012-fdsa" & "2001" in "My-title-120012-fdsa" (I didn't add the boundary \b because the title might be "A-Good-Life2010" which has no boundary around the year).
changed the string to a raw string (#"...") so I don't need to worry about escaping backslashes in the regex because of C# interpreting backslashes.

you can try like this
/\b\d{4}\b/
d{4}\b will match four d's at a word boundary.Depending on the input data you may also want to consider adding another word boundary (\b) at the beginning.

using System.Text.RegularExpressions;
string GoodParts(string input) {
Regex re = new Regex(#"^(.*\D)\d{4}(\D|$)");
var match = re.Match(input);
string result = Regex.Replace(match.Groups[1].Value, "[^0-9a-zA-Z]+", "");
return result;
}

You can use Regex.Split() to make the code ever so terser (and possibly faster due to the simpler regex):
var str = "400-Gestures.In1.2000-Communication";
var re = new Regex(#"(^|\D)\d{4}(\D|$)");
var start = re.Split(str)[0];
// remove nonalphanumerics
var result = new string(start.Where(c=>Char.IsLetterOrDigit(c)).ToArray());

I suppose you want a fancy regular excpression?
Why not a simple for loop?
digitCount = 0;
for i = 0 to strlen(filename)
{
if isdigit(fielname[i])
{
digitCount++;
if digitCount == 4
thePartOfTheFileNameThatYouWant = strcpy(filename, 0, i-4)
}
else digitCount = 0;
}
// Sorry, I don't know C-sharp

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract version number - c#

Related

Regex to text before set of numbers

Replace all characters and first 0's (zeroes)

How to do this Regex in C#?

named groups splitting regardless of position of match

Is there a way to get a string up until a year value?

Categories

Resources