Getting rid of multiple periods in a filename using RegEx

Getting rid of multiple periods in a filename using RegEx - c#

I have an application that requires me to "clean" "dirty" filenames.
I was wondering if anybody knew how to handle files that are named like:
1.0.1.21 -- Confidential...doc
or
Accounting.Files.doc
Basically there's no guarantee that the periods will be in the same place for every file name. I was hoping to recurse through a drive, search for periods in the filename itself (minus the extension), remove the period and then append the extension onto it.
Does anybody know either a better way to do this or how do perform what I'm hoping to do?
As a note, regEx is a REQUIREMENT for this project.
EDIT: Instead of seeing 1.0.1.21 -- Confidential...doc, I'd like to see: 10121 -- Confidential.doc
For the other filename, Instead of Accounting.Files.doc, i'd like to see AccountingFiles.doc

You could do it with a regular expression:
string s = "1.0.1.21 -- Confidential...doc";
s = Regex.Replace(s, #"\.(?=.*\.)", "");
Console.WriteLine(s);
Result:
10121 -- Confidential.doc
The regular expression can be broken down as follows:
\. match a literal dot
(?= start a lookahead
.* any characters
\. another dot
) close the lookahead
Or in plain English: remove every dot that has at least one dot after it.
It would be cleaner to use the built in methods for handling file names and extensions, so if you could somehow remove the requirement that it must be regular expressions I think it would make the solution even better.

Here is an alternate solution that doesn't use regular expressions -- perhaps it is more readable:
string s = "1.0.1.21 -- Confidential...doc";
int extensionPoint = s.LastIndexOf(".");
if (extensionPoint < 0) {
extensionPoint = s.Length;
}
string nameWithoutDots = s.Substring(0, extensionPoint).Replace(".", "");
string extension = s.Substring(extensionPoint);
Console.WriteLine(nameWithoutDots + extension);

I'd do this without regular expressions*. (Disclaimer: I'm not good with regular expressions, so that might be why.)
Consider this option.
string RemovePeriodsFromFilename(string fullPath)
{
string dir = Path.GetDirectoryName(fullPath);
string filename = Path.GetFileNameWithoutExtension(fullPath);
string sanitized = filename.Replace(".", string.Empty);
string ext = Path.GetExtension(fullPath);
return Path.Combine(dir, sanitized + ext);
}
* Whoops, looks like you said using regular expressions was a requirement. Never mind! (Though I have to ask: why?)

Related

C# Request path comparison with router path

I'm trying to compare below two strings using c# asp.net core. The motivation is to compare two paths except path parameter (without manually splitting and comparing one by one). Is it possible to do this in single line using any in-build method?
Requested: /api/v1/schedules/S210715001/comments
Original: /api/v1/schedules/{id}/comments
Thanks in advance.

You could create a regex pattern of the original string containing curly brackets and then match against the requested string.
string original = #"/api/v1/schedules/{id}/comments";
string requested = #"/api/v1/schedules/S210715001/comments";
string originalPattern = Regex.Replace(original, #"\{[^\}]*\}", #"\w*");
var isMatch = Regex.Match(requested, $"^{originalPattern}$", RegexOptions.IgnoreCase).Success;

There is a built-in fuction called Path.GetFileName in System.IO allowing you to extract a file name from a whole path.
Here is how to use it:
var original = "/api/v1/schedules/S210715001/comments";
var requested = "api/v1/schedules/{id}/comments";
Console.WriteLine("original: " + Path.GetFileName(original));
Console.WriteLine("requested: " + Path.GetFileName(requested));
output:
original: comments
requested: comments
Note : there are some subtilities on what is considered a directory separator : backslash, forward slash etc. (see here), but I think it's easier to use than regular expressions.

Regular expression to extract file name from p4 path

From path "//source/project/file.cs#232", I need to match file.cs
Match myMatch = Regex.Match(path, #"(\w+\.\w+)[^/]*$");
This would give file.cs in groups[1].
But for paths with dots in the file name, this doesn't work.
path "//source/project/file.initial.config.cs#232"
How could I modify this to work to give file.initial.config.cs?

Try this regex -- also into group 1, and assuming the extension can only be letters, numbers or the underscore:
.*/((?:.*?\.)+\w+)
This could be made more robust, if necessary, with knowledge of the allowable characters and suffixes for file naming, as well as details about the text in which (if) this file name is embedded. For example, if spaces were not allowed as part of the name
.*/((?:\S*?\.)+\w+)
or if ONLY letters, digits or the underscore are allowed:
.*/((?:\w*?\.)+\w+)
If we could be assured that there will be no dots or spaces after the last dot in the sequence, and spaces not allowed in the filename, it could be shortened further to:
.*/(\S*\.\w+)
to pick up everything between the last "/" and the last "." as well as any word characters after the last "."
etc

A number of non-'/' before '#':
/([^/]+)#

This should allow you to do what you want, or at least give you a better idea of how to achieve it:
/(\w+)(?:\..*)(\w{2,3})\#)
• example: http://regex101.com/r/wQ9jG2

Can you not simply modify your regex from (\w+\.\w+)[^/]*$ to (\w+(\.\w+)+)[^/]*$, to allow multiple occurrences of .words?

Why use regex, when you can do it in c# ?
I've created a function for you:
public static class FileNameHelper
{
public static string GetFileNameFromPath(string path, string extWithoutdot = "cs")
{
var startIndex = path.LastIndexOf('/') + 1;
var stringg = path.Substring(startIndex);
var remIndex = stringg.LastIndexOf("." + extWithoutdot) + extWithoutdot.Length+1;
return stringg.Remove(remIndex);
}
}
How to use ?
string filename=FileNameHelper.GetFileNameFromPath("//source/project/file.initial.config.cs#232","cs");
Remember to use the extension without .
See this has a lot of advantage over regex. They are:
Its not regex !
Its fast and efficient.
Its readable and pure c#
Note: Don't use regex in c# for trivial things. It's definitely a blow on the performance. First think of ways of achieving it in c#. Regex should be a last resort. Of course, if performance doesn't matter, use whatever !
By the way, mark it as answer if it helps. I know it'll help :)

If you're not averse to avoiding regular expressions, you could do this with just a small bit of string manipulation:
string mypath = "//source/project/file.initial.config.cs#232";
string filename = GetFileName(mypath);
static string GetFileName(string path)
{
var pathPieces = path.Split('/').Last().Split('#');
var filename = pathPieces.Take(pathPieces.Length - 1);
return String.Join("#", filename);
}
Easier, and works with any arbitrary filename (even those with spaces or # characters).
EDIT: Now works with filenames with # characters in them, although those are highly discouraged in Perforce.

(?<=/)[^/]+(?=#)
Using lookaround, it matches only the filename.

What actually happen behind Path.Combine

I have :
string Combine = Path.Combine("shree\\", "file1.txt");
string Combine1 = Path.Combine("shree", "file1.txt");
Both gives same result :
shree\file1.txt
What actually happen behind Path.Combine?Which is the best coding practice to do this.please clear my vision.Thanks.

If the first path (shree or shree\\) does not end with a valid separator character (e.g. DirectorySeparatorChar) it is appended to the path before concatenation.
So
string path1 = "shree";
string path2 = "file1.txt";
string combined = Path.Combine(path1, path2);
will result in "shree\file1.txt", while
string path1 = "shree\\";
already contains a valid separator character, so the Combine method will not add another one.
Here you typed two slashes in the string variable (path1). The first one just acts as an escape character for the second one. This is the same as using a verbatim string literal.
string path1 = #"shree\";
More information on the Combine method can be found on MSDN:
http://msdn.microsoft.com/en-us/library/fyy7a5kt.aspx

Use the second one. This way you don't care about what is the directory separator.

What actually happen behind Path.Combine?
It builds you a path... so it's doesn't matter what of those two you will use. but those \\ are redundant.
If you're interested with micro optimization, create a test which of the two is faster.

Conversion to a double-escaped string

In C#, I have a filename that needs to converted to be double-escaped (because I feed this string into a regex).
In other words, if I have:
FileInfo file = new FileInfo(#"c:\windows\foo.txt");
string fileName = file.FullName;
fileName is: c:\\\\windows\\\\foo.txt
But I need to convert this to have sequences of two literal backslashes \\ in the fileName.
fileName needs to be #"c:\\\\windows\\\\foo.txt", or "c:\\\\\\\\windows\\\\\\\\foo.txt".
Is there an easy way to make this conversion?

I Think you're looking for Regex.Escape
Regex.Escape(#"c:\test.txt") == #"C:\\Test\.txt"
notice how it also escapes '.'

simplest without resorting to regex for this part:
string fileName = file.FullName.Replace(#"\", #"\\\\");
based on OP, but I think you really want this:
string fileName = file.FullName.Replace(#"\", #"\\");
That being said, I can't see how you want to use it... it shouldn't need escaping at all... maybe you should post more code?

C# Extracting a name from a string

I want to extract 'James\, Brown' from the string below but I don't always know what the name will be. The comma is causing me some difficuly so what would you suggest to extract James\, Brown?
OU=James\, Brown,OU=Test,DC=Internal,DC=Net
Thanks

A regex is likely your best approach
static string ParseName(string arg) {
var regex = new Regex(#"^OU=([a-zA-Z\\]+\,\s+[a-zA-Z\\]+)\,.*$");
var match = regex.Match(arg);
return match.Groups[1].Value;
}

You can use a regex:
string input = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
Match m = Regex.Match(input, "^OU=(.*?),OU=.*$");
Console.WriteLine(m.Groups[1].Value);

A quite brittle way to do this might be...
string name = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string[] splitUp = name.Split("=".ToCharArray(),3);
string namePart = splitUp[1].Replace(",OU","");
Console.WriteLine(namePart);
I wouldn't necessarily advocate this method, but I've just come back from a departmental Christmas lyunch and my brain is not fully engaged yet.

I'd start off with a regex to split up the groups:
Regex rx = new Regex(#"(?<!\\),");
String test = "OU=James\\, Brown,OU=Test,DC=Internal,DC=Net";
String[] segments = rx.Split(test);
But from there I would split up the parameters in the array by splitting them up manually, so that you don't have to use a regex that depends on more than the separator character used. Since this looks like an LDAP query, it might not matter if you always look at params[0], but there is a chance that the name might be set as "CN=". You can cover both cases by just reading the query like this:
String name = segments[0].Split('=', 2)[1];

That looks suspiciously like an LDAP or Active Directory distinguished name formatted according to RFC 2253/4514.
Unless you're working with well known names and/or are okay with a fragile hackaround (like the regex solutions) - then you should start by reading the spec.
If you, like me, generally hate implementing code according to RFCs - then hope this guy did a better job following the spec than you would. At least he claims to be 2253 compliant.

If the slash is always there, I would look at potentially using RegEx to do the match, you can use a match group for the last and first names.
^OU=([a-zA-Z])\,\s([a-zA-Z])
That RegEx will match names that include characters only, you will need to refine it a bit for better matching for the non-standard names. Here is a RegEx tester to help you along the way if you go this route.

Replace \, with your own preferred magic string (perhaps & #44;), split on remaining commas or search til the first comma, then replace your magic string with a single comma.
i.e. Something like:
string originalStr = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string replacedStr = originalStr.Replace("\,", ",");
string name = replacedStr.Substring(0, replacedStr.IndexOf(","));
Console.WriteLine(name.Replace(",", ","));

Assuming you're running in Windows, use PInvoke with DsUnquoteRdnValueW. For code, see my answer to another question: https://stackoverflow.com/a/11091804/628981

If the format is always the same:
string line = GetStringFromWherever();
int start = line.IndexOf("=") + 1;//+1 to get start of name
int end = line.IndexOf("OU=",start) -1; //-1 to remove comma
string name = line.Substring(start, end - start);
Forgive if syntax is not quite right - from memory. Obviously this is not very robust and fails if the format ever changes.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Getting rid of multiple periods in a filename using RegEx - c#

Related

C# Request path comparison with router path

Regular expression to extract file name from p4 path

What actually happen behind Path.Combine

Conversion to a double-escaped string

C# Extracting a name from a string

Categories

Resources