Parsing a string to extract a URL or folder path

Parsing a string to extract a URL or folder path - c#

I asked a similar question recently about using regex to retrieve a URL or folder path from a string. I was looking at this comment by Dour High Arch, where he says:
"I recommend you do not use regexes at all; use separate code paths
for URLs, using the Uri class, and file paths, using the FileInfo
class. These classes already handle parsing, matching, extracting
components, and so on."
I never really tried this, but now I am looking into it and can't figure out if what he said actually is useful to what I'm trying to accomplish.
I want to be able to parse a string message that could be something like:
"I placed the files on the server at http://www.thewebsite.com/NewStuff, they can also
be reached on your local network drives at J:\Downloads\NewStuff"
And extract out the two strings http://www.thewebsite.com/ and J:\Downloads\NewStuff. I don't see any methods on the Uri or FileInfo class that parse a Uri or FileInfo object from a string like I think Dour High Arch was implying.
Is there something I'm missing about using the Uri or FileInfo class that will allow this behavior? If not is there some other class in the framework that does this?

I'd say the easiest way is splitting the strings into parts first.
First delimiter would be spaces, for each word - second would be qoutes (double and single)
Then use Uri.IsWellFormedUriString on each token.
So something like:
foreach(var part in String.Split(new char[]{''', '"', ' '}, someRandomText))
{
if(Uri.IsWellFormedUriString(part, UriKind.RelativeOrAbsolute))
doSomethingWith(part);
}
Just saw at URI.IseWellFormedURIString that this is a bit to strickt to suit your needs maybe.
It returns false if www.Whatever.com is missing the http://

It was not clear from your earlier question that you wanted to extract URL and file path substrings from larger strings. In that case, neither Uri.IsWellFormedUriString nor rRegex.Match will do what you want. Indeed, I do not think any simple method can do what you want because you will have to define rules for ambiguous strings like httX://wasThatAUriScheme/andAre/these part/of/aURL or/are they/separate.strings?andIsThis%20a%20Param?
My suggestion is to define a recursive descent parser and create states for each substring you need to distinguish.

U can use :
(?<type>[^ ]+?:)(?<path>//[^ ]*|\\.+\\[^ ]*)
that will give you 2 groups on each result
type : "http:"
path : //www.thewebsite.com/NewStuff
and
type : "J:"
path : \Downloads\NewStuff
out of the string
"I placed the files on the server at
http://www.thewebsite.com/NewStuff, they can also be reached on your
local network drives at J:\Downloads\NewStuff"
you can use the "type" group to see if the type is http:or not and set action on that.
EDIT
or use regex below if you are sure there is no whitespace in your filepath :
(?<type>[^ ]+?:)(?<path>//[^ ]*|\\[^ ]*)

Try \w+:\S+ and see how well that fits your purposes.

Related

Use Path.GetFileNameWithoutExtension method in C# to get the file name but the display is not complete

This is the simple C# code I wrote :
string file = #"D:\test(2021/02/10).docx";
var fileName = Path.GetFileNameWithoutExtension(file);
Console.WriteLine(fileName);
I thought I would get the string "test(2021/02/10)" , but I got this result "10)".
How can I solve such a problem?

I just wonder why would you want such behavior. On windows slashes are treated as separator between directory and subdirectory (or file).
So, basically you are not able to create such file name.
And since slashes are treated as described, it is very natural that method implementation just checks what's after last slash and extracts just filename.
If you are interested on how the method is implemented take a look at source code

Regex to match a fragment of the URL

I have URL's like:
http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd
or
http://127.0.0.1:81/controller/verbTwo/NXw4fDF8MXwxfDQ1
I'd like to extract that part in bold. The host and port can change to anything (when I publish it to a live server it will change). The controller never changes. And for the verb part, there are 2 possibilities.
Can anyone help me with the regex?
Thanks

Instead of using a regex you could use the built in functionality of Uri
Uri uri = new Uri("http://127.0.0.1:81/controller/verbOne/NXw4fDF8MXwxfDQ1?source=dddd");
var lastSegment = uri.Segments.Last();

You're looking for the Uri and Path classes:
Path.GetFileName(new Uri(str).AbsolutePath)

Why do you look for a regex? you can look for the two string elements "verbOne/" or "verbTwo/" and make a substring from the end. And then you can look for the rest and substrakt the part with the '?'
I think this is faster then a regex.
krikit

Though everyone else here is correct that regex is not the best solution, because it could fail when parsers already exist that should never fail due to their specialization, I believe you could use the following regex:
(?<=http://127\.0\.0\.1:81/controller/verb(One|Two)/)[a-zA-Z0-9]*

Regex to match subdomain?

I have the following so far:
^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$
Been testing against these:
https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
https://google.com:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
http://www.foo.com
http://www.foo.com/
http://blog.foo.com/
http://blog.foo.com.ar/
http://foo.com
http://blog.foo.com
http://foo.com.ar
I'm using the following tool to test the regexes: regex tester
So far I've been able to yield the following groups:
full protocol
reduced protocol
full domain name
subdomain?
top level domain
port
port number
rest of the url
rest of the "directory"
no idea how to drop this group
page name
argument string
argument string
hash tag
hash tag
I will be using this regex to change the subdomain for my application for cross-domain redirect hyperlinks.
Using Request.Url as a parameter, I want to redirect from
http://example.com or http://www.example.com to http://blog.example.com
How can I achieve this?
I can't really tell what, if any, the current subdomain ( either nothing, www, blog, or forum, for instance) actually is...
What would be the best way to make this replacement?
What I actually need is some way to find out what the top level domain is. in either http://www.example.com, http://blog.example.com, or http://example.com I want to get example.com.

What would be the best way to make this replacement?
This may not be the answer you're looking for... but IMO the best way would be to make use of the System.Uri class.
The Uri class will easily extract the Host for you - and you can then split the host on "." delimiter - that should easily give you access to the current subdomain.
This is just my opinion - and its especially formed because I find it hard to maintain regex code like ^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?(((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?)?$

You can use the Uri class to parse the strings. There are many properties available in addition to Segments:
Uri MyUri = new Uri("https://www.google.com.ar:8080/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash");
foreach (String Segment in MyUri.Segments)
Response.Write(Segment + "<br />");

I think you should reconsider whether usage of a RegEx is really needed in this case;
I think extracting the top level domain from an URL is quite simple; in case of "http://www.example.com/?blah=111" you can simply take the part before the 3rd slash and perform a String.Split('.') and concat the last two array items. In case of "http://www.example.com", even easier.
Regex-patterns are very error-prone and quite hard to maintain and according to me you won't get any advantage of it. I recommend you to get rid off the Regex. Perhaps the result will be 2 - 3 more lines of code, but it will work, your code will be much better readable and easier to understand.

Validating File Path w/Spaces in C#

I'm something of a n00b at C# and I'm having trouble finding an answer to this, so if it's already been answered somewhere feel free to laugh at me (provided you also share the solution). :)
I'm reading an XML file into a GUI form, where certain elements are paths to files that are entered into TextBox objects. I'm looping through the controls on the form, and for each file path in each TextBox (lol there's like 20 of them on this form), I want to use File.Exists() to ensure it's a valid file.
The problem with this is that the file path can potentially contain spaces, and can potentially be valid; however File.Exists() is telling me it's invalid, based entirely on the spaces. Obviously I can't hard-code them and do something like
if (File.Exists(#"c:\Path To Stuff"))
and I tried surrounding the path with ", like
if (File.Exists("\"" + contentsOfTextBox + "\""))
but that didn't make a difference. Is there some way to do this? Can I escape the spaces somehow?
Thank you for your time. :)

File.Exists works just fine with spaces. There is something else giving you a problem I'll wager.
Make sure your XML reader isn't failing to read the filename (parts of XML do not allow spaces and some readers will throw an exception if they encounter one).

#"c:\Path To Stuff"
The above could be a directory not a file!
Hence you would want to use Directory.Exists!
#"c:\Path To Stuff\file.txt"
If you did have a file on the end of the path then you would use File.Exists!

As the answer said, File.Exists works with spaces, if you are checking for existence of a Directory however, you should be using Directory.Exists

What is the exact error that you get when File.Exists says it is invalid?
I suspect that you are passing a path to a directory and not a file, which will return false. If so, to check the presence of a directory, use Directory.Exists.

To echo Ron Warholic: make sure the process has permissions over the target folder. I just ran into the same "bug" and it turned out to be a permissions issue.

Did you remember to replace \ with \\ ?

You need to use youtStringValue.Trim() to remove spaces leading/trailing, and Replace to remove spaces in the string you do not want.
Also, rather use System.IO.Path.Combine rather to combine these strings.

You can use # on string variables:
string aPath = "c:\Path To Stuff\text.txt";
File.Exists(#aPath);
That should solve any escape character problems because I don't think this really looks like the spaces being the problem.

hi this is not difficult if you can convert the name of the path to a string array then go through one by one and remove the spaces
once that is done just write() to the screen where you have the files, if it is xml then your xmlmapper will suffice
file.exists() should only be used in certain circumstances if you know that it does exist but not when there can be space chars or any other possible user input

Is file path a url?

I have a variable in code that can have file path or url as value. Examples:
http://someDomain/someFile.dat
file://c:\files\someFile.dat
c:\files\someFile.dat
So there are two ways to represent a file and I can't ignore any of them.
What is the correct name for such a variable: path, url, location?
I'm using a 3rd party api so I can't change semantics or separate to more variables.

The first two are URLs, the third is a file path. Of course, the file:/// protocol is only referring to a file also.
When using the Uri class, you can use the IsFile and the LocalPath properties to handle file:/// Uris, and in that case you should also name it like that.

Personally, I'd call the variable in question "fileName"

in fact a formal URL will be file:///c|/files/someFile.dat
urls always starts with protocol:// and then path + names, with '/' as seperator.
evil windows IE sometimes use '\' to replace '/', but the formal usage is '/'.

Pick one that you'll be using internally to start with. If you need to support URLs, use URLs internally everywhere, and have any method that can set the variable check if it got a file path, and coerce it to an URL immediately.

If the values are not opaque to your application you may find it better to model them as a class. Otherwise, whenever you are going to act upon the values you may find yourself writing code like this:
if (variable.StartsWith("http://") || variable.StartsWith("file://")) {
// Handle url
}
else {
// Handle file path
}
You may fold some of the functionality regarding treatment of the values into your class, but it is properly better to treat it as an immutable value type.
Use a descriptive name for your class like FileLocation or whatever fits your nomenclature. It will then be very natural to declare FileLocation variables named fileLocation or inputFileLocation or even fl if you are sloppy.

if the path you are using includes the protocol "file://" then it is in fact a url.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a string to extract a URL or folder path - c#

Try \w+:\S+ and see how well that fits your purposes.

Related

Use Path.GetFileNameWithoutExtension method in C# to get the file name but the display is not complete

Regex to match a fragment of the URL

Regex to match subdomain?

Validating File Path w/Spaces in C#

Is file path a url?

Categories

Resources