What regular expression is good for extracting URLs from HTML? - c#

I have tried using my own and using the top ones here on StackOverflow, but most of them let matched more than was desired.
For instance, some would extract http://foo.com/hello?world<br (note <br at end) from the input ...http://foo.com/hello?world<br>....
If there a pattern that can match just the URL more reliably?
This is the current pattern I am using:
#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&^]*)"

The most secure regex is to not use a regex at all and use the System.Uri class.
System.Uri
Uri uri = new Uri("http://myUrl/%2E%2E/%2E%2E");
Console.WriteLine(uri.AbsoluteUri);
Console.WriteLine(uri.PathAndQuery);

Your regex needs an escape for the dash "-" in the last character group:
#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+\-=\\\.&^]*)"
Essentially, you were allowing characters from + through =, which includes <

Try this:
public static string[] Parse(string pattern, string groupName, string input)
{
var list = new List<string>();
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
for (var match = regex.Match(input); match.Success; match = match.NextMatch())
{
list.Add(string.IsNullOrWhiteSpace(groupName) ? match.Value : match.Groups[groupName].Value);
}
return list.ToArray();
}
public static string[] ParseUri(string input)
{
const string pattern = #"(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*";
return Parse(pattern, string.Empty, input);
}

Related

Get values from a string based on a format

I am trying to get some individual values from a string based on a format, now this format can change so ideally, I want to specify this using another string.
For example let's say my input is 1. Line One - Part Two (Optional Third Part) I would want to specify the format as to match so %number%. %first% - %second% (%third%) and then I want these values as variables.
Now the only way I could think of doing this was using RegEx groups and I have very nearly got RegEx works.
var input = "1. Line One - Part Two (Optional Third Part)";
var formatString = "%number%. %first% - %second% (%third%)";
var expression = new Regex("(?<Number>[^.]+). (?<First>[^-]+) - (?<Second>[^\\(]+) ((?<Third>[^)]+))");
var match = expression.Match(input);
Console.WriteLine(match.Groups["Number"].ToString().Trim());
Console.WriteLine(match.Groups["First"].ToString().Trim());
Console.WriteLine(match.Groups["Second"].ToString().Trim());
Console.WriteLine(match.Groups["Third"].ToString().Trim());
This results in the following output, so all good apart from that opening bracket.
1 Line One Part Two (Optional Third Part
I'm now a bit lost as to how I could translate my format string into a regular expression, now there are no rules on this format, but it would need to be fairly easy for a user.
Any advice is greatly appreciated, or perhaps there is another way not involving Regex?
You included in your pattern couple of special characters (such as .) without escaping them, so Regex does not match . literlally.
Here's corrected code of yours:
using System.Text.RegularExpressions;
var input = "1. Line One - Part Two (Optional Third Part)";
var pattern = string.Format(
"(?<Number>{0})\\. (?<First>{1}) - (?<Second>{2}) \\((?<Third>{3})\\)",
"[^\\.]+",
"[^\\-]+",
"[^\\(]+",
"[^\\)]+");
var match = Regex.Match(input, pattern);
Console.WriteLine(match.Groups["Number"]);
Console.WriteLine(match.Groups["First"]);
Console.WriteLine(match.Groups["Second"]);
Console.WriteLine(match.Groups["Third"]);
Sample output:
If you want to keep you syntax, you can leverage Regex.Escape method. I also written some code that parses all parameters within %
using System.Text.RegularExpressions;
var input = "1. Line One - Part Two (Optional Third Part)";
var formatString = "%number%. %first% - %second% (%third%)";
formatString = Regex.Escape(formatString);
var parameters = new List<string>();
formatString = Regex.Replace(formatString, "%([^%]+)%", match =>
{
var paramName = match.Groups[1].Value;
var groupPattern = "(?<" + paramName + ">{" + parameters.Count + "})";
parameters.Add(paramName);
return groupPattern;
});
var pattern = string.Format(
formatString,
"[^\\.]+",
"[^\\-]+",
"[^\\(]+",
"[^\\)]+");
var match = Regex.Match(input, pattern);
foreach (var paramName in parameters)
{
Console.WriteLine(match.Groups[paramName]);
}
Further notes
You need to adjust part where you specify pattern for each group, currently it's not generic and does not care about how many paramters there would be.
So finally, taking it all into account and cleaning up the code a little, you can use such solution:
public static class FormatBasedCustomRegex
{
public static string GetPattern(this string formatString,
string[] subpatterns,
out string[] parameters)
{
formatString = Regex.Escape(formatString);
formatString = formatString.ReplaceParams(out var #params);
if(#params.Length != subpatterns.Length)
{
throw new InvalidOperationException();
}
parameters = #params;
return string.Format(
formatString,
subpatterns);
}
private static string ReplaceParams(
this string formatString,
out string[] parameters)
{
var #params = new List<string>();
var outputPattern = Regex.Replace(formatString, "%([^%]+)%", match =>
{
var paramName = match.Groups[1].Value;
var groupPattern = "(?<" + paramName + ">{" + #params.Count + "})";
#params.Add(paramName);
return groupPattern;
});
parameters = #params.ToArray();
return outputPattern;
}
}
and main method would look like:
var input = "1. Line One - Part Two (Optional Third Part)";
var pattern = "%number%. %first% - %second% (%third%)".GetPattern(
new[]
{
"[^\\.]+",
"[^\\-]+",
"[^\\(]+",
"[^\\)]+",
},
out var parameters);
var match = Regex.Match(input, pattern);
foreach (var paramName in parameters)
{
Console.WriteLine(match.Groups[paramName]);
}
But it's up to you how would you define particular methods and what signatures they should have for you to have the best code :)
You may use this regex:
^(?<Number>[^.]+)\. (?<First>[^-]+) - (?<Second>[^(]+)(?: \((?<Third>[^)]+)\))?$
RegEx Demo
RegEx Details:
^: Start
(?<Number>[^.]+): Match and capture 1+ of any char that is not .
\. : Match ". "
(?<First>[^-]+):
-: Match " - "
(?<Second>[^(]+): Match and capture 1+ of any char that is not (
(?:: Start a non-capture group
\(: Match space followed by (
(?<Third>[^)]+): Match and capture 1+ of any char that is not )
\): Match )
)?: End optional non-capture group
$: End
Your format contains special characters that are becoming part of the regular expression. You can use the Regex.Escape method to handle that. After that, you can just use a Regex.Replace with a delegate to transform the format into a regular expression:
var input = "1. Line One - Part Two (Optional Third Part)";
var fmt = "%number%. %first% - %second% (%third%)";
var templateRE = new Regex(#"%([a-z]+)%", RegexOptions.Compiled);
var pattern = templateRE.Replace(Regex.Escape(fmt), m => $"(?<{m.Groups[1].Value}>.+?)");
var ansRE = new Regex(pattern);
var ans = ansRE.Match(input);
Note: You may want to place ^ and $ at the beginning and end of the pattern respectively, to ensure the format must match the entire input string.

Regex Replace - Based on Char Input

Lets say we have string:
Hello
The user enters a char input "e"
What is the correct way of returning the string as the following using a regex method:
-e---
Code tried:
public static string updatedWord(char guess, string word)
{
string result = Regex.Replace(word, guess, "-");
console.writeline(result);
return result;
}
Assuming the input were e, you could build the following regex pattern:
[^e]
Then, do a global replacement on this pattern, which matches any single character which is not e, and replace it with a single dash.
string word = "Hello";
char guess = 'e';
string regex = "[^" + guess + "]";
string result = Regex.Replace(word, regex, "-");
Console.WriteLine(result);
This prints:
-e---
Note that to ensure that we handle regex metacharacters correctly, should they be allowed as inputs, we can wrap the regex pattern above in Regex.Escape:
Regex.Escape(regex)
This can be done without Regex, you need to "loop" all characters of the secret word and replace not yet guessed characters with -, regex will loop letters also, but c# methods are more comprehensible ;)
You need to keep collection of already guessed letters.
public class Guess
{
private readonly string _word;
private readonly HashSet<char> _guessed;
public Guess(string word)
{
_word = word;
_guessed = new HashSet<char>();
}
public string Try(char letter)
{
_guessed.Add(letter);
var maskedLetters = _word.Select(c => _guessed.Contains(c) ? c : '-').ToArray();
return new string(maskedLetters);
}
}
Usage
var game = new Guess("Hello");
var result = game.Try('e');
Console.WriteLine(result); // "-e---"

Multiple strings have special characters in regex

I am new to Regular expression, I have a requirement to find "/./" or
"/../" in a string. My program look likes as follow,
String Path1 = "https://18.56.199.56/Directory1/././Directory2/filename.txt";
String Path2 = https://18.56.199.56/Directory1/../../Directory2/filename.txt";
String Path3 = "https://18.56.199.56/Directory1/Directory2/filename.txt";
Regex nameRegex = new Regex(#"[/./]+[/../]");
bool b = nameRegex.IsMatch(OrginalURL);
This code giving true for Path3(dont have any "." or ".." strings) also.
It seems the expression "Regex nameRegex = new Regex(#"[/./]+[/../]");" is not true. Kindly correct this expression.
Regex match should be success for Path1 or Path2 and not Path3.
Your [/./]+[/../] (=[/.]+[/.]) regex matches 1+ / or . chars followed with a / or .. It can thus match ....../, /////////////, and certainly // in the protocol part.
If you do not have to use a regex you may simply use .Contains:
if (s.Contains("/../") || s.Contains("/./")) { ... }
See this C# demo.
You may use the following regex, too:
bool b = Regex.IsMatch(OrginalURL, #"/\.{1,2}/");
See this regex demo and the regex graph:
Details
/ - a / char
\.{1,2} - 1 or 2 dots
/ - a / char.
While this would not be the best way to do this task, an expression similar to:
\/\.{1,2}(?=\/)
might work.
Demo
Escaping is just for demoing purpose, you can remove those.
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"\/\.{1,2}(?=\/)";
string input = #"https://18.56.199.56/Directory1/./Directory2/filename.txt
https://18.56.199.56/Directory1/././Directory2/filename.txt
https://18.56.199.56/Directory1/../../../Directory2/filename.txt
https://18.56.199.56/Directory1/./../.../Directory2/filename.txt
https://18.56.199.56/Directory1/Directory2/filename.txt";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}

Get values between curly braces c#

I never used regex before. I was abel to see similar questions in forum but not exactly what im looking for
I have a string like following. need to get the values between curly braces
Ex: "{name}{name#gmail.com}"
And i Need to get the following splitted strings.
name and name#gmail.com
I tried the following and it gives me back the same string.
string s = "{name}{name#gmail.com}";
string pattern = "({})";
string[] result = Regex.Split(s, pattern);
Use Matches of Regex rather than Split to accomplish this easily:
string input = "{name}{name#gmail.com}";
var regex = new Regex("{(.*?)}");
var matches = regex.Matches(input);
foreach (Match match in matches) //you can loop through your matches like this
{
var valueWithoutBrackets = match.Groups[1].Value; // name, name#gmail.com
var valueWithBrackets = match.Value; // {name}, {name#gmail.com}
}
Is using regex a must? In this particular example I would write:
s.Split(new char[] { '{', '}' }, StringSplitOptions.RemoveEmptyEntries)
here you go
string s = "{name}{name#gmail.com}";
s = s.Substring(1, s.Length - 2);// remove first and last characters
string pattern = "}{";// split pattern "}{"
string[] result = Regex.Split(s, pattern);
or
string s = "{name}{name#gmail.com}";
s = s.TrimStart('{');
s = s.TrimEnd('}');
string pattern = "}{";
string[] result = Regex.Split(s, pattern);

A More Efficient Way to Parse a String in C#

I have this code that reads a file and creates Regex groups. Then I walk through the groups and use other matches on keywords to extract what I need. I need the stuff between each keyword and the next space or newline. I am wondering if there is a way using the Regex keyword match itself to discard what I don't want (the keyword).
//create the pattern for the regex
String VSANMatchString = #"vsan\s(?<number>\d+)[:\s](?<info>.+)\n(\s+name:(?<name>.+)\s+state:(?<state>.+)\s+\n\s+interoperability mode:(?<mode>.+)\s\n\s+loadbalancing:(?<loadbal>.+)\s\n\s+operational state:(?<opstate>.+)\s\n)?";
//set up the patch
MatchCollection VSANInfoList = Regex.Matches(block, VSANMatchString);
// set up the keyword matches
Regex VSANNum = new Regex(#" \d* ");
Regex VSANName = new Regex(#"name:\S*");
Regex VSANState = new Regex(#"operational state\S*");
//now we can extract what we need since we know all the VSAN info will be matched to the correct VSAN
//match each keyword (name, state, etc), then split and extract the value
foreach (Match m in VSANInfoList)
{
string num=String.Empty;
string name=String.Empty;
string state=String.Empty;
string s = m.ToString();
if (VSANNum.IsMatch(s)) { num=VSANNum.Match(s).ToString().Trim(); }
if (VSANName.IsMatch(s))
{
string totrim = VSANName.Match(s).ToString().Trim();
string[] strsplit = Regex.Split (totrim, "name:");
name=strsplit[1].Trim();
}
if (VSANState.IsMatch(s))
{
string totrim = VSANState.Match(s).ToString().Trim();
string[] strsplit=Regex.Split (totrim, "state:");
state=strsplit[1].Trim();
}
It looks like your single regex should be able to gather all you need. Try this:
string name = m.Groups["name"].Value; // Or was it m.Captures["name"].Value?

Categories

Resources