Regex matching dynamic words within an html string

Regex matching dynamic words within an html string - c#

I have an html string to work with as follows:
string html = new MvcHtmlString(item.html.ToString()).ToHtmlString();
There are two different types of text I need to match although very similar. I need the initial ^^ removed and the closing |^^ removed. Then if there are multiple clients I need the ^ separating clients changed to a comma(,).
^^Client One- This text is pretty meaningless for this task, but it will exist in the real document.|^^
^^Client One^Client Two^Client Three- This text is pretty meaningless for this task, but it will exist in the real document.|^^
I need to be able to match each client and make it bold.
Client One- This text is pretty meaningless for this task, but it will exist in the real document.
Client One, Client Two, Client Three- This text is pretty meaningless for this task, but it will exist in the real document.
A nice stack over flow user provided the following but I could not get it to work or find any matches when I tested it on an online regex tester.
const string pattern = #"\^\^(?<clients>[^-]+)(?<text>-.*)\|\^\^";
var result = Regex.Replace(html, pattern,
m =>
{
var clientlist = m.Groups["clients"].Value;
var newClients = string.Join(",", clientlist.Split('^').Select(s => string.Format("<strong>{0}</strong>", s)));
return newClients + m.Groups["text"];
});
I am very new to regex so any help is appreciated.

I'm new to C# so forgive me if I make rookie mistakes :)
const string pattern = #"\^\^([^-]+)(-[^|]+)\|\^\^";
var temp = Regex.Replace(html, pattern, "<strong>$1</strong>$2");
var result = Regex.Replace(temp, #"\^", "</strong>, <strong>");
I'm using $1 even though MSDN is vague about using that syntax to reference subgroups.
Edit: if it's possible that the text after - contains a ^ you can do this:
var result = Regex.Replace(temp, #"\^(?=.*-)", "</strong>, <strong>");

Related

Extract substring between startsequence and endsequence in C# using LINQ

I have an XML instance that contains processing instructions. I want a specific one (the schematron declaration):
<?xml-model href="../../a/b/c.sch" schematypens="http://purl.oclc.org/dsdl/schematron"?>
There may or may not be more than these very processing instructions present, so I can't rely on its position in the DOM; it is guaranteed, on the other hand, that there will be only one (or none) such Schematron file reference. Thus, I get it like so:
XProcessingInstruction p = d.Nodes().OfType<XProcessingInstruction>()
.Where(x => x.Target.Equals("xml-model") &&
x.Data.Contains("schematypens=\"http://purl.oclc.org/dsdl/schematron\""))
.FirstOrDefault();
In the example given, the content of p.Data is the string
href="../../a/b/c.sch" schematypens="http://purl.oclc.org/dsdl/schematron"
I need to extract the path specified via #href (i. e. in this example I would want the string ../../a/b/c.sch) without double quotes. In other words: I need the substring after href=" and before the next ". I'm trying to achieve my goal with LINQ:
var a = p.Data.Split(' ').Where(s => s.StartsWith("href=\""))
.Select(s => s.Substring("href=\"".Length))
.Select(s => s.TakeWhile(c => c != '"'));
I would have thought this gave me a IEnumerable<char> which I could then convert to a string in one of the ways described here, but that's not the case: According to LINQPad, I seem to be getting a IEnumerabale<IEnumerable<char>> which I can't manage to make into a string.
How could this be done correctly using LINQ? Maybe I'd better be using Regex within LINQ?
Edit: After typing this down, I came up with a working solution, but it seems very inelegant:
string a = new string
(
p.Data.Substring(p.Data.IndexOf("href=\"") + "href=\"".Length)
.TakeWhile(c => c != '"').ToArray()
);
What would be a better way?

Try this:
var input = #"<?xml-model href=""../../a/b/c.sch"" schematypens=""http://purl.oclc.org/dsdl/schematron""?>";
var match = Regex.Match(input, #"href=""(.*?)""");
var url = match.Groups[1].Value;
That gives me ../../a/b/c.sch in url.
Please don't use Regex for general XML parsing, but for this situation it's fine.

Get specific char from a string request

I need to get 4 variable from a string
That's a resquest someone can make to a server :
String request= "Jouer_un_bulletin <nbT> <mise> <numeros_grilleA> <numeros_grilleB>"
I need to get nbt, mise, numeros_grilleA and numeros_grilleB

You can try using regular expressions:
String request = "Jouer_un_bulletin <nbT> <mise> <numeros_grilleA> <numeros_grilleB>";
String pattern = #"(?<=<)[^<>]*(?=>)";
String[] prms = Regex
.Matches(request, pattern)
.OfType<Match>()
.Select(match => match.Value)
.ToArray();
Test:
// nbT
// mise
// numeros_grilleA
// numeros_grilleB
Console.Write(String.Join(Environment.NewLine, prms));

You need to parse the request. How to do this is entirely based on what you expect to receive and how you validate it. Assuming (and this is a big assumption) that you will always get the format above, you can use String.Split to split the string on the open bracket character, and then take the component pieces (ignoring the first one) and trim off the close bracket and any additional spaces. These will be your variables. This is not, by any means, a good way of sending data, and you should at a minimum validate this data a lot before using it.
The very basic (and by all means this is terrible code you shouldn't use) concept is:
String request= "Jouer_un_bulletin <nbT> <mise> <numeros_grilleA> <numeros_grilleB>";
var pieces = request.Split('<');
var strList = new List<string>();
for(int i = 1 ; i < pieces.Length; i++)
{
strList.Add(pieces[i].Trim(' ','>'));
}

Parse Line and Break it into Variables

I have a text file that contain only the FULL version number of an application that I need to extract and then parse it into separate Variables.
For example lets say the version.cs contains 19.1.354.6
Code I'm using does not seem to be working:
char[] delimiter = { '.' };
string currentVersion = System.IO.File.ReadAllText(#"C:\Applicaion\version.cs");
string[] partsVersion;
partsVersion = currentVersion.Split(delimiter);
string majorVersion = partsVersion[0];
string minorVersion = partsVersion[1];
string buildVersion = partsVersion[2];
string revisVersion = partsVersion[3];

Altough your problem is with the file, most likely it contains other text than a version, why dont you use Version class which is absolutely for this kind of tasks.
var version = new Version("19.1.354.6");
var major = version.Major; // etc..

What you have works fine with the correct input, so I would suggest making sure there is nothing else in the file you're reading.
In the future, please provide error information, since we can't usually tell exactly what you expect to happen, only what we know should happen.
In light of that, I would also suggest looking into using Regex for parsing in the future. In my opinion, it provides a much more flexible solution for your needs. Here's an example of regex to use:
var regex = new Regex(#"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9])");
var match = regex.Match("19.1.354.6");
if (match.Success)
{
Console.WriteLine("Match[1]: "+match.Groups[1].Value);
Console.WriteLine("Match[2]: "+match.Groups[2].Value);
Console.WriteLine("Match[3]: "+match.Groups[3].Value);
Console.WriteLine("Match[4]: "+match.Groups[4].Value);
}
else
{
Console.WriteLine("No match found");
}
which outputs the following:
// Match[1]: 19
// Match[2]: 1
// Match[3]: 354
// Match[4]: 6

How to retrieve the locale(country) code from URL?

I have a URL, which is like http://example.com/UK/Deal.aspx?id=322
My target is to remove the locale(country) part, to make it like http://example.com/Deal.aspx?id=322
Since the URL may have other similar formats like: https://ssl.example.com/JP/Deal.aspx?id=735, using "substring" function is not a good idea.
What I can think about is to use the following method for separating them, and map them back later.
HttpContext.Current.Request.Url.Scheme
HttpContext.Current.Request.Url.Host
HttpContext.Current.Request.Url.AbsolutePath
HttpContext.Current.Request.Url.Query
And, suppose HttpContext.Current.Request.Url.AbsolutePath will be:
/UK/Deal.aspx?id=322
I am not sure how to deal with this since my boss asked me not to use "regular expression"(he thinks it will impact performance...)
Except "Regular Expression", is there any other way to remove UK from it?
p.s.: the UK part may be JP, DE, or other country code.
By the way, for USA, there is no country code, and the url will be http://example.com/Deal.aspx?id=322
Please also take this situation into consideration.
Thank you.

Assuming that you'll have TwoLetterCountryISOName in the Url. yYou can use UriBuilder class to remove the path from Uri without using the Regex.
E.g.
var originalUri = new Uri("http://example.com/UK/Deal.aspx?id=322");
if (IsLocaleEnabled(sourceUri))
{
var builder = new UriBuilder(sourceUri);
builder.Path
= builder.Path.Replace(sourceUri.Segments[1] /* remove UK/ */, string.Empty);
// Construct the Uri with new path
Uri newUri = builder.Uri;;
}
Update:
// Cache the instance for performance benefits.
static readonly Regex regex = new Regex(#"^[aA-zZ]{2}\/$", RegexOptions.Compiled);
/// <summary>
/// Regex to check if Url segments have the 2 letter
/// ISO code as first ocurrance after root
/// </summary>
private bool IsLocaleEnabled(Uri sourceUri)
{
// Update: Compiled regex are way much faster than using non-compiled regex.
return regex.IsMatch(sourceUri.Segments[1]);
}
For performance benefits you must cache it (means keep it in static readonly field). There's no need to parse a pre-defined regex on every request. This way you'll get all the performance benefits you can get.
Result - http://example.com/Deal.aspx?id=322

It all depends on whether the country code always has the same position. If it's not, then some more details on the possible formats are required.. Maybe you could check, if the first segment has two chars or something, to be sure it really is a country code (not sure if this is reliable though). Or you start with the filename, if it's always in the format /[optionalCountryCode]/deal.aspx?...
How about these two approaches (on string level):
public string RemoveCountryCode()
{
Uri originalUri = new Uri("http://example.com/UK/Deal.aspx?id=322");
string hostAndPort = originalUri.GetLeftPart(UriPartial.Authority);
// v1: if country code is always there, always has same position and always
// has format 'XX' this is definitely the easiest and fastest
string trimmedPathAndQuery = originalUri.PathAndQuery.Substring("/XX/".Length);
// v2: if country code is always there, always has same position but might
// not have a fixed format (e.g. XXX)
trimmedPathAndQuery = string.Join("/", originalUri.PathAndQuery.Split('/').Skip(2));
// in both cases you need to join it with the authority again
return string.Format("{0}/{1}", hostAndPort, trimmedPathAndQuery);
}

If the AbsolutePath will always have the format /XX/...pagename.aspx?id=### where XX is the two letter country code, then you can just strip off the first 3 characters.
Example that removes the first 3 characters:
var targetURL = HttpContext.Current.Request.Url.AbsolutePath.Substring(3);
If the country code could be different lengths, then you could find the index of the second / character and start the substring from there.
var sourceURL = HttpContext.Current.Request.Url.AbsolutePath;
var firstOccurance = sourceURL.IndexOf('/')
var secondOccurance = sourceURL.IndexOf('/', firstOccurance);
var targetURL = sourceURL.Substring(secondOccurance);

The easy way would be to treat as string, split it by the "/" separator, remove the fourth element, and then join them back with the "/" separator again:
string myURL = "https://ssl.example.com/JP/Deal.aspx?id=735";
List<string> myURLsplit = myURL.Split('/').ToList().RemoveAt(3);
myURL = string.Join("/", myURLsplit);
RESULT: https://ssl.example.com/Deal.aspx?id=735

Code an elegant way to strip strings

I am using C# and in one of the places i got list of all peoples names with their email id's in the format
name(email)\n
i just came with this sub string stuff just off my head. I am looking for more elegant, fast ( in the terms of access time, operations it performs), easy to remember line of code to do this.
string pattern = "jackal(jackal#gmail.com)";
string email = pattern.SubString(pattern.indexOf("("),pattern.LastIndexOf(")") - pattern.indexOf("("));
//extra
string email = pattern.Split('(',')')[1];
I think doing the above would do sequential access to each character until it finds the index of the character. Works ok now since name is short, but would struggle when having a large name ( hope people don't have one)

A dirty hack would be to let microsoft do it for you.
try
{
new MailAddress(input);
//valid
}
catch (Exception ex)
{
// invalid
}
I hope they would do a better job than a custom reg-ex.
Maintaining a custom reg-ex that takes care of everything might involve some effort.
Refer: MailAddress
Your format is actually very close to some supported formats.
Text within () are treated as comments, but if you replace ( with < and ) with > and get a supported format.

The second parameter in Substring() is the length of the string to take, not the ending index.
Your code should read:
string pattern = "jackal(jackal#gmail.com)";
int start = pattern.IndexOf("(") + 1;
int end = pattern.LastIndexOf(")");
string email = pattern.Substring(start, end - start);
Alternatively, have a look at Regular Expression to find a string included between two characters while EXCLUDING the delimiters

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex matching dynamic words within an html string - c#

Related

Extract substring between startsequence and endsequence in C# using LINQ

Get specific char from a string request

Parse Line and Break it into Variables

How to retrieve the locale(country) code from URL?

Code an elegant way to strip strings

Categories

Resources