Split substrings within a string in C# - c#

I need to check a string that contains a list of e-mails. These emails are usually separated by commas, but I need to check if somewhere in that list there is a delimiter other than a comma. Here's an example:
email1#email.com,email2#email.com,email3#email.com#email4#email.com
I need to identify that different character and replace to a comma.
I cannot just use a regex to identify special characters other than the comma and replace them because emails may have some of these characters. So I need to find something between two e-mail.
I made the following regex to identify an e-mail and I believe it will cover most of the emails:
^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#[a-z0-9]+(\.[a-z0-9]+)+$
But I'm a little lost on how to use it to solve my problem, using C #. I need to capture something that was between two matches of this regex and replace to a comma.
Could anyone help me?
Thank you.

Your problem is unsolvable because the delimiter can not always be determined by a human.
Consider this input where the delimiter is a .:
user#server.co.uk.user#otherServer.com
Is this:
user#server.co | uk.user#otherServer.com
or is it:
user#server.co.uk | user#otherServer.com
Or this input:
user#server.intuser#otherServer.com
Is it delimiter u:
user#server.int | ser#otherServer.com
Or delimiter t:
user#server.in | user#otherServer.com
If you're not willing to accept a certain percentage of failures, you're better off looking for ways not to receive this input to begin with.

([^#,]+#[^.]+\.\w{3}(?!,|$)).
Try this.Replace by $1,.See demo.
http://regex101.com/r/tF4jD3/15
P.S this will work for email id's of format something#something.com.

I can't think of an elegant way to achieve this. If you don't mind an inelegant solution, you can replace any top level domain plus one character with the same TLD plus comma.
You'll end up replacing ".com#" with ".com,", ".eu*" with ".eu," and so on. Replacement could take place using Regex so your iterations will be the same number of the TLDs you want to replace.

One option you could try is to split the incoming string using the # symbol and check that each part of the resulting array has a comma in int--except the first and last.
If you find one that is missing the comma do a search for the .com or .net or .org in that element and stick a comma after that character.
Lastly just run splice the list back together with the # symbol

Thanks for the replies.
The string must have only commas as the delimiter.
The example I mentioned was just to illustrate, because this list was generated using a jquery plugin that had a flaw that was noticed only after allowing it to be saved in the list something like "email1#email.comemail2#email.com" or any other combination non standard "email1#email.com,email2#email.com".
My main concern is cases like "email1#email.com/email2#email.com"
I'm trying to automate a search for this kind of inconsistency, as prevention.
I thought about using regex but I really do not know if it is the best approach.
I am now thinking, as it is not a critical part of the system, it would be a simpler way just to use a list of invalid characters to make the replace.
But I will try the vks's solution.
Thank you all.

Related

How to Match a Comma Seperated List and End with a Different Character

One project I am currently working on involves writing a parser in C#.
I chose to use Regex to extract the parts of each line. Only one problem... I have very little Regex experience.
My current issue is that I can't get argument lists to work. More specifically, I can't match comma separated lists. After two hours of being stuck, I've turned to SO.
My closest regex so far is:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+\s*)*\)
Obviously, the actual code part is not matched. Only the listed types are wanted.
I removed any and all comma detection code, as it all broke.
I want to make it match void FunctionName(int a, string b) or the equivalent with other spacing.
How can I make this happen?
Please suggest edits before voting to close, I'm bad at Stack Overflowing.
Try it like this:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+(?(?=\s*,\s*\w)\s*,\s*|\s*))*\)
Demo
Explanation:
the crucial part here is the if-else regex a la (?(?=regex)then|else):
(?(?=\s*,\s*\w)\s*,\s*|\s*)
which means: if a type-param pair is followed by a comma assert another word character appears.
However, if feel using regex could turn out to be the wrong choice for your task at hand. There are some lightweight parser frameworks out, e.g. Sprache.
You're actually very close:
(?:\s|^)(bool|int|string|float|void)\s+(\w+)\s*\(((?:bool|int|string|float)\s+\w+,?\s*)*\)
The only difference is the ,? close to to end of the regex, which Means an optional comma and will match the comma between variables.

Prevent Regex from devouring optional part of the match

I'v searched extensively but I can't find a simple answer to this and my Regex experience is limited. I'd appreciate a simple solution that is explained, please.
I have a very large string and I need to substitute certain words in it as follows:
Example: wherever you find the string "LINK-ABC" make it "LINK_ABC".
I wrote my Regex Match and Replace strings:
#"LINK-ABC", #"LINK_ABC" and it worked.
But there were a couple of things I had not recognized.
There COULD be words in the file like this:
LINK-ABC-DEF LINK-ABC-GHI-JKL ... and so on.
So I get "LINK_ABC-DEF" etc. (which is NOT what I want; this should have remained intact...)
Once I realized the problem it seemed that what I REALLY wanted was to recognize ONLY the word being matched and leave any cases where it was in combination with something else, unchanged. It seemed to me that if I checked for a space or period on the Match word, that should do it, so...
#"LINK-ABC[ |\\.]",#"LINK_ABC"
... and now I have stumbled.
Sample string:
link-xxx link-aaa-sss link-xxx-bbb link-xxx link-xxx.
Match/Replace string:
link-xxx[ |\\.],link_xxx
Result string:
link_xxxlink-aaa-sss link-xxx-bbb link_xxxlink_xxx
The replacements are correct, BUT the trailing comma or period has been "devoured" and so the result string is wrong.
Is there a way that I can match so that if it matches on space, the replacement will have a space and if it matches on a period, the replacement will have a period? I s'pose I could do 2 separate matches but I'd like to increase my understanding of Regex and do it more elegantly if it is possible.
You should be able to achieve the behavior you want with "capture groups"
var matchstring = #"link-xxx([ \.]|$)";
var fixstr = #"link_xxx$1";
The parenthesis around the last part of the matchstring will retain whatever matched inside it, and the $1 in the fixstr will substitute whatever was captured by that group.
I've also modified your punctuation section a little bit, presuming you want to replace a match if it happens to be the last word in the input (by adding the |$). A | inside a character class [] is a literal | character, so I removed it assuming you don't actually expect that in your input.

RegEx to split and extract based strict requirement

I’m using Nintex Workflows with a RegEx action. I believe the RegEx is based on .NET. I need to perform a RegEx on some data that is sent to me by users who input it in a different formats based on the person writing the data.
Test: A-BC12 (1,2,3,4,5,6,7,8,9);
Test: A-DE34 (1,2,3,4, words, 5,6,7,8,9);
Test: AFG56 (1,2,3,4 word, 5);
STOP some extra
My goal is this.
Start the extract after Test:
Capture the last 4 of the alpha numeric before the parenthesis
Capture the numbers only inside the parenthesis
Split each data based on ;
End the whole capture when the word STOP is found.
End results
BC12 (1,2,3,4,5,6,7,8,9);
DE34 (1,2,3,4,5,6,7,8,9);
FG56 (1,2,3,4,5);
I have tried splitting the data, forward lookup and exclude and I can’t seem to get everything to work together. If I have to execute multiple RegEx to achieve my results I’m ok with that.
I’ve tried the following to achieve each one of my goals
(?s)(?<=^.*?Test:\s)[a-zA-Z0-9]+ this only capture the first ABC12 or A-BC12 then stops
[,;] split the data so it is easier to maintain. However the word Test: is captured.
I feel I'm going in the right direction, however I'm missing something or taking the wrong approach. Any help would be greatly appreciated.
If you need to omit the first group you can use this regex: Test:\s*A[^;]*;(.*?)STOP.
That way, you can take $1 and split it on ;.
Edit: Clarifications have rendered the above solution obsolete. I've made new stuff that will directly address your steps:
a. Start the extract after Test:
b. Capture the last 4 of the alpha numeric before the parenthesis
c. Capture the numbers only inside the parenthesis
d. Split each data based on ;
e. End the whole capture when the word STOP is found.
You're actually looking for something like:
Use Test:\s*(.*?)STOP. This addresses steps a and e.
Take $1 and use [A-Z0-9]{4}\s*\(([^)]*)\);. This addresses steps b and d.
Take the $1 from the previous step, and use ([0-9]+) to get the numbers. This will get all the numbers, and if given: 9,10 it will produce two matches: 9 and 10.
You may need to use modifiers, like i for case insensitive, s for single line, and g for global.
I hope this is finally what you're looking for!

RegEx for a specific string pattern

Using C#, I will be handling character arrays of info, looking for the following pattern:
a pipe (0x7C), 2 to 7 pairs of characters, followed by another pipe (0x7C).
Stated another way:
|1122[33][44][55][66][77]|
The character pairs consist of characters whose range is from 33-124 decimal ( '!' to '|').
Pairs 3 through 7 are optional, but occur in order, if they occur, so you could have
|1122| <---shortest
|112233|
|11223344|
|1122334455|
|112233445566|
|11223344556677| <---longest
I want to 1) find out if this pattern exists in the character array, 2) extract the individual pairs. These tasks can be separate. I think the best approach to this would be a RegEx, but so far I haven't been able to dream-up an expression to get the job done.
Is a RegEx the way to go and what would a solution for the RegEx itself be?
Is there a better way?
Chuck
If I understand your question correctly the correct pattern would be:
\|([!-|]{2}){2,7}\|
Or to capture each set
\|([!-|]{2})([!-|]{2})([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?\|
Not sure if the range will work directly like that or not, so you may need to do [A-Za-Z!##$......] if the simplified range doesn't work
Also, I think you don't want to include pipe(|) in the range as it could mess up the rest so [!-{] might be better

I need a regular expression to convert US tel number to link

Basically, the input field is just a string. People input their phone number in various formats. I need a regular expression to find and convert those numbers into links.
Input examples:
(201) 555-1212
(201)555-1212
201-555-1212
555-1212
Here's what I want:
(201) 555-1212 - Notice the space is gone
(201)555-1212
201-555-1212
555-1212
I know it should be more robust than just removing spaces, but it is for an internal web site that my employees will be accessing from their iPhone. So, I'm willing to "just get it working."
Here's what I have so far in C# (which should show you how little I know about regular expressions):
strchk = Regex.Replace(strchk, #"\b([\d{3}\-\d{4}|\d{3}\-\d{3}\-\d{4}|\(\d{3}\)\d{3}\-\d{4}])\b", "<a href='tel:$&'>$&</a>", RegexOptions.IgnoreCase);
Can anyone help me by fixing this or suggesting a better way to do this?
EDIT:
Thanks everyone. Here's what I've got so far:
strchk = Regex.Replace(strchk, #"\b(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})\b", "<a href='tel:$1'>$1</a>", RegexOptions.IgnoreCase);
It is picking up just about everything EXCEPT those with (nnn) area codes, with or without spaces between it and the 7 digit number. It does pick up the 7 digit number and link it that way. However, if the area code is specified it doesn't get matched. Any idea what I'm doing wrong?
Second Edit:
Got it working now. All I did was remove the \b from the start of the string.
Remove the [] and add \s* (zero or more whitespace characters) around each \-.
Also, you don't need to escape the -. (You can take out the \ from \-)
Explanation: [abcA-Z] is a character group, which matches a, b, c, or any character between A and Z.
It's not what you're trying to do.
Edits
In response to your updated regex:
Change [-\.\s] to [-\.\s]+ to match one or more of any of those characters (eg, a - with spaces around it)
The problem is that \b doesn't match the boundary between a space and a (.
Afaik, no phone enters the other characters, so why not replace [^0-9] with '' ?
Here's a regex I wrote for finding phone numbers:
(\+?\d[-\.\s]?)?(\(\d{3}\)\s?|\d{3}[-\.\s]?)\d{3}[-\.\s]?\d{4}
It's pretty flexible... allows a variety of formats.
Then, instead of killing yourself trying to replace it w/out spaces using a bunch of back references, instead pass the match to a function and just strip the spaces as you wanted.
C#/.net should have a method that allows a function as the replace argument...
Edit: They call it a `MatchEvaluator. That example uses a delegate, but I'm pretty sure you could use the slightly less verbose
(m) => m.Value.Replace(' ', '')
or something. working from memory here.

Categories

Resources