I have a data source that is comma-delimited, and quote-qualified. A CSV. However, the data source provider sometimes does some wonky things. I've compensated for all but one of them (we read in the file line-by-line, then write it back out after cleansing), and I'm looking to solve the last remaining problem when my regex-fu is pretty weak.
Matching a Quoted String inside of another Quoted String
So here is our example string...
"foobar", 356, "Lieu-dit "chez Métral", Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I am looking to match the substring "chez Métral", in order to replace it with the substring chez Métral. Ideally, in as few lines of code as possible. The final goal is to write the line back out (or return it as a method return value) with the replacement already done.
So our example string would end up as...
"foobar", 356, "Lieu-dit chez Métral, Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I know I could define a pattern such as (?<quotedstring>\"\w+[^,]+\") to match quoted strings, but my regex-fu is weak (database developer, almost never use C#), so I'm not sure how to match another quoted string within the named group quotedstring.
FYI: For those noticing the large integer that is formatted with commas but not quote-qualified, that's already handled. As is the random use of row-delimiters (sometimes CR, sometimes LF). As other problems...
Replace with this regex
(?<!,\s*|^)"([^",]*)"
now replace it with $1
try it here
escaping " with "" it would become
(?<!,\s*|^)""([^"",]*)""
Related
I am trying to see if a large string contains this line of HTML:
<label ng-class="choiceCaptionClass" class="ng-binding choice-caption">Was this information helpful?</label>
As you can see, this snippet has quotations in multiple places and it's causing problems when I do something like this:
Assert.IsTrue(responseContent.Contains("<label ng-class="choiceCaptionClass" class="ng - binding choice - caption">Was this information helpful?</label>"));
I've tried both of these ways of defining the string:
#"<label ng-class=""choiceCaptionClass"" class=""ng - binding choice - caption"">Was this information helpful?</label>"
and
"<label ng-class=\"choiceCaptionClass\" class=\"ng - binding choice - caption\">Was this information helpful?</label>"
But in each case the Contains() method looks for the literal string with either the double quotes or the backslashes. Is there another way I could define this string so I can correctly search for it?
Escaping the double-quotes with backslashes is the proper thing to do.
The reason your search may be failing is that the strings don't actually match. For example, in your version with backslashes, you have spaces around some of the dashes but your HTML string does not.
Try using regular expressions. I made this one for you but you can test your own regex here.
var regex = new Regex(#"<label\s+ng-class\s*=\s*""choiceCaptionClass""\s+class\s*=\s*""ng-binding choice-caption""\s*>\s*Was this information helpful\?\s*</label>", RegexOptions.IgnoreCase);
Assert.IsTrue(regex.IsMatch(responseContent));
If this is not working use the tester tool to figure it out what part of the pattern is getting off.
Hope this help!
I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.
Sometimes the data I get includes multiple details in the single string. As the name of this Movie "Dog Eats Dog (2012) (2012)". The name should have been "Dog Eats Dog (2012)" rather than the first one.
Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)". Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.
I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.
But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)".
Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring.
The duplicate year always comes in the end of the string. So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?
If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.
Thanks.
The right regex is "( \(\d{4}\))\1+".
string pattern = #"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2
Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \(\d{4}\)) does the capture, \1+ finds any non empty sequence of that captured block
Finally, replace the initial block and its copies by the initial block alone.
This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)
`#"(\(\d{4}\))(?:\s*\1)+"`
I have a demo of it here
I have a list like :
george fg
michel fgu
yasser fguh
I would like to replace fg, fgu, and fguh by "fguhCool" I already tried something like this :
foreach (var ignore in NameToPoulate)
{
tempo = ignore.Replace("fg", "fguhCool");
NameToPoulate_t.Add(tempo);
}
But then "fgu" become "fguhCoolu" and "fguh" become "fguhCooluh" is there are a better idea ?
Thanks for your help.
I assume that this is a homework assignment and that you are being tested for the specific algorihm rather than any code that does the job.
This is probably what your teacher has in mind:
Students will realize that the code should check for "fguh" first, then "fgu" then "fg". The order is important because replacing "fg" will, as you have noticed, destroy a "fguh".
This will by some students be implemented as a loop with if-else conditions in them. So that you will not replace a "fg" that is within an already replaced "fguhCool".
But then you will find that the algorithm breaks down if "fg" and "fgu" are both within the same string. You cannot then allow the presence of "fgu" prevent you to check for "fg" at a different part of the string.
The answer that your teacher is looking for is probably that you should first locate "fguh", "fgu" and "fg" (in that order) and replace them with an intermediary string that doesn't contain "fg". Then after you have done that, you can search for that intermediary string and replace it with "fguhCool".
You could use regular expressions:
Regex.Replace(#"\bfg\b", "fguhCool");
The \b matches a so-called word boundary which means it matches the beginnnig or end of a word (roughly, but for this purpose enough).
Use a regular expression:
Regex.Replace("fg(uh?)?", "fguhCool");
An alternative would be replacing the long words for the short ones first, then replacing the short for the end value (I'm assuming all words - "fg", "fgu" and "fguh" - would map to the same value "fguhCool", right?)
tempo = ignore
.Replace("fguh", "fg")
.Replace("fgu", "fg")
.Replace("fg", "fguhCool");
Obs.: That assumes those words can appear anywhere in the string. If you're worried about whole words (i.e. cases where those words are not substrings of a bigger word), then see #Joey's answer (in this case, simple substitutions won't do, regexes are really the best option).
Let's say I have this string:
"param1,r:1234,p:myparameters=1,2,3"
...and I would like to split it into:
param1
r:1234
p:myparameters=1,2,3
I've used the split function and of course it splits it at every comma. Is there a way to do this using regex or will I have to write my own split function?
Personally, I would try something like this:
,(?=[^,]+:.*?)
Basically, use a positive look-ahead to find a comma, followed by a "key-value" pair (this defined by a key, a colon, and more information [data] (including other commas). This should disqualify the commas between the numbers, too.
You can use ; for separating values which makes easy to work with it.
Since you have , for separation and also for values it is difficult to split it.
You have
string str = "param1,r:1234,p:myparameters=1,2,3"
Recommended to use
string str = "param1;r:1234;p:myparameters=1,2,3"
which can be splited as
var strArray = str.Split(';');
strArray[0]; // contains param1
strArray[1]; // r:1234
strArray[2]; // p:myparameters=1,2,3
I'm not sure how you would write a split that knew which commas to split on there, honestly.
Unless it's a fixed number each time in which case, just use the String.Split overload that takes an int specifying how many substrings to return at max
If you're going to have comma-delimited data that's not always a fixed number of items and it could have literal commas in the data itself, they really should be quoted. If you can control the input in any way, you should encourage that, and use an actual CSV parser instead of String.Split
That depends. You can't parse it with regex (or anything else) unless you can identify a consistent rule separating one group from another. Based on your sample, I can't clearly identify such a rule (though I have some guesses). How does the system know that p:myparameters=1,2,3 is a single item? For example, if there were another item after it, what would be the difference between that and the 1,2,3? Figure that out and you'll be pretty close to a solution.
If you're able to change the format of the input string, why not decide on a consistent delimiter between your groups? ; would be a good choice. Use an input like param1;r:1234;p:myparameters=1,2,3 and there will be no ambiguity where the groups are, plus you can just split on ; and you won't need regex.
The simplest approach would be changing your delimiter from "," to something like "|". Then you can split on "|" no problem. However if you can't change the delimiting character then maybe you could encode the sections in a fashion similar to CSV.
CSV files have the same issue... the standard there is to put double quotes "" around columns.
For example, your string would be "param1","r:1234","p:myparameters=1,2,3".
Then you could use the Microsoft.VisualBasic.FileIO.TextFieldParser to split/parse. You can include this in c# even though its in the VisualBasic namespace.
TextFieldParser
Do you mean that:string[] str = System.Text.RegularExpression.Regex.Spilt("param1,r:1234,p:myparameters=1,2,3",#"\,");
I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.
This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)
This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?
If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212
One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.
You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters
Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.
Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)
In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.
Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).