Let's say I have this string:
"param1,r:1234,p:myparameters=1,2,3"
...and I would like to split it into:
param1
r:1234
p:myparameters=1,2,3
I've used the split function and of course it splits it at every comma. Is there a way to do this using regex or will I have to write my own split function?
Personally, I would try something like this:
,(?=[^,]+:.*?)
Basically, use a positive look-ahead to find a comma, followed by a "key-value" pair (this defined by a key, a colon, and more information [data] (including other commas). This should disqualify the commas between the numbers, too.
You can use ; for separating values which makes easy to work with it.
Since you have , for separation and also for values it is difficult to split it.
You have
string str = "param1,r:1234,p:myparameters=1,2,3"
Recommended to use
string str = "param1;r:1234;p:myparameters=1,2,3"
which can be splited as
var strArray = str.Split(';');
strArray[0]; // contains param1
strArray[1]; // r:1234
strArray[2]; // p:myparameters=1,2,3
I'm not sure how you would write a split that knew which commas to split on there, honestly.
Unless it's a fixed number each time in which case, just use the String.Split overload that takes an int specifying how many substrings to return at max
If you're going to have comma-delimited data that's not always a fixed number of items and it could have literal commas in the data itself, they really should be quoted. If you can control the input in any way, you should encourage that, and use an actual CSV parser instead of String.Split
That depends. You can't parse it with regex (or anything else) unless you can identify a consistent rule separating one group from another. Based on your sample, I can't clearly identify such a rule (though I have some guesses). How does the system know that p:myparameters=1,2,3 is a single item? For example, if there were another item after it, what would be the difference between that and the 1,2,3? Figure that out and you'll be pretty close to a solution.
If you're able to change the format of the input string, why not decide on a consistent delimiter between your groups? ; would be a good choice. Use an input like param1;r:1234;p:myparameters=1,2,3 and there will be no ambiguity where the groups are, plus you can just split on ; and you won't need regex.
The simplest approach would be changing your delimiter from "," to something like "|". Then you can split on "|" no problem. However if you can't change the delimiting character then maybe you could encode the sections in a fashion similar to CSV.
CSV files have the same issue... the standard there is to put double quotes "" around columns.
For example, your string would be "param1","r:1234","p:myparameters=1,2,3".
Then you could use the Microsoft.VisualBasic.FileIO.TextFieldParser to split/parse. You can include this in c# even though its in the VisualBasic namespace.
TextFieldParser
Do you mean that:string[] str = System.Text.RegularExpression.Regex.Spilt("param1,r:1234,p:myparameters=1,2,3",#"\,");
Related
I want to import data from a CSV file, But some cells contain comma in string value. How can I recognize which comma is for separate and which is in cell content?
use TextFieldParser :usage
using Microsoft.VisualBasic.FileIO; //Microsoft.VisualBasic.dll
...
using(var csvReader = new TextFieldParser(reader)){
csvReader.SetDelimiters(new string[] {","});
csvReader.HasFieldsEnclosedInQuotes = true;
fields = csvReader.ReadFields();
}
In general, do not bother writing the import yourself.
I have good experiences with the FileHelpers lib.
http://www.filehelpers.com/
And indeed, I hope your fields are quoted. Filehelpers supports this out of the box.
Otherwise there is not much you can do.
Unless you have quotes around the strings you are pretty much hosed, hence the "quote and comma" delimiter style. If you have control of the export facility then you must select "enclose strings quotes" or change the delimiter to something like a tilde or carat symbol.
If not well then you have to write some code. If you detect "a..z" then start counting commas and then keep working through string until you detect [0..9] and even then this is going to be problematic since people can put a [0..9] in their text. At best this is going to be a best efforts process. Your going to have to know when you are in chars and when you are not going to be in chars. I doubt even regex will help you much on this.
The only other thing I can think of is to run through the data and look for commas. Then look prior to and after the comma. If you are surrounded by chars then replace the comma with alternate char like the carat "^" symbol or the tilde "~". Then process the file as normal then go back and replace the alternate char with a comma.
Good luck.
using FileHelper is defnitley way to go. They have done a great job building all the logic for you. I had the same issue where i had to parse a CSV file having comma as part of the field. And this utility did the job very well. All you have to do is to use fillowing attribute on to the field
[FieldQuoted('"', QuoteMode.OptionalForBoth)]
For details http://www.filehelpers.com/forums/viewtopic.php?f=12&t=391
We can use RegEx also as bellow.
Regex CSVParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] Fields = CSVParser.Split(Test);
Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks
There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.
Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.
you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format
well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".
Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?
You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);
I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)
Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.
I want to build a comma separated list so that I can split on the comma later to get an array of the values. However, the values may have comma's in them. In fact, they may have any normal keyboard character in them (they are supplied from a user). What is a good strategy for determining a character you are sure will not collide with the values?
In case this matters in a language dependent way, I am building the "some character" separated list in C# and sending it to a browser to be split in javascript.
If JavaScript is consuming the list, why not send it in the form of a JavaScript array? It already has an established and reliable method for representing a list and escaping characters.
["Value 1", "Value 2", "Escaped \"Quotes\"", "Escaped \\ Backslash"]
You could split it by a null character, and terminate your list with a double null character.
I always use | but if you still think that it can contain it, you can use combinations like #|#. For example:
"string one#|#string two#|#...#|#last string"
Eric S. Raymond wrote a book chapter on this that you might find useful. It is directed toward Unix users but should still apply.
As for your question, if you will have commas within cells, then you will need some form of escaping. Using \, is a standard way, but you will also have to escape slashes, which are also common.
Alternatively, use another character such as the pipe (|), tab, or something else of your choice. If users need to work with the data using a spreadsheet program, you can usually add filter rules to split cells on the delimiter of your choice. If this is a concern, it's probably best to choose a delimiter that users can easily type, which excludes the nul char, among others.
You could also use quoting:
"value1", "value2", "etc"
In which case, you will only need to escape quotes (and slashes). This should also be accepted by spreadsheets given the correct filter options.
There are several ways to do this. The first is to select a separator character that would not normally be input from the keyboard. NULL or TAB are normally good. The second is to use a character sequence as a separator, the Excel CSV files are a good example where the cell values are defined by quotes with commas separating the cells.
The answer is dependent on whether you want to reinvent the wheel or not.
If there is potential for any splitting character to appear in your strings then then I would suggest that you write a script element to your output with a javascript array definition in it. For example:
<script>
var myVars=new Array();
myVars[0]="abc|#123$";
myVars[1]="123*456";
myVars[2]="blah|blah";
</script>
Your javascript can then reference that array
Doing this also avoids the need to create a comma seperated string from your C# string array.
The only gotcha I can think of is strings that contains quotes, in this case you would have to escape them in C# when writing them out to the myVars output.
There is an RFC which documents the CSV format. Follow the standards and you will avoid reinventing the wheel and creating a mess for the next guy to come along and maintain your code. The nice thing is that there are libraries available to import/export CSV for just about any platform you can imagine.
That said, if you are serialising data to send to a browser, JSON is really the way to go and it too is documented in an RFC and you can get libraries for just about any platform such as JSON.NET
I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.
This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)
This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?
If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212
One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.
You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters
Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.
Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)
In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.
Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).
For a one-shot operation, i need to parse the contents of an XML string and change the numbers of the "ID" field. However, i can not risk changing anything else of the string, eg. whitespace, line feeds, etc. MUST remain as they are!
Since i have made the experience that XmlReader tends to mess whitespace up and may even reformat your XML i don't want to use it (but feel free to convince me otherwise). This also screams for RegEx but ... i'm not good at RegEx, particularly not with the .NET implementation.
Here's a short part of the string, the number of the ID field needs to be updated in some cases. There can be many such VAR entries in the string. So i need to convert each ID to Int32, compare & modify it, then put it back into the string.
<VAR NAME="sf_name" ID="1001210">
I am looking for the simplest (in terms of coding time) and safest way to do this.
The regex pattern you are looking for is:
ID="(\d+)"
Match group 1 would contain the number. Use a MatchEvaluator Delegate to replace matches with dynamically calculated replacements.
Regex r = new Regex("ID=\"(\\d+)\"");
string outputXml = r.Replace(inputXml, new MatchEvaluator(ReplaceFunction));
where ReplaceFunction is something like this:
public string ReplaceFunction(Match m)
{
// do stuff with m.Groups(1);
return result.ToString();
}
If you need I can expand the Regex to match more specifically. Currently all ID values (that contain numbers only) are replaced. You can also build that bit of "extra intelligence" into the match evaluator function and make it return the match unchanged if you don't want to change it.
Take a look at this property PreserveWhitespace in XmlDocument class