C# Regex Split To Java Pattern split

C# Regex Split To Java Pattern split - c#

I have to port some C# code to Java and I am having some trouble converting a string splitting command.
While the actual regex is still correct, when splitting in C# the regex tokens are part of the resulting string[], but in Java the regex tokens are removed.
What is the easiest way to keep the split-on tokens?
Here is an example of C# code that works the way I want it:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String[] values = Regex.Split("5+10", #"([\+\-\*\(\)\^\\/])");
foreach (String value in values)
Console.WriteLine(value);
}
}
Produces:
5
+
10

I don't know how C# does it, but to accomplish it in Java, you'll have to approximate it. Look at how this code does it:
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
// Iterate trough each match
while (m.find()) {
// Text since last match
splitted.add(text.substring(last_match,m.start()));
// The delimiter itself
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
// Trailing text
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}

This is because you are capturing the split token. C# takes this as a hint that you wish to retain the token itself as a member of the resulting array. Java does not support this.

Related

Remove characters from List<string> in between separators (from text file)

Fast way to replace text in text file.
From this: somename#somedomain.com:hello_world
To This: somename:hello_world
It needs to be FAST and support multiple lines of text file.
I tried spiting the string into three parts but it seems slow. Example in the code below.
<pre><code>
public static void Conversion()
{
List<string> list = File.ReadAllLines("ETU/Tut.txt").ToList();
Console.WriteLine("Please wait, converting in progress !");
foreach (string combination in list)
{
if (combination.Contains("#"))
{
write: try
{
using (StreamWriter sw = new
StreamWriter("ETU/UPCombination.txt", true))
{
sw.WriteLine(combination.Split('#', ':')[0] + ":"
+ combination.Split('#', ':')[2]);
}
}
catch
{
goto write;
}
}
else
{
Console.WriteLine("At least one line doesn't contain #");
}
}
}</code></pre>
So a fast way to convert every line in text file from
somename#somedomain.com:hello_world
To: somename:hello_world
then save it different text file.
!Remember the domain bit always changes!

Most likely not the fastest, but it is pretty fast with an expression similar to,
#[^:]+
and replace that with an empty string.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"#[^:]+";
string substitution = #"";
string input = #"somename#somedomain.com:hello_world1
somename#some_other_domain.com:hello_world2";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Delimit a string by character unless within quotation marks C#

I need to demilitarise text by a single character, a comma. But I want to only use that comma as a delimiter if it is not encapsulated by quotation marks.
An example:
Method,value1,value2
Would contain three values: Method, value1 and value2
But:
Method,"value1,value2"
Would contain two values: Method and "value1,value2"
I'm not really sure how to go about this as when splitting a string I would use:
String.Split(',');
But that would demilitarise based on ALL commas. Is this possible without getting overly complicated and having to manually check every character of the string.
Thanks in advance

Copied from my comment: Use an available csv parser like VisualBasic.FileIO.TextFieldParser or this or this.
As requested, here is an example for the TextFieldParser:
var allLineFields = new List<string[]>();
string sampleText = "Method,\"value1,value2\"";
var reader = new System.IO.StringReader(sampleText);
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
parser.Delimiters = new string[] { "," };
parser.HasFieldsEnclosedInQuotes = true; // <--- !!!
string[] fields;
while ((fields = parser.ReadFields()) != null)
{
allLineFields.Add(fields);
}
}
This list now contains a single string[] with two strings. I have used a StringReader because this sample uses a string, if the source is a file use a StreamReader(f.e. via File.OpenText).

You can try Regex.Split() to split the data up using the pattern
",|(\"[^\"]*\")"
This will split by commas and by characters within quotes.
Code Sample:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string data = "Method,\"value1,value2\",Method2";
string[] pieces = Regex.Split(data, ",|(\"[^\"]*\")").Where(exp => !String.IsNullOrEmpty(exp)).ToArray();
foreach (string piece in pieces)
{
Console.WriteLine(piece);
}
}
}
Results:
Method
"value1,value2"
Method2
Demo

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!

You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979

I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}

It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.

You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');

If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"

Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

C# Best way to retrieve strings that's in quotation mark?

Suppose I am given a following text (in a string array)
engine.STEPCONTROL("00000000","02000001","02000043","02000002","02000007","02000003","02000008","02000004","02000009","02000005","02000010","02000006","02000011");
if("02000001" == 1){
dimlevel = 1;
}
if("02000001" == 2){
dimlevel = 3;
}
I'd like to extract the strings that's in between the quotation mark and put it in a separate string array. For instance, string[] extracted would contain 00000000, 02000001, 02000043....
What is the best approach for this? Should I use regular expression to somehow parse those lines and split it?

Personally I don't think a regular expression is necessary. If you can be sure that the input string is always as described and will not have any escape sequences in it or vary in any other way, you could use something like this:
public static string[] ExtractNumbers(string[] originalCodeLines)
{
List<string> extractedNumbers = new List<string>();
string[] codeLineElements = originalCodeLines[0].Split('"');
foreach (string element in codeLineElements)
{
int result = 0;
if (int.TryParse(element, out result))
{
extractedNumbers.Add(element);
}
}
return extractedNumbers.ToArray();
}
It's not necessarily the most efficient implementation but it's quite short and its easy to see what it does.

that could be
string data = "\"00000000\",\"02000001\",\"02000043\"".Replace("\"", string.Empty);
string[] myArray = data.Split(',');
or in 1 line
string[] data = "\"00000000\",\"02000001\",\"02000043\"".Replace("\"", string.Empty).Split(',');

C# preg_replace?

What is the PHP preg_replace in C#?
I have an array of string that I would like to replace by an other array of string. Here is an example in PHP. How can I do something like that in C# without using .Replace("old","new").
$patterns[0] = '/=C0/';
$patterns[1] = '/=E9/';
$patterns[2] = '/=C9/';
$replacements[0] = 'à';
$replacements[1] = 'é';
$replacements[2] = 'é';
return preg_replace($patterns, $replacements, $text);

Real men use regular expressions, but here is an extension method that adds it to String if you wanted it:
public static class ExtensionMethods
{
public static String PregReplace(this String input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (var i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
You use it like this:
class Program
{
static void Main(string[] args)
{
String[] pattern = new String[4];
String[] replacement = new String[4];
pattern[0] = "Quick";
pattern[1] = "Fox";
pattern[2] = "Jumped";
pattern[3] = "Lazy";
replacement[0] = "Slow";
replacement[1] = "Turtle";
replacement[2] = "Crawled";
replacement[3] = "Dead";
String DemoText = "The Quick Brown Fox Jumped Over the Lazy Dog";
Console.WriteLine(DemoText.PregReplace(pattern, replacement));
}
}

You can use .Select() (in .NET 3.5 and C# 3) to ease applying functions to members of a collection.
stringsList.Select( s => replacementsList.Select( r => s.Replace(s,r) ) );
You don't need regexp support, you just want an easy way to iterate over the arrays.

public static class StringManipulation
{
public static string PregReplace(string input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (int i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
Here is what I will use. Some code of Jonathan Holland but not in C#3.5 but in C#2.0 :)
Thx all.

You are looking for System.Text.RegularExpressions;
using System.Text.RegularExpressions;
Regex r = new Regex("=C0");
string output = r.Replace(text);
To get PHP's array behaviour the way you have you need multiple instances of `Regex
However, in your example, you'd be much better served by .Replace(old, new), it's much faster than compiling state machines.

Edit: Uhg I just realized this question was for 2.0, but I'll leave it in case you do have access to 3.5.
Just another take on the Linq thing. Now I used List<Char> instead of Char[] but that's just to make it look a little cleaner. There is no IndexOf method on arrays but there is one on List. Why did I need this? Well from what I am guessing, there is no direct correlation between the replacement list and the list of ones to be replaced. Just the index.
So with that in mind, you can do this with Char[] just fine. But when you see the IndexOf method, you have to add in a .ToList() before it.
Like this: someArray.ToList().IndexOf
String text;
List<Char> patternsToReplace;
List<Char> patternsToUse;
patternsToReplace = new List<Char>();
patternsToReplace.Add('a');
patternsToReplace.Add('c');
patternsToUse = new List<Char>();
patternsToUse.Add('X');
patternsToUse.Add('Z');
text = "This is a thing to replace stuff with";
var allAsAndCs = text.ToCharArray()
.Select
(
currentItem => patternsToReplace.Contains(currentItem)
? patternsToUse[patternsToReplace.IndexOf(currentItem)]
: currentItem
)
.ToArray();
text = new String(allAsAndCs);
This just converts the text to a character array, selects through each one. If the current character is not in the replacement list, just send back the character as is. If it is in the replacement list, return the character in the same index of the replacement characters list. Last thing is to create a string from the character array.
using System;
using System.Collections.Generic;
using System.Linq;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regex Split To Java Pattern split - c#

This is because you are capturing the split token. C# takes this as a hint that you wish to retain the token itself as a member of the resulting array. Java does not support this.

Related

Remove characters from List<string> in between separators (from text file)

Delimit a string by character unless within quotation marks C#

How to properly split a CSV using C# split() function?

C# Best way to retrieve strings that's in quotation mark?

C# preg_replace?

Categories

Resources