c# - Regular Expression to split by delimiter and text qualifier - c#

i need to split a text file, with values separated by comma and with text qualifier like ¨|¨
I was trying to use these function:
public string[] Split(string expression, string delimiter,
string qualifier, bool ignoreCase)
{
string _Statement = String.Format
("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter), Regex.Escape(qualifier));
RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;
Regex _Expression = new Regex(_Statement, _Options);
return _Expression.Split(expression);
}
to process a text file with rows like this one:
¨|¨column 1¨|¨,¨|¨column 2¨|¨,¨|¨column 3¨|¨,¨|¨column 4¨|¨
But my regex expression is not working...
Any ideas that could help me to make this work?
Thanks in advance

You can do this without a Regex, just split the string by ¨|¨ then each item by a space to get the individual key/value e.g.
foreach (var item in str.Split(new[] { "¨|¨" }, StringSplitOptions.RemoveEmptyEntries))
{
var tokens = item.Split(' ');
Console.WriteLine(tokens[0]);
Console.WriteLine(tokens[1]);
}

Not really sure why you need Regex for something like this, string.Split can give you the output you need like:
string str = "¨|¨column 1¨|¨,¨|¨column 2¨|¨,¨|¨column 3¨|¨,¨|¨column 4¨|¨";
string[] splitArray = str.Split(new[] { "¨|¨,", "¨|¨" }
, StringSplitOptions.RemoveEmptyEntries);
For output:
foreach (var item in splitArray)
{
Console.WriteLine(item);
}
Output:
column 1
column 2
column 3
column 4

In .net, we can do this! :)
I just pushed through it and feel like sharing.
This is a pretty full regex solution to splitting a delimited file row:
private bool RowMe(string strColumnDelimiter, string strTextQualifier, string strInput, out string[] strSplitOutput, out string strResultMessage)
{
string[] retVal = null;
bool blnResult = false;
strResultMessage = "";
//---- We need to escape at least some of the most common
// special characters for both delimiter & qualifier ----
switch (strColumnDelimiter) {
case "|":
strColumnDelimiter = "\\|";
break;
case "\\":
strColumnDelimiter = "\\\\";
break;
}
switch (strTextQualifier)
{
case "\"":
strTextQualifier = "\\\"";
break;
}
//---- Let's have our delimited row splitter regex! ----
string strPattern = String.Concat(
"^"
,"(?:"
,"("
, "[^\\S" + strColumnDelimiter + strTextQualifier + "]*" // allow leading whitespace, not counting our delimiter & qualifier
,"(?:"
,"(?:[^" + strColumnDelimiter + strTextQualifier +"]*)" // any amount of characters not colum-delimiter or text-qualifier
,"|"
, "(?:" + strTextQualifier + "(?:(?:[^" + strTextQualifier + "])|(?:" + strTextQualifier + strTextQualifier + "))*" + strTextQualifier + ")" // any amount of characters not text-qualifier OR doubled-text-qualifier inside leading & trailing text-qualifier (allow even colum-delimiter inside text qualifier)
,"|"
,"(?:(?:[^" + strColumnDelimiter + strTextQualifier + "]{1})(?:[^" + strColumnDelimiter + "]*)(?:[^" + strColumnDelimiter + strTextQualifier + "]{1}))" // any amount of characters not column-delimiter inside other leading & trailing characters not column-delimiter or text-qualifier (allow text-qualifier inside value if it is not leading or trailing)
,")"
, "[^\\S" + strColumnDelimiter + strTextQualifier + "]*" // allow trailing whitespace, not counting our delimiter & qualifier
,")"
, "){0,1}"
//-- note how this second section is almost the same as the first but with a leading delimiter...
// the first column must not have a leading delimiter, and any subsequent ones must
, "(?:"
,"(?:"
, strColumnDelimiter // << :)
,"(?:"
, "("
, "[^\\S" + strColumnDelimiter + strTextQualifier + "]*" // allow leading whitespace, not counting our delimiter & qualifier
, "(?:"
, "(?:[^" + strColumnDelimiter + strTextQualifier + "]*)" // any amount of characters not colum-delimiter or text-qualifier
, "|"
, "(?:" + strTextQualifier + "(?:(?:[^" + strTextQualifier + "])|(?:" + strTextQualifier + strTextQualifier + "))*" + strTextQualifier + ")" // any amount of characters not text-qualifier OR doubled-text-qualifier inside leading & trailing text-qualifier (allow even colum-delimiter inside text qualifier)
, "|"
, "(?:(?:[^" + strColumnDelimiter + strTextQualifier + "]{1})(?:[^" + strColumnDelimiter + "]*)(?:[^" + strColumnDelimiter + strTextQualifier + "]{1}))" // any amount of characters not column-delimiter inside other leading & trailing characters not column-delimiter or text-qualifier (allow text-qualifier inside value if it is not leading or trailing)
, ")"
, "[^\\S" + strColumnDelimiter + strTextQualifier + "]*" // allow trailing whitespace, not counting our delimiter & qualifier
, ")"
,")"
,")"
, "){0,}"
,"$"
);
);
//---- And do the regex Match-ing ! ----
System.Text.RegularExpressions.Regex objRegex = new System.Text.RegularExpressions.Regex(strPattern);
System.Text.RegularExpressions.MatchCollection objMyMatches = objRegex.Matches(strInput);
//---- So what did we get? ----
if (objMyMatches.Count != 1) {
blnResult = false;
strResultMessage = "--NO-- no overall match";
}
else if (objMyMatches[0].Groups.Count != 3) {
blnResult = false;
strResultMessage = "--NO-- pattern not correct";
throw new ApplicationException("ERROR SPLITTING FLAT FILE ROW! The hardcoded regular expression appears to be broken. This should not happen!!! What's up??");
}
else {
int cnt = (1 + objMyMatches[0].Groups[2].Captures.Count);
retVal = new string[cnt];
retVal[0] = objMyMatches[0].Groups[1].Captures[0].Value;
for (int i = 0; i < objMyMatches[0].Groups[2].Captures.Count; i++) {
retVal[i+1] = objMyMatches[0].Groups[2].Captures[i].Value;
}
blnResult = true;
strResultMessage = "SUCCESS";
}
strSplitOutput = retVal;
return blnResult;
}

Related

Moving a whole line string to the right

const string Duom = "Text.txt";
char[] seperators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] lines = File.ReadAllLines(Duom, Encoding.GetEncoding(1257));
for (int i = 0; i < lines.Length; i++)
{
string GLine = " " + lines[i];
GLine = Regex.Replace(GLine, #"\s+", " ");
GLine = GLine.PadRight(5, ' ');
Console.WriteLine(GLine);
}
Reads a text file, for each line it adds a whitespace at the start, removes all double and above whitespaces, and I want to move the line to the right , but it doesn't do anything.
Result :
Expected Result:
PadLeft and PadRight doesn't add characters to the start/end of your string if the specified length has already been reached.
From the docs for String.PadRight (emphasis mine):
Returns a new string that left-aligns the characters in this string by padding them on the right with a specified Unicode character, for a specified total length.
All of your strings are larger than 5, the specified total length, so PadRight/PadLeft won't do anything.
"Padding" the string is adding spaces (or some other character) so that the new string is at least as large as the number you want.
Instead, just manually add 5 spaces before your string.
GLine = " " + GLine;
Or more programmaticly:
GLine = new string(' ', 5) + GLine;
You could replace the body of your loop like this:
string GLine = new string(' ', 1 + i * 5) + Regex.Replace(lines[i], #"\s+", " ");
Console.WriteLine(GLine);
This will add 1 space and then 5 more spaces for each line.
for (int i = 0; i < lines.Count(); i++)
{
string GLine = new string(' ',5*i) + lines[i];
Console.WriteLine(GLine);
}
This should add 5 extra spaces for each line you have, which i believe is what you are trying to accomplish if i understand correctly.
You need to left pad a tab depending on how many lines of text you have. The best increment to use is the i variable.
string GLine = " " + lines[i];
change this to
string GLine = new String('\t', i) + lines[i];
By the way, PadLeft should work but keep in mind you need to execute it i times

Converting boundary points from a .kml using regex

I have this boundary that I received from a kml, I was able to dig down the xml and grab just the boundary points. I need to convert the points from this :
-92.25968002689014,30.7180061776264,0 -92.25976564548085,30.71751889774971,0 -92.25992462712097,30.71670626485147,0 -92.26006418327708,30.71604891951008,0 -92.26018466460856,30.71558863525373,0 -92.26037301574165,30.71498469610939,0 -92.26054805030229,30.71444051930294,0 -92.26065861561004,30.71411636559884,0
To This:
POLYGON((-92.25968002689014 30.7180061776264, -92.25976564548085,30.71751889774971, -92.25992462712097 30.71670626485147, -92.26006418327708,30.71604891951008, -92.26018466460856 30.71558863525373, -92.26037301574165,30.71498469610939, -92.26054805030229 30.71444051930294, -92.26065861561004,30.71411636559884))
The regex pattern I am using is : ",[0-9.-]* *"
My plan was to use a regex replace to replace any commas followed by any number of digits, periods, or minus signs followed by one or more spaces with some character like a colon. Then replace all commas with spaces and then replae all colons with commas. But for some reason I can't get it to work. Any Advice would be greatly appreciated.
You can try this:
([-\d.]+),([-\d.]+),([-\d.]+)\s+([-\d.]+),([-\d.]+),([-\d.]+)\s*;
Sample c# code:
String polygon(String input)
{
string pattern = #"([-\d.]+),([-\d.]+),([-\d.]+)\s+([-\d.]+),([-\d.]+),([-\d.]+)\s*";
RegexOptions options = RegexOptions.Singleline | RegexOptions.Multiline;
String finalString = "POLYGON((";
int count = 0;
foreach (Match m in Regex.Matches(input, pattern, options))
{
if (count > 0)
finalString += ",";
finalString += m.Groups[1] + " " + m.Groups[2] + ", " + m.Groups[4] + "," + m.Groups[5];
count = 1;
}
finalString += "))";
return finalString;
}
output:
POLYGON((-92.25968002689014 30.7180061776264, -92.25976564548085,30.71751889774971,-92.25992462712097 30.71670626485147,
-92.26006418327708,30.71604891951008,-92.26018466460856 30.71558863525373, -92.26037301574165,30.71498469610939,-92.260
54805030229 30.71444051930294, -92.26065861561004,30.71411636559884))

Getting only the last entry on counting

I'm only getting the last entry of the counting typed like this
public string ZodziuSkaiciavimas()
{
foreach (var sentence in Sakiniai.TrimEnd('.').Split('.'))
{
Rezultatas=(eilute.ToString() + " sakinyje zodziu:" + (sentence.Trim().Split(' ').Count() + sentence.Trim().Split('-').Count() + sentence.Trim().Split(';').Count() + sentence.Trim().Split(':').Count() + sentence.Trim().Split(',').Count() - 4));
eilute++;
}
return Rezultatas;
And I need to get the answer with a return type.
If I type code like this than i get what i want,but no returns.
public string ZodziuSkaiciavimas()
{
foreach (var sentence in Sakiniai.TrimEnd('.').Split('.'))
{
Console.WriteLine(eilute.ToString() + " sakinyje zodziu:" + (sentence.Trim().Split(' ').Count() + sentence.Trim().Split('-').Count() + sentence.Trim().Split(';').Count() + sentence.Trim().Split(':').Count() + sentence.Trim().Split(',').Count() - 4));
eilute++;
}
return Rezultatas;
}
Why arent you appending your results as below
Rezultatas +=(eilute.ToString() + " sakinyje zodziu:" + (sentence.Trim().Split(' ').Count() + sentence.Trim().Split('-').Count() + sentence.Trim().Split(';').Count() + sentence.Trim().Split(':').Count() + sentence.Trim().Split(',').Count() - 4)) + "\n";
It looks like you want to return multiple numbers from your method, but Rezultatas is a single string. You can fix it by changing the return type to List<int>, and returning a list:
public List<int> ZodziuSkaiciavimas() {
var Rezultatas = new List<int>()
foreach (var sentence in Sakiniai.TrimEnd('.').Split('.')) {
var res = sentence.Trim().Split(' ', '-', ';', ':', ',').Length;
Rezultatas.Add(res);
}
return Rezultatas;
}
When the callers decide to print the Rezultatas they gets back from your method, they could decide what character to put between the numbers (say, a comma ',') and print it like this:
var numbers = ZodziuSkaiciavimas();
Console.WriteLine(string.Join(", ", numbers));

Count the number of rows in each column

How can I count number of rows in specified column in a Excel sheet?
For example I have 2 columns in a spreadsheet:
A B
--- -----
abc hi
fff hello
ccc hi
hello
The result should look like:
count of A column is 3
count of B column is 4
How can I do this using Microsoft Interop?
The approach suggested by Doug Glancy is accurate and simple to be implemented. You can write the function and retrieve the value from a cell not seenable by the user (ZZ1000, for example). The code is straightforward:
Range notUsed = curSheet.get_Range("ZZ1000", "ZZ1000");
string targetCol = "A";
notUsed.Value2 = "=COUNTA(" + targetCol + ":" + targetCol + ")";
int totRows = Convert.ToInt32(notUsed.Value2);
notUsed.Value2 = "";
UPDATE ---
From your example I understood that you were looking for the total number of non-empty cells, what COUNTA delivers. But, apparently, this is not the case: you want the row number of the last non-empty cell; that is, by using a more descriptive example:
C
---
abc
fff
ccc
hello
You don't want to count the number of non-empty cells (4 in this case; what COUNTA delivers), but the position of "hello", that is, 5.
I don't like relying on Excel formulae too much, unless for clearly-defined problems (like yours, as I understood it initially). Excel formulae deliver still the best solution for what you really want (although its complexity is right "in the limit"). To account for the situation as described above, you can rely on MATCH. If your cells contain text (at least one letter per cell), the code can be changed into:
notUsed.Value2 = "=MATCH(REPT(\"z\",255)," + targetCol + ":" + targetCol + ")";
In case of having numeric values (not a single letter in the cell):
notUsed.Value2 = "=MATCH(LOOKUP(" + Int32.MaxValue.ToString() + "," + targetCol + ":" + targetCol + ")," + targetCol + ":" + targetCol + ")";
If you want to account for both options, you would have to combine these equations: you can create a new formula including both; or you might rely on C# code (e.g., get the values from both equations and consider only the bigger one).
Bear also in mind that you have to account for cases where no matches are found. Here you have a code accounting for both situations (letters and numbers via C# code) and for no matches:
notUsed.Value2 = "=MATCH(REPT(\"z\",255)," + targetCol + ":" + targetCol + ")";
int lastLetter = Convert.ToInt32(notUsed.Value2);
if (lastLetter == -2146826246)
{
lastLetter = 0;
}
totRows = lastLetter;
notUsed.Value2 = "=MATCH(LOOKUP(" + Int32.MaxValue.ToString() + "," + targetCol + ":" + targetCol + ")," + targetCol + ":" + targetCol + ")";
int lastNumber = Convert.ToInt32(notUsed.Value2);
if (lastNumber == -2146826246)
{
lastNumber = 0;
}
if (lastNumber > totRows)
{
totRows = lastNumber;
}
This should do it:
private static int GetRowsInColumnOnWorkSheetInWorkbook(string workbookName, int worksheetNumber, int workSheetColumn)
{
return new Excel.Application().Workbooks.Open(workbookName)
.Sheets[worksheetNumber]
.UsedRange
.Columns[workSheetColumn]
.Rows
.Count;
}
You could have the following override also:
private static int GetRowsInColumnOnWorkSheetInWorkbook(string workbookName, string worksheetName, int workSheetColumn)
{
return new Excel.Application().Workbooks.Open(workbookName)
.Sheets[worksheetName]
.UsedRange
.Columns[workSheetColumn]
.Rows
.Count;
}
It's slightly longer than the other answer, but I think this is more readable, and simpler.

Find string in text, remove the first and last char

I search in a text for some strings and want to remove the first and last char in those strings.
Example :
...
...
OK 125 ab_D9 "can be "this" or; can not be "this" ";
...
OK 673 e_IO1_ "hello; is strong
or maybe not strong";
...
So I use the code to find all strings begin with OK and remove from the 4 groups "...":
tmp = fin.ReadToEnd();
var matches = Regex.Matches(tmp, "(OK) ([0-9]+) ([A-Za-z_0-9]+) (\"(?:(?!\";).)*\");", RegexOptions.Singleline);
for (int i = 0; i < matches.Count; i++)
{
matches[i].Groups[4].Value.Remove(0);
matches[i].Groups[4].Value.Remove(matches[i].Groups[4].Value.ToString().Length - 1);
Console.WriteLine(matches[i].Groups[1].Value + "\r\n" + "\r\n" + "\r\n" + matches[i].Groups[2].Value + "\r\n" + "\r\n" + matches[i].Groups[3].Value + "\r\n" + "\r\n" + "\r\n" + matches[i].Groups[4].Value);
Console.WriteLine(" ");
}
But it doesn't remove first and last char from Group 4. What did I do wrong?
My Result should be:
OK
125
ab_D9
can be "this" or; can not be "this"
OK
673
e_IO1
hello; is strong
or maybe not strong
There is no need to remove things. Just don't capture the quotes in the first place. So move the parentheses one character inward.
"(OK) ([0-9]+) ([A-Za-z_0-9]+) \"((?:(?!\";).)*)\";"
You should assign the result of Substring() and Remove() methods. they do not change the existing string but return the changed string which you need to assign to the same or some other string variable. Check the code:
tmp = fin.ReadToEnd();
var matches = Regex.Matches(tmp, "(OK) ([0-9]+) ([A-Za-z_0-9]+) (\"(?:(?!\";).)*\");", RegexOptions.Singleline);
for (int i = 0; i < matches.Count; i++)
{
string str = matches[i].Groups[4].Value.Substring(0);
str = str.Remove(str.Length - 1);
Console.WriteLine(matches[i].Groups[1].Value + "\r\n" + "\r\n" + "\r\n" + matches[i].Groups[2].Value + "\r\n" + "\r\n" + matches[i].Groups[3].Value + "\r\n" + "\r\n" + "\r\n" + str);
Console.WriteLine(" ");
}
P.S. You should use Environment.NewLine instead of "\r\n", it's the better approach.

Categories

Resources