Easier method to group by multiple newlines - c#

Today I've faced a problem that looked easy at glance but it was not certainly as simple. My task is to group multiple strings separated by two newlines. Example:
a
b
c
d
e
f
g
h
into:
[ [a, b], [c, d], [e, f, g], [h] ]
At first I thought about getting the groups out of a regular expression, but I couldn't find the right one to separate and give me the strings grouped. Then I decided to look at it with LINQ, but I couldn't manage to get anything useful either. Any tips?

You can use String.Split with two concatenated Environment.NewLine:
string[][] result = text.Split(new [] { Environment.NewLine + Environment.NewLine }, StringSplitOptions.None)
.Select(token => token.Split(new []{Environment.NewLine}, StringSplitOptions.None))
.ToArray();
https://dotnetfiddle.net/NinneE

Splitting input by new line is a classic from AOC.
Here is a part of my extention Method .net 7:
public static class Extention{
/// <summary>
/// Splits a text into lines.
/// </summary>
public static IEnumerable<string> Lines(this string text) => text.Split(Environment.NewLine, StringSplitOptions.RemoveEmptyEntries);
/// <summary>
/// Splits a text into blocks of lines. Split occurs on each empty line.
/// </summary>
public static IEnumerable<string> Blocks(this string text) => text.Trim().Split(Environment.NewLine + Environment.NewLine, StringSplitOptions.RemoveEmptyEntries);
}
Code will be :
var result = text.Blocks()
.Select(b => b.Lines());
NB:
.Split(Environment.NewLine + Environment.NewLine, is .NET 7

Related

C# Concat multiline string

I got two strings A and B.
string A = #"Hello_
Hello_
Hello_";
string B = #"World
World
World";
I want to add these two strings together with a function call which could look like this:
string AB = ConcatMultilineString(A, B)
The function should return:
#"Hello_World
Hello_World
Hello_World"
The best way to do this for me was splitting the strings into an array of lines and then adding all lines together with "\r\n" and then returning it. But that seems bad practice to me since mulitple lines are not always indicated with "\r\n".
Is there a way to do this which is more reliable?
For a one line solution:
var output = string.Join(System.Environment.NewLine, A.Split('\n')
.Zip(B.Split('\n'), (a,b) => string.Join("", a, b)));
We split on \n because regardless of whether it's \n\r or just \n, it will contain \n. Left over \r seem to be ignored, but you can add a call to Trim for a and b if you feel safer for it.
Environment.NewLine is a platform-agnostic alternative to using "\r\n".
Environment.NewLine might be helpful to resolve the "mulitple lines are not always indicated with "\r\n""-issue
https://msdn.microsoft.com/de-de/library/system.environment.newline(v=vs.110).aspx
Edit:
If you dont know if multiple lines are separated as "\n" or "\r\n" this might help:
input.Split(new string[] {"\n", "\r\n"}, StringSplitOptions.RemoveEmptyEntries);
Empty lines are removed. If you dont want this use: StringSplitOptions.None instead.
See also here: How to split strings on carriage return with C#?
This does as you asked:
static string ConcatMultilineString(string a, string b)
{
string splitOn = "\r\n|\r|\n";
string[] p = Regex.Split(a, splitOn);
string[] q = Regex.Split(b, splitOn);
return string.Join("\r\n", p.Zip(q, (u, v) => u + v));
}
static void Main(string[] args)
{
string A = "Hello_\rHello_\r\nHello_";
string B = "World\r\nWorld\nWorld";
Console.WriteLine(ConcatMultilineString(A, B));
Console.ReadLine();
}
Outputs:
Hello_World
Hello_World
Hello_World
I think a generic way is very impossible, if you will load the string that is created from different platforms (Linux + Mac + Windows) or even get strings that contain HTML Line Breaks or what so ever
I think you will have to define the line break you self.
string a = getA();
string b = getB();
char[] lineBreakers = {'\r', '\n'};
char replaceWith = '\n';
foreach(string lineBreaker in lineBreakers)
{
a.Replace(lineBreaker, replaceWith);
b.Replace(lineBreaker, replaceWith);
}
string[] as = a.Split(replaceWith);
string[] bs = a.Split(replaceWith);
string newLine = Environment.NewLine;
if(as.Length == bs.Length)
{
return as.Zip(bs (p, q) => $"{p}{q}{newLine }")
}

Ignore case of the delimiter in C# Split function [duplicate]

I need to split a string let's say "asdf aA asdfget aa uoiu AA" split using "aa" ignoring the case.
to
"asdf "
"asdfget "
"uoiu "
There's no easy way to accomplish this using string.Split. (Well, except for specifying all the permutations of the split string for each char lower/upper case in an array - not very elegant I think you'll agree.)
However, Regex.Split should do the job quite nicely.
Example:
var parts = Regex.Split(input, "aa", RegexOptions.IgnoreCase);
If you don't care about case, then the simplest thing to do is force the string to all uppercase or lowercase before using split.
stringbits = datastring.ToLower().Split("aa")
If you care about case for the interesting bits of the string but not the separators then I would use String.Replace to force all the separators to a specific case (upper or lower, doesn't matter) and then call String.Split using the matching case for the separator.
strinbits = datastring.Replace("aA", "aa").Replace("AA", "aa").Split("aa")
In your algorithm, you can use the String.IndexOf method and pass in OrdinalIgnoreCase as the StringComparison parameter.
My answer isn't as good as Noldorin's, but I'll leave it so people can see the alternative method. This isn't as good for simple splits, but it is more flexible if you need to do more complex parsing.
using System.Text.RegularExpressions;
string data = "asdf aA asdfget aa uoiu AA";
string aaRegex = "(.+?)[aA]{2}";
MatchCollection mc = Regex.Matches(data, aaRegex);
foreach(Match m in mc)
{
Console.WriteLine(m.Value);
}
Use My Method to Split
public static string[] Split(this string s,string word,StringComparison stringComparison)
{
List<string> tmp = new List<string>();
int wordSt;
s.IndexOf(word, 0, stringComparison);
while(s.IndexOf(word, 0, stringComparison) > -1)
{
wordSt = s.IndexOf(word, 0, stringComparison);
tmp.Add(s.Substring(0, wordSt));
s = s.Substring(wordSt + word.Length);
}
tmp.Add(s);
return tmp.ToArray();
}
It's not the pretties version but also works:
"asdf aA asdfget aa uoiu AA".Split(new[] { "aa", "AA", "aA", "Aa" }, StringSplitOptions.RemoveEmptyEntries);
public static List<string> _Split(this string input,string[] splt)
{
List<string> _Result=new List<string>();
foreach(string _splt in splt)
{
if (splt.Count() == 1)
{
_Result.AddRange(Regex.Split(input, _splt, RegexOptions.IgnoreCase).ToList());
}
else
{
List<string> NewStr = Regex.Split(input, _splt, RegexOptions.IgnoreCase).ToList();
foreach(string _NewStr in NewStr)
{
List<string> NewSplt = splt.ToList();
NewSplt.Remove(_splt);
return _Split(_NewStr, NewSplt.ToArray());
}
}
}
return _Result;
}
then use this function as bellow
public frmThematicConversation()
{
InitializeComponent();
string str = "a b c d e f g h a b c f a d c b f";
string[] splt = { "a", "b" };
List<string> _result = str._Split(splt);
}
I have had good success with this extension method I wrote that uses .replace() to find and fix the casing.
You call it as follows:
var result = source.Split(prefix, StringComparison.InvariantCultureIgnoreCase);
The extension method is defined as follows.
public static string[] Split(this string source, string separator,
StringComparison comparison = StringComparison.CurrentCulture,
StringSplitOptions splitOptions = StringSplitOptions.None)
{
if (source is null || separator is null)
return null;
// Pass-through the default case.
if (comparison == StringComparison.CurrentCulture)
return source.Split(new string[] { separator }, splitOptions);
// Use Replace to deal with the non-default comparison options.
return source
.Replace(separator, separator, comparison)
.Split(new string[] { separator }, splitOptions);
}
NOTE: This method deals with my default case where I am usually passing a single string separator.
Dim arr As String() = Strings.Split("asdf aA asdfget aa uoiu AA",
"aa" ,, CompareMethod.Text)
CompareMethod.Text ignores case.
Building on the answer from #Noldorin i made this extension method.
It takes in more than one seperator string, and mimics the behavior of string.Split(..) if you supply several seperator strings. It has invariant ('culture-unspecific') culture and ignores cases of course.
/// <summary>
/// <see cref="string.Split(char[])"/> has no option to ignore casing.
/// This functions mimics <see cref="string.Split(char[])"/> but also ignores casing.
/// When called with <see cref="StringSplitOptions.RemoveEmptyEntries"/> <see cref="string.IsNullOrWhiteSpace(string)"/> is used to filter 'empty' entries.
/// </summary>
/// <param name="input">String to split</param>
/// <param name="separators">Array of separators</param>
/// <param name="options">Additional options</param>
/// <returns></returns>
public static IEnumerable<string> SplitInvariantIgnoreCase(this string input, string[] separators, StringSplitOptions options)
{
if (separators == null) throw new ArgumentNullException(nameof(separators));
if (separators.Length <= 0) throw new ArgumentException("Value cannot be an empty collection.", nameof(separators));
if (string.IsNullOrWhiteSpace(input)) throw new ArgumentException("Value cannot be null or whitespace.", nameof(input));
// Build a regex pattern of all the separators this looks like aa|bb|cc
// The Pipe character '|' means alternative.
var regexPattern = string.Join("|", separators);
var regexSplitResult = Regex.Split(input, regexPattern, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
// NOTE To be honest - i don't know the exact behaviour of Regex.Split when it comes to empty entries.
// Therefore i doubt that filtering null values even matters - however for consistency i decided to code it in anyways.
return options.HasFlag(StringSplitOptions.RemoveEmptyEntries) ?
regexSplitResult.Where(c => !string.IsNullOrWhiteSpace(c))
: regexSplitResult;
}

String is not splitting correctly

I am trying to split a string into a string[] made of the words the string originally held using the fallowing code.
private string[] ConvertWordsFromFile(String NewFileText)
{
char[] delimiterChars = { ' ', ',', '.', ':', '/', '|', '<' , '>','/','#','#','$','%','^','&','*','"','(',')',';'};
string[] words = NewFileText.Split(delimiterChars);
return words;
}
I am then using this to add the words to a dictionary that keeps up with word keys and their frequency value. All other duplicated words are not added as keys and only the value is affected. However the last word is counted as a different word and is therefore made into a new key. How can i fix this?
This is the code I have for adding words to the dictionary :
public void AddWord(String newWord)
{
newWord = newWord.ToLower();
try
{
MyWords.Add(newWord, 1);
}
catch (ArgumentException)
{
MyWords[newWord]++;
}
}
To clarify the problem i am having is that even if the word at the end of a string is a duplicate it is still treated like a new word and therefore a new string.
Random guess - space at the end makes empty word that you don't expect. If yes - use correct option for Split:
var words = newFileText.Split(delimiterChars,
StringSplitOptions.RemoveEmptyEntries);
Split is not the best choice to do what you want to do because you end having this kind of problems and you also have to specify all the delimiters, etc.
A much better option is using a regular expressions instead of your ConvertWordsFromFile method as follow:
Regex.Split(theTextToBeSplitted, #"\W+")
This line will return an array containing all the 'words'. Once you have that, the next step should be create your dictionary so, if you can use linq in your code, the easiest and cleaner way to do what you want is this one:
var theTextToBeSplitted = "#Hi, this is a 'little' test: <I hope it is useful>";
var myDictionary = Regex.Split(theTextToBeSplitted, #"\W+")
.GroupBy(x => x)
.ToDictionary(x => x.Key, x => x.Count());
That´s all that you need.
Good luck!

Regex.Split adding empty strings to result array

I have a Regex to split out words operators and brackets in simple logic statements (e.g. "WORD1 & WORD2 | (WORd_3 & !word_4 )". the Regex I've come up with is "(?[A-Za-z0-9_]+)|(?[&!\|()]{1})". Here is a quick test program.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("* Test Project *");
string testExpression = "!(LIONV6 | NOT_superCHARGED) &RHD";
string removedSpaces = testExpression.Replace(" ", "");
string[] expectedResults = new string[] { "!", "(", "LIONV6", "|", "NOT_superCHARGED", ")", "&", "RHD" };
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})");
Console.WriteLine("Expected\n{0}\nActual\n{1}", expectedResults.AllElements(), splits.AllElements());
Console.WriteLine("*** Any Key to finish ***");
Console.ReadKey();
}
}
public static class Extensions
{
public static string AllElements(this string[] str)
{
string output = "";
if (str != null)
{
foreach (string item in str)
{
output += "'" + item + "',";
}
}
return output;
}
}
The Regex does the required job of splitting out words and operators into an array in the right sequence, but the result array contains many empty elements, and I can't work out why. Its not a serious problem as I just ignore empty elements when consuming the array but I'd like Regex to do all the work if possible, including ignoring spaces.
Try this:
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})").Where(x => x != String.Empty);
The spaces are jsut becasue of the way the split works. From the help page:
If multiple matches are adjacent to one another, an empty string is inserted into the array.
What split is doing as standard is taking your matches as delimiters. So in effect the standard that would be returned is a lot of empty strings between the adjacent matches (imagine as a comparison what you might expect if you split ",,,," on ",", you'd probably expect all the gaps.
Also from that help page though is:
If capturing parentheses are used in a Regex.Split expression, any
captured text is included in the resulting string array.
This is the reason you are getting what you actually want in there at all. So effectively it is now showing you the text that has been split (all the empty strings) with the delimiters too.
What you are doing may well be better off done with just matching the regular expression (with Regex.Match) since what is in your regular expression is actually what you want to match.
Something like this (using some linq to convert to a string array):
Regex.Matches(testExpression, #"([A-Za-z0-9_]+)|([&!\|()]{1})")
.Cast<Match>()
.Select(x=>x.Value)
.ToArray();
Note that because this is taking positive matches it doesn't need the spaces to be removed first.
var matches = Regex.Matches(removedSpaces, #"(\w+|[&!|()])");
foreach (var match in matches)
Console.Write("'{0}', ", match); // '!', '(', 'LIONV6', '|', 'NOT_superCHARGED', ')', '&', 'RHD',
Actually, you don't need to delete spaces before extracting your identifiers and operators, the regex I proposed will ignore them anyway.

How to specify Lucene.net boolean AND, OR , not operator from normal and, or and not variables?

In my project i was implementing a full text index search using Lucence. But while doing this i was stuck up with a logic of differentiating Lucene boolean operators from Normal and, or , not words.
Suppose for example if we are searching for "I want a pen and pencil" , but by default Lucene.net searching Lucene OR operation. so it will search like "I OR want OR a OR pen OR pencil" not like what i would like to have like "I OR want OR a OR pen OR and OR pencil". So how come we differentiate a normal and, or, not from Lucene operators?
For this I have done a helper method which looks like
/// <summary>
/// Method to get search predicates
/// </summary>
/// <param name="searchTerm">Search term</param>
/// <returns>List of predicates</returns>
public static IList<string> GetPredicates(string searchTerm)
{
//// Remove unwanted characters
//searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty);
string exactSearchTerm = string.Empty,
keywordOrSearchTerm = string.Empty,
andSearchTerm = string.Empty,
notSearchTerm = string.Empty,
searchTermWithOutKeywords = string.Empty;
//// Exact search tern
exactSearchTerm = "\"" + searchTerm.Trim() + "\"";
//// Search term without keywords
searchTermWithOutKeywords = Regex.Replace(
searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase);
//// Splioted keywords
string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries);
//// Or search term
keywordOrSearchTerm = string.Join(" OR ", splittedKeywords);
//// And search term
andSearchTerm = string.Join(" AND ", splittedKeywords);
//// not search term
int index = 0;
List<string> searchTerms = (from term in Regex.Split(
searchTerm, " and not ", RegexOptions.IgnoreCase)
where index++ != 0
select term).ToList();
searchTerms = (from term in searchTerms
select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ?
Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() :
term).ToList();
notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\"";
return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm };
}
but it will return four results. so i have to loop through my index for 4 times , but it seems to be very hectic one. so can anybody give a hand to resolve this one in a single loop?
Like #Matt Warren suggested, lucene has what are called "stop words" that usually add little value to the quality of search but make the index HUGE and bloated. StopWords like "a, and, or, the, an" are usually automatically filtered out of your text as it is indexed, and then filtered out of your query when it is parsed. The StopFilter is resposible for this behavior in both cases, but you can pick an analyzer that does not use the StopFilter.
The other issue is in query parsing. If I remember correctly, the lucene query parser will only treat capitalized OR AND and NOT as keywords, so if the user types in all capital letters, you'll need to replace them with lower-case so it is not treated as the operators. Here's some Regex.Replace code for that:
string queryString = "the red pencil and blue pencil are both not green or brown";
queryString =
Regex.Replace (
queryString,
#"\b(?:OR|AND|NOT)\b",
m => m.Value.ToLowerInvariant ());
The built-in StandardAnalyzer will strip out common words for you, see this article for an explanation.

Categories

Resources