Extract values from a string into arrays - c#

I have a string like this:
john "is my best buddy" and he loves "strawberry juice"
I want to-
Extract texts within double-quotes into a string array array1
Split texts outside of double-quotes by spaces and then insert them into another string array (array2).
Output:
array1[0]: is my best buddy
array1[1]: strawberry juice
array2[0]: john
array2[1]: and
array2[2]: he
array2[3]: loves
Any help is appreciated.

Clearly, this is a call for Regular Expressions:
var str = #"john ""is my best buddy"" and he loves ""strawberry juice""";
var regex = new Regex("(\"(?'quoted'[^\"]+)\")|(?'word'\\w+)",
RegexOptions.Singleline|RegexOptions.Compiled);
var matches = regex.Matches(str);
var quotes = matches.Cast<Match>()
.SelectMany(m => m.Groups.Cast<Group>())
.Where(g => g.Name == "quoted" && g.Success)
.Select(g => g.Value)
.ToArray();
var words = matches.Cast<Match>()
.SelectMany(m => m.Groups.Cast<Group>())
.Where(g => g.Name == "word" && g.Success)
.Select(g => g.Value)
.ToArray();

Related

The most common word in spaceless string

I have a very long string of text that is many words separated by camelCase like so:
AedeagalAedilityAedoeagiAefaldnessAegeriidaeAeginaAeipathyAeneolithicAeolididaeAeonialAerialityAerinessAerobia
I need to find the most common word and the number of times it has been used, I am unaware how to do this due to the lack of spaces and being new to C#.
I have tried many methods but none seem to work, any advice you have I'd be very grateful.
I have a github repo with the file being downloaded and a few tests already done here: https://github.com/Imstupidpleasehelp/C-code-test
Thank you.
You can try querying the string with a help of regular expressions and Linq:
string source = ...
var result = Regex
.Matches(source, "[A-Z][a-z]*")
.Cast<Match>()
.Select(match => match.Value)
.GroupBy(word => word)
.Select(group => (word : group.Key, count : group.Count()))
.OrderByDescending(pair => pair.count)
.First();
Console.Write($"{result.word} appears {result.count} time");
string[] split = Regex.Split(exampleString, "(?<=[A-Za-z])(?=[A-Z][a-z])");
var result = split.GroupBy(s => s)
.Where(g=> g.Count()>=1 )
.OrderByDescending(g => g.Count())
.Select(g => new{ Word = g.Key, Occurrences = g.Count()});
var result will contain pairs of (Word, Occurrences) for all words.
If you want just the first one (the one with the most occurrences) use
var result = split.GroupBy(s => s)
.Where(g=> g.Count()>=1 )
.OrderByDescending(g => g.Count())
.Select(g => new{ Word = g.Key, Occurrences = g.Count()}).First();
Have in mind that it can happen that you have 2 or more words with the same number of occurrences, so using First() would only give you one of those.
A non-linq approach using for loop and IsUpper to separate the words.
string data = "AedeagalAedilityAedoeagiAefaldness";
var words = new List<string>();
var temp = new StringBuilder();
for(int i = 0;i < data.Length;i++)
{
temp.Append(data[i]);
if (i == data.Length-1 || char.IsUpper(data[i+1]))
{
words.Add(temp.ToString());
temp.Clear();
}
}

Linq select with regex

I want to extract the strings like aaa.a1 and aaa.a2 from my list. All this strings contain "aaa.".
How can I combine Regex with Linq?
var inputList = new List<string>() { "bbb aaa.a1 bbb", "ccc aaa.a2 ccc" };
var result = inputList.Where(x => x.Contains(#"aaa.")).Select(x => x ???? ).ToList();
You may use
var inputList = new List<string>() { "bbb aaa.a1 bbb", "ccc aaa.a2 ccc" };
var result = inputList
.Select(i => Regex.Match(i, #"\baaa\.\S+")?.Value)
.Where(x => !string.IsNullOrEmpty(x))
.ToList();
foreach (var s in result)
Console.WriteLine(s);
Output:
aaa.a1
aaa.a2
See C# demo
The Regex.Match(i, #"\baaa\.\S+")?.Value part tries to match the following pattern in each item:
\b - a word boundary
aaa\. - an aaa. substring
\S+ - 1+ non-whitespace chars.
The .Where(x => !string.IsNullOrEmpty(x)) will discard empty items that result from the items with no matching strings.
You could try slight different solution:
var result = inputList
.Where(i => Regex.Match(i, #"\baaa\.[a-z0-9]+")?.Success)
// or even
// .Where(i => Regex.Match(i, #"\ba+\.[a-z0-9]+")?.Success)

Check if String Contains Match in Enumerable.Range Filter List

I want to check if a string contains a word or number from a list and remove it from the string.
I want to use Enumerable.Range() to create the filter list and use it to filter many different strings.
I'm trying to combine two previous answers:
https://stackoverflow.com/a/49733139/6806643
https://stackoverflow.com/a/49740832/6806643
The sentence I want to filter:
This is a A05B09 hello 02 100 test
Filter
A00B00-A100B100, 01-100, 000-100, hello
Should read:
This is a test
Old Way
For Loop - Works
http://rextester.com/BJL70824
New Way
Enumerable Range List - Does not work
http://rextester.com/ZSCM64375
C#
List<List<string>> filters = Enumerable.Range(0, 101)
.SelectMany(a => Enumerable.Range(0, 101).Select(b => "A{0:00}B{1:00}"))
.Select(i => Enumerable.Range(0, 10).Select(c => string.Empty).ToList())
.SelectMany(a => Enumerable.Range(0, 101).Select(b => "{0:000}"))
.SelectMany(a => Enumerable.Range(0, 101).Select(b => "{0:00}"))
.SelectMany(a => Enumerable.Range(0, 1).Select(b => "hello"))
.ToList();
List<string> matches = new List<string>();
// Sentence
string sentence = "This is a A05B09 hello 02 100 test";
string newSentence = string.Empty;
// Find Matches
for (int i = 0; i < filters.Count; i++)
{
// Add to Matches List
if (sentence.Contains(filters[i].ToString()))
{
matches.Add(filters[i]);
}
}
// Filter Sentence
newSentence = Regex.Replace(
sentence
, #"(?<!\S)(" + string.Join("|", matches) + #")(?!\S)"
, ""
, RegexOptions.IgnoreCase
);
// Display New Sentence
Console.WriteLine(newSentence);
I think creating a list of all possible combinations is a very bad approach. You are creating huge lists which will make your process use a lot of RAM and be very slow without any good reason. Why not just create a good Regex? For example, with this expression, you get your desired string:
\b(A\d\dB\d\d|A100B100|0?\d\d|100|hello)\b\s*
That is assuming you don't want to replace stuff like A101B101 or 123.
If you want to replace those as well, the regex is a bit simpler:
\b(A\d\d\d?B\d\d\d?|\d\d\d?|hello)\b\s*
Your this line seems not meet your requirements..SelectMany(a => Enumerable.Range(0, 101).Select(b => "A{0:00}B{1:00}"))
Can you try this Linq?
List<string> filters = Enumerable.Range(0, 101)
.SelectMany(a => Enumerable.Range(0, 101).Select(b => $"A{a:00}B{b:00}"))
.Union(Enumerable.Range(0, 101).Select(b => $"{b:000}"))
.Union(Enumerable.Range(0, 101).Select(b => $"{b:00}"))
.Union(new List<string> {"hello"})
.ToList();
This verion can give you expected result on rextester
List<string> filters = Enumerable.Range(0, 101)
.SelectMany(a => Enumerable.Range(0, 101).Select(b => string.Format("A{0:00}B{1:00}", a, b)))
.Union(Enumerable.Range(0, 101).Select(b => string.Format("{0:000}", b)))
.Union(Enumerable.Range(0, 101).Select(b => string.Format("{0:00}", b)))
.Union(new List<string> { "hello" })
.ToList();

Check value in datatable with Linq

I am creating a Word Cloud and so I am splitting my sentences in Linq using Regex and grouping the words and taking the count of them. However, I don't want some blacklist words to appear in my cloud, so I get those words in a datatable (dtBlackList) and check with Linq as shown in the code below
var result = (Regex.Split(StringsForWordCloud, #"\W+")
.GroupBy(s => s, StringComparer.InvariantCultureIgnoreCase)
.Where(q => q.Key.Trim() != "")
.Where(q => (dtBlackList.Select("blacklistword = '" + q.Key.Trim() + "'").Count() == 0))
.OrderByDescending(g => g.Count())
.Select(p => new { Word = p.Key, Count = p.Count() })
).Take(200);
Will this query affect my performance badly? Is this the right way to check against a datatable?
A LINQ query as this one will execute a query for each word found with the Regex.Split operation. I'm referring to this line of code:
.Where(q => (dtBlackList.Select("blacklistword = '" + q.Key.Trim() + "'").Count() == 0))
I've had to deal with a lot of performance problems on the project I'm working right now, caused by situations similar to this one.
In general, performing a query to check or complete the data extracted in your database is not a good practice.
In your case, I think it's much better to write a single query that will extract the blacklist words and then exclude that list from the dataset you have just extracted. As follows:
var words = Regex.Split(StringsForWordCloud, #"\W+")
.GroupBy(s => s, StringComparer.InvariantCultureIgnoreCase)
.Where(q => q.Key.Trim() != "")
.OrderByDescending(g => g.Count())
.Select(p => new { Word = p.Key, Count = p.Count() });
// Now extract all the word in the blacklist
IEnumerable<string> blackList = dtBlackList...
// Now exclude them from the set of words all in once
var result = words.Where(w => !blackList.Contains(w.Word)
.OrderByDescending(g => g.Count())
.Take(200);

LINQ C# Selecting characters from string

I have a string that I convert to a char array and then I use LINQ to select the different characters inside the char array and then order them by Descending but only catch the characters, not the punctuation marks etc...
Here is the code:
string inputString = "The black, and, white cat";
var something = inputString.ToCharArray();
var txtEntitites = something.GroupBy(c => c)
.OrderByDescending(g => g.Count())
.Where(e => Char.IsLetter(e)).Select(t=> t.Key);
And the error message I get:
Error CS1502: The best overloaded method match for `char.IsLetter(char)' has some invalid arguments (CS1502)
Error CS1503: Argument '#1' cannot convert 'System.Linq.IGrouping<char,char>' expression to type `char' (CS1503)
Any ideas? Thanks :)
Try this:
string inputString = "The black, and, white cat";
var something = inputString.ToCharArray();
var txtEntitites = something.GroupBy(c => c)
.OrderByDescending(g => g.Count())
.Where(e => Char.IsLetter(e.Key))
.Select(t=> t.Key);
Note the Char.IsLetter(e.Key))
Another idea is to rearrange your query:
var inputString = "The black, and, white cat";
var txtEntitites = inputString.GroupBy(c => c)
.OrderByDescending(g => g.Count())
.Select(t=> t.Key)
.Where(e => Char.IsLetter(e));
Also note you don't need the call to inputString.ToCharArray() since String is already an IEnumerable<Char>.
In your where clause, e in that context is your grouping, not the character. If you want to check if the character is a letter, you should be testing your key.
//...
.Where(g => Char.IsLetter(g.Key))
List<char> charArray = (
from c in inputString
where c >= 'A' && c <= 'z'
orderby c
select c
).Distinct()
.ToList();
I think this is what you are looking for
string inputString = "The black, and, white cat";
var something = inputString.ToCharArray();
var txtEntitites = something.Where(e => Char.IsLetter(e))
.GroupBy(c => c)
.OrderByDescending(g => g.Count()).Select(t=> t.Key);

Categories

Resources