I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).
What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.
For example:
array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to
Result= array A = [0]want [1]home [2]sleep
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
If it is possible as well, I need array A to have no duplicates/distinct values.
You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once
For example
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
And applied to your code is:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}
You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.
I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude.
Loop through articleSplit
foreach x in arcticle split
foreach y in wordsToBeRemoved
if x != y newArray.Add(x)
However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.
Here are some helpful articles:
filter an array in C#
https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx
These will save you from all that looping.
You want to remove stop words. You can do it with a help of Linq:
...
string filePath = #"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, #"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
Related
So I have a server that receives a connection with the message being converted to a string, I then have this string split between by the spaces
So you have a line:
var line = "hello world my name is bob";
And you don't want "world" or "is", so you want:
"hello my name bob"
If you split to a list, remove the things you don't want and recombine to a line, you won't have extraneous spaces:
var list = line.Split().ToList();
list.Remove("world");
list.Remove("is");
var result = string.Join(" ", list);
Or if you know the exact index positions of your list items, you can use RemoveAt, but remove them in order from highest index to lowest, because if you e.g. want to remove 1 and 4, removing 1 first will mean that the 4 you wanted to remove is now in index 3.. Example:
var list = line.Split().ToList();
list.RemoveAt(4); //is
list.RemoveAt(1); //world
var result = string.Join(" ", list);
If you're seeking a behavior that is like string.Replace, which removes all occurrences, you can use RemoveAll:
var line = "hello is world is my is name is bob";
var list = line.Split().ToList();
list.RemoveAll(w => w == "is"); //every occurence of "is"
var result = string.Join(" ", list);
You could remove the empty space using TrimStart() method.
Something like this:
string text = "Hello World";
string[] textSplited = text.Split(' ');
string result = text.Replace(textSplited[0], "").TrimStart();
Assuming that you only want to remove the first word and not all repeats of it, a much more efficient way is to use the overload of split that lets you control the maximum number of splits (the argument is the maximum number of results, which is one more than the maximum number of splits):
string[] arguments = line.Split(new[] { ' ' }, 2, StringSplitOptions.RemoveEmptyEntries); // split only once
User.data = arguments.Skip(1).FirstOrDefault();
arguments[1] does the right thing when there are "more" arguments, but throw IndexOutOfRangeException if the number of words is zero or one. That could be fixed without LINQ by (arguments.Length > 1)? arguments[1]: string.Empty
If you're just removing the first word of a string, you don't need to use Split at all; doing a Substring after you found the space will be more efficient.
var line = ...
var idx = line.IndexOf(' ')+1;
line = line.Substring(idx);
or in recent C# versions
line = line[idx..];
Question:
I have a array of strings and I am trying to find the closest match to a provided string. I have made a few attempts below as well as checking into some other solutions such as Levenshtein Distance which seems to only work if all the strings are of similar sizes.
Expetation:
If I were to use "two are better" as the match string is that it would match with "Two are better than one".
Thought:
I was wondering if breaking apart the stringToMatch string where there are spaces and then seeing if each of those parts of the stringToMatch string are found in the current iteration of the array ( arrayOfStrings[i] ) would be helpful at all?
// Test array and string to search
string[] arrayOfStrings = new string[] { "A hot potato", "Two are better than one", "Best of both worlds", "Curiosity killed the cat", "Devil's Advocate", "It takes two to tango", "a twofer" };
string stringToMatch = "two are better";
// Contains attempt
List<string> likeNames = new List<string>();
for (int i = 0; i < arrayOfStrings.Count(); i++)
{
if (arrayOfStrings[i].Contains(stringToMatch))
{
Console.WriteLine("Hit1");
likeNames.Add(arrayOfStrings[i]);
}
if (stringToMatch.Contains(arrayOfStrings[i]))
{
Console.WriteLine("Hit2");
likeNames.Add(arrayOfStrings[i]);
}
}
// StringComparison attempt
var matches = arrayOfStrings.Where(s => s.Equals(stringToMatch, StringComparison.InvariantCultureIgnoreCase)).ToList();
// Display matched array items
Console.WriteLine("List likeNames");
likeNames.ForEach(Console.WriteLine);
Console.WriteLine("\n");
Console.WriteLine("var matches");
matches.ForEach(Console.WriteLine);
You can try below code.
I have created List<string> based on your stringToMatch and checked if strings in array of strings contains every string present in toMatch, if yes then selected that string into match.
List<string> toMatch = stringToMatch.Split(' ').ToList();
List<string> match = arrayOfStrings.Where(x =>
!toMatch.Any(ele => !x.ToLower()
.Contains(ele.ToLower())))
.ToList();
For your implementation, I have split the stringToMatch and then took the count for matchings.
The below code will give you Order list with count with ordered with Highest string match count.
string[] arrayOfStrings = new string[] { "A hot potato", "Two are better than one", "Best of both worlds", "Curiosity killed the cat", "Devil's Advocate", "It takes two to tango", "a twofer" };
string stringToMatch = "two are better";
var matches = arrayOfStrings
.Select(s =>
{
int count = 0;
foreach (var item in stringToMatch.Split(' '))
{
if (s.Contains(item))
count++;
}
return new { count, s };
}).OrderByDescending(d => d.count);
I have used very simple string comparison to verify. The algorithm can vary as per exact requirement(Like sequence of matching string, etc)
I read the *.txt file from c# and displayed in the console.
My text file looks like a table.
diwas hey
ivonne how
pokhara d kd
lekhanath when
dipisha dalli hos
dfsa sasf
Now I want to search for a string "pokhara" and if it is found then it should display the "d kd" and if not found display "Not found"
What I tried?
string[] lines = System.IO.ReadAllLines(#"C:\readme.txt");
foreach(string line in lines)
{
string [] words = line.Split();
foreach(string word in words)
{
if (word=="pokhara")
{
Console.WriteLine("Match Found");
}
}
}
My Problem:
Match was found but how to display the next word of the line. Also sometimes
in second row some words are split in two with a space, I need to show both words.
I guess your delimiter is the tab-character, then you can use String.Split and LINQ:
var lineFields = System.IO.File.ReadLines(#"C:\readme.txt")
.Select(l => l.Split('\t'));
var matches = lineFields
.Where(arr => arr.First().Trim() == "pokhara")
.Select(arr => arr.Last().Trim());
// if you just want the first match:
string result = matches.FirstOrDefault(); // is null if not found
If you don't know the delimiter as suggested by your comment you have a problem. If you don't even know the rules of how the fields are separated it's very likely that your code is incorrect. So first determine the business logic, ask the people who created the text file. Then use the correct delimiter in String.Split.
If it's a space you can either use string.Split()(without argument), that includes spaces, tabs and new-line characters or use string.Split(' ') which only includes the space. But note that is a bad delimiter if the fields can contain spaces as well. Then either use a different or wrap the fields in quoting characters like "text with spaces". But then i suggest a real text-parser like the Microsoft.VisualBasic.FileIO.TextFieldParser which can also be used in C#. It has a HasFieldsEnclosedInQuotes property.
This works ...
string[] lines = System.IO.ReadAllLines(#"C:\readme.txt");
string stringTobeDisplayed = string.Empty;
foreach(string line in lines)
{
stringTobeDisplayed = string.Empty;
string [] words = line.Split();
//I assume that the first word in every line is the key word to be found
if (word[0].Trim()=="pokhara")
{
Console.WriteLine("Match Found");
for(int i=1 ; i < words.Length ; i++)
{
stringTobeDisplayed += words[i]
}
Console.WriteLine(stringTobeDisplayed);
}
}
Currently I have large entries (in array) of Pinyin tone notation, some string are combined, for example Diànnǎo = Diàn + nǎo
Now problem is I want replace a string that contain 2 or more, for example:
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
for (int i = 0; i < Py.Length; i++)
if (Input.Contains(Py[i]))
Input = Input.Replace(Py[i], Km[i]);
Code above have a problem due to loop index, xiaguo contains xi become true (shiaguo) not (shieguo) since xi get first before xia
How do I achieve this? and make sure get xia instead of xi
Full code I posted on GitHub: https://github.com/Anime4000/py2km/blob/beta/py2km.api/Converter.cs#L15
Assuming longer tokens take precedence over shorter tokens, the 2 arrays can be converted to a dictionary and then sorted by the length of the key:
var dic = new Dictionary<string, string>
{
{"xi","shi"},
{"xia","shie"},
{"xian","shien"},
}.OrderByDescending(x => x.Key.Length)
.ThenBy(x => x.Key)
.ToDictionary(x => x.Key, x => x.Value);
string input = "xiaguo";
foreach(var d in dic)
input = input.Replace(d.Key, d.Value);
Console.WriteLine(input);
The above example with sort the dictionary:
by the length of the key
then by the alpha sort of the key
finally, the LINQ query is converted back to a dictionary.
From there, just iterate over the dictionary and replace all the tokens; there's no need to check to see if the key/token exists.
you could use a regular expresion for this.
I modified your code so the regex wil only match xi and not xia.
the regex "xi\b" matches xi and the \b means word boundary so it only matches that exact word.
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
string pattern = "xi\b"
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
for (int i = 0; i < Py.Length; i++)
{
MatchCollection matches = rgx.Matches(Py[i]);
if (matches.Count > 0)
{
Input = Input.Replace(Py[i], Km[i]);
}
}
The tone/language specifics could not have an easy structure, so you may assume some pattern and then find out later that it's not right for some 'word'.
Anyway, to handle the informed scenario, you must order the target tone by descending length, and then perform only a single replacement for each 'word' (this will avoid replacing xi, xia when processing xian.
The steps would be:
for each replacement ordered by length descending
try to find tone
if found: replace and mark as done (jump to next 'word')
The idea here is the same as when replacing a two numbers in a list, say 2 to 1 and 3 to 2, for example. The order really matter, if you replace 3 by 2 then you will be replacing both 3 and 2 to 1 after all.
I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];
here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth
Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks
Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.
It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"
Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.