how to perform tokenization and stopword removal in C#? - c#

Basically i want to tokenise each word of the paragraph and then perform stopword removal. Which will be preprocessed data for my algorithm.

You can remove all punctuation and split the string for whitespace.
string s = "This is, a sentence.";
s = s.Replace(",","").Replace(".");
string words[] = s.split(" ");

if read from text file or any text you can:
char[] dele = { ' ', ',', '.', '\t', ';', '#', '!' };
List<string> allLinesText = File.ReadAllText(text file).Split(dele).ToList();
then you can convert stop-words to dictionary and save your document to list then
foreach (KeyValuePair<string, string> word in StopWords)
{
if (list.contain(word.key))
list.RemovAll(s=>s==word.key);
}

You can store all separation symbols and stopwords in constants or db:
public static readonly char[] WordsSeparators = {
' ', '\t', '\n', '\n', '\r', '\u0085'
};
public static readonly string[] StopWords = {
"stop", "word", "is", "here"
};
Remove all puctuations. Split text and filter:
var words = new List<string>();
var stopWords = new HashSet<string>(TextOperationConstants.StopWords);
foreach (var term in text.Split(TextOperationConstants.WordsSeparators))
{
if (String.IsNullOrWhiteSpace(term)) continue;
if (stopWords.Contains(term)) continue;
words .Add(term);
}

Related

Removing words from text with separators in front(using Regex or List)

I need to remove words from the text with separators next to them. The problem is that the program only removes 1 separator after the word but there are many of them. Any suggestions how to remove other separators?
Also, I need to make sure that the word is not connected with other letters. For example (If the word is fHouse or Housef it should not be removed)
At the moment I have:
public static void Process(string fin, string fout)
{
using (var foutv = File.CreateText(fout)) //fout - OutPut.txt
{
using (StreamReader reader = new StreamReader(fin)) // fin - InPut.txt
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] WordsToRemove = { "Home", "House", "Room" };
char[] seperators = {';', ' ', '.', ',', '!', '?', ':'};
foreach(string word in WordsToRemove)
{
foreach (char seperator in seperators)
{
line = line.Replace(word + seperator, string.Empty);
}
}
foutv.WriteLine(line);
}
}
}
}
I have :
fhgkHouse!House!Dog;;;!!Inside!C!Room!Home!House!Room;;;;;;;;;;!Table!London!Computer!Room;..;
Results I get:
fhgkDog;;;!!Inside!C!;;;;;;;;;!Table!London!Computer!..;
The results should be:
fhgkHouse!Dog;;;!!Inside!C!Table!London!Computer!
Try this regex : \b(Home|House|Room)(!|;)*\b|;+\.\.;+
See at: https://regex101.com/r/LUsyM8/1
In there, I substitute words and special characters with blank or empty string.
It produces the same expected result I guess.

How to insert a text file into a Binary Search Tree?

I know that a loop is involved in order to insert each word to the BST, but I'm not sure how to implement it.
You created an insert function in you binary search tree ... use it.
class Program
{
static BSTree<string> myTree = new BSTree<string>();
static void Main(string[] args)
{
readFile("textfile.txt");
string buffer = "";
myTree.InOrder(ref buffer);
Console.WriteLine(buffer);
}
static void readFile(string filename)
{
const int MAX_FILE_LINES = 50000;
string[] AllLines = new string[MAX_FILE_LINES];
//reads from bin/DEBUG subdirectory of project directory
AllLines = File.ReadAllLines(filename);
foreach (string line in AllLines)
{
//split words using space , . ?
string[] words = line.Split(' ', ',', '.', '?', ';', ':', '!');
foreach (string word in words)
{
if (word != "")
myTree.InsertItem(word.ToLower());
}
}
}
}
On another note, I will mention your InOrder function will return a string that starts with ',' character. Not sure if this was intended. Also, for various reasons, you may want to use a StringBuilder instead of manupulating a string.

Replace all occurrences of a string (in array) with a single value

I have a string array:
string[] arr2 = { "/", "#", "&" };
I have another string (i.e. strValue). Is there a clean way to replace all instances of the array contents with a single value (i.e. an underscore)? So before:
strValue = "a/ new string, with some# values&"
And after:
strValue = "a_ new string, with some_ values_"
I considered doing this:
strValue = strValue.Replace("/", "_");
strValue = strValue.Replace("#", "_");
strValue = strValue.Replace("&", "_");
But my array of characters to replace may become a lot bigger.
Instead of using the Replace over and over you could just write your own. This might even be a performance gain since you mentioned
But my array may get a lot bigger.
public string Replace(string original, char replacement, params char[] replaceables)
{
StringBuilder builder = new StringBuilder(original.Length);
HashSet<char> replaceable = new HashSet<char>(replaceables);
foreach(Char character in original)
{
if (replaceable.Contains(character))
builder.Append(replacement);
else
builder.Append(character);
}
return builder.ToString();
}
public string Replace(string original, char replacement, string replaceables)
{
return Replace(original, replacement, replaceables.ToCharArray());
}
Can be called like this:
Debug.WriteLine(Replace("a/ new string, with some# values&", '_', '/', '#', '&'));
Debug.WriteLine(Replace("a/ new string, with some# values&", '_', new[] { '/', '#', '&' }));
Debug.WriteLine(Replace("a/ new string, with some# values&", '_', existingArray));
Debug.WriteLine(Replace("a/ new string, with some# values&", '_',"/#&"));
Output:
a_ new string, with some_ values_
a_ new string, with some_ values_
a_ new string, with some_ values_
a_ new string, with some_ values_
As #Sebi pointed out, this would also work as an extension method:
public static class StringExtensions
{
public static string Replace(this string original, char replacement, params char[] replaceables)
{
StringBuilder builder = new StringBuilder(original.Length);
HashSet<Char> replaceable = new HashSet<char>(replaceables);
foreach (Char character in original)
{
if (replaceable.Contains(character))
builder.Append(replacement);
else
builder.Append(character);
}
return builder.ToString();
}
public static string Replace(this string original, char replacement, string replaceables)
{
return Replace(original, replacement, replaceables.ToCharArray());
}
}
Usage:
"a/ new string, with some# values&".Replace('_', '/', '#', '&');
existingString.Replace('_', new[] { '/', '#', '&' });
// etc.
This is how i'd do it building a regex clause from the list of delimiters and replacing them with an underscore
string[] delimiters = { "/", "#", "&" };
string clause = $"[{string.Join("]|[", delimiters)}]";
string strValue = "a/ new string, with some# values&";
Regex chrsToReplace = new Regex(clause);
string output = chrsToReplace.Replace(strValue, "_");
You'll probably want to encapsulate within if(delimiters.Any()), else it will crash if the array is empty
Sure. Here's one approach:
var newString = arr2.Aggregate(strValue, (net, curr) => net.Replace(curr, "_"));
If you're only substituting individual characters and have large enough input sizes to need optimization, you can create a set from which to substitute:
var substitutions = new HashSet<char>() { '/', '#', '&' };
var strValue = "a/ new string, with some# values&";
var newString = new string(strValue.Select(c => substitutions.Contains(c) ? '_' : c).ToArray());
Maybe not the fastest but the easiest would be a Select with a Contains.
Something like this : source.Select(c => blacklist.Contains(c) ? letter : c)
Demo on .NetFiddle.
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var strValue = "a/ new string, with some# values&";
Console.WriteLine(strValue.Replace("/#&", '_'));
}
}
public static class Extensions {
public static string Replace(this string source, string blacklist, char letter) =>
new string(source.Select(c => blacklist.Contains(c) ? letter : c).ToArray());
}
You can split your string with your list of string []:
string[] arr2 = { "/", "#", "&" };
string strValue = "a/ new string, with some# values&";
string Output = null;
string[] split = strValue.Split(arr2, StringSplitOptions.RemoveEmptyEntries);
foreach (var item in split)
{
Output += item + "_";
}
Console.WriteLine(Output);
//-> a_ new string, with some_ values_
Updated answer with #aloisdg comment (interesting article, thank you).
string[] arr2 = { "/", "#", "&" };
string strValue = "a/ new string, with some# values&";
string[] split = strValue.Split(arr2, StringSplitOptions.RemoveEmptyEntries);
StringBuilder Output = new StringBuilder();
foreach (var item in split)
{
Output.Append(item + "_");
}
Console.WriteLine(Output);
//-> a_ new string, with some_ values_
You could use a foreach in a single line to achieve what you want:
arr2.ToList().ForEach(x => strValue = strValue.Replace(x, "_"));

Trimstart and TrimEnd not working as wanted

I am testing to cut the strings via C#, but I am not getting the results correctly.
It is still showing the full text exactString.
String exactString = ABC##^^##DEF
char[] Delimiter = { '#', '#', '^', '^', '#', '#' };
string getText1 = exactString.TrimEnd(Delimiter);
string getText2 = exactString.TrimStart(Delimiter);
MessageBox.Show(getText1);
MessageBox.Show(getText2);
OUTPUT:
ABC##^^##DEF for both getText1 and getText2.
Correct OUTPUT should be
ABC for getText1 and DEF for getText2.
How do I fix it?
Thanks.
You want to split your string, not trim it. Thus, the correct method to use is String.Split:
String exactString = "ABC##^^##DEF";
var result = exactString.Split(new string[] {"##^^##"}, StringSplitOptions.None);
Console.WriteLine(result[0]); // outputs ABC
Console.WriteLine(result[1]); // outputs DEF
You are looking for String.Replace, not Trim.
char[] Delimiter = { '#', '^' };
string getText1 = exactString.Replace(Delimiter,'');
Trim only removes the characters at the beginning, Replace looks through the whole string.
You can split strings up in 2 pieces using the (conveniently named) String.Split method.
char[] Delimiter = { '#', '^' };
string[] text = exactString.Split(Delimiter, StringSplitOptions.RemoveEmptyEntries);
//text[0] = "ABC", text[1] = "DEF
you can use String.Split Method
String exactString = "ABC##^^##DEF";
string[] splits = exactString.Split(new string[]{"##^^##"}, StringSplitOptions.None);
string getText1 = splits[0];
string getText2 = splits[1];
MessageBox.Show(getText1);
MessageBox.Show(getText2);

how would you remove the blank entry from array

How would you remove the blank item from the array?
Iterate and assign non-blank items to new array?
String test = "John, Jane";
//Without using the test.Replace(" ", "");
String[] toList = test.Split(',', ' ', ';');
Use the overload of string.Split that takes a StringSplitOptions:
String[] toList = test.Split(new []{',', ' ', ';'}, StringSplitOptions.RemoveEmptyEntries);
You would use the overload of string.Split which allows the suppression of empty items:
String test = "John, Jane";
String[] toList = test.Split(new char[] { ',', ' ', ';' },
StringSplitOptions.RemoveEmptyEntries);
Or even better, you wouldn't create a new array each time:
private static readonly char[] Delimiters = { ',', ' ', ';' };
// Alternatively, if you find it more readable...
// private static readonly char[] Delimiters = ", ;".ToCharArray();
...
String[] toList = test.Split(Delimiters, StringSplitOptions.RemoveEmptyEntries);
Split doesn't modify the list, so that should be fine.
string[] result = toList.Where(c => c != ' ').ToArray();
Try this out using a little LINQ:
var n = Array.FindAll(test, str => str.Trim() != string.Empty);
You can put them in a list then call the toArray method of the list, or with LINQ you could probably just select the non blank and do toArray.
If the separator is followed by a space, you can just include it in the separator:
String[] toList = test.Split(
new string[] { ", ", "; " },
StringSplitOptions.None
);
If the separator also occurs without the trailing space, you can include those too:
String[] toList = test.Split(
new string[] { ", ", "; ", ",", ";" },
StringSplitOptions.None
);
Note: If the string contains truely empty items, they will be preserved. I.e. "Dirk, , Arthur" will not give the same result as "Dirk, Arthur".
string[] toList = test.Split(',', ' ', ';').Where(v => !string.IsNullOrEmpty(v.Trim())).ToArray();

Categories

Resources