How to split lines but keep those with tabs? - c#

I know how to split a long string by new lines
string[] lines = d.Split(new string[] { Environment.NewLine },
StringSplitOptions.RemoveEmptyEntries);
But I would like to
Split in new lines like this
and in new lines + tabs
like this, in order to have
an array which contains the first line and this second block.
Basically I would like to split not by one rule, but by two.
The resulting array should contain following strings:
[0] Split in new lines like this
[1] and in new lines + tabs
like this, in order to have
an array which contains the first line and this second block.

This trick should work. I have replaced "\n\t" with "\r" temporarily. After splitting the string, restored back the "\n\t" string. So your array lines, will have desired count of strings.
This way, you can get your desired output:
d = d.Replace("\n\t", "\r");
string[] lines = d.Split(new string[] {"\n"}, StringSplitOptions.RemoveEmptyEntries);
lines = lines.Select((line) => line = line.Replace("\r", "\n\t")).ToArray();

What is a tab?
If it is 4 spaces, use the following pattern:
string pattern = Environment.NewLine + #"(?!\s{4})";
If it is tabulation, use the following pattern:
string pattern = Environment.NewLine + #"(?!\t)";
Next, we use a regular expression:
string[] lines = Regex.Split(text, pattern);

One should use the right tool for the right situation. To avoid using a tool available to one (a tool which is in every programming language btw) is foolish.
Regex is best when a discernible pattern has to be expressed which can't be done, or easily done, by the string functions. Below I use each tool in the situation it was best designed for...
The following is a three stage operation using regex, string op, and Linq.
Identifying which lines have to be kept together due to the indented rule. This is done to not lose them in the main split operation, the operation will replace \r\n\t with a pipe character (|) to identify their specialty. This is done with regex because we are able to effectively group and process the operations with minimal overhead.
We split all the remaining lines by the newline character which gives us a glimpse at the final array of lines wanted.
We project (change) each line via linq's Select to change the | to a \r\n.
Code
Regex.Replace(text, "\r\n\t", "|\t")
.Split(new string[] { Environment.NewLine }, StringSplitOptions.None )
.Select (rLine => rLine.Replace("|", Environment.NewLine));
Try it here (.Net Fiddle)
Full code with before and after results as run in LinqPad. Note that .Dump() is only available in Linqpad to show results and is not a Linq extension.
Result first:
Full code
string text = string.Format("{1}{0}{2}{0}\t\t{3}{0}\t\t{4}",
Environment.NewLine,
"Split in new lines like this",
"and in new lines + tabs",
"like this, in order to have",
"an array which contains the first line and this second block.");
text.Dump("Before");
var result = Regex.Replace(text, "\r\n\t", "|\t")
.Split(new string[] { Environment.NewLine }, StringSplitOptions.None )
.Select (rLine => rLine.Replace("|", Environment.NewLine));
result.Dump("after");

After you split all lines, you can just "join" tabbed lines like so:
List<string> lines = d.Split(new string[] { Environment.NewLine })
.ToList();
// loop through all lines, but skip the first (lets assume it isn't tabbed)
for (int i = 1; i < lines.Count; i++)
{
if (lines[i][0] == '\t') // current line starts with tab
{
lines[i - 1] += "\r\n" + lines[i]; // append it to prev line
lines.RemoveAt(i); // remove current line from list
i--; // and dec i so you don't skip an item
}
}
You could add more complex logic to consider different number of tabs if you wanted, but this should get you started.
If you expect many tabbed lines to be grouped together, you might want to use StringBuilder instead of string for increased performance in appending the lines back together.

Related

How to take a string, split it into an array and then join it back together

I have some code that will take in a string from console in [0,0,0,0] format. I then want to split it into an array but leave the [] and only take the numbers. How do i do this? This is what i have, i thought i could split it all and remove the brackets after but it doesnt seem to take the brackets and rather just leaves a null space. is there a way to split from index 1 to -1?
input = Console.ReadLine();
Char[] splitChars = {'[',',',']'};
List<string> splitString = new List<string>(input.Split(splitChars));
Console.WriteLine("[" + String.Join(",", splitString) + "]");
Console.ReadKey();
I love using LinqPad for such tasks, because you can use Console.WriteLine() to get the details of a result like so:
It becomes obvious that you have empty entries, i.e. "" at the beginning and the end. You want to remove those with the overloaded funtion that takes StringSplitOptions.RemoveEmptyEntries [MSDN]:
List<string> splitString = new List<string>(input.Split(splitChars, StringSplitOptions.RemoveEmptyEntries));
Result:

How to save the strings in array and display the next string array if match found?

I read the *.txt file from c# and displayed in the console.
My text file looks like a table.
diwas hey
ivonne how
pokhara d kd
lekhanath when
dipisha dalli hos
dfsa sasf
Now I want to search for a string "pokhara" and if it is found then it should display the "d kd" and if not found display "Not found"
What I tried?
string[] lines = System.IO.ReadAllLines(#"C:\readme.txt");
foreach(string line in lines)
{
string [] words = line.Split();
foreach(string word in words)
{
if (word=="pokhara")
{
Console.WriteLine("Match Found");
}
}
}
My Problem:
Match was found but how to display the next word of the line. Also sometimes
in second row some words are split in two with a space, I need to show both words.
I guess your delimiter is the tab-character, then you can use String.Split and LINQ:
var lineFields = System.IO.File.ReadLines(#"C:\readme.txt")
.Select(l => l.Split('\t'));
var matches = lineFields
.Where(arr => arr.First().Trim() == "pokhara")
.Select(arr => arr.Last().Trim());
// if you just want the first match:
string result = matches.FirstOrDefault(); // is null if not found
If you don't know the delimiter as suggested by your comment you have a problem. If you don't even know the rules of how the fields are separated it's very likely that your code is incorrect. So first determine the business logic, ask the people who created the text file. Then use the correct delimiter in String.Split.
If it's a space you can either use string.Split()(without argument), that includes spaces, tabs and new-line characters or use string.Split(' ') which only includes the space. But note that is a bad delimiter if the fields can contain spaces as well. Then either use a different or wrap the fields in quoting characters like "text with spaces". But then i suggest a real text-parser like the Microsoft.VisualBasic.FileIO.TextFieldParser which can also be used in C#. It has a HasFieldsEnclosedInQuotes property.
This works ...
string[] lines = System.IO.ReadAllLines(#"C:\readme.txt");
string stringTobeDisplayed = string.Empty;
foreach(string line in lines)
{
stringTobeDisplayed = string.Empty;
string [] words = line.Split();
//I assume that the first word in every line is the key word to be found
if (word[0].Trim()=="pokhara")
{
Console.WriteLine("Match Found");
for(int i=1 ; i < words.Length ; i++)
{
stringTobeDisplayed += words[i]
}
Console.WriteLine(stringTobeDisplayed);
}
}

How do I know which delimiter was used when delimiting a string on multiple delimiters? (C#)

I read strings from a file and they come in various styles:
item0 item1 item2
item0,item1,item2
item0_item1_item2
I split them like this:
string[] split_line = line[i].split(new char[] {' ',',','_'});
I change an item (column) and then i stitch the strings back together using string builder.
But now when putting the string back I have to use the right delimiter.
Is it possible to know which delimiter was used when splitting the string?
UPDATE
the caller will pass me the first item so that I only change that line.
Unless you keep track of splitting action (one at the time) you don't.
Otherwise, you could create a regular expression, to catch the item and the delimiter and go from there.
Instead of passing in an array of characters, you can use a Regex to split the string instead. The advantage of doing this, is that you can capture the splitting character. Regex.Split will insert any captures between elements in the array like so:
string[] space = Regex.Split("123 456 789", #"([,_ ])");
// Results in { "123", " ", "456", " ", "789" }
string[] comma = Regex.Split("123,456,789", #"([,_ ])");
// Results in { "123", ",", "456", ",", "789" }
string[] underscore = Regex.Split("123_456_789", #"([,_ ])");
// Results in { "123", "_", "456", "_", "789" }
Then you can edit all items in the array with something like
for (int x = 0; x < space.Length; x += 2)
space[x] = space[x] + "x";
Console.WriteLine(String.Join("", space));
// Will print: 123x 456x 789x
One thing to be wary of when dealing with multiple separators is if there are any lines that have spaces, commas and underscores in them. e.g.
37,hello world,238_3
This code will preserve all the distinct separators but your results might not be expected. e.g. the output of the above would be:
37x,hellox worldx,238x_3x
As I mentioned that the caller passes me the first item so I tried something like this:
// find the right row
if (lines[i].ToLower().StartsWith(rowID))
{
// we have to know which delim was used to split the string since this will be
// used when stitching back the string together.
for (int delim = 0; delim < delims.Length; delim++)
{
// we split the line into an array and then use the array index as our column index
split_line = lines[i].Trim().Split(delims[delim]);
// we found the right delim
if (split_line.Length > 1)
{
delim_used = delims[delim];
break;
}
}
}
basically I iterate each line over the delims and check the resulting array length. If it is > 1 that means that delim worked otherwise skip to next one. I am using split functions property "If this instance does not contain any of the characters in separator, the returned array consists of a single element that contains this instance."

Auto-numbering each line in textbox

I want to auto number each line that a user puts into a textbox and display the result in another textbox.
Turn this
blah blah blah
some stuff to be numbered
more stuff to number
to this
1 blah blah blah
2 some stuff to be numbered
3 more stuff to number
so far I have
output.Text = Regex.Replace(input.Text, input.Text, #"{1,}+");
But this is replacing all text with {1,}
I cant seem to figure out how to loop each line back after placing a number and a space.
(I am new to c#)
Any suggestions?
It might be simpler to implement a non-Regex solution:
var numberedLines = input.Text
.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None)
.Select ((line, index) => (index + 1) + " " + line)
.ToArray();
var result = string.Join(Environment.NewLine, numberedLines);
output.Text = result;
The first line uses string.Split() to split up the string around line returns into an array. Then I use LINQ .Select method to apply a function to each element in the array - in this case, adding line number and space at the beginning of each line (index + 1 is necessary because the index values are 0-based). Then I use string.Join method to put the array back together into a single string.
Demo: http://ideone.com/DrFTfl
It can actually be done with a Regular Expression if you use a MatchEvaluator delegate to apply the line numbering:
var index = 1;
output.Text = Regex.Replace(input.Text, "^",
(Match m) => (index++).ToString() + " ",
RegexOptions.Multiline);
The pattern ^ typically matches the beginning of an expression. However, with RegexOptions.Multiline, it matches the beginning of each line. Then for replacement, I use a delegate (anonymous function) that adds # + space to the beginning of the line, and then increments the index counter for the next row.
Demo: http://ideone.com/9LD0ZY
Why not just split by \r\n, concatenate each line of the string[] with an incremented number and a space, and then join by \r\n ?
If you're new to C# lets keep it as simple as possible.
By the looks of it you already have all your strings. So it boils down to this:
// store all lines in a list
// ...
var list = new List<string> {"blah", "blah2", "and some more blah"};
var list2 = new List<string>();
var i = 1;
foreach (var str in list)
{
list2.Add(string.Format("{0} {1}", i, str));
i++;
}
// write contents of list2 back to wherever you want them visualized

remove stop words from text C#

i want to remove an array of stop words from input string, and I have the following procedure
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
string input = "Did you try this yourself before asking";
foreach (string word in arrToCheck )
{
input = input.Replace(word, "");
}
Is it the best way to conduct this task, specially when I have (450) stop words and the input string is long? I prefer using replace method, because I want to remove the stop words when they appear in different morphologies. For example, if the stop word is "do" then delete "do" from (doing, does and so on ). are there any suggestions for better and fastest processing? thanks in advance.
May I suggest a StringBuilder?
http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
StringBuilder input = new StringBuilder("Did you try this yourself before asking");
foreach (string word in arrToCheck )
{
input.Replace(word, "");
}
Because it does all its processing inside it's own data structure, and doesnt allocate hundreds of new strings, I believe you will find it to be far more memory efficient.
There are a few aspects to this
Premature optimization
The method given works and is easy to understand/maintain. Is it causing a performance problem?
If not, then don't worry about it. If it ever causes a problem, then look at it.
Expected Results
In the example, what you do want the output to be?
"Did you this asking"
or
"Did you this asking"
You haved added spaces to the end of "try" and "before" but not "yourself". Why? Typo?
string.Replace() is case-sensitive. If you care about casing, you need to modify the code.
Working with partials is messy.
Words change in different tenses. The example of 'do' being removed from 'doing' words, but how about 'take' and 'taking'?
The order of the stop words matters because you are changing the input. It is possible (I've no idea how likely but possible) that a word which was not in the input before a change 'appears' in the input after the change. Do you want to go back and recheck each time?
Do you really need to remove the partials?
Optimizations
The current method is going to work its way through the input string n times, where n is the number of words to be redacted, creating a new string each time a replacement occurs. This is slow.
Using StringBuilder (akatakritos above) will speed that up an amount, so I would try this first. Retest to see if this makes it fast enough.
Linq can be used
EDIT
Just splitting by ' ' to demonstrate. You would need to allow for punctuation marks as well and decide what should happen with them.
END EDIT
[TestMethod]
public void RedactTextLinqNoPartials() {
var arrToCheck = new string[] { "try", "yourself", "before" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ",input.Split(' ').Where(wrd => !arrToCheck.Contains(wrd)));
Assert.AreEqual("Did you this asking", output);
}
Will remove all the whole words (and the spaces. It will not be possible to see from where the words were removed) but without some benchmarking I would not say that it is faster.
Handling partials with linq becomes messy but can work if we only want one pass (no checking for 'discovered' words')
[TestMethod]
public void RedactTextLinqPartials() {
var arrToCheck = new string[] { "try", "yourself", "before", "ask" };
var input = "Did you try this yourself before asking";
var output = string.Join(" ", input.Split(' ').Select(wrd => {
var found = arrToCheck.FirstOrDefault(chk => wrd.IndexOf(chk) != -1);
return found != null
? wrd.Replace(found,"")
: wrd;
}).Where(wrd => wrd != ""));
Assert.AreEqual("Did you this ing", output);
}
Just from looking at this I would say that it is slower than the string.Replace() but without some numbers there is no way to tell. It is definitely more complicated.
Bottom Line
The String.Replace() approach (modified to use string builder and to be case insensitive) looks like a good first cut solution. Before trying anything more complicated I would benchmark it under likely performance conditions.
hth,
Alan.
Here you go:
var words_to_remove = new HashSet<string> { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input
.Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
.Where(word => !words_to_remove.Contains(word))
);
Console.WriteLine(output);
This prints:
Did you this asking
The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).
However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.
To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?
BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.
--- EDIT ---
Here is another alternative:
var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);
I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)
For a simple way to remove a list of strings from your sentence, and aggregate the results back together, you can do the following:
var input = "Did you try this yourself before asking";
var arrToCheck = new [] { "try ", "yourself", "before " };
var result = input.Split(arrToCheck,
arrToCheck.Count(),
StringSplitOptions.None)
.Aggregate((first, second) => first + second);
This will break your original string apart by your word delimiters, and create one final string using the result set from the split array.
The result will be, "Did you this before asking"
shorten your code, and use LINQ
string[] arrToCheck = new string[] { "try ", "yourself", "before " };
var test = new StringBuilder("Did you try this yourself before asking");
arrToCheck.ForEach(x=> test = test.Replace(x, ""));
Console.Writeln(test.ToString());
String.Join(" ",input.
Split(' ').Where(w=>stop.Where(sW=>sW==w).
FirstOrDefault()==null).ToArray());

Categories

Resources