I have a large string which is converted from a text file (eg 1 MB text 0file) and I want to process the string. It takes near 10 minutes to process the string.
Basically string is read character by character and increment counter for each character by one, some characters such as space, comma, colon and semi-colon are counted as space and rest characters are just ignored and thus space's counter is incremented.
Code:
string fileContent = "....." // a large string
int min = 0;
int max = fileContent.Length;
Dictionary<char, int> occurrence // example c=>0, m=>4, r=>8 etc....
// Note: occurrence has only a-z alphabets, and a space. comma, colon, semi-colon are coutned as space and rest characters ignored.
for (int i = min; i <= max; i++) // run loop to end
{
try // increment counter for alphabets and space
{
occurrence[fileContent[i]] += 1;
}
catch (Exception e) // comma, colon and semi-colon are spaces
{
if (fileContent[i] == ' ' || fileContent[i] == ',' || fileContent[i] == ':' || fileContent[i] == ';')
{
occurrence[' '] += 1;
//new_file_content += ' ';
}
else continue;
}
totalFrequency++; // increment total frequency
}
Try this:
string input = "test string here";
Dictionary<char, int> charDict = new Dictionary<char, int>();
foreach(char c in input.ToLower()) {
if(c < 97 || c > 122) {
if(c == ' ' || c == ',' || c == ':' || c == ';') {
charDict[' '] = (charDict.ContainsKey(' ')) ? charDict[' ']++ : 0;
}
} else {
charDict[c] = (charDict.ContainsKey(c)) ? charDict[c]++ : 0;
}
}
Given your loop is iterating through a large number you want to minimize the checks inside the loop and remove the catch which is pointed out in the comments. There should never be a reason to control flow logic with a try catch block. I would assume you initialize the dictionary first to set the occurrence cases to 0 otherwise you have to add to the dictionary if the character is not there. In the loop you can test the character with something like char.IsLetter() or other checks as D. Stewart is suggesting. I would not do a toLower on the large string if you are going to iterate every character anyway (this would do the iteration twice). You can do that conversion in the loop if needed.
Try something like the below code. You could also initialize all 256 possible characters in the dictionary and completely remove the if statement and then remove items you don't care about and add the 4 space items to the space character dictionary after the counting is complete.
foreach (char c in fileContent)
{
if (char.IsLetter(c))
{
occurrence[c] += 1;
}
else
{
if (c == ' ' || c == ',' || c == ':' || c == ';')
{
occurrence[' '] += 1;
}
}
}
}
You could initialize the entire dictionary in advance like this also:
for (int i = 0; i < 256; i++)
{
occurrence.Add((char)i, 0);
}
There are several issues with that code snippet (i <= max, accessing dictionary entry w/o being initialized etc.), but of course the performance bottleneck is relying on exceptions, since throwing / catching exceptions is extremely slow (especially when done in a inner loop).
I would start with putting the counts into a separate array.
Then I would either prepare a char to count index map and use it inside the loop w/o any ifs:
var indexMap = new Dictionary<char, int>();
int charCount = 0;
// Map the valid characters to be counted
for (var ch = 'a'; ch <= 'z'; ch++)
indexMap.Add(ch, charCount++);
// Map the "space" characters to be counted
foreach (var ch in new[] { ' ', ',', ':', ';' })
indexMap.Add(ch, charCount);
charCount++;
// Allocate count array
var occurences = new int[charCount];
// Process the string
foreach (var ch in fileContent)
{
int index;
if (indexMap.TryGetValue(ch, out index))
occurences[index]++;
}
// Not sure about this, but including it for consistency
totalFrequency = occurences.Sum();
or not use dictionary at all:
// Allocate array for char counts
var occurences = new int['z' - 'a' + 1];
// Separate count for "space" chars
int spaceOccurences = 0;
// Process the string
foreach (var ch in fileContent)
{
if ('a' <= ch && ch <= 'z')
occurences[ch - 'a']++;
else if (ch == ' ' || ch == ',' || ch == ':' || ch == ';')
spaceOccurences++;
}
// Not sure about this, but including it for consistency
totalFrequency = spaceOccurences + occurences.Sum();
The former is more flexible (you can add more mappings), the later - a bit faster. But both are fast enough (complete in milliseconds for 1M size string).
Ok, it´s a little late, but it should be the fastest solution:
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication99
{
class Program
{
static void Main(string[] args)
{
string fileContent = "....."; // a large string
// --- high perf section to count all chars ---
var charCounter = new int[char.MaxValue + 1];
for (int i = 0; i < fileContent.Length; i++)
{
charCounter[fileContent[i]]++;
}
// --- combine results with linq (all actions consume less than 1 ms) ---
var allResults = charCounter.Select((count, index) => new { count, charValue = (char)index }).Where(c => c.count > 0).ToArray();
var spaceChars = new HashSet<char>(" ,:;");
int countSpaces = allResults.Where(c => spaceChars.Contains(c.charValue)).Sum(c => c.count);
var usefulChars = new HashSet<char>("abcdefghijklmnopqrstuvwxyz");
int countLetters = allResults.Where(c => usefulChars.Contains(c.charValue)).Sum(c => c.count);
}
}
}
for very large text-files, it´s better to use the StreamReader...
Related
given a string with words separated by spaces how would you go about merging two words if one of them is made by one character only ? An example should clarify:
"a bcd tttt" => "abcd tttt"
"abc d hhhh" => "abcd hhhh"
I would like to merge the single characer word with the one on the left in all cases where it is not the first word in the string, in this case i would like to merge it with the one on the right.
I am trying to loop through the string and create some logic but it turned out to be more complex than i was expecting.
Try the below program's approach:
using System;
using System.Text;
public class Program
{
public static void Main()
{
var delimiter=new char[]{' '};
var stringToMerge="abc d hhhh";
var splitArray=stringToMerge.Split(delimiter);
var stringBuilder=new StringBuilder();
for(int wordIndex=0;wordIndex<splitArray.Length;wordIndex++)
{
var word=splitArray[wordIndex];
if(wordIndex!=0 && word.Length>1)
{
stringBuilder.Append(" ");
}
stringBuilder.Append(word);
}
Console.WriteLine(stringBuilder.ToString());
}
}
Basically, you split the string to words, then using StringBuilder, build a new string, inserting a space before a word only if the word is larger than one character.
One way to approach this is to first use string.Split(' ') to get an array of words, which is easier to deal with.
Then you can loop though the words, handling single character words by concatenating them with the previous word, with special handling for the first word.
One such approach:
public static void Main()
{
string data = "abcd hhhh";
var words = data.Split(' ');
var sb = new StringBuilder();
for (int i = 0; i < words.Length; ++i)
{
var word = words[i];
if (word.Length == 1)
{
sb.Append(word);
if (i == 0 && i < words.Length - 1) // Single character first word is special case: Merge with next word.
sb.Append(words[++i]); // Note the "++i" to increment the loop counter, skipping the next word.
}
else
{
sb.Append(' ' + word);
}
}
var result = sb.ToString();
Console.WriteLine(result);
}
Note that this will concatenate multiple instances of single-letter words, so that "a b c d e" will result in "abcde" and "ab c d e fg" will result in "abcde fg". You don't actually specify what should happen in this case.
if you want to do it with a plain for loop and string walking:
using System;
using System.Text;
public class Program
{
public static void Main()
{
Console.WriteLine(MergeOrphant("bcd a tttt") == "bcda tttt");
Console.WriteLine(MergeOrphant("bcd a tttt a") == "bcda tttta");
Console.WriteLine(MergeOrphant("a bcd tttt") == "abcd tttt");
Console.WriteLine(MergeOrphant("a b") == "ab");
}
private static string MergeOrphant(string source)
{
var stringBuilder = new StringBuilder();
for (var i = 0; i < source.Length; i++)
{
if (i == 1 && char.IsWhiteSpace(source[i]) && char.IsLetter(source[i - 1])) {
i++;
}
if (i > 0 && char.IsWhiteSpace(source[i]) && char.IsLetter(source[i - 1]) && char.IsLetter(source[i + 1]) && (i + 2 == source.Length || char.IsWhiteSpace(source[i + 2])) )
{
i++;
}
stringBuilder.Append(source[i]);
}
return stringBuilder.ToString();
}
}
Quite short with Regex.
string foo = "a bcd b tttt";
foo = Regex.Replace(foo, #"^(\w) (\w{2,})", "$1$2");
foo = Regex.Replace(foo, #"(\w{2,}) (\w)\b", "$1$2");
Be aware \w is [a-zA-Z0-9_] if you need an other definition you have to define you own character class.
My answer would not be the best practice but it works for your second case, but still you should be clear about the letter merging rules.
public static void Main()
{
Console.WriteLine(Edit("abc d hhhh") == "abcd hhhh");
Console.WriteLine(Edit("abc d hhhh a") == "abcd hhhha");
Console.WriteLine(Edit("abc d hhhh a b") == "abcd hhhhab");
Console.WriteLine(Edit("abc d hhhh a def g") == "abcd hhhha defg");
}
public static string Edit(string str)
{
var result = string.Empty;
var split = str.Split(' ', StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < split.Length; i++)
{
if(i == 0)
result += split[i];
else
{
if (i > 0 && split[i].Length == 1)
{
result += split[i];
}
else
{
result += $" {split[i]}";
}
}
}
return result;
}
As I have mentioned above, this does not work for your 1st case which is : Edit("a bcd") would not generate "abcd".
Expanding on Matthew's answer,
If you don't want the extra space in the output you can change the last line to;
Console.WriteLine(result.TrimStart(' '));
Hi i want to search for character in a string array but i need to search Between 2 indices. For example between index 2 and 10. How can I do that?
foreach (var item in currentline[2 to 10])
{
if (item == ',' || item == ';')
{
c++;
break;
}
else
{
data += item;
c++;
}
}
As you can see, foreach enumerates over a collection or any IEnumerable.
As the comments say, you can use a for loop instead, and pick out the elements you want.
Alternatively, since you want to search for a character in a string, you can use IndexOf, using the start index and count overload to find where a character is.
As there is no use of the c++ in your code I will assume that it's a vestige of code.
You can simply addess your issue like this:
In the currentline
Take char from index 2 to 10
Till you find a char you don't want.
concatenate the resulting char array to a string.
Resulting Code:
var data = "##";//01234567891 -- index for the string below.
var currentline= "kj[abcabc;z]Selected data will be between: '[]';";
var exceptChar = ",;";
data += new string(
input.Skip(3)
.Take(8)
.TakeWhile(x=> !exceptChar.Contains(x))
.ToArray()
);
There is a string method called string.IndexOfAny() which will allow you to pass an array of characters to search for, a start index and a count. For your example, you would use it like so:
string currentLine = ",;abcde;,abc";
int index = currentLine.IndexOfAny(new[] {',', ';'}, 2, 10-2);
Console.WriteLine(index);
Note that the last parameter is the count of characters to search starting at the specified index, so if you want to start at index 2 and finish at index 10, the count will be finish-start, i.e. 10-2.
You can search for characters in strings and get their indexes with this LINQ solution:
string str = "How; are, you; Good ,bye";
char[] charArr = { ',', ';' };
int startIndex = 2;
int endIndex = 10;
var indexes = Enumerable.Range(startIndex, endIndex - startIndex + 1)
.Where(i=>charArr.Contains(str[i]))
.ToArray();
In this case we get Enumerable.Range(2, 9) which generates a sequence between 2 and 10 and the Where clause filters the indexes of the characters in str that are matching one of the characters inside charArr.
Thanks everey one finaly i fixed it by your guid thanks all
myarr = new mytable[50];
number_of_records = 0;
number_of_records = fulllines.Length;
for (int line = 1; line < fulllines.Length; line++)
{
int c = 0;
for (int i = 0; i < record_lenth; i++)
{
string data = "";
string currentline = fulllines[line];
string value = "";
for (int x = c; x < fulllines[line].Length; x++)
{
value += currentline[x];
}
foreach (var item in value)
{
if (item == ',' || item == ';')
{
c++;
break;
}
else
{
data += item;
c++;
}
}
}
}
I have a task and I have to check how many odd numbers there are. For example:
cw(string[54]); //37 42 44 61 62
From this I need to get how many odd numbers are in this string. The only way I figured out was to cut the string into 5 ints so int 1 is 37, 2 is 42 and so on. But that is a really long and slow process even with methods.
Any help, or shall I stick with the "cutting" which looks something like this:
for (int y = 0; y < all_number.Length; y++)
{
for (int x = 0; x < 5; x++)
{
cutter = all_number[y];
placeholder = cutter.IndexOf(" ");
final[x] = Convert.ToInt32(cutter.Remove(placeholder));
}
}
This one is for the first numbers, so at 37 42 44 61 62 final would be 37.
I'd start by using an outer foreach loop rather than referring array values by index, unless the index is important (which it doesn't look like it is here).
I'd then use string.Split to split each string by spaces, and then LINQ to sum the odd numbers.
For example:
foreach (string line in lines)
{
var oddSum = line.Split(' ')
.Select(int.Parse) // Parse each chunk
.Where(number => (number & 1) == 1) // Filter out even values
.Sum(); // Sum all the odd values
// Do whatever you want with the sum of the odd values for this line
}
If you actually only want to count the odd numbers, you can use the overload of Count that accepts a predicate:
foreach (string line in lines)
{
var oddCount = line.Split(' ')
.Select(int.Parse) // Parse each chunk
.Count(number => (number & 1) == 1) // Count the odd values
// Do whatever you want with the count of the odd values for this line
}
Note that this will throw an exception (in int.Parse) at the first non-integer value encountered. That may well be fine, but you can use int.TryParse to avoid the exception. That's harder to use with LINQ though; please specify how you want them handled if you need this functionality.
First of all, use the built in tools you have available.
To split a string by a predefined character, use Split:
var numbers = allNumbersString.Split(' ');
Now you have an array of strings, each holding a string representation of what we hope is a number.
Now we need to extract numbers out of each string. The safest way to do this is using int.TryParse:
foreach (var n in numbers)
{
if (int.TryParse(out var number)
{
//ok we got a number
}
else
{
//we don't. Do whatever is appropriate:
//ignore invalid number, log parse failure, throw, etc.
}
}
And now, simply return those that are odd: number % 2 != 0;
Putting it all together:
public static IEnumerable<int> ExtractOddNumbers(
string s
char separator)
{
if (s == null)
throw new ArgumentNullException(name(s));
foreach (var n in s.Split(separator))
{
if (int.TryParse(out var number)
{
if (number % 2 != 0)
yield return number;
}
}
}
So, if you want to know how many odd numbers there are in a given string, you would do:
var countOfOddNumbers = ExtractOddNumbers(s, ' ').Count();
The good thing about this approach is that now, its easily extensible. A small modification to our current method makes it a whole lot more powerful:
public static IEnumerable<int> ExtractNumbers(
string s
char separator
Func<int, bool> predicate)
{
if (s == null)
throw new ArgumentNullException(name(s));
foreach (var n in s.Split(separator))
{
if (int.TryParse(out var number)
{
if (predicate(number))
yield return number;
}
}
}
See what we've done? We've made the filtering criteria one more argument of the method call; now you can extract numbers based on any condition. Odd numbers? ExtractNumbers(s, ' ', n => n % 2 != 0). Multiples of 7? ExtractNumbers(s, ' ', n => n % 7 == 0). Greater than 100? ExtractNumbers(s, ' ', n => n > 100), etc.
As mentioned by others, the Split method is what you're after.
if you want the count of odd numbers then you can accomplish the task like so:
var oddCount = lines.SelectMany(line => line.Split(' ')) // flatten
.Select(int.Parse) // parse the strings
.Count(n => n % 2 != 0); // count the odd numbers
or if you want the summation you can do:
var oddSum = lines.SelectMany(line => line.Split(' '))// flatten
.Select(int.Parse) // parse the strings
.Where(n => n % 2 != 0)// retain the odd numbers
.Sum();// sum them
This assumes there will be no invalid characters in the string, otherwise, you'll need to perform checks with the Where clause prior to proceeding.
An alternative would be to loop over the string's characters and, if the current character is a space or the end of the string and the previous character is '1', '3', '5', '7' or '9' (odd numbers end with an odd figure), increase the count.
This allows the string to contain numbers that are much bigger than ints, does not allocate new memory (as with String.Split) and doesn't require the parsing of ints. It does assume a valid string with valid numbers:
var count = 0;
for(var i = 1; i < cw.Length; i++)
{
int numberIndex = -1;
if(i == cw.Length - 1) numberIndex = i;
if(cw[i] == ' ') numberIndex = i - 1;
if(numberIndex != -1)
{
if(cw[numberIndex] == '1' || cw[numberIndex] == '3' ||
cw[numberIndex] == '5' || cw[numberIndex] == '7' ||
cw[numberIndex] == '9')
{
count++;
}
}
}
How do I convert numbers to its equivalent alphabet character and convert alphabet character to its numeric values from a string (except 0, 0 should stay 0 for obvious reasons)
So basically if there is a string
string content="D93AK0F5I";
How can I convert it to ?
string new_content="4IC11106E9";
I'm assuming you're aware this is not reversible, and that you're only using upper case and digits. Here you go...
private string Transpose(string input)
{
StringBuilder result = new StringBuilder();
foreach (var character in input)
{
if (character == '0')
{
result.Append(character);
}
else if (character >= '1' && character <= '9')
{
int offset = character - '1';
char replacement = (char)('A' + offset);
result.Append(replacement);
}
else if (character >= 'A' && character <= 'Z') // I'm assuming upper case only; feel free to duplicate for lower case
{
int offset = character - 'A' + 1;
result.Append(offset);
}
else
{
throw new ApplicationException($"Unexpected character: {character}");
}
}
return result.ToString();
}
Well, if you are only going to need a one way translation, here is quite a simple way to do it, using linq:
string convert(string input)
{
var chars = "0abcdefghijklmnopqrstuvwxyz";
return string.Join("",
input.Select(
c => char.IsDigit(c) ?
chars[int.Parse(c.ToString())].ToString() :
(chars.IndexOf(char.ToLowerInvariant(c))).ToString())
);
}
You can see a live demo on rextester.
You can use ArrayList of Albhabets. For example
ArrayList albhabets = new ArrayList();
albhabets.Add("A");
albhabets.Add("B");
and so on.
And now parse your string character by character.
string s = "1BC34D";
char[] characters = s.ToCharArray();
for (int i = 0; i < characters.Length; i++)
{
if (Char.IsNumber(characters[0]))
{
var index = characters[0];
var stringAlbhabet = albhabets[index];
}
else
{
var digitCharacter = albhabets.IndexOf(characters[0]);
}
}
This way you can get "Alphabet" representation of number & numeric representation of "Alphabet".
I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.