I need to convert a byte array of a text file to it's string character representation.
For example, if I have a text file that has:
hello (tab) there (newline) friend
I would like to convert that to an array:
my_array = {'h', 'e' ,'l','l','o', '\t', 't', 'h','e','r','e', '\r','\n', 'f', 'r' ,'i','e','n', 'd'};
I'm having trouble with converting the control characters to their escaped strings, i.e.:
0x09 = '\t';
0x0D = '\r';
0x0A = '\n';
I have tried this, but the tabs and new lines aren't represented here:
byte[] text_bytes = File.ReadAllBytes("ok.txt");
char[] y = Encoding.ASCII.GetChars(text_bytes);
I know I can just loop through each byte and have a condition to look for 0x09 and if I find it, then replace with "\t", but I'm wondering if there is something built in.
There are several ways you could do it. The simplest would be to load the entire file into memory:
string theText = File.ReadAllText(filename);
Then use string.Replace to replace the items you're interested in:
// "escaping" the '\t' with '\\t' makes it write the literal characters '\' and 't'
theText = theText.Replace("\t", "\\t");
theText = theText.Replace("\r", "\\r");
theText = theText.Replace("\n", "\\n");
Then you can create your array of characters. If you're sure that it's all ASCII text, you can use Encoding.ASCII:
byte[] theChars = Encoding.ASCII.GetBytes(theText);
Or, if you want a character array:
char[] theChars = theText.ToCharArray();
That's probably going to be fast enough for your purposes. You might be able to speed it up by making a single pass through the string, reading character by character and copying to a StringBuilder:
StringBuilder sb = new StringBuilder(theText.Length);
foreach (char c in theText)
{
switch (c)
{
case '\t' : sb.Append("\\t"); break;
case '\r' : sb.Append("\\r"); break;
case '\n' : sb.Append("\\n"); break;
default : sb.Append(c); break;
}
}
byte[] theChars = Encoding.ASCII.GetBytes(sb.ToString());
If you want to escape all control characters then you can use Regex.Escape.
string myText = File.ReadAllLines("ok.txt");
//to optimize, you could remove characters that you know won't be there (e.g. \a)
Regex rx = new Regex(#"[\a\e\f\n\r\t\v]", RegexOptions.Compiled);
myText = rx.Replace(myText, m => { return Regex.Escape(m.Value); });
Console.WriteLine(myText);
You can't convert it to a char array in the way you've posted because an escaped control character would count as two characters (\ and t). But if you don't mind each character being separate, you can simply do
char[] myCharArray = myText.ToCharArray();
In the "y" array, the "escaped characters" will have their actual values (0x09, 0x0D, etc.) with an unprintable character as the "text".
When you write \t, \n, \r, etc. you could have written (char)0x09, (char)0x0D and this is what the data gets written as. In other words the "\t" character doesn't exist!
Whether you roll your own, or use an existing library, someone is going to have to map 0x09 to the "\t" escape sequence and inject it into your string.
If you don't mind it being somewhat slower than a hand-rolled solution, then you could use a CodeDomProvider (which would probably be fast enough).
I found sample code here: http://code.google.com/p/nbehave-cf/source/browse/trunk/CustomTool/StringExtensions.cs?spec=svn5&r=5
using System;
using System.CodeDom;
using System.CodeDom.Compiler;
using System.IO;
namespace CustomTool
{
public static class StringExtensions
{
public static String ToLiteral(this String input)
{
using (var writer = new StringWriter())
{
using (var provider = CodeDomProvider.CreateProvider("CSharp"))
{
provider.GenerateCodeFromExpression(new CodePrimitiveExpression(input), writer, null);
return writer.ToString();
}
}
}
}
}
You would use it by reading the string using Encoding.Ascii.ReadString(), and then use .ToLiteral() to convert it to a string, then .ToCharArray() to get the final result.
This gives the correct result with, for example:
// You would do (using your sample code):
// string test = Encoding.ASCII.GetString(text_bytes);
string test = "hello\tthere\nfriend";
char[] result = test.ToLiteral().ToCharArray();
If you inspect result you will see that it has the correct characters.
However, I'd just use a loop and a switch statement to convert the characters. It's easy to write and understand, and it'd be much more efficient.
Related
I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#?
As an example, I need to convert this string:
"hello world!"
To this:
"hello_x0020_world_x0021_"
To get string safe for XML node name you should use XmlConver.EncodeName.
Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.
You could start with this code using LINQ Select extension method:
string str = "hello world!";
string a = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
a += a.ToLower();
char[] alphabet = a.ToCharArray();
str = string.Join("",
str.Select(ch => alphabet.Contains(ch) ?
ch.ToString() : String.Format("_x{0:x4}_", ch)).ToArray()
);
Now clearly it has some problems:
it does linear search in the list of characters
missed numeric...
if we add numeric need to decide if first character is ok to be digit (assuming yes)
code creates large number of strings that are immediately discarded (one per character)
alphanumeric is limited to ASCII (assuming ok, if not Char.IsLetterOrDigit to help)
does to much work for pure alpha-numeric strings
First two are easy - we can use HashSet (O(1) Contains) initialized by full list of characters (if any alpahnumeric characters are ok more readable to use existing method - Char.IsLetterOrDigit):
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
To avoid ch.ToString() that really pointlessly produces strings for immediate GC we need to figure out how to construct string from mix of char and string. String.Join does not work because it wants strings to start with, regular new string(...) does not have option for mix of char and string. So we are left with StringBuilder that happily takes both to Append. Consider starting with initial size str.Length if most strings don't have other characters.
So for each character we just need to either builder.Append(ch) or builder.AppendFormat(("_x{0:x4}_", (int)ch). To perform iteration it is easier to just use regular foreach, but if one really wants LINQ - Enumerable.Aggregate is the way to go.
string ReplaceNonAlphaNum(string str)
{
var builder = new StringBuilder();
foreach (var ch in str)
{
if (asciiAlphaNum.Contains(ch))
builder.Append(ch);
else
builder.AppendFormat("_x{0:x4}_", (int)ch);
}
return builder.ToString();
}
string ReplaceNonAlphaNumLinq(string str)
{
return str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
To the last point - we don't really need to do anything if there is nothing to convert - so some check like check alphanumeric characters in string in c# would help to avoid extra strings.
Thus final version (LINQ as it is a bit shorter and fancier):
private static asciiAlphaNumRx = new Regex(#"^[a-zA-Z0-9]*$");
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
string ReplaceNonAlphaNumLinq(string str)
{
return asciiAlphaNumRx.IsMatch(str) ? str :
str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
Alternatively whole thing could be done with Regex - see Regex replace: Transform pattern with a custom function for starting point.
Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.
You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).
Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}
My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();
I have a Regex to split out words operators and brackets in simple logic statements (e.g. "WORD1 & WORD2 | (WORd_3 & !word_4 )". the Regex I've come up with is "(?[A-Za-z0-9_]+)|(?[&!\|()]{1})". Here is a quick test program.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("* Test Project *");
string testExpression = "!(LIONV6 | NOT_superCHARGED) &RHD";
string removedSpaces = testExpression.Replace(" ", "");
string[] expectedResults = new string[] { "!", "(", "LIONV6", "|", "NOT_superCHARGED", ")", "&", "RHD" };
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})");
Console.WriteLine("Expected\n{0}\nActual\n{1}", expectedResults.AllElements(), splits.AllElements());
Console.WriteLine("*** Any Key to finish ***");
Console.ReadKey();
}
}
public static class Extensions
{
public static string AllElements(this string[] str)
{
string output = "";
if (str != null)
{
foreach (string item in str)
{
output += "'" + item + "',";
}
}
return output;
}
}
The Regex does the required job of splitting out words and operators into an array in the right sequence, but the result array contains many empty elements, and I can't work out why. Its not a serious problem as I just ignore empty elements when consuming the array but I'd like Regex to do all the work if possible, including ignoring spaces.
Try this:
string[] splits = Regex.Split(removedSpaces, #"(?[A-Za-z0-9_]+)|(?[&!\|()]{1})").Where(x => x != String.Empty);
The spaces are jsut becasue of the way the split works. From the help page:
If multiple matches are adjacent to one another, an empty string is inserted into the array.
What split is doing as standard is taking your matches as delimiters. So in effect the standard that would be returned is a lot of empty strings between the adjacent matches (imagine as a comparison what you might expect if you split ",,,," on ",", you'd probably expect all the gaps.
Also from that help page though is:
If capturing parentheses are used in a Regex.Split expression, any
captured text is included in the resulting string array.
This is the reason you are getting what you actually want in there at all. So effectively it is now showing you the text that has been split (all the empty strings) with the delimiters too.
What you are doing may well be better off done with just matching the regular expression (with Regex.Match) since what is in your regular expression is actually what you want to match.
Something like this (using some linq to convert to a string array):
Regex.Matches(testExpression, #"([A-Za-z0-9_]+)|([&!\|()]{1})")
.Cast<Match>()
.Select(x=>x.Value)
.ToArray();
Note that because this is taking positive matches it doesn't need the spaces to be removed first.
var matches = Regex.Matches(removedSpaces, #"(\w+|[&!|()])");
foreach (var match in matches)
Console.Write("'{0}', ", match); // '!', '(', 'LIONV6', '|', 'NOT_superCHARGED', ')', '&', 'RHD',
Actually, you don't need to delete spaces before extracting your identifiers and operators, the regex I proposed will ignore them anyway.
Im using a StreamReader to open a text file and grab its contents. I need to grab just the text from the file without any escape characters ( \n, \r, \", etc ). Google is failing me right now. Any ideas?
There are no escape characters in a text that you read from a file. Escape characters are used when you write a string literal, for example in program code. I assume that you mean that you want to replace any write space characters with plain spaces.
You can use a regular expression to match white space characters and replace them with spaces. It's easier to use the File.ReadAllText to read the text from the file:
string text = Regex.Replace(File.ReadAllText(fileName), #"[\r\n\t ]+", " ");
Why don't you just call ReadToEnd and then Split the string?
// using statement and whatever code here
var rawContent = sr.ReadToEnd();
var usefulContent = rawContent.Split(new []{ "\r\n", "\\" },
StringSplitOptions.RemoveEmptyEntries);
Note: you'll want to tweak the separators in the Split method; this is just an example.
You could also simply Replace the unwanted characters:
// using statement and whatever code here
var rawContent = sr.ReadToEnd();
var usefulContent = rawContent
.Replace("\r\n", "" )
.Replace("\\", "");
If you're trying to do it as you stream, call StreamReader.Read() in a while loop and test the characters one by one.
If you're able to grab the entire file contents into a string, use a regular expression to strip the undesirable characters. Check out RegexHero: http://regexhero.net/tester/
Assume you have read the entire file in a string s
for (int i = 0; i < s.Length; i++)
{
if (char.IsLetterOrDigit(s, i)) // or if (!char.IsWhiteSpace(s, i))
{
// append to StringBuilder
}
}
If IsLetterOrDigit or IsWhiteSpace don't fit your needs you can create your own method and call it.
You may use universal function for skipping all characters you not need:
public string SkipChars(string InputString, char[] CharsToSkip)
{
string result = InputString;
foreach (var chr in CharsToSkip)
{
result = result.Replace(chr.ToString(), "");
}
return result;
}
usage:
string test = "one\ntwo\tthree";
MessageBox.Show(SkipChars(test, new char[] { '\n', '\t' }));
I am formatting numbers to string using the following format string "# #.##", at some point I need to turn back these number strings like (1 234 567) into something like 1234567. I am trying to strip out the empty chars but found that
value = value.Replace(" ", "");
for some reason and the string remain 1 234 567. After looking at the string I found that
value[1] is 160.
I was wondering what the value 160 means?
The answer is to look in Unicode Code Charts - where you'll find the Latin-1 supplement chart; this shows that U+00A0 (160 as per your title, not 167 as per the body) is a non-breaking space.
char code 160 would be
Maybe you could to use a regex to replace those empty chars:
Regex.Replace(input, #"\p{Z}", "");
This will remove "any kind of whitespace or invisible separator".
value.Replace(Convert.ToChar(160).ToString(),"")
This is a fast (and fairly readable) way of removing any characters classified as white space using Char.IsWhiteSpace:
StringBuilder sb = new StringBuilder (value.Length);
foreach (char c in value)
{
if (!char.IsWhiteSpace (c))
sb.Append (c);
}
string value= sb.ToString();
As dbemerlin points out, if you know you will only need numbers from your data, you would be better use Char.IsNumber or the even more restrictive Char.IsDigit:
StringBuilder sb = new StringBuilder (value.Length);
foreach (char c in value)
{
if (char.IsNumber(c))
sb.Append (c);
}
string value= sb.ToString();
If you need numbers and decimal seperators, something like this should suffice:
StringBuilder sb = new StringBuilder (value.Length);
foreach (char c in value)
{
if (char.IsNumber(c)|c == System.Globalization.NumberFormatInfo.CurrentInfo.NumberDecimalSeparator )
sb.Append (c);
}
string value= sb.ToString();
I would suggest using the char overload version:
value = value.Replace(Convert.ToChar(160), ' ')
Solution with extended methods:
public static class ExtendedMethods
{
public static string NbspToSpaces(this string text)
{
return text.Replace(Convert.ToChar(160), ' ');
}
}
And it can be used with this code:
value = value.NbspToSpaces();
Wouldn't be the preferred method to replace all empty characters (and this is what the questioner wanted to do) with the Regex Method which Rubens already posted?
Regex.Replace(input, #"\p{Z}", "");
or what Expresso suggests:
Regex.Replace(input, #"\p{Zs}", "");
The difference here is that \p{Z} replaces any kind of whitespace or invisible separator whereas the \p{Zs} replaces a whitespace character that is invisible, but does take up space.
You can read it here (Section Unicode Categories):
http://www.regular-expressions.info/unicode.html
Using RegEx has the advantage that only one command is needed to replace also the normal whitespaces and not only the non-breaking space like explained in some answers above.
If performance is the way to go then of course other methods should be considered but this is out of scope here.