I have a binary file (i.e., it contains bytes with values between 0x00 and 0xFF). There are also ASCII strings in the file (e.g., "Hello World") that I want to find and edit using Regex. I then need to write out the edited file so that it's exactly the same as the old one but with my ASCII edits having been performed. How?
byte[] inbytes = File.ReadAllBytes(wfile);
string instring = utf8.GetString(inbytes);
// use Regex to find/replace some text within instring
byte[] outbytes = utf8.GetBytes(instring);
File.WriteAllBytes(outfile, outbytes);
Even if I don't do any edits, the output file is different from the input file. What's going on, and how can I do what I want?
EDIT: Ok, I'm trying to use the offered suggestion and am having trouble understanding how to actually implement it. Here's my sample code:
string infile = #"C:\temp\in.dat";
string outfile = #"C:\temp\out.dat";
Regex re = new Regex(#"H[a-z]+ W[a-z]+"); // looking for "Hello World"
byte[] inbytes = File.ReadAllBytes(infile);
string instring = new SoapHexBinary(inbytes).ToString();
Match match = re.Match(instring);
if (match.Success)
{
// do work on 'instring'
}
File.WriteAllBytes(outfile, SoapHexBinary.Parse(instring).Value);
Obviously, I know I'll not get a match doing it that way, but if I convert my Regex to a string (or whatever), then I can't use Match, etc. Any ideas? Thanks!
Not all binary strings are valid UTF-8 strings. When you try to interpret the binary as a UTF-8 string, the bytes that can't be thus interpreted are probably getting mangled. Basically, if the whole file is not encoded text, then interpreting it as encoded text will not yield sensible results.
An alternative to playing with binary file can be: converting it to hex string, working on it(Regex can be used here) and then saving it back
byte[] buf = File.ReadAllBytes(file);
var str = new SoapHexBinary(buf).ToString();
//str=89504E470D0A1A0A0000000D49484452000000C8000000C808030000009A865EAC00000300504C544......
//Do your work
File.WriteAllBytes(file,SoapHexBinary.Parse(str).Value);
PS: Namespace : System.Runtime.Remoting.Metadata.W3cXsd2001.SoapHexBinary
I got it! Check out the code:
string infile = #"C:\temp\in.dat";
string outfile = #"C:\temp\out.dat";
Regex re = new Regex(#"H[a-z]+ W[a-z]+"); // looking for "Hello World"
string repl = #"Hi there";
Encoding ascii = Encoding.ASCII;
byte[] inbytes = File.ReadAllBytes(infile);
string instr = ascii.GetString(inbytes);
Match match = re.Match(instr);
int beg = 0;
bool replaced = false;
List<byte> newbytes = new List<byte>();
while (match.Success)
{
replaced = true;
for (int i = beg; i < match.Index; i++)
newbytes.Add(inbytes[i]);
foreach (char c in repl)
newbytes.Add(Convert.ToByte(c));
Match nmatch = match.NextMatch();
int end = (nmatch.Success) ? nmatch.Index : inbytes.Length;
for (int i = match.Index + match.Length; i < end; i++)
newbytes.Add(inbytes[i]);
beg = end;
match = nmatch;
}
if (replaced)
{
var newarr = newbytes.ToArray();
File.WriteAllBytes(outfile, newarr);
}
else
{
File.WriteAllBytes(outfile, inbytes);
}
Related
I have this string, I want to get the part after the date. The part till date always remains the same. I would have hoped to get the index of date but it changes always hence I can't use it.
var str = "c:\ somefolder\ download\ 2019-14-11 merchandise of today"
char[] spearator = {" "};
var _split = str.Split(spearator);
Here I have all the words broken down according to spaces.
How do I get the 'merchandise of today'?
You can try following codes, use the regular expression
var str = #"c:\ somefolder\ download\ 2019-14-11 merchandise of today";
var reg = new Regex(#".+\d{4}-\d{2}-\d{2}");
var result = reg.Replace(str, string.Empty).Trim();
If the format of the date does not change, you could do the following:
var path = #"C:\Example\Download\2019-14-11 Merchandise Of Today";
var merchandise = path.Substring(path.LastIndexOf('\\')).Trim();
OR
var merchandiseWithoutDate = path.Substring(path.IndexOf(' ').Trim();
That will output the date and merchandise of today, if you change the character to a space you will also retrieve that.
What Substring is basically allows you to control an index to begin and end, which allows you to mitigate the text received.
//Thanks guys!
//This is what I did before I saw the two answers. I know it is messy a lot and now I am //gonna try the substring method but this was a temporary fix for me. Thanks again!
var str = "c:\ somefolder\ download\ 2019-14-11 merchandise of today"
char[] spearator = {" "};
var _split = str.Split(spearator);
int len = _split.lenght-1;
for(i=0; i<_split.lenght; i++)
{
var getStr = _split[i]
Match search = regex.match(getStr, "my regex patern for date");
if (search.success)
{
var converted = search.tostring();
int index = Array.IndexOf(_split, converted)
index++;
string temp = "";
while(index<=len_)
{
string temp = temp+ " "+_split[index];
index++;
}
console.writeline("String is {0}", temp);
}
}
I have a C# script which takes in two CSV files as input, combines the two files, performs numerous calculations on them, and writes the result in a new CSV file.
These two input CSV file names are declared as variables and are used in the C# script by accessing those variable names.
The data in the input CSV files looks like this:
Since the data has values in thousands and millions, line splits in the C# code are truncating the data incorrectly. For instance a value of 11,861 appears only as 11 and 681 goes in the next columns.
Is there any way in C#, by which I can specify a text qualifier (" in this case) for the two files ?
Here is the C# code snippet:
string[,] filesToProcess = new string[2, 2] { {(String)Dts.Variables["csvFileNameUSD"].Value,"USD" }, {(String)Dts.Variables["csvFileNameCAD"].Value,"CAD" } };
string headline = "CustType,CategoryType,CategoryValue,DataType,Stock QTY,Stock Value,Floor QTY,Floor Value,Order Count,Currency";
string outPutFile = Dts.Variables["outputFile"].Value.ToString();
//Declare Output files to write to
FileStream sw = new System.IO.FileStream(outPutFile, System.IO.FileMode.Create);
StreamWriter w = new StreamWriter(sw);
w.WriteLine(headline);
//Loop Through the files one by one and write to output Files
for (int x = 0; x < filesToProcess.GetLength(1); x++)
{
if (System.IO.File.Exists(filesToProcess[x, 0]))
{
string categoryType = "";
string custType = "";
string dataType = "";
string categoryValue = "";
//Read the input file in memory and close after done
StreamReader sr = new StreamReader(filesToProcess[x, 0]);
string fileText = sr.ReadToEnd();
string[] lines = fileText.Split(Convert.ToString(System.Environment.NewLine).ToCharArray());
sr.Close();
where csvFileNameUSD and csvFileNameCAD are variables with values pointing to their locations.
Well, based on the questions you have answered, this ought to do what you want to do:
public void SomeMethodInYourCodeSnippet()
{
string[] lines;
using (StreamReader sr = new StreamReader(filesToProcess[x, 0]))
{
//Read the input file in memory and close after done
string fileText = sr.ReadToEnd();
lines = fileText.Split(Convert.ToString(System.Environment.NewLine).ToCharArray());
sr.Close(); // redundant due to using, but just to be safe...
}
foreach (var line in lines)
{
string[] columnValues = GetColumnValuesFromLine(line);
// Do whatever with your column values here...
}
}
private string[] GetColumnValuesFromLine(string line)
{
// Split on ","
var values = line.Split(new string [] {"\",\""}, StringSplitOptions.None);
if (values.Count() > 0)
{
// Trim leading double quote from first value
var firstValue = values[0];
if (firstValue.Length > 0)
values[0] = firstValue.Substring(1);
// Trim the trailing double quote from the last value
var lastValue = values[values.Length - 1];
if (lastValue.Length > 0)
values[values.Length - 1] = lastValue.Substring(0, lastValue.Length - 1);
}
return values;
}
Give that a try and let me know how it works!
You posted a very similar looking question few days ago. Did that solution not help you?
If so, what issues are you facing on that. We can probably help you troubleshoot that as well.
Here is my problem:
I have a string that I think it is binary:
zv�Q6��.�����E3r
I want to convert this string to something which can be read. How I can do this in C#?
You may try enumerating (testing) all available encodings and find out that one
which encodes reasonable text. Unfortunately, when it's not an absolute solution:
it could be a information loss on erroneous conversion.
public static String GetAllEncodings(String value) {
List<Encoding> encodings = new List<Encoding>();
// Ordinary code pages
foreach (EncodingInfo info in Encoding.GetEncodings())
encodings.Add(Encoding.GetEncoding(info.CodePage));
// Special encodings, that could have no code page
foreach (PropertyInfo pi in typeof(Encoding).GetProperties(BindingFlags.Static | BindingFlags.Public))
if (pi.CanRead && pi.PropertyType == typeof(Encoding))
encodings.Add(pi.GetValue(null) as Encoding);
foreach (Encoding encoding in encodings) {
Byte[] data = Encoding.UTF8.GetBytes(value);
String test = encoding.GetString(data).Replace('\0', '?');
if (Sb.Length > 0)
Sb.AppendLine();
Sb.Append(encoding.WebName);
Sb.Append(" (code page = ");
Sb.Append(encoding.CodePage);
Sb.Append(")");
Sb.Append(" -> ");
Sb.Append(test);
}
return Sb.ToString();
}
...
// Test / usage
String St = "Некий русский текст"; // <- Some Russian Text
Byte[] d = Encoding.UTF32.GetBytes(St); // <- Was encoded as UTF 32
St = Encoding.UTF8.GetString(d); // <- And erroneously read as UTF 8
// Let's see all the encodings:
myTextBox.Text = GetAllEncodings(St);
// In the myTextBox.Text you can find the solution:
// ....
// utf-32 (code page = 12000) -> Некий русский текст
// ....
byte[] hexbytes = System.Text.Encoding.Unicode.GetBytes();
this gives you hex bytes of the string but you have to know the encoding of your string and replace the 'Unicode' with that.
I am reading strings from a binary file. Each string is null-terminated. Encoding is UTF-8. In python I simply read a byte, check if it's 0, append it to a byte array, and continue reading bytes until I see a 0. Then I convert byte array into a string and move on. All of the strings were read correctly.
How can I read this in C#? I don't think I have the luxury of simply appending bytes to an array since the arrays are fixed size.
Following should get you what you are looking for. All of text should be inside myText list.
var data = File.ReadAllBytes("myfile.bin");
List<string> myText = new List<string>();
int lastOffset = 0;
for (int i = 0; i < data.Length; i++)
{
if (data[i] == 0)
{
myText.Add(System.Text.Encoding.UTF8.GetString(data, lastOffset, i - lastOffset));
lastOffset = i + 1;
}
}
I assume you're using a StreamReader instance:
StringBuilder sb = new StringBuilder();
using(StreamReader rdr = OpenReader(...)) {
Int32 nc;
while((nc = rdr.Read()) != -1) {
Char c = (Char)nc;
if( c != '\0' ) sb.Append( c );
}
}
You can either use a List<byte>:
List<byte> list = new List<byte>();
while(reading){ //or whatever your condition is
list.add(readByte);
}
string output = Encoding.UTF8.GetString(list.ToArray());
Or you could use a StringBuilder :
StringBuilder builder = new StringBuilder();
while(reading){
builder.Append(readByte);
}
string output = builder.ToString();
If your "binary file" only contains null terminated UTF8 strings, then for .NET it isn't a "binary file" but just a text file because null characters are characters too. So you could just use a StreamReader to read the text and split it on the null characters.
(Six years later "you" would presumably be some new reader and not the OP.)
A one line (ish) solution would be:
using (var rdr = new StreamReader(path))
return rdr.ReadToEnd().split(new char[] { '\0' });
But that will give you a trailing empty string if the last string in the file was "properly" terminated.
A more verbose solution that might perform differently for very large files, expressed as an extension method on StreamReader, would be:
List<string> ReadAllNullTerminated(this System.IO.StreamReader rdr)
{
var stringsRead = new System.Collections.Generic.List<string>();
var bldr = new System.Text.StringBuilder();
int nc;
while ((nc = rdr.Read()) != -1)
{
Char c = (Char)nc;
if (c == '\0')
{
stringsRead.Add(bldr.ToString());
bldr.Length = 0;
}
else
bldr.Append(c);
}
// Optionally return any trailing unterminated string
if (bldr.Length != 0)
stringsRead.Add(bldr.ToString());
return stringsRead;
}
Or for reading just one at a time (like ReadLine)
string ReadNullTerminated(this System.IO.StreamReader rdr)
{
var bldr = new System.Text.StringBuilder();
int nc;
while ((nc = rdr.Read()) > 0)
bldr.Append((char)nc);
return bldr.ToString();
}
For example i have this:
"Was? Wo war ich? Ach ja.<pa>">
I need to create a new text file that will contain only:
Was? Wo war ich? Ach ja.
And i have a big file like 43mb and i need to scan all over the file and get only the places that start with " and end with <pa>" and to get the string between this tags.
I did this code so far:
private void retrivingTestText()
{
w = new StreamWriter(retrivedTextFile);
string startTag = "\"";
string endTag = "<pa>";
int startTagWidth = startTag.Length;
int endTagWidth = endTag.Length;
string text = "\"Was? Wo war ich? Ach ja.<pa>\">";
int begin = text.IndexOf(startTag);
int end = text.IndexOf(endTag, begin + 1);
string result = text.Substring(begin+1, end-1);
w.WriteLine(result);
w.Close();
}
But now i need to make it on a big file 43mb xml file.
So in the constructor i already did StreamReader r;
And string f;
Then i did :
r = new StreamReader(#"D:\New folder (22)\000004aa.xml")
f = r.ReadToEnd();
Now i need to use it with the code above to extract all the strings in the big file between the startTag and endTag and not only specific text.
Second thing i need to make another function so after i make changes it will know to add back all the extractes text strings to the right places where it was before between the startTag and the endTag
Thanks.
You can go for following approach to extract the data.
string word = "\"Was? Wo war ich? Ach ja<pa>\"Jain\"Romil<pa>\"";
string[] stringSeparators = new string[] { "<pa>\"" };
string ans=String.Empty;
string[] text = word.Split(stringSeparators, StringSplitOptions.None);
foreach (string s in text)
{
if (s.IndexOf("\"") >= 0)
{
ans += s.Substring(s.IndexOf("\"")+1);
}
}
return ans;
There is a similar post on how to remove HTML tags using regular expressions. Here is the link.
And another one that you can tweak, here.