How to find whether the stream has an unicode

How to find whether the stream has an unicode - c#

I am having a file name "Connecticut is now 2 °C.txt" which contains a unicode but the file contents are just normal characters. Previously the code was used to identify whether the file name has unicode if so the file header was written with the unicode details. This way of implementation leads to conflict in the output file. So can anyone suggest how to find whether the file stream has an unicode in it.
Thanks in advance,
Lokesh.

By far the simplest strategy is to decide on an encoding for a particular file, e.g. UTF-8, and use it exclusively, both when you write it and then when you read it. Trying to detect what encoding is in use is decidedly error prone so it's best not to have to do this detection.
UPDATE
In the comments below you clarify that you wish to write to a file that is created by somebody else with an unknown encoding.
In full generality this is impossible to do with 100% reliability.
If you are lucky then you may find that the file comes with a Byte Order Mark (BOM). In which case you can read the BOM and thus infer the encoding. There's no requirement for a text file to contain a BOM and they frequently don't.
However, I would urge you to agree an interchange format with whoever is creating these files. Pick a single encoding and always use it.

I think this link would be helpful for you. Pay attention to IsTextUnicode Function

Related

Data concatenation as steganography technique

For example, I recorded a video using my camera and saved it as my_vacation.mp4 which size is 50MB. I opened the video file and an encrypted file called secret_message.dat using Visual Studio, by using File.ReadAllBytes() in C#, concatenated both arrays of bytes, and then saved it as my_vacation_2.mp4.
The program I created for testing purpose is able to save the byte index where the hidden file begin and I want to use it as key to extract that hidden file later.
Now I can play that video file normally, without any error. Total file size is 65MB. Suppose no one could access the original file, of course no one would know that the last 15MB part of that video file is actually another file, right?
What might be the flaw of this technique? Is this also a valid steganography technique?

Is this a valid steganography technique?
Yes, it is. The definition of steganography is hiding information in another medium without someone suspecting its presence or existence. Just because it may be a bad approach doesn't change its intentions at all. If anything, a multitude of papers on steganography mention this technique in their introduction section as an example of how steganography can be applied.
What might be the flaw of this technique?
There are mainly 2 flaws: it is trivial to detect and is absolutely fragile to modification attacks.
Many formats encode their data either by a header which says in advance how many bytes to read before the end of file, or by putting an end-of-file marker, which means to keep on reading data until the marker is encountered. By attaching your data after that, you ensure they won't be read by the appropriate format decoder. This can fool your 11-year old cousin who knows nothing about that sort of stuff, but anyone mildly experienced can load the file and count how many bytes were read. If there are unaccounted bytes in the physical file, that will instantly raise red flags.
Even worse, it's trivial to fully extract your secret. You may argue it's encrypted, but remember, the aim of steganography is to not raise any suspicion. Most steganalysis approaches put a statistical number to it, e.g., 60% there is a message hidden in X medium. A few others can go a bit further and guess the approximate length of the embedded secret. In comparison, you're already caught red-handed.
Talking about length, a file of X bitrate/compression and Y duration approximately results to a file of size Z. Even an unsavvy one will know what's up when the size is 30% larger than expected.
Now, imagine your file is communicated through an insecure channel where a warden inspects its contents and if he suspects foul play, he can modify the file so that the recipient doesn't get the message. In this case, it's as simple as loading the file and resaving it. In fact, your method is so fragile it can be destroyed by even the most unintentional of attacks. By just uploading your track to a site for playback, it can unwittingly reencode it for higher compression, just because it makes sense.

Suppose no one could access the original file, of course no one would know that the last 15MB part of that video file is actually another file, right?
No. Your secret file is encrypted, so that probably rules out any headers showing up in hex editor, but there is a problem - MP4 container format and its structure is well known.
You can extract all video/audio tracks and what you are left with is some metadata and your secret message, so it will be obvious that it's not supposed to be there.
It is a valid technique, just not a very effective one.

Why does not changing a few number of bytes in a file corrupts the file?

In C#, I have a ZIP file that I want to corrupt by XORing or Nulling its bytes.
(by Nulling I mean make all the bytes in the file zeros)
XORing its bytes requires me to first, read the bytes to a byte array, XOR the bytes in the array with some value, then write the bytes back to the file.
Now, if I XOR/Null All (or half) of the file's bytes, it gets corrupted, but if Just
XOR/Null some of the bytes, say the first few bytes (or any few number of bytes in any position of the file) it doesn't get corrupted, and by that I mean that i can still access the file as if nothing really happend.
Same thing happened with mp3 files.
Why isn't the file getting corrupted ?
and is there a "FAST" way that i could corrupt a file with ?
the problem is that the zip file that I'm dealing with is big,
so XORing/Nulling even half of its bytes will take a couple of secs.
Thank You So Much In Advance .. :)

Just read all files completely and you probaly will get reading errors.
But of course, if you want to keep something 'secret', use encryption.
A zip contains a small header, a directory structure (a the end) and in between the individual files. See Wikipedia for details.
Corrupting the first bytes is sure to corrupt the file but it is also very easily repaired. The reader won't be able to find the directory block at the end.
Damaging the last block has the same effect: the reader will give up immediately but it is repairable.
Changing a byte in the middle will corrupt 1 file. The CRC will fail.

It depends on the file format you are trying to "corrupt". It also depends on what portion of the file you are trying to modify. Lastly, it depends how you are verifying if it is corrupted. Most file formats have some type of error detection.
The other thing working against you is that the zip file format uses a CRC algorithm for corruption. In addition, there are two copies of the directory structure, so you need to corrupt both.
I would suggest you corrupt the directory structure at the end and then modify some of the bytes in the front.

I could just lock the zip entries with a pass, but I don't want anybody to even open it up and see what's in it
That makes it sound as if you're looking for a method of secure deletion. If you simply didn't want someone to read the file, delete it. Otherwise, unless you do something extreme like go over it a dozen times with different values or apply some complex algorithm over it a hundred times, there are still going to be ways to read the data, even if the format is 'corrupt'.
On the other hand, breaking a file simply to stop someone else accessing it conventionally just seems overkill. If it's a zip, you can read it in (there are plenty of questions here for handling archive files), encrypt it with a password and then write it back out. If it's a different type of file, there are literally a million different questions and solutions for encrypting, hiding or otherwise preventing access to data. Breaking a file isn't something you should being going out of your way to do, unless this is to help test some sort of un-zip-corrputing-program or something similar, but your comments imply this is to prevent access. Perhaps a bit more background on why you want to do this could help us provide a better answer?

How to match and erase a (potentially) large portion of text between certain points in C#?

I'm trying to find a way to clear out links in a .txt document loaded into the project as a string via StreamReader.
Firstly I need to identify that there is a link (it could be inside of tags, or it could just be out by itself in the middle of a sentence, like http://www.somesite.com )
I found a neat class online called GetStringInBetween which allows me to find all the links in the document. However I'm struggling in using the same class to then match both the found link(s) AND another point - I was trying to go for a linebreak so that I'm able to replace everything between a linebreak and the end of the url - effectively erasing chunks of text surrounding the url; they typically say something like "you can visit our site at http:/", etc.
What is the best way to a) identify links in an extremely long string and b) how to erase them AND some text around them?
I'd also like to note that unless I specify to use Encoding.UTF7 the text comes out all garbled when it's read from the text files. I don't know if this might be a source of the matching issues.
Thanks ladies and gents :)

First of all - how big is the file that you're trying to parse? If it's just on the order of a few hundred MB, then you can load it in RAM entirely which simplifies things.
The UTF-7 encoding should not bother you, because all .NET strings are internally UTF-16 and .NET converts from UTF-7 to UTF-16 when reading the file so you don't have to worry about encodings anymore.
After you have it in one big string, your best bet is to proceed with using regexps on it. They allow replacing text as well, so you might be able to "clean" your file in one line of code! Of course, regexps for matching URLs will never be perfect (and even less so for parsing HTML), so you can expect that some parts of more exotic URLs might escape now and then. But if you want perfection, then it might get REALLY tricky.
Alternatively, if the file is large, and you only care about removing one line at a time, you might try reading the file line-by-line and then process each line separately. If you find and URL in it, discard the line. If there is no URL, write to target file. That's also a very simple to write. You'd still be dependent on regexps for finding URLs though.

How to determine if a CSV file is unicode or not

I am using C#, I have a comma delimited csv file with different strings in different languages.
My app should only open the CSV if it's unicode.
Is there an easy way to determine this in code ?

When you say "Unicode" I assume you mean UTF-8. Unicode is not an encoding and a file can't be "Unicode".
You could use a library, for example, ude is a C# library that attempts to determine what encoding a file uses. It uses the algorithm described here. It is not 100% foolproof.

The CSV specification does not provide a way to provide metadata describing the encoding format. The specification itself uses ASCII encoding for separators. But the data tokens between separators can be anything.
You will have to read through the data itself and infer the coding type based on that.
If you are in control of the output and input, you could modify the format it for your own needs by adding your own metadata, but then it wouldn't fit the CSV file format then, if that matters in your case.
So no, there isn't an "easy" way to determine the encoding.

How to convert the encoding of an string to UTF-8 without know the original encoding in C#?

I'm reading a CSV file with Fast CSV Reader (on codeproject). When I print the content of the fields, the console show the character '?' in some words. How can fix it?

The short version is that you have to know the encoding of any text file you're going to read up front. You could use things like byte order marks and other heuristics if you really aren't going to know, but you should always allow for the value to be tweaked (in the same way that Excel does if you're importing CSV).
It's also worth double checking the values in the debugger, as it may be that it is the output that is wrong, as opposed to the reading -- bear in mind that all strings are Unicode internally, and conversion to '?' sounds like it is failing converting the unicode to the relevant code page for the console.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.