Find Special Character Sequence in file - c#

I have a file that uses NUL's and SOH as markers. I need to look at the pattern of those special characters in order to parse out what I need. For example, when the file is viewed in Notepad++:
NULNULNUL Boom BoomSOHNULNULNULNULDLENULNULNULSIJohn Lee HookerSOH
I would like to extract the "Boom Boom" and the "John Lee Hooker". Those values will change (these are music files) with each file.
I was thinking of usiing the "NULNULNUl" pattern to find the first section and the "NULSI" to find the second part.
I tried a FileStream to read in the bytes, but i don't know how to detect the special characters.

Related

Read multiple files with different encoding, preserving all characters

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt
Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));

Regex - Extracting File Paths

I need to be able to extract the full file path out of this string (without whatever is after the file extension):
$/FilePath/FilePath/KeepsGoing/Folder/Script.sql (CS: 123456)
A simple solution such as the following could would work for this case, however it is only limited to a file extension with 3 characters:
(\$.*\..{3})
However, I find problems with this when the file contains multiple dots:
$/FilePath/FilePath/File.Setup.Task.exe.config (CS: 123456)
I need to be able to capture the full file path (from $ to the end of whatever the file extension is, which can be any number of things). I need to be able to get this no matter how many dots are in the name of the file. In some cases there are spaces in the name of the file too, so I need to be able to incorporate that.
Edit: The ending (CS....) in this case is not standard. All kinds of stuff can follow the path so I cannot predict what will come after the path, but the path will always be first. Sometimes spaces do exist in the file name.
Any suggestions?
Try this:
(\$.*\.[\w.-]+)
But! it will not properly match files with space or special chars in the file extension. If you need to match files that might have special chars in the file extension you'll need to elaborate on the input (is it quoted? is it escaped?).

Extracting data from a large file with regex

I have a close to 800 MB file which consists of several (header followed by content).
Header looks something like this M=013;X=rast;645.jpg while content is binary of the jpg file.
So the file looks something like this
M=013;X=rast;645.jpgNULœDüŠˆ.....M=217;X=rast;113.jpgNULÿñÿÿ&åbÿås....M=217;X=rast;1108.jpgNUL]_ÿ×ÉcË/...
The header can occur in one line or across two lines.
I need to parse this file and basically pop out the several jpg images.
Since this is too big a file, please suggest an efficient way? I was hoping to use StreamReader but do not have much experience with regular expressions to use with it.
RegEx:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=(?1)|$))/gs *with recursion (not supported in .NET)
.NET RegEx workaround:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=M=.+?;X=.+?;.+?\.jpg|$))/gs
replaced the (?1) recursion group with the contents inside the 1st capture group
Live demo and Explanation of RegExp: http://regex101.com/r/nQ3pE0/1
You'll want to use the 2nd capture group for binary contents, the 1st group will match the header and the expression needs it to know where to stop.
*edited in italic

Why am I getting "�" characters?

I've written a quick-and-dirty utility to parse a text file, but in some cases it's writing out a "�" character. My utility reads from a .txt file which contains "records" in this format:
Biography
Title:George F. Kennan: An American Life
Author:John Lewis Gaddis
Kindle: B0054TVO1G
Hardcover: B007R93I1U
Paperback: 0143122150
Image link: <img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...and writes out lines from that to a CSV file such as:
Biography,"George F. Kennan: An American Life","John Lewis Gaddis",B0054TVO1G,B007R93I1U,0143122150,<img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...but in several cases, as mentioned, that weird character is appending itself to an author's name. In most cases where this is happening, it's what appears to be a space character in the .txt file. I'm trimming the author's name prior to writing it out to the CSV file, so it's obviously not being seen as a space, though.
When I save the text file with these characters, I get the message about non-unicode characters, etc.
What could be the cause of that? And better yet, how can I delete them with a search and replace operation? In Notepad, they are not found, so I have to delete them one-by-one.
Prior to being in the .txt file, this data was in an Open Office/.odt file, if that means anything to anyone.
BTW, I have no idea how that "stackoverflow" got into the href above; it's not in the original text I pasted in...
UPDATE
I am curious how that character got in my files. I sure didn't put it there (deliberately), any more than I added the "stackoverflow" to the URL above. Could it be that a call to Environment.Newline would add that?
Here was my process:
1) Copy and paste info from the interwebs into an Open Office/.odt file
2) Copy and past that into a text (Notepad) file
3) Open that text file programmatically and loop through it, writing to a new "csv"/.txt file.
UPDATE 2
Silly me - all I had to do was save the file (which wouldn't save those weird characters), then open it again. IOW, when I opened it today (at home, after work) those were gone.
UPDATE 3
I wrote too soon - it replaced the weird character with a question mark (a "normal" one, not a stylized one).
They are almost certainly non-breaking spaces, U+00A0 (although there are other fixed-width space characters which are also possible.) These won't be trimmed as spaces, but will be rendered as spaces if the encoding of the file matches the encoding of the output device.
My guess is that your text file is in CP-1252 (i.e., Windows default one-byte coding) but your output is being rendered as though it were UTF-8.
Normally you would type these characters as AltGr+Space. You might try that with Notepad, but no guarantees.

C#: Search and replace txt line

I am looking for a way to search a comma separated txt file for a keyword, and then replace another keyword on that exact line. For example if i have the following line in a big txt file:
Help, 0
I want to find this line in the txt (by telling program to look for the first word 'help') and replace the 0 with 1 to indicate that i have read it once so it looks like:
Help, 1
Thanks
It is generally a very bad idea to try and overwrite data in the same file: if your code throws an exception, you'll be left with a partially processed file; if your search target and replacement value have different lengths, you have to re-write the rest of the file. Note that these don't apply in your specific situation - but it's best not to let it become habit.
My recommendation:
Open both the input file and a temporary file (Path.GetTempFileName)
process and write each line ( StreamReader.ReadLine)
When finished with no errors, rename the original file to something like origFile.old
rename the temporary file to the original file name.
If something goes wrong, delete the temporary file and exit. This way the original file is left intact in the event of an error.
If you want to do the replacement "in place" (meaning you don't want to use another, temporary, file) then you would do so with a FileStream.
You have a couple of options, you can Read through the file stream until you find the text that you're looking for, then issue a Write. Keep in mind that FileStream works at the byte level, so you'll need to take character encoding into consideration. Encoding.GetString will do the conversion.
Alternatively, you can search for the text, and note its position. Then you can open a FileStream and just Seek to that position. Then you can issue the Write.
This may be the most efficient way, but it's definitely more challenging then the naive option. With the naive implementation, you:
Read the entire file into memory (File.ReadAllText)
Perform the replace (Regex.Replace)
Write it back to disk (File.WriteAllText)
There's no second file, but you are bound by the amount of memory the system has. If you know you're always dealing with small files, then this could be an option. Otherwise, you need to read up on character encoding and file streams.
Here's another SO question on the topic (including sample code): Editing a text file in place through C#

Categories

Resources