I've written a quick-and-dirty utility to parse a text file, but in some cases it's writing out a "�" character. My utility reads from a .txt file which contains "records" in this format:
Biography
Title:George F. Kennan: An American Life
Author:John Lewis Gaddis
Kindle: B0054TVO1G
Hardcover: B007R93I1U
Paperback: 0143122150
Image link: <img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...and writes out lines from that to a CSV file such as:
Biography,"George F. Kennan: An American Life","John Lewis Gaddis",B0054TVO1G,B007R93I1U,0143122150,<img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...but in several cases, as mentioned, that weird character is appending itself to an author's name. In most cases where this is happening, it's what appears to be a space character in the .txt file. I'm trimming the author's name prior to writing it out to the CSV file, so it's obviously not being seen as a space, though.
When I save the text file with these characters, I get the message about non-unicode characters, etc.
What could be the cause of that? And better yet, how can I delete them with a search and replace operation? In Notepad, they are not found, so I have to delete them one-by-one.
Prior to being in the .txt file, this data was in an Open Office/.odt file, if that means anything to anyone.
BTW, I have no idea how that "stackoverflow" got into the href above; it's not in the original text I pasted in...
UPDATE
I am curious how that character got in my files. I sure didn't put it there (deliberately), any more than I added the "stackoverflow" to the URL above. Could it be that a call to Environment.Newline would add that?
Here was my process:
1) Copy and paste info from the interwebs into an Open Office/.odt file
2) Copy and past that into a text (Notepad) file
3) Open that text file programmatically and loop through it, writing to a new "csv"/.txt file.
UPDATE 2
Silly me - all I had to do was save the file (which wouldn't save those weird characters), then open it again. IOW, when I opened it today (at home, after work) those were gone.
UPDATE 3
I wrote too soon - it replaced the weird character with a question mark (a "normal" one, not a stylized one).
They are almost certainly non-breaking spaces, U+00A0 (although there are other fixed-width space characters which are also possible.) These won't be trimmed as spaces, but will be rendered as spaces if the encoding of the file matches the encoding of the output device.
My guess is that your text file is in CP-1252 (i.e., Windows default one-byte coding) but your output is being rendered as though it were UTF-8.
Normally you would type these characters as AltGr+Space. You might try that with Notepad, but no guarantees.
Related
Why while using the following code
string filePath = #"C:\test\upload.pdf"
FileStream fs = File.OpenRead(filePath);
raises the following exception?
The filename, directory name, or volume label syntax is incorrect :
'C:\ConsoleApp\bin\Debug\netcoreapp2.1\C:\test\upload.pdf'
From where the C:\ConsoleApp\bin\Debug\netcoreapp2.1\ directory comes from?
Update:
The File.OpenRead() in my case, exists within in an dll & the filePath (C:\test\upload.pdf) is sent via the application that is using the dll.
The string starts with an invisible character, so it's not a valid path. This
(int)#"C:\test\upload.pdf"[0]
Returns
8234
Or hex 202A. That's the LEFT-TO-RIGHT EMBEDDING punctuation character
UPDATE
Raymond Chen posted an article Why is there an invisible U+202A at the start of my file name?.
We saw some time ago that you can, as a last resort, insert the character U+202B (RIGHT-TO-LEFT EMBEDDING) to force text to be interpreted as right-to-left. The converse character is U+202A (LEFT-TO-RIGHT EMBEDDING), which forces text to be interpreted as left-to-right.
The Security dialog box inserts that control character in the file name field in order to ensure that the path components are interpreted in the expected manner. Unfortunately, it also means that if you try to copy the text out of the dialog box, the Unicode formatting control character comes along for a ride. Since the character is normally invisible, it can create all sorts of silent confusion.
(We’re lucky that the confusion was quickly detected by Notepad and the command prompt. But imagine if you had pasted the path into the source code to a C program!)
In the 4 years since that article Notepad got UTF8 support so the character isn't replaced by a question mark. Pasting into the current Windows Console with its incomplete UTF8 support still replaces the character.
The File.OpenRead() in my case, exists within in an dll .
Set the CopyLocal= true in properties of dll in which File.OpenRead exists.
I need to be able to extract the full file path out of this string (without whatever is after the file extension):
$/FilePath/FilePath/KeepsGoing/Folder/Script.sql (CS: 123456)
A simple solution such as the following could would work for this case, however it is only limited to a file extension with 3 characters:
(\$.*\..{3})
However, I find problems with this when the file contains multiple dots:
$/FilePath/FilePath/File.Setup.Task.exe.config (CS: 123456)
I need to be able to capture the full file path (from $ to the end of whatever the file extension is, which can be any number of things). I need to be able to get this no matter how many dots are in the name of the file. In some cases there are spaces in the name of the file too, so I need to be able to incorporate that.
Edit: The ending (CS....) in this case is not standard. All kinds of stuff can follow the path so I cannot predict what will come after the path, but the path will always be first. Sometimes spaces do exist in the file name.
Any suggestions?
Try this:
(\$.*\.[\w.-]+)
But! it will not properly match files with space or special chars in the file extension. If you need to match files that might have special chars in the file extension you'll need to elaborate on the input (is it quoted? is it escaped?).
I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the ñ character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees ñ as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from ñ to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.
You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, ñ
Since you say "the input file sees ñ as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.
You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be ñ, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .
I have an excel file which contains some data when I save that file to CSV then some weird ? marks appear before & end of the text. Will any 1 please tell me how can I resolve that issue.
?XXXXXX-XXX?
Above is the link to download excel file : http://www.filedropper.com/book1_5
In this file, in the column C you've got following data:
"0000468750-IN"
"0000468750-IN"
"0000843576AB"
"0000843576AB"
It is not reslly visible now, but at start and end of every number you have there an additional invisible whitespace character. You may see it for yourself, just edit that cell and move through the text by directional arrows - it will make a little pause when moving over that invisible character. If I replace it with an underscore, it looks like that:
"_0000468750-IN_"
"_0000468750-IN_"
"_0000843576AB_"
"_0000843576AB_"
If my text editor doesn't cheat on me, that character has code 0x00, and it's called null-character.
When converting to CSV, Excel didn't know what to do with that character. CSV is a textfile and must follow some encoding rules. For example, if you saved it as CSV/ANSI, then it's not possible to store some Unicode characters like ąęćżń. Similarly, it's usually not possible to store a 0x00 character in a textfile at all, because this character is special in most encodings. With this character inside, such textfile could be detected as "binary file" by readers and rejected.
Excel simply replaced that odd charcter with "?" character to make the data safe for CSV format. Excel didn't just erase the 0x00 character to let you know that there was something odd in the original data.
It's very strange to see it in textual data. If this XLSX was generated by a computer program, it might indicate that this program has some bugs/errors. I highly doubt this file to be manually created. It's really hard to write "0x00" character by hand. One option I can think of when you could get this manually is by using a crappy barcode reader, and scanning the codes right into the Excel sheet. The barcode scanning software sometimes leaks the control characters into the textdata stream. If that's the case, change the reader or write a filter that will cut those chars out.
Btw. you should be able to just find&replace all that strange characters. Edit one of the cells (F2 key), go to the end of the text (END key) select the LAST character of the text (Shift + LeftArrow ONCE), copy that character (Control + C), then open Find&Replace window (Control + H) and paster that character into "Find" and press "Replace All".
On my Excel this resulted in finding/replacing 8 such characters, so it works.
Note that after the END key you must press ShiftLeft exactly ONCE. The cursor will not move and nothing will happen, no selection will show up. That's because the character is invisible. But it is there, and it will be selected and copied.
I am looking for a way to search a comma separated txt file for a keyword, and then replace another keyword on that exact line. For example if i have the following line in a big txt file:
Help, 0
I want to find this line in the txt (by telling program to look for the first word 'help') and replace the 0 with 1 to indicate that i have read it once so it looks like:
Help, 1
Thanks
It is generally a very bad idea to try and overwrite data in the same file: if your code throws an exception, you'll be left with a partially processed file; if your search target and replacement value have different lengths, you have to re-write the rest of the file. Note that these don't apply in your specific situation - but it's best not to let it become habit.
My recommendation:
Open both the input file and a temporary file (Path.GetTempFileName)
process and write each line ( StreamReader.ReadLine)
When finished with no errors, rename the original file to something like origFile.old
rename the temporary file to the original file name.
If something goes wrong, delete the temporary file and exit. This way the original file is left intact in the event of an error.
If you want to do the replacement "in place" (meaning you don't want to use another, temporary, file) then you would do so with a FileStream.
You have a couple of options, you can Read through the file stream until you find the text that you're looking for, then issue a Write. Keep in mind that FileStream works at the byte level, so you'll need to take character encoding into consideration. Encoding.GetString will do the conversion.
Alternatively, you can search for the text, and note its position. Then you can open a FileStream and just Seek to that position. Then you can issue the Write.
This may be the most efficient way, but it's definitely more challenging then the naive option. With the naive implementation, you:
Read the entire file into memory (File.ReadAllText)
Perform the replace (Regex.Replace)
Write it back to disk (File.WriteAllText)
There's no second file, but you are bound by the amount of memory the system has. If you know you're always dealing with small files, then this could be an option. Otherwise, you need to read up on character encoding and file streams.
Here's another SO question on the topic (including sample code): Editing a text file in place through C#