Save text with emoji to file become '?' - c#

What I have to do
I have to create a text file (.txt, .doc, ...) with the exact text passed (so with emojis) by a .net WebaApi (and attach it to an email).
Situation:
I have a project with .net webapi. One of my routes consist of creating a text file and attach it to an email, with some text passed by a device that may contain emojis.
I can't figure out how to save emojis correctly. If I copy-paste an emoji into a word or notepad file it works, but if I save it through my code it doesn't. I suppose it is due to formatting, but I tried Unicode, UTF-32, UTF-8, ASCII,...
I tried many solutions found here on SO, but none of them worked for me.
For example this emoji (copy-pasted from .net debugger) --> 🎶 is converted into quotation mark or ¶ó, based on encoding used.
How can I save emoji as text into a file so that they can be read by the receivers?
This is what I've done:
//smsText is a string containing emojis
byte[] bytes = Encoding.Unicode.GetBytes(smsText);
Attachment attachment = new Attachment(new MemoryStream(bytes), tokenKey + ".doc");
attachment.ContentType = new ContentType("application/ms-word");
List <Attachment> attachments = new List<Attachment>();
attachments.Add(attachment);
//send email with attachments
Note that smsText, with debugger, contains the 🎶 correctly displayed.
The email correctly reach the receiver, with the .doc attachment, but the attachment doesn't contains the emojis

Your smsText contains a plaintext string. You can't just write that string into a stream or file that you then call a Word file*.
Word files are binary files with a specific format. You need to use a library that can write this format, or use Interop to interoperate with an existing Word installation.
See for example Free library to MS Word.
And if you're fine with plaintext files, just write the text's bytes to a stream and propagate the appropriate encoding (in this case Unicode, being UTF-16 on .NET).
*: yes you can, just like that Excel tries its best to format an HTML table as an Excel document, but you shouldn't.

Related

How to set the BOM for a file being read

I have been having issues reading a file that contains a mix of Arabic and Western text. I read the file into a TextBox as follows:
tbx1.Text = File.ReadAllText(fileName.Text, Encoding.UTF8);
No matter what value I tried instead of "Encoding.UTF8" I got garbled characters displayed in place of the Arabic. The western text was displayed fine.
I thought it might have been an issue with the way the TextBox was defined, but on start up I write some mixed Western/Arabic text to the textbox and this displays fine:
tbx1.Text = "Start السلا عليكم" + Environment.NewLine + "Here";
Then I opened Notepad and copied the above text into it, then saved the file, at which point Notepad save dialogue asked for which encoding to use.
I then presented the saved file to my code and it displayed all the content correctly.
I examined the file and found 3 binary bytes at the beginning (not visible in Notepad):
The 3 bytes, I subsequently found through research represent the BOM, and this enables the C# "File.ReadAllText(fileName.Text, Encoding.UTF8);" to read/display the data as desired.
What puzzles me is specifying the " Encoding.UTF8" value should take care of this.
The only way I can think is to code up a step to add this data to a copy of teh file, then process that file. But this seems rather long-winded. Just wondering if there is a better way to do, or why the Encoding.UTF8 is not yielding the desired result.
Edit:
Still no luck despite trying the suggestion in the answer.
I cut the test data down to containing just Arabic as follows:
Code as follows:
FileStream fs = new FileStream(fileName.Text, FileMode.Open);
StreamReader sr = new StreamReader(fs, Encoding.UTF8, false);
tbx1.Text = sr.ReadToEnd();
sr.Close();
fs.Close();
Tried with both "true" and "false" on the 2nd line, but both give the same result.
If I open the file in Notepad++, and specify the Arabic ISO-8859-6 Character set it displays fine.
Here is what is looks like in Notepad++ (and what I would liek the textbox to display):
Not sure if the issue is in the reading from file, or the writing to the textbox.
I will try inspecting the data post read to see. But at the moment, I'm puzzled.
The StreamReader class has a constructor that will take care of testing for the BOM for you:
using (var stream = new FileStream(fileName.Text, FileAccess.Read))
{
using (var sr = new StreamReader(stream, Encoding.UTF8, true))
{
var text = sr.ReadToEnd();
}
}
The final true parameter is detectEncodingFromByteOrderMark:
The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes:
UTF-8
little-endian Unicode
and big-endian Unicode text
if the file
starts with the appropriate byte order marks. Otherwise, the
user-provided encoding is used. See the Encoding.GetPreamble method
for more information.

C# Web scraper copying text

I have a web scraper written in C# for extracting data. I want to copy text from the web browser control and paste it into a Word file programmatically. When I try to extract rich text box content using its ID and InnerText, the text contains encoded characters like %2c.
I need to get the text with all formatting but I can't find any way. I have tried Encoding, HTTPUtility.UrlDecode, SendKeys and elem.InvokeMember() without success.
How can I programmatically copy and paste text from web browser control preserving formatting?
Here is the sample data to extract:
Description
The Advance Concepts Engineering team designs and develops new vehicles which will meet future regulatory requirements and customer competitive requirements. A qualified candidate will be responsible for the total vehicle packaging. The candidate will identify and resolve adaptation and packaging issues as the vehicle moves toward production. They will lead cross functional team meetings working with Systems & Components, Advance Manufacturing, Service, etc. to ensure that the solutions are optimized for all stages of the vehicle's life.
HtmlElement elem = wb.Document.GetElementById("ctl00_contplhDynamic_txtDescrContentHiddenTextarea");
if (elem == null) return;
elem.InvokeMember("Click");
//elem.InvokeMember("Select All");
//elem.InvokeMember("Copy");
SendKeys.SendWait("^a");
SendKeys.SendWait("^c");
Clipboard.Clear();
elem.Focus();
elem.InvokeMember("Right Click");
elem.InvokeMember("Select All");
elem.InvokeMember("Copy");
Clipboard.SetText(elem.InnerText);
string clipbrdText = Clipboard.GetText();
string data = elem.InnerText;
richTextBox1.Text = data;
string temp = System.Web.HttpUtility.UrlDecode(data);
Encoding iso = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(data);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);
The text with "%2c" etc has been encoded. If you are getting the content of a web page, you are decoding the HTML, not the URL. You can use HttpUtility.HtmlDecode, or if you are using .NET 4.0 or above you can also use WebUtility.HtmlDecode - this is available within the System.Net namespace.
You should note that Word does not use HTML for its formatting, so you won't be able to paste HTML tags and expect it to recognise them. i.e. <strong>Description</strong> will not result in bold text if you type that into Word.
EDIT:
It looks like you are mixing two different ways to copy the text in the code you pasted - both SendKeys.SendWait("^c"); and elem.InvokeMember("Copy");. I presume both of these methods work?
I think the problem you are having lies in the way you are getting the text. I see you're using Clipboard.GetText() to get the text. Try specifying that it is formatted text using Clipboard.GetText(TextDataFormat.Rtf) or Clipboard.GetText(TextDataFormat.Html). This should hopefully copy the string preserving the formatting.

How to get the contents of attachments of a file

I am trying to get the content of attachment. It may be an excel file, Document file or text file whatever it is but I want to store it in database so here I am using this code: -
foreach (FileAttachment file in em.Attachments)// Here em is type of EmailMessage class
{
Console.Write("Hello friends" + file.Name);
file.Load();
var stream = new System.IO.MemoryStream(file.Content);
var reader = new System.IO.StreamReader(stream, UTF8Encoding.UTF8);
var text = reader.ReadToEnd();
reader.Close();
Console.Write("Text Document" + text);
}
So By printing file.name is showing attachment file name but while printing 'text' on the console it is working if the attachment is .txt type but if it is .doc or .xls type then it is showing some symbolic result. I am not getting any text result. Am I doing something wrong or missing something. I want text result of any kind of file attachment . Please help me , I am beginner in C#
What you are seeing is what is actually in the file. Try opening one with Notepad.
There is no built-in way in .NET to show the "text contents" of arbitrary file formats. You'll have to create (preferably using third-party libraries that already solve this problem) some kind of logic that extracts plaintext from rich text documents.
See for example How to extract text from Pdf, Word and Excel documents?, Extract text from pdf and word files, and so on.
First, what do you expect when reading a binary file?
Your result is exactly what is expected. A text file can be shown as a string, but a doc or xls file is a binary file. You will see the binary content of the file. You will need to use a tool/lib to get the text/content from a binary file in human readable format.
TXT type is simple,DOC or XLS are much more complex.You can see TXT because is just text,DOC or XLS or PPT or something else needs to be interpreted by other mechanism.
See,for example,you have different colors or font sizes on a Word document,or a chart in an Excel document,how can you show that in a simple TextBox or RichTextBox?Short answer,you can't.

Parsing emails for TIFF attachments in C#

I built an email parser that extracts TIFF attachments from emails sent by 2 different fax providers, RingCentral and eFax.
The application uses Pop3 to retrieve the email as a text stream and then parse the text to identify the section that represents the Tiff image.
By converting that section of text to a byte array and using a BinaryWriter, I'm able to create the TIFF file on my local hard drive.
public void SaveToFile(string filepath)
{
BinaryWriter bw = new BinaryWriter(new FileStream(filepath, FileMode.Create));
bw.Write(this.Data);
bw.Flush();
bw.Close();
}
The issue is that the eFax email attachments cause runtime errors when converting the text to a byte array.
//_data is a byte array
//RawData is a string
_data = Convert.FromBase64String(RawData); //fails on this line
I get the following error:
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or a non-white space character among the padding characters.
I assume it has something to do with the encoding/decoding of the string, but I've tried various encoding types and still get the error.
Some additional information:
Programming Language: C#
Email Host: GMail
If I manually forward the email back to myself, the parser works, but will not work against the original.
I even tried auto-forwarding in GMail but this did not work.
I'm responding here to the first comment below and thanks for your response.
The TIFF file is created by taking the section of text from the email that is associated to the TIFF file attachment, converting it to a byte array, and saving the file with a .tiff file extension. This works fine for all RingCentral emails. For example, the RingCentral email section header looks like this:
------=_NextPart_3327195283162919167883
Content-Type: image/tiff; name="18307730038-0803-141603-326.tif"
Content-Transfer-Encoding: base64
Content-Description: 18307730038-0803-141603-326.tif
Content-Disposition: attachment; filename="18307730038-0803-141603-326.tif"
Please note the Content-Transfer-Encoding value of base64. This explains why I use the following C# conversion code:
_data = Convert.FromBase64String(tiffEmailString);
_data is the private variable and is used as the return value in the SaveToFile method above (i.e. _data is returned when this.Data property value was used).
Now for the eFax (the email the fails) section header:
Content-Type: image/tiff; name=FAX_20130802_1375447833_61.tif
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="FAX_20130802_1375447833_61.tif"
Content-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==
It too shows base64. So shouldn't Convert.FromBase64String() method call work?
I'm also going to check whether my parser is grabbing additional text. But if I'm missing something, please point it out. Thanks.
Latest update:
As it turns out the issue was not the encoding but my parser! I was inadvertently including an additional header value in the attachment text. It's working now. Thanks.

Read UNIX encoded file with C#

I have c# program we use to replace some Values with others, to be used after as parameters. Like 'NAME1' replaced with &1, 'NAME2' with &2, and so on.
The problem is that the data to modify is on a text file encoded on UNIX, and special characters like í, which even on memory, gets read as a square(Invalid char). Due specifications that are out of my control, the file can't be changed and have no other choice than read it like that.
I have tryed to read with most of the 130 Encodings c# offers me with:
EncodingInfo[] info = System.Text.Encoding.GetEncodings();
string text;
for (int a = 0; a < info.Length; ++a)
{
text = File.ReadAllText(fn, info[a].GetEncoding());
File.WriteAllText(fn + a, text, info[a].GetEncoding());
}
fn is the file path to read. Have checked all the made files(like 130), no one of them writes properly the í so im out of ideas and im unable to find anything on internet.
SOLUTION:
Looks like finally this code made the work to get the text properly, also, had to fix the same encoder for the Writing part:
System.Text.Encoding encoding = System.Text.Encoding.GetEncodings()[41].GetEncoding();
String text = File.ReadAllText(fn, encoding); // get file text
// DO ALL THE STUFF I HAD TO
File.WriteAllText(fn, text, encoding) System.Text.Encoding.GetEncodings()[115].GetEncoding(); //Latin 9 (ISO)
/* ALL THIS ENCODINGS WORKED APARENTLY FOR ME WITH ALL WEIRD CHARS I WAS ABLE TO WRITE :P
System.Text.Encoding.GetEncodings()[108].GetEncoding(); //Baltic (ISO)
System.Text.Encoding.GetEncodings()[107].GetEncoding(); //Latin 3 (ISO)
System.Text.Encoding.GetEncodings()[106].GetEncoding(); //Central European (ISO)
System.Text.Encoding.GetEncodings()[105].GetEncoding(); //Western European (ISO)
System.Text.Encoding.GetEncodings()[49].GetEncoding(); //Vietnamese (Windows)
System.Text.Encoding.GetEncodings()[45].GetEncoding(); //Turkish (Windows)
System.Text.Encoding.GetEncodings()[41].GetEncoding(); //Central European (Windows) <-- Used this one
*/
Thank you very much for your help
Noman(1)
you have to get the proper encoding format. try
use file -i. That will output MIME-type information for the file,
which will also include the character-set encoding. I found a
man-page for it, too :)
Or try enca
It can guess and even convert between encodings. Just look at
the man page.
If you have the proper encoding format, look for a way to apply it to your file reading.
Quotes: How to find encoding of a file in Unix via script(s)

Categories

Resources