katex convert to UTF-8 in C#

katex convert to UTF-8 in C# - c#

I have a 3rd party api that returns me mathematical equations in katex form. Is there a way to convert katex into utf-8 in c#.
Thanks.

No. You can't. And there is a very good reason why.
Let's start about what with the fact that KaTeX is (I'm quoting the katex tag): KaTeX is a fast, easy-to-use JavaScript library for TeX math rendering on the web.
So it is a renderer (written in javascript, but it is irrelevant) that transforms some TeX commands (and some real text) into formatted text. Some (many) of those TeX commands are commands that print a character (\xi prints ξ for example). Others control the formatting (similarly to the formatting tags of html we could say). Now. The first ones could be extracted and converted. We look for \xi and we print ξ. The second ones we can't. In the same way that we can't put in a C# string the "bold" formatting. Formatting is part of the presentation.
For example:
\widetilde{\f\relax{x} = \int_{-\infty}^\infty \f\hat\xi\,e^{2 \pi i \xi x}\,d\xi}
is formatted by KaTeX to
If we convert all the TeX commands that simply print a character we will obtain:
f(x)=∫−∞∞f^(ξ) e2πiξx dξ
I don't think it would be very useful. We have lost the formatting and we have lost the undulated line that was above the equation (the equation was taken from the homepage of the KaTeX project, but I've added the \widetilde{} undulated line to show this point). The line isn't an unicode character. It is drawn with SVG.

Related

Why C# Unicode range cover limited range (up to 0xFFFF)?

I'm getting confused about C# UTF8 encoding...
Assuming those "facts" are right:
Unicode is the "protocol" which define each character.
UTF-8 define the "implementation" - how to store those characters.
Unicode define character range from 0x0000 to 0x10FFFF (source)
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:
u"\U00010000" #WORKING!!!
which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!
# Python (write):
import codecs
with codes.open("file.txt", "w+", encoding="utf-8") as f:
f.write(text) # len(text) -> 1
// C# (read):
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2
Why? How to fix?

According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
Unfortunately, a C#/.NET char does not represent a Unicode character.
A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.
The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".
Why?
The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.
How to fix?
There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.
Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.
A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.

Unicode has so-called planes (wiki).
As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.
I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).
This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.
UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.
You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.

Something like translit.net but on autohotkey

I want to write Translit.net but on autohotkey. So I succesfully done with the part where I have only one letter:
:*:a::а
:*:b::б
:*:v::в
:*:g::г
:*:d::д
...
But now I have a problem with the translation of "shh" to "щ" and other 'two to one' char translations. When I start typing shh i get схх back, but I want to get щ. What could I do?
My current idea: When I press a key it should write down the letter and add non translated letter to a 3 element array and check if the array elements create a shh ,ch, sh or any other combination larger than one. Then I could remove the last 3 or 2 typed letter and send a russian letter what I need. Maybe someone know an easier way to do that. I want my script to work exactly like that page I posted. A solution in C or C# instead of AutoHotkey would help me too.

I have the same problem, while using the unicode version of Autohotkey, but only if the file is saved in UTF-8 without BOM format.
Saving the file as UNICODE (UCS-2, must be Little Endian) solves the problem.
It also works with UTF-8 with BOM, so apparently autohotkey has truble determining endianness on its own.

C# / Python Encoding difference

Basically I am doing some conversions of PDF's into text, then analyzing and clipping parts of that text using a library in Python. The Python "clipping" doesn't actually cut the text into separate files it just has a start character and end character position for string extraction. For example:
the quick brown fox jumped over the lazy dog
My python code might cut out "quick" by specifying 4 , 9. Then I am using C# for a GUI program and try to take these values assigned by Python, and it works... for the most part. It appears the optical character recognition program that turned the pdf into a text file included some odd UTF characters which will change the counts on the C# side.
The PDF-txt conversion odd characters characters include a "ﬁ" character, instead of an "f" and "i" character (possibly other characters too, they are large files.) Now this wouldn't be a problem, except C# says this is one character and Python (as well as Notepad++) consider this 3 character positions.
C#: "ﬁ" length = 1 character.
Python/Notepad++: "ﬁ" length = 3 characters.
What this ends up doing is giving me an offset clip due to the difference of character counts. Like I said when I run it in python (linux) and try outputting the clipping its perfect, and then I transferred the text file to Windows and Notepad++ confirms they are the correct positions. C# really just counts the "ﬁ" as one character and Notepad++ as well as Python count it as 3 characters for some reason.
I need a way to bridge this discrepancy from the Python side OR the C# side.

You have to distinguish between characters and bytes. utf8 is a character encoding, where one character can have up to 4 bytes. So notepad++ displays probably byte positions, where Python can work with both byte and character strings. In C# probably have read the file as text file, which also produces character strings.
To read character strings in python use:
import codecs
with codecs.open(filename, encoding="utf-8") as inp:
text = inp.read()

Right To Left Language Bracket Reversed

I am using a StringBuilder in C# to append some text, which can be English (left to right) or Arabic (right to left)
stringBuilder.Append("(");
stringBuilder.Append(text);
stringBuilder.Append(") ");
stringBuilder.Append(text);
If text = "A", then output is "(A) A"
But if text = "بتث", then output is "(بتث) بتث"
Any ideas?

This is a well-known flaw in the Windows text rendering engine when asked to render Right-To-Left text, Arabic or Hebrew. It has a difficult problem to solve, people often fall back to Western words and punctuation when there is no good alternative word available in the language. Brand and company names for example. The renderer tries to guess at the proper render order by looking at the code points, with characters in the Latin character set clearly having to be rendered left-to-right.
But it fumbles at punctuation, with brackets being the most visible. You have to be explicit about it so it knows what to do, you must use the Unicode Right-to-left mark, U+200F or \u200f in C# code. Conversely, use the Left-to-right mark if you know you need LTR rendering, U+200E.

Use AppendFormat instead of just Append:
stringBuilder.AppendFormat("({0}) {0}", text)
This may fix the issue, but it may - you need to look at the text value - it probably has LTR/RTL markers characters embedded. These need to either be removed or corrected in the value.

I had a similar issue and I managed to solve it by creating a function that checks each Char in Unicode. If it is from page FE then I add 202C after it as shown below. Without this it gets RTL and LTF mixed for what I wanted.
string us = string.Format("\uFE9E\u202C\uFE98\u202C\uFEB8\u202C\uFEC6\u202C\uFEEB\u202C\u0020\u0660\u0662\u0664\u0668 Aa1");

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.

Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly

Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski

I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.