Unicode string display in XNA - c#

In the game I'm making, I'd like to be able to display and have the user input Unicode characters. However, I'm having problems with using SpriteFonts to handle this task. Including all of the Unicode characters uses up WAY too many resources (it even causes VS2010 to crash!), so that's out of the question. But I'm not sure what other options I may have.
I know there are ways to dynamically load Unicode codepoints in an "as-needed" basis, but these methods seem to be geared towards string tables and other static text. All of my text is provided by the user, so a static approach wouldn't work here. Any ideas/help?

The XNA Forums are a good place to ask XNA questions. Here is a question similar to yours:
http://forums.xna.com/forums/p/3302/16475.aspx#16475

Related

Math equation to image

I'm developing an application for Windows Phone that are using different kinds of math problems, such as hard equations. The thing is that I want this app to be nice and cool and therefor It's obviously that you can't use the * (snowflake) as the multiplication symbol and you can't use the ^ (roof-top) as the exponent symbol. No, what I want is that I want to parse equations into cool images. I mean like this:
I know that there are several ways to parse math, including Latex, MathML and many more, I do also know that there are many javascript based strategies to create images like this, but I still haven't found one that are compatible with Windows Phone 7 based on C#. And I don't want to have anything server based, I want the conversion between equation to an image to be done in the client.
If you have any suggestions please leave a answer.
So the "*" becomes a "."? Not sure whether that's cooler or not but you still don't need an image generator, a nice font will do.

Winform character spacing

I am trying to use Graphics.DrawString and TextRenderer.DrawText to laydown on a fixed rectangle some strings with variable number of characters.
However, even using the GDI+ wrapping methods I am not satisfied with result: I would need to control the font kerning (or string character spacing) to give a chance to pack high number of characters strings.
I read about FontStretches but I do not know how to use in winform. Another method is Typography.SetKerning but again I am blank about using it.
Can someone help?!
Round 2:
I know it could be hard, Win32 API has a freetype support which could be the solution to issue.
Practically my aim is to do something similar to "http://stackoverflow.com/questions/4582545/kerning-problems-when-drawing-text-character-by-character", in .NET. Notice that I am working on pre-formed string of arabic language, not user character imput.
My problem is:
(1) identify which library has the wanted kerning function (most probably gdi32.dll), (2) build a c# safe environment to deal with dll calls, (3) implement a call to dll that works in c#.
Can someone help?
Thank you for answering.
If you look at the documentation, its quite easy to find out which does what, and how to use it.
The method Typography.SetKerning is an WPF-only thing, so you won't be able to use it in WinForms.
A quick Google found this article, which shows us how to modify kerning values to GDI text.

Printing a line instead of "--------"

For a Windows CE project that we print slips, we have a new request which asks if it is possible to print a line insted of printing "-----------" all the way.
Is this possible without printing an image?
c# / .net 3.5
Thank you
On your desktop run charmap.exe. Tick "Advanced view" and type "box" in the Search box. You'll get the Unicode codepoints that you can use to draw lines and boxes. Copy and paste them into your code. Whether they actually show up properly on your device depends on the font support. Odds are decent since they've been around since the first IBM PC. You'll have to try.
There are extendedascii values to do this (196) but it really depends on the printer.
Or as quppa comments use _ but it will not be adequate if you want to box in a title or so.
Wikipedia has an article on box-drawing characters.
Since ─ (U+2500) didn't work for you, it's unlikely ━ (U+2501) will work either, but it's perhaps worth a shot. There is also no guarantee that there won't be spaces between these characters, given that spaces appear between underscores.
The issue is not Windows CE supporting Unicode but finding a font that you can use that has the box-drawing characters. Given the likely size limitations (fonts with lots of characters are tens of megabytes big), this might be a challenge.

Tesseract OCR Library - Learning Font

Well I'm using a complied .NET version of this OCR which can be found # http://www.pixel-technology.com/freeware/tessnet2/
I have it working, however the aim of this is to translate license plates, sadly the engine really doesn't accurately translate some letters, for example here's an image I scanned to determine the character problems
Result:
12345B7B9U
ABCDEFGHIJKLMNUPIJRSTUVHXYZ
Therefore the following characters are being translated incorrectly:
1, O, Q, W
This doesn't seem too bad, however on my license plates, the result isn't so great:
= H4 ODM
= LDH IFW
Fake Test
= NR4 y2k
As you might be able to tell, I've tried noise reduction, increasing contrast, and remove pixels that aren't absolute black, with no real improvements.
Apparently you can 'learn' the engine new fonts, but I think I would need to re-compile the library for .NET, also it seems this is performed on a Linux OS which I don't have.
http://www.scribd.com/doc/16747664/Tesseract-Trainingfor-Khmer-LanguageFor-Posting
So I'm stuck as what to try next, I've wrote a quick console application purely for testing purposes if anyone wants to try it. If anyone has any ideas/graphic manipulation/library thoughts, I'd appreciate hearing them.
I used Tesseract via Tessnet2 recently (Tessnet2 is a VS2008 C++ wrapper around Tesseract 2.0 made by Rémy Thomas, if I remember well). Let me try to help you with the little knowledge I have concerning this tool:
1st, as I said above, this wrapper is only for Tesseract 2.0, and the newest Tesseract version on Google Code is 3.00 (the code is no longer hosted on Source Forge). There are regular contributors: I saw that version 3.01 or so is planned. So you don't benefit from the last enhancements, including page layout analysis which may help when your license plates are not 100% horizontal.
I asked Rémy for a Tessnet2 .NET wrapper around version 3, he doesn't plan any for now. So as I did, you'll have to do it by yourself !
So if you want to get the latest version of the sources, you can download them from the Subversion repository (everything's described on the dedicated site page) and you'll be able to compile them if you have Visual Studio 2008, since they sources contain a VS2008 solution in the vs2008 sub-folder. This solution is made of VS2008 C++ projects, so to be able to get results in C# you'll have to use .NET P/Invoke with the tessDll built by the project. Again if you need this, I have code examples that may interest you, but you may want to stay with C++ and do your own new WinForm projects, for instance !
When you have achieved to compile (there should not be major problems for that, but tell me if you meet some, I may have met them too :-) ), you'll have in output several binaries that will allow you to do a specific training ! Again, there is a page specially dedicated to Tesseract 3 training. Thanks to this training, you can:
restrain your set of characters, which will automatically remove the punctuation ('/-\' instead of 'A', for instance)
indicate the ambiguities you have detected ('D' instead of 'O' as you could see, 'B' instead of '8' etc) that will be taken into account when you will use your training.
I also saw that Tesseract results are better if you restrain the image to the zone where the letters are located (i.e. no face, no landscape around): in my case, I needed to recognize only a specific zone of cards photos taken from a webcam, so I used image processing to restrain the zone. That was long, of course, but my images came from many different sources so I had no choice. If you can get images that are restrained to the minimum, that will be great !
I hope it was of any help, do not hesitate to give me your remarks and questions !
Hi I've done lots of ocr with tesseract, and I have had some of your problems, too. You ask about IMAGE PROCESSING tools, and I'd recommend "unpaper" (there are windows ports too, see google) That's a nice de-skew, unrotate, remove-borders-and-noise and-so-on program. Great for running before ocr'ing.
If you have a (somewhat) variable background color on your images, I'd recommend the "textcleaner" imagemagick script
I think it's edge detecting and whitening out all non-edgy stuff.
And if you have complex text then "ocropus" could be of use.
Syntax is (on linux): "ocroscript rec-tess "
My setup is
1. textcleaner
2. unpaper
3. ocroups
With these three steps I can read almost anything. Even quite blurry+noisy images taken in uneven lighting, with two columns of tightly packed text comes out very readable. OK maybe your needs aren't that much text, but step 1) & 2) could be of use to you.
I'm currently building a license plate recognition engine for ispy - I got much better results from tesseract when I split the license plate into individual characters and built a new image displayed vertically with white space around them like:
W
4
O
O
M
I think a big problem that tesseract has is it tries to make words out of the horizontal letters and numbers and in the case of license plates with letters and numbers mixed up it will decide that a number is a letter or vice versa. Entering an image with the characters spaced vertically makes it treat them as individual characters instead of text.
A great read! http://robotics.usc.edu/publications/downloads/pub/635/
About your skew problem for license plates:
Issue: When OCR input is taken from a hand-held camera
or other imaging device whose perspective is not fixed like
a scanner, text lines may get skewed from their original
orientation [13]. Based on our experiments, feeding such a
rotated image to our OCR engine produces extremely poor
results.
Proposed Approach: A skew detection process is needed
before calling the recognition engine. If any skew is detected,
an auto-rotation procedure is performed to correct the skew
before processing text further. While identifying the algorithm
to be used for skew detection, we found that many
approaches, such as the one mentioned in [13], are based on
the assumptions that documents have s et margins. However,
this assumption does not always hold in our application.
In addition, traditional methods based on morphological
operations and projection methods are extremely slow and
tend to fail in presence of camera-captured images. In this
work, we choose a more robust approach based on Branchand-
Bound text line finding algorithm (RAST algorithm) [25]
for skew detection and auto-rotation. The basic idea of this
algorithm is to identify each line independently and use the
slope of the best scoring line as the skew angle for the entire
text segment. After detecting the skew angle, rotation is
performed accordingly. Based on our experiments, we found
this algorithm to be highly robust and extremely efficient
and fast. However, it suffered from one minor limitation in
the sense that it failed to detect rotation greater than 30.
We also tried an alternate approach, which could detect any
angle of skew up to 90. However, this approach was based
on presence of some sort of cross on the image. Due to
the lack of extensibility, we decided to stick with RAST
algorithm.
Tesseract 3.0x, by default, penalizes combinations that aren't words and aren't common words. The FAQ describes a method to increase its aversion to such nonsense. You might find it helpful to turn off the penalty for rare or nonexistent words, as described (inversely) here:
http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?
If anyone from the future comes across this question, there is a tool called jTessBoxEditor that makes teaching Tesseract a breeze. All you do is point it at a folder containing sample images, then click a button and it creates your *.learneddata file for you.
ABCocr .NET uses Tesseract3 so that might be appropriate if you need the latest code under .NET.

Localization: How to map culture info to a script name or Unicode character range?

I need some information about localization. I am using .net 2.0 with C# 2.0 which takes care of most of the localization related issues. However, I need to manually draw the alphabets corresponding to the current culture on the screen in one particular screen.
This would be similar to the Contacts screen in Microsoft Outlook (Address Cards view or Detailed Address Cards View under Contacts), and so it needs a the column of buttons at the right end, one for each alphabet.
I am trying to emulate that, but I don't want to ask the user to choose the script. If the current culture is say, Chinese, I want to draw Chinese alphabets. When the user changes the culture info to English (and when he restarts the application) I want to draw English alphabets instead. Hope you understand where I am going with this query.
I can determine the culture of the current user (Application.CurrentCulture or System.Globalization.CultureInfo.CurrentCulture will give the culture related information). I also have all the scripts to render the alphabets. However, the problem is that I don't know how to map the culture info to the name of a script.
In other words, is there a way to determine the script name corresponding to a culture? Or is it possible to determine the range of Unicode character values corresponding to a culture? Either of them would allow me to render the alphabets on the button properly.
Any suggestions or guidance regarding this is truly appreciated. If there is something fundamentally wrong with my approach (or with what I am trying to achieve), please point out that as well. Thanks for your time.
PS: I know the easiest solution is to either configure the script name as part of user preferences or display a list of languages for the user to choose from (a la Contact in Outlook 2007). But I am just trying to see whether I can render the alphabets corresponding to the culture without the user having to do anything.
In native code there's LOCALE_SSCRIPTS for GetLocaleInfoEx() (Vista & above) that shows you what scripts are expected for a locale. There isn't a similar concept for .Net at this time.
Chinese has thousands of characters, so it might not be feasible to show all the characters in their character set. There's no native concept of 'alphabet' in Chinese, and I don't think Chinese has a syllabary like Japanese does.
Pinyin (Chinese written in roman alphabet) can be used to represent the Chinese characters, and that might help you index them. I know this doesn't answer your question, but I hope it's helpful.
I fully agree with mikiemacman. In addition, a given laguage doesn't necessarily uses all the letters of a script.
Anyway, the closest I can think of is CultureInfo.TextInfo.ANSICodePage -> There are only a handful of ANSI code pages. You could have create a table (or a switch() statement, whatever) that lists the script for each ANSI codepage.
Proto, wait! There's a much more accurate solution. It's an unmanaged on hance you may have to P/Invoke.
GetLocaleInfoW(MAKELCID(wLangId, SORT_DEFAULT), LOCALE_FONTSIGNATURE, wcBuf, MAXWCBUF);
This gives you a LOCALESIGNATURE stucture. The anwer is in the lsUsb field: Unicode subsets bitfield. Rats! the MS page for this structure is empty. But look it up in your MSDN copy. It's fully documented there: A whole set of flags that describe which scripts are spported. And yes, there's a flag for Tamil ;-)
HTH.
EDIT: Oops! Hadn't seen Shawne's answer. Wow! Answer from an in-house expert! ;-) Anyway, you may still be interested in a Pre-Vista compatible answer.
Fascinating topic. While it might not answer your question, Omniglot is a good resource.
The correct answer is likely to be complex, and depend on the exact problem you're solving. Assuming your goal showing only letters used in a particular language to separate phonebook sections (as in Outlook), few of the issues are:
People who have contact names spanning several scripts/languages.
2-glyph letters (e.g. 'Lj' in Serbian). It is one phoneme, always treated as a single letter although it has 2 Unicode symbols. 'It would have its own section in the phonebook (separate from 'L').
Too many glyphs to list (e.g. Chinese)
Unorthodox ordering (e.g. Thai -- a phone book would be separated by consonants only, ignoring the vowels).
Uppercase / lowercase distinction (presumably you'd only want one case for languages that support it -- which breaks down in minor ways Turkish 'i').

Categories

Resources