I am building a file parsing tool in WPF that let's me adjust line length till data lines up. See this video around 2:10 https://www.youtube.com/watch?v=OMeghA82kSk
I really need to fix it so that the text has a fixed width. I had thought about maybe doing a DataGridView and having each cell be a character, but that seems slow and kinda silly. Since it is recreating the view constantly, it needs to perform rather quickly.
I feel like what I am asking isn't that unusual, but I have tried using all the Fixed Width fonts, but when it gets to the out of normal range control chars, it doesn't hold up.
I see other applications such as v64 that do exactly what I am looking for (see below). Do I need to use something other than a TextBox? What would be the ideal way to do this?
Ok, so I found the issue. First off, you HAVE to specify the file encoding or else it will skip some bytes. In my case it was skipping \x86 which threw everything off.
The only way I figured that out was by doing:
string shortText = File.ReadAllText("Original.dat");
File.WriteAllText("New.dat", shortText);
And then doing a byte by byte analysis. The right way is to do the following:
string shortText = File.ReadAllText("Wrapped.dat", Encoding.ASCII);
Even then, and even with a monospaced font it won't look correct. That is because most TTF fonts don't have a definition for things that aren't alphanumeric, so you add in a regular expression to strip out the rest and it works.
shortText = Regex.Replace(shortText, #"[^\w\n',|\.#-]", " ", RegexOptions.None);
Related
I am fairly new to programming and I just wrote a simple application in C# .NET to retrieve information about system drive space. The program functions fine but I'm struggling with formatting the output.
See output:
I'm trying to use padding to get the text to line up in sort of a column format within a rich text box but the output doesn't line up because if there are multiple drives, the drive names are different lengths which throws off the padding. Even if the drive letter comes back one as M: and the other as I: the difference in the size of the letter is enough to throw off the alignment while padding.
I am wondering if there is a way to force each string value to a specific length so the padding is applied evenly or if maybe there's an even better way to format my output. Thank you in advance for your time and let me know if any further information would be helpful!
Note: One of the comments asked an important question, regarding whether the question refers to the System.Windows.Forms.RichTextBox (WinForms) or the System.Windows.Controls.RichTextBox (WPF) control. This answer applies only to the WinForms version of RichTextBox, so if you're using WPF, this doesn't apply.
The most important thing, and this was mentioned in the comments, is that you'll need to use a Monospaced font.
Since you stated you're using a RichTextBox, you'll need to know how to set it to use whatever monospaced font you've chosen.
To do that, you can use the RichTextBox.SelectionFont property.
For more general instructions, refer to this MSDN article: Setting Font Attributes for the Windows Forms RichTextBox Control
Once you set the RichTextBox.SelectionFont property, only text added to the control afterwards will use the specified font. To apply the font to existing text (i.e. you populate the RichTextBox and then change the font to an appropriate monospaced font), take a look at this answer, which tells you precisely what to do.
Once that's done, there remains the simple matter of adding the appropriate amount of whitespace to the end of each string, such that the next piece of data appears at the appropriate position. You'll probably be using String.PadRight, but for more general information about padding strings, check out this MSDN article: Padding Strings in the .NET Framework
Here is string formatting example:
string varOne = "Line One";
double varTwo = 15/100;
string output= String.Format("{0,-10} {1,5:P1}", varOne, varTwo);
//expected output is
//Line One 15 %
where formatting properties in curly brackets are:
{index[,alignment][ :formatString] }
I'm using a c# wrapper for the Tesseract library (3.02 if I'm not mistaken) (https://github.com/charlesw/tesseract). I've got it running and giving output, but that output is essentially garbage. Often it gives nothing and when it does give something it's often a mess. I know it's theoretically working because I've tried it on some really perfect images and it works. I'm wondering if someone can help me diagnose the issues and suggest some ways I can improve Tesseract accuracy. I've already converted all the images to black and white and the resolution is set at 300x300. I don't do any line straightening programmatically but as you can see below they're pretty straight.
This image works perfectly
This one does not work at all, producing either gibberish or nothing at all
I tried flipping the colors on some examples, thinking that it might give greater contrast (since most text is black on a white background, whereas the working ones were white text on black background). But:
Does not work at all, whereas
Again works perfectly.
I suspect this has something to do with the additional spacing between the letters in "INVOICE." But there must be some way to get decent results with a tighter font. Any suggestions are welcome, I'm a relative noob here.
If possible you should consider using pictures with a higher resolution. The other problem about the Payments image is probably the gap between the letters that is too small. Tesseract cannot detect single letters if they are (almost) connected to the next letter of the word.
I would suggest an image processing library like openCV to improve your results.
You could try erosion/dilation. This will seperate the letters if the right parameters are used for the kernel. Use different kernels to see what works best for you.
Mat element = getStructuringElement(erosion_type,
Size(2 * erosion_size + 1, 2 * erosion_size + 1),
Point(erosion_size, erosion_size));
erode(src, erosion_dst, element);
What was helping me a lot when I was working on my project was using an adaptive threshold. I found this to be way more effective than just turning it into a grayscale or binary image.
Note: Java Code, should be very similar in C though.
Imgproc.adaptiveThreshold(cropedIm, cropedIm, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C, Imgproc.THRESH_BINARY, 29, 10);
This is what I get after selecting one of your images in Pixtern, an android project of mine(source code on github). I was using a the adapting threshold but no dilation/erosion and the result is already quite good.
[broken links removed]
For the Payments image and similar ones:
Try using a normal threshold and inverting the image(black font, white background). Again, dilation/erosion can be used afterwards. Java Code:
//results in binary image
Imgproc.threshold(cropedIm, cropedIm, 127, 255, Imgproc.THRESH_BINARY);
//Inverting image
Core.bitwise_not(cropedIm, cropedIm);
Tesseract expects whole pages or rather it was trained on those.
If you give it one or two characters or words it won't work well.
I assume you have more of these images. Stitch them together as lines of text: like each image is a line of text after the previous and it should work much better.
Furthermore, make sure you set the psm-parameter right when using tesseract. More on this: https://www.pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/
I'm trying to display all glyphs in a font. I'm using GetFontUnicodeRanges to get the available characters, then I create a bitmap with all the available characters and their index next to each one.
I used the font "Wingdings 2" as a test case, and compared it to what I see in Windows' charmap.exe. I see that while all the characters appear, some characters appear more than once (total of 480 glyphs in that non-unicode font), and the positions are not the same as in charmap (for instance, medium sized circle glyph, in charmap located as 0x97, and in the font it is glyph 0xF097 and I also think it is the one in 0x2014).
I want to use the font as the "regular" way, meaning, I want to see the same data as in charmap.exe (and in a side note I would also like to know if a font is a unicode font or ascii font, as charmap shows). Basically, you can say I am trying to write my own charmap from scratch.
How can I fill in that missing data? I was looking through the Windows' fonts and text APIs, but couldn't find anything to help me, so I must be missing some relevant APIs. What are they?
After struggling a lot with GetFontData and the lack of documentation (well, not exactly lack of, but it is really not well organized, and some data is indeed missing), I found a way writing my own CharMap. Here's what I've found during development:
The documentation will tell you to use a "trick" possible since the glyph location data comes right after the arrays in cmap table. It doesn't mean it is IN the cmap table. Actually, they are in the loca table.
You would also need to read the head table for the location format flag (offset 34), and the maxp table for the number of glyphs field (offset 4).
It seems that in symbol fonts (you can tell if a font is a symbol font if the cmap header encoding id is 0, at least in TTF format 4, which is the Microsoft format) the characters are added 0xF000 to their actual index, so instead of the regular ASCII codes, you get a Unicode value in the far end of the Unicode table. I subtracted 0xF000 from each character code and tested on Wingdings[2,3] and Webdings fonts and it worked just fine.
I used the official documentation a lot: www.microsoft.com/typography/tt/ttf_spec/ttch02.doc, and the reference code: http://support.microsoft.com/kb/241020.
The reference code is written in C, so in order to write it in C# I read all the data to byte[] buffers, and "manually" read each element from it.
I went through this nightmare years ago too and I know a lot about all this stuff now. I figured I should pitch in and provide some answers.
1) You can not assume that 'loca' is following the 'cmap'. The order can vary by font. The location of each block is defined by the OffsetTable which begins generally at byte 0 of the font file. (http://www.microsoft.com/typography/otspec/otff.htm)
2) You can not assume that "cmap header encoding id is 0, at least in TTF format 4" means symbol fonts. I know for a fact that certain old arabic fonts also use that encoding. To this date, I still do not know how to differentiate them. Windows does it but I do not know how. I do not know how to know for sure that a font is a symbol font. Even checking the OS/2 table for the code page bit 32 isn't enough in many case.
3) You can not simply use the magic 0xF000 number and add it to your small 0-255 number to get the character that will give you the glyph mapping you are going for. That is because those small 0 to 255 "ASCII" code will vary depending on your system locale.
Symbol font are specials in the way that windows processes them.
Unlike normal font where the mapping between glyphs and character is static, symbol fonts mapping varies based on the system default code page for non-unicode application aka CP_ACP.
For example, Pretend your symbol font have this glyph : '%'. If your system is using CP 1252 by default, then to render this glyph you, for example, have to render the character value '0xC2'.
If your system is using CP 1251 by default, then to render this glyph you, for example, have to render the character value '0x416' which is entirely different.
Said otherwise, the font's unicode ranges varies based on the default non-unicode code page!
After investigation, we discovered that the valid character value for fonts are the values obtained by converting 0 through 255 are if they were CP_ACP value to unicode.
What does this mean? This means that you want to use MultiByteToWideChar with CP_ACP to get the mapping from values 0 to 255 to their localized unicode value based on your system locale (CP_ACP).
So, doing that will give you a map like :
ASCII -> localized non-static UNICODE
0x00 -> 0x00
0x01 -> 0x01
0x02 -> 0x02
...
0xC2 -> 0x416 <----- This is correct : the value will be different in some cases.
...
0xE3 -> 0xE3
The 0xF000 to 0xF0FF values are the static UNICODE values : they never change.
So to get the glyph ID for a "localized non-static UNICODE", you first use your map above to find the corresponding ASCII value and then you add 0xF000 to that and then you get the glyph id for that.
Of course, non of this non-sense is documented by MS... or I could never find it.
I've never looked at "WingDings 2" in detail, but it's very common for glyphs to be reused for different characters. For example, uppercase Roman A and uppercase Greek alpha are frequently the same glyph.
However, I guess the equality of 0x97, 0xF097 and 0x2014 is some kind of hack to deal with windows-1252. In the windows-1252 codepage, 0x97 is an em-dash, which is 0x2014 in Unicode. 0xF097 is in the private use area; I guess it is providing a Unicode-compatible (and reversible) way of encoding the windows-1252 0x97.
In my experience, the most reliable way to get an unambiguous list of the unicode characters supported by a font is to parse the cmap table from the ttf file. This is a bit of a chore (cmap supports something like six different encodings) but it is documented online. You can use the GetFontData function to get the raw data, or parse the ttf directly.
charmap uses the GetFontData function and the code includes the string "cmap", suggesting that charmap is also doing this.
The Windows SDK Debugging Tools include logger.exe, which records all the APIs used by an app. You can use this if you want to be really sure what charmap is doing.
Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.
You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!
I'm rendering text using FormattedText, but there does appear to be any way to perform per-char hit testing on the rendered output. It's read-only, so I basically only need selection, no editing.
I'd use RichTextBox or similar, but I need to output text based on control codes embed in the text itself, so they don't always nest, which makes building the right Inline elements very complex. I'm also a bit worried about performance with that solution; I have a large number of lines, and new lines are appended often.
I've looked at GlyphRun, it appears I could get hit-testing from it or a related class, but I'd be reimplementing a lot of functionality, and it seems like there should be a simpler way...
Does anyone know of a good way to implement this?
You can get the geometry of each character from a FormattedText object and use the bounds of each character to do your hit testing.
var geometry = (GeometryGroup)((GeometryGroup)text.BuildGeometry(new Point(0, 0))).Children[0];
foreach (var c in geometry.Children)
{
if (c.Bounds.Contains(point))
return index;
index++;
}
In OnRender you can render these geometry objects instead of the formatted text.
The best way is to design a good data structure for storing your text and which also considers hit-testing. One example could be to split the text into blocks (words, lines or paragraphs depending on what you need). Then each such block should have a bounding-box which should be recomputed in any formatting operations. Also consider caret positions in your design.
Once you have such facility it becomes very easy to do hit-testing, just use the bounding boxes. It will also help in subsequent operations like highlighting a particular portion of text.
Completely agree with Sesh - the easiest way you're going to get away with not re-implementing a whole load of FormattedText functionality is going to be by splitting up the individual items you want to hit-test into their own controls/inlines.
Consider using a TextBlock and adding each word as it's own Inline ( or ), then either bind to the inline's IsMouseDirectlyOver property, our add delegates to the MouseEnter & MouseLeave events.
If you want to do pixel-level hit testing of the actual glyphs (i.e. is the mouse exactly in the dot of this 'i'), then you'll need to use GlyphRuns and do manual hit testing on the glyphs (read: hard work).
I'm very late to the party--if the party is not over, and you don't need the actual character geometry, I found something like this useful:
for (int i = 0; i < FormattedText.Text.Length; i++)
{
characterHighlightGeometry = FormattedText.BuildHighlightGeometry(new Point(), i, 1);
CharacterHighlightGeometries.Children.Add(characterHighlightGeometry);
}
BuildGeometry() only includes the actual path geometry of a character. BuildHighlightGeometry() generates the outer bounds of all characters--including
spaces, so an index to a space can be located by:
foreach (var c in CharacterHighlightGeometries.Children)
{
if (c.Bounds.Contains(centerpoint))
{
q = c;
cpos = index;
break;
}
index++;
}
Hope this helps.