C# Get Specific Data from Text

C# Get Specific Data from Text - c#

I am still new to C# and its formatting in general, so I may have a hard time explaining what I'm looking for. But is there a method to read in a text file (.txt) and only pull certain lines of data, specifically when the line has more than 8 "," (commas) in the line, and ignore the rest?
Trying to capture data on a control box, and the data it sends out in debug has everything from motor position, velocity, axis, and more. But it also includes data not needed to graph into excel such as extra comments like Motor has stopped, Acknowledge, and so on. Those lines tend to only have 1-3 commas, whereas the data I'd like to extract for analysis is covered in 8+ commas for all the details given.
Thanks for any tips/advice on this or being able to point me in the right directions with the terminology I'm lacking which does what I'm looking for. I've figured out the File.ReadAllText function so far, and I think that's a good start. But not really sure which terms or functions are used to try to express the rest of what I'm needing.

Related

C# Renaming a lot of files using formatted strings based on calculations

I am new around and learning C# programming.
I am working on modding the game Train Simulator and one of the tasks I've got into is renaming files and their contents for the routes used in the game. All the files used having codes that tells the game what objects should be placed where etc.
The first step that I am asking here is to rename the files themselves. The files are arranged like a coordinate-system, like this:
List of files name +000000+000000.bin and counting.
As you can see, the files are úsing a 6-digit numbers and starts with a + or - in the beginning like an XY-grid, relative to a point. My goal here is to mathematically increase or decrease the numbers in the XY-grid and make sure all the files are renamed according to the input provided by the user.
EDIT: The user would be moving point 0 somewehere else. if the file +000000+000000.bin should be renamed +000025-000067 then file +000000+000001.bin should be renamed +000025-000066.bin, -000003-000006-bin should be +000022-000070.bin and so on, by adding 25 in the first half and removing -67 from the other half of the filename
If there is someone out there who have a good suggestion about how to do this, then I would greatly appreciate this. English isn't my first language and C# is pretty new to me.

It is generally recommended to make a honest effort to try to find a solution before posting a question. If you have not already, you might want to read "how to ask a good question"
Some pointers to get you started:
List all the files. See the GetFiles method.
Split the filename into the x and y coordinates. See Substring
Parse the x and y coordinates. See Parse
Apply offset to x and y coordinates, I hope you will manage this.
Format the new filename, See String.Format , format strings and Include sign
Finally, Rename the file
This should give enough information to start to make a solution. If you run into specific issues it might be a good idea to post a more specific question of whatever you have an issue with.

Algorithm to find individual, closed groups of lines in a large set of connected lines

I'm currently trying to write an algorithm that can find individual groups of connected lines within a larger set of lines. The images below should explain this a bit more clearly.
In the first image you can see a set of lines. What I'm trying to do is split those lines into 3 groups, as seen in the second image. The red and green groups share a line.
I can assume that each line has a start and an end coordinate, and each line can belong to one or more groups.
I'm currently trying to write a recursive function that follows each line until it reaches an end point with one or more lines it can follow. At this point the function recalls itself until it's followed the lines back round to the split point. However this is proving unsuccessful.
The output of this example, as shown in the second image, should be 3 separate groups of lines, stored in a list. I'm currently using c#, however I should be able to use a suitable algorithm in any language, including pseudocode. I know there must be an algorithm that can achieve this, however I cannot seem to work it out or find it online.

In the language of graph theory (where vertices are all the line endpoints and each edge is a line), your problem is to find all the faces of a planar graph. This is sometimes called planar face traversal. There are some resources you could consult for information on this, including this mathoverflow question. Though it is a C++ library and not C#, the Boost Graph Library has an API for planar face traversal, and consulting its documentation could be helpful.

Multilingual winforms application

I want my C# (winforms) application to be multilingual. My idea is:
I will have my translations in some text file(s), each "sentence" or phrase will have it's unique ID (integer)
at the start-up of the app I will iterate through all controls on all forms I have in my app (I suppose this should be done in each form's 'Load' event handler) and I will test the control of it's type
i.e. if it is a button or menu item, I will read it's default 'Text' property, locate this phrase in one text file, read it's unique ID and through this ID will locate translated phrase in (other) text file
then I will overwrite that 'Text' property of the control with translated phrase
This enables me to have separate text file with phrases for each and every language (easy to maintain individual translation in the future - only 1 txt file)
I would like to hear from you - proffesionals if there is some better / easier / faster / more 'pro' way how to accomplish this.
What format of translation text file should I use (plain text, XML, ini....) - it should be human readable. I don't know if finding a phrase in XML would be in C# faster than going line-by-line in plain text file and searching for given phrase/string...?
EDIT - I want users (community) to be able to translate my app for them into their native language without my interaction (it means Microsoft's resources are out of the game)
Thank you very much in advance.
CLOSED - My solution:
Looks like I'm staying at my original concept - every phrase will be in separate line of plain text file - Unicode encoding (and ID at the beginning of the line). I was thinking about deleting ID's too and to use only the line numbers, but it would need advanced text editor (Notepad shows no line numbers) and if somebody accidentaly hits shortcut for "Delete line" and doesn't notice that, whole app would go crazy :)
//sample of my translation text file for one language
0001:Text of my first button
0002:Text of my first label
0003:MessageBox title text
...etc etc

Why not use Microsoft's resource file method? You won't need to write any complex custom code this way.

It sounds like you are somewhat invested in the "one text file" idea, or else you would probably lean towards the standard way and use Microsoft's resource files. Handling for resource files is built-in, and the controls are already keyed to support it. But, as you are probably aware, each translation goes into it's own resource file. So you are left juggling multiple files to distribute with your app.
With a custom, roll-your-own solution, you can probably trim it down to one unicode file. But you will have to loop through the controls to set the text, and then look up the text for each one. As you add control types, you will have to add support in your code for them. Also, your text file will grow in large chunks as you add languages, so you will have to account for that as well.
I still lean towards using the resource files, but your phrasing suggests you already don't like that solution, so I don't think I have changed your mind.
Edit:
Since you want the solution separated from the app to avoid having to recompile, you could distribute SQL-CE database files for each language type. You can store the text values in NVARCHAR fields.
That will make your querying easier, but raises the self-editing requirements. You would have to provide a mechanism for users to add their own translation files, as well as edit screens.
Edit 2:
Driving towards a solution. :)
You can use a simple delimited text file, encoded in Unicode, with a convention based naming system. For example:
en-US.txt
FormName,ControlName,Text
"frmMain","btnSubmit","Save"
"frmMain","lblDescription","Description"
Then you can use the CurrentUICulture to determine which text file to load for localization, falling back to en-US if no file is found. This lets the users create (and also change!) their own localization files using common text editors and without any steep learning curve.

If you want the users to edit the translations through your application while keeping things simple and quick, resource file is best. If you don't like it, the second best option is XML file.
Still, to answer you question on how to do it best with a text file, it is pretty straight forward: You just make sure that your unique identifier (int probably) are in order (validate before using the file). Then to search quickly, you use the technique of the halves.
You look for number X, so you go to the file's middle line. If id > x, to go to ¼ of the file, etc.
You cut in two until you get to the right line. This is the fastest know research method.
NOTE: Beware of the things that are external to the application but need translation: External file items, information contained in a database, etc.

Text files to test the functionality of a search engine

In the purpose of practicing for an upcoming programming contest, I'm making a very basic search engine in C# that takes a query from the user (e.g. "Markov Decision Process") and searches through a couple of files to find the most relevant one to the query.
The application seems to be working (I used a term-document matrix algorithm).
But now I'd like to test the functionality of the search engine to see if it really is working properly. I tried to take a couple of Wikipedia articles and saving them as .txt files and testing it out, but I just can't see if it's working fast enough (even with some timers).
My question is, is there a website that shows a couple of files to test a search engine on (along with the logically expected result)?
I'm testing with common sense so far, but it would be great to be sure of my results.
Also, how can I get a collection of .txt files (maybe 10 000+ files) about various subjects to see if my application runs fast enough?
I tried copying a few Wikipedia articles, but it would take way too much time to do. I also thought about making a script of some sort to do it for me, but I really don't know how to do that.
So, where can I find a lot of files with separated subjects?
Otherwise, how can I benchmark my application?
Note: I guess a simple big .txt file where each line represents a "file" about a subject would do the job too.

One source of text files would be Project Gutenberg. They supply CD/DVD images if you want to download thousands of files at once. (The page doesn't state it, but I would imagine they are in txt format inside the CD/DVD iso.)

You can get wikipedia pages by using a recursive function and loading the html from every page linked to by one set page.
if you have some experience with c# this should help you:
http://www.csharp-station.com/HowTo/HttpWebFetch.aspx
then loop through the text and collect all the instances of the text: "<a href=\""
and recursively call that method. You should also use a counter to limit the number of recursions.
Also, to prevent OutOfMemory exceptions you should stop the method when it reaches multiples of some number of iterations and write everything to a file. Then flush the old data from a string

You can use the datasets from GroupLens Research's site.
Some samples: movies, books

Tesseract OCR Library - Learning Font

Well I'm using a complied .NET version of this OCR which can be found # http://www.pixel-technology.com/freeware/tessnet2/
I have it working, however the aim of this is to translate license plates, sadly the engine really doesn't accurately translate some letters, for example here's an image I scanned to determine the character problems
Result:
12345B7B9U
ABCDEFGHIJKLMNUPIJRSTUVHXYZ
Therefore the following characters are being translated incorrectly:
1, O, Q, W
This doesn't seem too bad, however on my license plates, the result isn't so great:
= H4 ODM
= LDH IFW
Fake Test
= NR4 y2k
As you might be able to tell, I've tried noise reduction, increasing contrast, and remove pixels that aren't absolute black, with no real improvements.
Apparently you can 'learn' the engine new fonts, but I think I would need to re-compile the library for .NET, also it seems this is performed on a Linux OS which I don't have.
http://www.scribd.com/doc/16747664/Tesseract-Trainingfor-Khmer-LanguageFor-Posting
So I'm stuck as what to try next, I've wrote a quick console application purely for testing purposes if anyone wants to try it. If anyone has any ideas/graphic manipulation/library thoughts, I'd appreciate hearing them.

I used Tesseract via Tessnet2 recently (Tessnet2 is a VS2008 C++ wrapper around Tesseract 2.0 made by Rémy Thomas, if I remember well). Let me try to help you with the little knowledge I have concerning this tool:
1st, as I said above, this wrapper is only for Tesseract 2.0, and the newest Tesseract version on Google Code is 3.00 (the code is no longer hosted on Source Forge). There are regular contributors: I saw that version 3.01 or so is planned. So you don't benefit from the last enhancements, including page layout analysis which may help when your license plates are not 100% horizontal.
I asked Rémy for a Tessnet2 .NET wrapper around version 3, he doesn't plan any for now. So as I did, you'll have to do it by yourself !
So if you want to get the latest version of the sources, you can download them from the Subversion repository (everything's described on the dedicated site page) and you'll be able to compile them if you have Visual Studio 2008, since they sources contain a VS2008 solution in the vs2008 sub-folder. This solution is made of VS2008 C++ projects, so to be able to get results in C# you'll have to use .NET P/Invoke with the tessDll built by the project. Again if you need this, I have code examples that may interest you, but you may want to stay with C++ and do your own new WinForm projects, for instance !
When you have achieved to compile (there should not be major problems for that, but tell me if you meet some, I may have met them too :-) ), you'll have in output several binaries that will allow you to do a specific training ! Again, there is a page specially dedicated to Tesseract 3 training. Thanks to this training, you can:
restrain your set of characters, which will automatically remove the punctuation ('/-\' instead of 'A', for instance)
indicate the ambiguities you have detected ('D' instead of 'O' as you could see, 'B' instead of '8' etc) that will be taken into account when you will use your training.
I also saw that Tesseract results are better if you restrain the image to the zone where the letters are located (i.e. no face, no landscape around): in my case, I needed to recognize only a specific zone of cards photos taken from a webcam, so I used image processing to restrain the zone. That was long, of course, but my images came from many different sources so I had no choice. If you can get images that are restrained to the minimum, that will be great !
I hope it was of any help, do not hesitate to give me your remarks and questions !

Hi I've done lots of ocr with tesseract, and I have had some of your problems, too. You ask about IMAGE PROCESSING tools, and I'd recommend "unpaper" (there are windows ports too, see google) That's a nice de-skew, unrotate, remove-borders-and-noise and-so-on program. Great for running before ocr'ing.
If you have a (somewhat) variable background color on your images, I'd recommend the "textcleaner" imagemagick script
I think it's edge detecting and whitening out all non-edgy stuff.
And if you have complex text then "ocropus" could be of use.
Syntax is (on linux): "ocroscript rec-tess "
My setup is
1. textcleaner
2. unpaper
3. ocroups
With these three steps I can read almost anything. Even quite blurry+noisy images taken in uneven lighting, with two columns of tightly packed text comes out very readable. OK maybe your needs aren't that much text, but step 1) & 2) could be of use to you.

I'm currently building a license plate recognition engine for ispy - I got much better results from tesseract when I split the license plate into individual characters and built a new image displayed vertically with white space around them like:
W
4
O
O
M
I think a big problem that tesseract has is it tries to make words out of the horizontal letters and numbers and in the case of license plates with letters and numbers mixed up it will decide that a number is a letter or vice versa. Entering an image with the characters spaced vertically makes it treat them as individual characters instead of text.

A great read! http://robotics.usc.edu/publications/downloads/pub/635/
About your skew problem for license plates:
Issue: When OCR input is taken from a hand-held camera
or other imaging device whose perspective is not fixed like
a scanner, text lines may get skewed from their original
orientation [13]. Based on our experiments, feeding such a
rotated image to our OCR engine produces extremely poor
results.
Proposed Approach: A skew detection process is needed
before calling the recognition engine. If any skew is detected,
an auto-rotation procedure is performed to correct the skew
before processing text further. While identifying the algorithm
to be used for skew detection, we found that many
approaches, such as the one mentioned in [13], are based on
the assumptions that documents have s et margins. However,
this assumption does not always hold in our application.
In addition, traditional methods based on morphological
operations and projection methods are extremely slow and
tend to fail in presence of camera-captured images. In this
work, we choose a more robust approach based on Branchand-
Bound text line finding algorithm (RAST algorithm) [25]
for skew detection and auto-rotation. The basic idea of this
algorithm is to identify each line independently and use the
slope of the best scoring line as the skew angle for the entire
text segment. After detecting the skew angle, rotation is
performed accordingly. Based on our experiments, we found
this algorithm to be highly robust and extremely efficient
and fast. However, it suffered from one minor limitation in
the sense that it failed to detect rotation greater than 30.
We also tried an alternate approach, which could detect any
angle of skew up to 90. However, this approach was based
on presence of some sort of cross on the image. Due to
the lack of extensibility, we decided to stick with RAST
algorithm.

Tesseract 3.0x, by default, penalizes combinations that aren't words and aren't common words. The FAQ describes a method to increase its aversion to such nonsense. You might find it helpful to turn off the penalty for rare or nonexistent words, as described (inversely) here:
http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?

If anyone from the future comes across this question, there is a tool called jTessBoxEditor that makes teaching Tesseract a breeze. All you do is point it at a folder containing sample images, then click a button and it creates your *.learneddata file for you.

ABCocr .NET uses Tesseract3 so that might be appropriate if you need the latest code under .NET.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.