Text files to test the functionality of a search engine

Text files to test the functionality of a search engine - c#

In the purpose of practicing for an upcoming programming contest, I'm making a very basic search engine in C# that takes a query from the user (e.g. "Markov Decision Process") and searches through a couple of files to find the most relevant one to the query.
The application seems to be working (I used a term-document matrix algorithm).
But now I'd like to test the functionality of the search engine to see if it really is working properly. I tried to take a couple of Wikipedia articles and saving them as .txt files and testing it out, but I just can't see if it's working fast enough (even with some timers).
My question is, is there a website that shows a couple of files to test a search engine on (along with the logically expected result)?
I'm testing with common sense so far, but it would be great to be sure of my results.
Also, how can I get a collection of .txt files (maybe 10 000+ files) about various subjects to see if my application runs fast enough?
I tried copying a few Wikipedia articles, but it would take way too much time to do. I also thought about making a script of some sort to do it for me, but I really don't know how to do that.
So, where can I find a lot of files with separated subjects?
Otherwise, how can I benchmark my application?
Note: I guess a simple big .txt file where each line represents a "file" about a subject would do the job too.

One source of text files would be Project Gutenberg. They supply CD/DVD images if you want to download thousands of files at once. (The page doesn't state it, but I would imagine they are in txt format inside the CD/DVD iso.)

You can get wikipedia pages by using a recursive function and loading the html from every page linked to by one set page.
if you have some experience with c# this should help you:
http://www.csharp-station.com/HowTo/HttpWebFetch.aspx
then loop through the text and collect all the instances of the text: "<a href=\""
and recursively call that method. You should also use a counter to limit the number of recursions.
Also, to prevent OutOfMemory exceptions you should stop the method when it reaches multiples of some number of iterations and write everything to a file. Then flush the old data from a string

You can use the datasets from GroupLens Research's site.
Some samples: movies, books

Related

C# Renaming a lot of files using formatted strings based on calculations

I am new around and learning C# programming.
I am working on modding the game Train Simulator and one of the tasks I've got into is renaming files and their contents for the routes used in the game. All the files used having codes that tells the game what objects should be placed where etc.
The first step that I am asking here is to rename the files themselves. The files are arranged like a coordinate-system, like this:
List of files name +000000+000000.bin and counting.
As you can see, the files are úsing a 6-digit numbers and starts with a + or - in the beginning like an XY-grid, relative to a point. My goal here is to mathematically increase or decrease the numbers in the XY-grid and make sure all the files are renamed according to the input provided by the user.
EDIT: The user would be moving point 0 somewehere else. if the file +000000+000000.bin should be renamed +000025-000067 then file +000000+000001.bin should be renamed +000025-000066.bin, -000003-000006-bin should be +000022-000070.bin and so on, by adding 25 in the first half and removing -67 from the other half of the filename
If there is someone out there who have a good suggestion about how to do this, then I would greatly appreciate this. English isn't my first language and C# is pretty new to me.

It is generally recommended to make a honest effort to try to find a solution before posting a question. If you have not already, you might want to read "how to ask a good question"
Some pointers to get you started:
List all the files. See the GetFiles method.
Split the filename into the x and y coordinates. See Substring
Parse the x and y coordinates. See Parse
Apply offset to x and y coordinates, I hope you will manage this.
Format the new filename, See String.Format , format strings and Include sign
Finally, Rename the file
This should give enough information to start to make a solution. If you run into specific issues it might be a good idea to post a more specific question of whatever you have an issue with.

Multilingual winforms application

I want my C# (winforms) application to be multilingual. My idea is:
I will have my translations in some text file(s), each "sentence" or phrase will have it's unique ID (integer)
at the start-up of the app I will iterate through all controls on all forms I have in my app (I suppose this should be done in each form's 'Load' event handler) and I will test the control of it's type
i.e. if it is a button or menu item, I will read it's default 'Text' property, locate this phrase in one text file, read it's unique ID and through this ID will locate translated phrase in (other) text file
then I will overwrite that 'Text' property of the control with translated phrase
This enables me to have separate text file with phrases for each and every language (easy to maintain individual translation in the future - only 1 txt file)
I would like to hear from you - proffesionals if there is some better / easier / faster / more 'pro' way how to accomplish this.
What format of translation text file should I use (plain text, XML, ini....) - it should be human readable. I don't know if finding a phrase in XML would be in C# faster than going line-by-line in plain text file and searching for given phrase/string...?
EDIT - I want users (community) to be able to translate my app for them into their native language without my interaction (it means Microsoft's resources are out of the game)
Thank you very much in advance.
CLOSED - My solution:
Looks like I'm staying at my original concept - every phrase will be in separate line of plain text file - Unicode encoding (and ID at the beginning of the line). I was thinking about deleting ID's too and to use only the line numbers, but it would need advanced text editor (Notepad shows no line numbers) and if somebody accidentaly hits shortcut for "Delete line" and doesn't notice that, whole app would go crazy :)
//sample of my translation text file for one language
0001:Text of my first button
0002:Text of my first label
0003:MessageBox title text
...etc etc

Why not use Microsoft's resource file method? You won't need to write any complex custom code this way.

It sounds like you are somewhat invested in the "one text file" idea, or else you would probably lean towards the standard way and use Microsoft's resource files. Handling for resource files is built-in, and the controls are already keyed to support it. But, as you are probably aware, each translation goes into it's own resource file. So you are left juggling multiple files to distribute with your app.
With a custom, roll-your-own solution, you can probably trim it down to one unicode file. But you will have to loop through the controls to set the text, and then look up the text for each one. As you add control types, you will have to add support in your code for them. Also, your text file will grow in large chunks as you add languages, so you will have to account for that as well.
I still lean towards using the resource files, but your phrasing suggests you already don't like that solution, so I don't think I have changed your mind.
Edit:
Since you want the solution separated from the app to avoid having to recompile, you could distribute SQL-CE database files for each language type. You can store the text values in NVARCHAR fields.
That will make your querying easier, but raises the self-editing requirements. You would have to provide a mechanism for users to add their own translation files, as well as edit screens.
Edit 2:
Driving towards a solution. :)
You can use a simple delimited text file, encoded in Unicode, with a convention based naming system. For example:
en-US.txt
FormName,ControlName,Text
"frmMain","btnSubmit","Save"
"frmMain","lblDescription","Description"
Then you can use the CurrentUICulture to determine which text file to load for localization, falling back to en-US if no file is found. This lets the users create (and also change!) their own localization files using common text editors and without any steep learning curve.

If you want the users to edit the translations through your application while keeping things simple and quick, resource file is best. If you don't like it, the second best option is XML file.
Still, to answer you question on how to do it best with a text file, it is pretty straight forward: You just make sure that your unique identifier (int probably) are in order (validate before using the file). Then to search quickly, you use the technique of the halves.
You look for number X, so you go to the file's middle line. If id > x, to go to ¼ of the file, etc.
You cut in two until you get to the right line. This is the fastest know research method.
NOTE: Beware of the things that are external to the application but need translation: External file items, information contained in a database, etc.

How to detect if a JS is packed already

Hey guys.. I am writing a Windows application in C# that minifies CSS files and packs JS files as a batch job. One hurdle for the application is, what if the user selects a JavaScript file that has already been packed? It will end up increasing the file size, defeating my purpose entirely!
Is opening the file and looking for the string eval(function(p,a,c,k,e,d) enough? My guess is no, as there are other JS packing methods out there. Help me out!

One might suggest that you compare the size of the pre and post packed JS and return/use the smaller of the two.
UPDATE based on question in comment by GPX on Sep 30 at 1:02
The following is a very simple way to tell. There may be different, or more accurate, ways of determining this, but this should get you going in the right direction:
var unpackedJs = File.ReadAllText(...)
var unpackedSize = jsContent.Length;
var packedJs = ... // Your Packaging routine
File.WriteAllText(pathToFile, unpackedSize < packedJs.Length ? unpackedJs : packedJs)

I would check file size and lines of code (e.g.: average line length). These two information should be enough to know if the code is sufficiently compact.
Try this demo.

I direct you to a post that suggests packing is bad.
http://ejohn.org/blog/library-loading-speed/
Rather use minification. Google Closure compiler can do this via a REST web service. Only use a .min.js extension for minified (not packed).
Gzip will do a better job and will be uncompressed by the browser. Its best to switch on zip compression on the server which will zip a minified file down further.
Of course this raises the question 'How can I tell if my Javascript is already minified!'

When you create/save a minified file, use the standard file name convention of "Filename.min.js". Then when they select the file, you can check for that as a reliable indicator.
I do not think it is wise to go overboard on the dummy-proofing. If a user (who is a developer, at that), is dumb enough to double-pack a file, they should experience problems. I know you should give them the benefit of the doubt, but in this case it does not seem worth the overhead.

If you're using a safe minimization routine, your output should be the same as the input. I would not recommend the routine you mention. MS's Ajax Minifier is a good tool and even provides dll's to use in your project. This would make your concern a non-issue.

I would suggest adding a '.min' prefix to the extension of the packed file, something like 'script.min.js'. Then just check the file name.
Other than that, I would suggest checking how long the lines are, and how many spaces are used. Minified/packed JS typically has almost no spaces (typically in strings) and very long lines.

Is there a Sourcesafe API to get total lines of code in source control?

I want to be able to get the projects I have in Sourcesafe and their total lines of code (perhaps also with total number of classes, etc). Is there an SDK for Sourcesafe (I use 2005 edition) which will allow me to do this?
Or is there a document in Sourcesafe which lists all the projects in SS? Using this, I could work towards getting the line count.
Thanks

There is no specific line counting API. There is an API to access the files but its way too slow.
It would probably be better if you set up a shadow folder on the root project (this is done via the admin tool). A simple app could then open all the source files recursively from the shadow folder and do some line counting.

I realize this is not exactly what you're asking for, but you might be able to adapt the following to suit your needs:
http://richnewman.wordpress.com/2007/07/01/c-and-vbnet-line-count-utility/
I've used this before, and works very well. It differentiates between comments and auto-generated code as well.

You will need to get each file and count the number of lines yourself.

I don't need an API to count the number of lines in a class. This is easy to do and I know several ways.
Rather, it would be good to get a collection of the files stored in SS, so I can run the line count on each and every file.
However, I guess I could just label my root parent dir with a tag like projectnameISSOURCESAFECHECKEDIN, and for every folder (and only the parent folder), I will drill in, and count the lines in classes. Not a perfect solution, but effective and no dependency on any API.
Anthony,
Your solution is also credible. :)

Searching directories for tons of files?

I'm using MSVE, and I have my own tiles I'm displaying in layers on top. Problem is, there's a ton of them, and they're on a network server. In certain directories, there are something on the order of 30,000+ files. Initially I called Directory.GetFiles, but once I started testing in a pseudo-real environment, it timed out.
What's the best way to programatically list, and iterate through, this many files?
Edit: My coworker suggested using the MS indexing service. Has anyone tried this approach, and (how) has it worked?

I've worked on a SAN system in the past with telephony audio recordings which had issues with numbers of files in a single folder - that system became unusable somewhere near 5,000 (on Windows 2000 Advanced Server with an application in C#.Net 1.1)- the only sensible solution that we came up with was to change the folder structure so that there were a more reasonable number of files. Interestingly Explorer would also time out!
The convention we came up with was a structure that broke the structure up in years, months and days - but that will depend upon your system and whether you can control the directory structure...

Definitely split them up. That said, stay as far away from the Indexing Service as you can.

None. .NET relies on underlying Windows API calls that really, really hate that amount of files themselves.
As Ronnie says: split them up.

You could use DOS?
DIR /s/b > Files.txt

You could also look at either indexing the files yourself, or getting a third part app like google desktop or copernic to do it and then interface with their index. I know copernic has an API that you can use to search for any file in their index and it also supports mapping network drives.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.