Count distinct strings in C# code

Count distinct strings in C# code - c#

I'm in need to estimate localization effort needed for a legacy project. I'm looking for a tool that I could point at a directory, and it would:
Parse all *.cs files in the directory structure
Extract all C# string literals from the code
Count total number of occurrences of the strings
Do you know any tool that could do that? Writing it would be simple, but if some time can be saved, then why not save it?

Use ILDASM to decompile your .DLL / .EXE.
I just use options to dump all, and you get an .il file with a section "User String":
User Strings
-------------------------------------------------------
70000001 : (14) L"Starting up..."
7000001f : (12) L"progressBar1"
70000039 : (21) L"$this.BackgroundImage"
70000065 : (10) L"$this.Icon"
7000007b : ( 6) L"Splash"
Now if you want to know how many time a certain string is used. Search for a "ldstr" like this:
IL_003c: /* 72 | (70)000001 */ ldstr "Starting up..." /* 70000001 */
I think this will be a lot easier to parse as C#.

Doing a quick search, I found the following tool that may or may not be useful to you.
http://www.devincook.com/goldparser/
I also found another SO user who was trying to do something similar.
Regex to parse C# source code to find all strings

Well, if you have hardcoded strings, you need to know what is your i18n effort first (unhardcoding them could be quite painful). Another issue: you need to count translatable words not distinct strings, that is the input for translation providers. And even though string might seem duplicated, it could be translated in a different way depending on the context, so you don't need to care about "distninct", you just have to count all words... That's how Localization works per my experience.

In most common development, you should keep your strings external to your program source code. In your case, could you spare the effort to extract the strings into a resource file?
If so, then you can make use of the default localization solution in .NET, i.e.
resource.resx,
resource.fr.resx,
resources.es.resx
stores strings for different locales.
Updated :
The actual implementation depends on your project architecture/technology, resource files ain't the best way to do this, but it is the easiest, and the recommended way in .NET.
Like in this article
A few more tutorials
A few more tutorials

Related

What's a performance friendly way to pre-store 412.000 strings so they're searchable in Unity?

I have a CSV file with 412.000 strings in that I would like to pre-store locally so that I can deploy to Android and iOS. The game must then be able to look through these strings to check if there's a match based on user input.
The only viable solution that I can see would be SQLite. I haven't come across a very good SQLite solution for Unity yet.
Is there a built-in solution in Unity that I am overlooking?
The solution has to work locally. No HTTP calls.

400,000 strings is absolutely trivial.
Just put them in a dictionary (list, whatever is relevant and that you prefer).
It's a total non-issue.
It's likely you would just load them from a text file, easy as pie.
public TextAsset theTextFile;
(Just drag to the link in the Inspector, like any texture or similar.)
you can then very easily read that file as, say, JSON. (Just use JsonUtility. You can find numerous examples of this in SO and elsewhere.) For example,
Blah bb = JsonUtility.FromJson< Blah >(ta.text);
yourDict = bb.fieldname.ToDictionary(i => i.tag, i => i);
Note that you mention "memory" and so on. It's totally irrelevant, the data you are talking about is the fraction of the size of any tiny image - ! , it's a non-issue, you don't have to think about it. The hardware/software system will handle it.
P.S. ...
If you literally want to use csv, it's totally easy. I suggest you ask a new question giving the details of your file and so on, so you can get an exact answer.
Note that you'd just use a HashSet rather than a Dictionary. It's even easier.
It's just something like:
var wordList = theTextFile.text.Split('\n');
You can google many examples!
https://stackoverflow.com/a/9791488/294884
http://answers.unity.com/answers/397537/view.html

How to find minimum replacement strings or regex to convert string to another string

Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6

Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.

Compress a short but repeating string

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:
http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
I want to shorten that string:
http://site.com/app?things=somekindofencoding
The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.
I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.
Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example
we could have
files : file files
|
file : path id
path : itemsthing
| filesitem
| stuffthingsitem
you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:
/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7
you'd then end up with ?things=2[123,456,789]1[1,46,7]
where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123
EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be
?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item
which if the grammar is this simple ofcourse would do better with
?things=/stuff/things/[123,456,789]/files/item[1,46,7]
compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.
I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

C# string translation

Does C# offer a way to translate strings on-the-fly or something similiar?
I'm now working on some legacy code, which has some parts like this:
section.AddParagraph(String.Format("Premise: {0}", currentReport.Tenant.Code));
section.AddParagraph(String.Format("Description: {0}", currentReport.Tenant.Name));
section.AddParagraph();
section.AddParagraph(String.Format("Issued: #{0:D5}", currentReport.Id));
section.AddParagraph(String.Format("Date: {0}", currentReport.Timestamp.ToString(
"dd MMM yyyy", CultureInfo.InvariantCulture)));
section.AddParagraph(String.Format("Time: {0:HH:mm}", currentReport.Timestamp));
So, I want to implement the translation of these strings on-the-fly based on some substitution table (for example, as Qt does).
Is this possible (probably, using something what C# already has or using some post-processing - may be possible with PostSharp)?
Does some generic internalization approach for applications built with C# (from scratch) exist?

Does some generic internalization approach for applications built with C# (from scratch) exist?
Yes, using resource files. And here's another article on MSDN.

In the C# project I currently work on, we wrote a helper function that works like this:
section.AddParagraph(I18n.Translate("Premise: {0}", currentReport.Tenant.Code));
section.AddParagraph(I18n.Translate("That's all");
At build time, a script searches all I18n.Translate invocations, as well as all UI controls, and populates a table with all english phrases. This gets translated.
At runtime, the english text is looked up in a dictionary, and replaced with the translated text.
Something similar happens to our winforms Dialog resources: they are constructed in english and then translated using the same dictionary.
The biggest strength of this scheme, is also the biggest weakness: If you use the same string in two places, it gets translated the same. This shortens the file you send to translater which helps to reduce cost. If you ever need to force a different translation of the same english word, you need to work around that. As long as we have the system (4ish years or so), we never had the need for it. There's also benefits: You read the english UI text inline with the source (so not hiding behind an identifier you need to name), and if you delete code, its automatically removed from the translated resources as well.

Should we store format strings in resources?

For the project that I'm currently on, I have to deliver specially formatted strings to a 3rd party service for processing. And so I'm building up the strings like so:
string someString = string.Format("{0}{1}{2}: Some message. Some percentage: {3}%", token1, token2, token3, number);
Rather then hardcode the string, I was thinking of moving it into the project resources:
string someString = string.Format(Properties.Resources.SomeString, token1, token2, token3, number);
The second option is in my opinion, not as readable as the first one i.e. the person reading the code would have to pull up the string resources to work out what the final result should look like.
How do I get around this? Is the hardcoded format string a necessary evil in this case?

I do think this is a necessary evil, one I've used frequently. Something smelly that I do, is:
// "{0}{1}{2}: Some message. Some percentage: {3}%"
string someString = string.Format(Properties.Resources.SomeString
,token1, token2, token3, number);
..at least until the code is stable enough that I might be embarrassed having that seen by others.

There are several reasons that you would want to do this, but the only great reason is if you are going to localize your application into another language.
If you are using resource strings there are a couple of things to keep in mind.
Include format strings whenever possible in the set of resource strings you want localized. This will allow the translator to reorder the position of the formatted items to make them fit better in the context of the translated text.
Avoid having strings in your format tokens that are in your language. It is better to use
these for numbers. For instance, the message:
"The value you specified must be between {0} and {1}"
is great if {0} and {1} are numbers like 5 and 10. If you are formatting in strings like "five" and "ten" this is going to make localization difficult.
You can get arround the readability problem you are talking about by simply naming your resources well.
string someString = string.Format(Properties.Resources.IntegerRangeError, minValue, maxValue );
Evaluate if you are generating user visible strings at the right abstraction level in your code. In general I tend to group all the user visible strings in the code closest to the user interface as possible. If some low level file I/O code needs to provide errors, it should be doing this with exceptions which you handle in you application and consistent error messages for. This will also consolidate all of your strings that require localization instead of having them peppered throughout your code.

One thing you can do to help add hard coded strings or even speed up adding strings to a resource file is to use CodeRush Xpress which you can download for free here: http://www.devexpress.com/Products/Visual_Studio_Add-in/CodeRushX/
Once you write your string you can access the CodeRush menu and extract to a resource file in a single step. Very nice.
Resharper has similar functionality.

I don't see why including the format string in the program is a bad thing. Unlike traditional undocumented magic numbers, it is quite obvious what it does at first glance. Of course, if you are using the format string in multiple places it should definitely be stored in an appropriate read-only variable to avoid redundancy.
I agree that keeping it in the resources is unnecessary indirection here. A possible exception would be if your program needs to be localized, and you are localizing through resource files.

yes you can
new lets see how
String.Format(Resource_en.PhoneNumberForEmployeeAlreadyExist,letterForm.EmployeeName[i])
this will gave me dynamic message every time
by the way I'm useing ResXManager

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.