Is sorting in LINQ by Ascii code? - c#

In my LINQ to Entities query I have a .orderby f.Description.Trim() command
The reason for .Trim() is that some of the data coming from DB have a bunch of white spaces at the beginning of them so I wanted to trim those so they won't affect sorting.
Now it sorts correctly but I see something like this in the result:
[Queries - Blah]
Action
Adhere
Azalia
Then I looked up ASCII code of "[" and it is 91 and "A" is 65 so how come that one showed up first? Maybe there are some other things in the code causing this and sort is fine?

OrderBy is using the default comparator for strings, which doesn't use ASCII (actually, Unicode) ordinal comparison. It actually depends on the current culture you are using.
And, if you think about it... if you were sorting entries for an appendix or index, symbols come before letters (at least in English).
If you want to sort by "raw ascii value", use
...OrderBy(s => s, StringComparer.Ordinal)

If the actual expression can be compiled to a store expression, then the ordering will be done as implemented by your store.
So: the result will depend on the collation of the database, table and column.

Related

Linq to Entity comparing strings ignores white spaces

When using LINQ to entity doing string comparisons will ignore white spaces.
In my table, I have an nchar(10) column so any data saved if it is not 10 characters will fill the rest with empty spaces. Below i am comparing the "ncharTextColumn" with the "Four" string. And even though the ncharText will equal "Four " It results in a match and the "result" variable will contain 1 record
TestEntities1 entity = new TestEntities1();
var result = entity.Table_1.Where(e => e.ncharText == "Four");
Is there an explanation for this and a way to work around it or am I going to have to call ToList on my query before any comparisons like so.
var newList = result.ToList().Where(e => e.ncharText == "Four");
This code now correctly returns 0 records as it takes into account white spaces. However, calling to list before a comparison can result in loading a large collection into memory which won't end up being used.
This answer explains why.
SQL Server follows the ANSI/ISO SQL-92 specification (Section 8.2, ,
General rules #3) on how to compare strings with spaces. The ANSI
standard requires padding for the character strings used in
comparisons so that their lengths match before comparing them. The
padding directly affects the semantics of WHERE and HAVING clause
predicates and other Transact-SQL string comparisons. For example,
Transact-SQL considers the strings 'abc' and 'abc ' to be equivalent
for most comparison operations.
The only exception to this rule is the LIKE predicate. When the right
side of a LIKE predicate expression features a value with a trailing
space, SQL Server does not pad the two values to the same length
before the comparison occurs. Because the purpose of the LIKE
predicate, by definition, is to facilitate pattern searches rather
than simple string equality tests, this does not violate the section
of the ANSI SQL-92 specification mentioned earlier.
Internally LINQ is just making SQL queries against your database.

Identify problematic characters in a string

I want to be able to identify problematic characters in a string saved in my sql server using LINQ to Entities.
Problematic characters are characters which had problem in the encoding process.
This is an example of a problematic string : "testing�stringáאç".
In the above example only the � character is considered as problematic.
So for example the following string isn't considered problematic:"testingstringáאç".
How can I check this Varchar and identify that there are problematic chars in it?
Notice that my preferred solution is to identify it via a LINQ to entities query , but other solutions are also welcome - for example: some store procedure maybe?
I tried to play with Regex and with "LIKE" statement but with no success...
Check out the Encoding class.
It has a DecoderFallback Property and a EncoderFallback Property that lets you detect and substitute bad characters found during decoding.
.Net and NVARCHAR both use Unicode, so there is nothing inherently "problematic" (at least not for BMP characters).
So you first have to define what "problematic" in meant to mean:
characters are not mapped in target codepages
Simply convert between encodings and check whether data is lost:
CONVERT(NVARCHAR, CONVERT(VARCHAR, #originalNVarchar)) = #originalNVarchar
Note that you can use SQL Server collations using the COLLATE clause rather than using the default database collation.
characters cannot be displayed due to the fonts used
This cannot be easily done in .Net
You can do something like this:
DECLARE #StringWithProblem NVARCHAR(20) = N'This is '+NCHAR(8)+N'roblematic';
DECLARE #ProblemChars NVARCHAR(4000) = N'%['+NCHAR(0)+NCHAR(1)+NCHAR(8)+']%'; --list all problematic characters here, wrapped in %[]%
SELECT PATINDEX(#ProblemChars, #StringWithProblem), #StringWithProblem;
That gives you the index of the first problematic character or 0 if none is found.

How to convert words to links?

I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.
You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words
If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.
You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.

Regex match a CSV file

I am trying to create a regex to match a CSV file of records in the form of:
optional value, , ,, again some value; this is already, next record;
Now there is an upper limit of commas (10) separating attributes of each record and unlimited number of ; separating each record. Values might or might not be present. I am unexperienced with regex and my effort is rather futile so far. Please help. If necessary, I will include more details.
EDIT
I want to verify that the file is in the required form and get the number of records in it.
Do you really need to use regular expressions for this? Might be a little bit overkill. I'd just perform one String.Split() to get the records, then another String.Split() on each record to get the values. Also rather easy to get the number of elements etc. then.
If you really want to use Regexps, I'd use two steps again:
/(.*?);/ to get the datasets;
/(.*?)[,;]/ to get the values.
Could probably be done with one regexp as well but I'd consider this overkill (as you'd have to find the sub matches etc. identify their parent record, etc.).
Escaped characters would be another thing but rather similar to do: e.g. /(.*?[^\\]);/
try this
bool isvalid = csv.Split(';')
.Select(c => c.Split(',')
.Count())
.Distinct()
.Count() == 1;
Reminds me to the famous article form Coding Horror: Regular Expressions: Now You Have Two Problems.
FileHelpers saved my day when dealing with CSV or other text format.

Efficient string matching algorithm

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.
Here are my requirements:
Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
Entries may be absolute matches, i.e. www.example.com.
Entries may include wildcards, i.e. *.example.com.
Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.
Language/environment: C# (.Net Framework 3.5)
I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.
I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.
My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?
If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.
Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.
The only differences between this and a regular spell-checking tree are:
The matching needs to be done in reverse
You have to take into account the wildcards
To address point #2, you would simply check for the "*" character at the end of a test.
A quick example:
Entries:
*.fark.com
www.cnn.com
Tree:
m -> o -> c -> . -> k -> r -> a -> f -> . -> *
\
-> n -> n -> c -> . -> w -> w -> w
Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.
Checking www.cern.com would fail on the second "n" of n,n,c,...
Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".
I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).
you don't need regexp .. just reverse all the strings,
get rid of '*', and put a flag to indicate partial match
till this point passes.
Somehow, a trie or suffix trie looks most appropriate.
If the list of domains is known at compile time, you may look at
tokenizing at '.' and using multiple gperf generated machines.
Links:
google for trie
http://marknelson.us/1996/08/01/suffix-trees/
I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.
Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.
To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.
This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.
Additionally, I would have to bet that this approach would be faster than RegEx.
You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.
Here's an article on recursive descent parsers.
Assuming the rules are as you said: literal or start with a *.
Java:
public static boolean matches(String candidate, List<String> rules) {
for(String rule : rules) {
if (rule.startsWith("*")) {
rule = rule.substring(2);
}
if (candidate.endsWith(rule)) {
return true;
}
}
return false;
}
This scales to the number of rules you have.
EDIT:
Just to be clear here.
When I say "sort the rules", I really mean create a tree out of the rule characters.
Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).
For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.
If you have a LOT (LOT LOT) of rules I would sort the rules.
For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)
If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.
I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.
I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.
Have a look at RegExLib
Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:
Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.
Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.
What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.
I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?
I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex
If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.
That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.
If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...
Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.
One source of information for these is:
KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html
BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Categories

Resources