I have a question which data structure is the best for particular situation.
we have one string "AAAAAAAAAAA", and we want to know this string contain in one data base column or not.
For example below database there is two column.
1. ID 2. Name
1 A
2 B
3 C
.....
49581 AAAAAAAAAAA
if it's match then, return true if not false.
I know I can use list<string> but I don't think it's best way to searching
I want to know which data structure is best way to search in this case.
A HashSet<string> would be faster to search than a List<string> if you only need to know whether the string exists.
HashSet<T> Class
..or if you feel adventurous, creating a "ternary search tree" or a "trie" may be an option:
http://www.drdobbs.com/database/ternary-search-trees/184410528
Similar to another answer, but note that if you have a hash table then for each hashed string in the column you can store the row number(s) that have that string in the hash table position for the string. So hashing is not just limited to determining whether the string exists in your column or not.
Related
I have a customers table with 2 columns for the name (firstname, lastname) and it contains around 100k records.
I have a scenario where I have to import new customers but their names come as a single column. Most names are simple (first and last), but some names are double names (with a space or hyphen), double surnames (with a space or hyphen) or even both.
Does using a ML.NET classification algorithm make sense to split the fullname based on a trained model from the 100k records?
I think it would be unnecessary to use machine learning methods for such a problem. You should try a rule-based method here.
Assuming the data comes in 1 column:
For example: After splitting the Text by space, is the length of the word count equal to 2? If equal, the 1st word is the name and the 2nd word is the surname.
Example 2: Does the text contain hyphen or not? If yes, what should I do? How can I determine my name and surname?
1-) What you need to do here is to create a training, validation and test set for yourself.
2-) Doing a coding with the rules you extracted from the data in the train set. (Here you need to make clever deductions by examining the data)
3-) You need to determine the most ideal rules with validation data.
Finally, you should evaluate your work by getting results on the test set with the rule you find most ideal.
Check the code bellow, the RandomMan.MyRandomString(64) is generating a random string of 64 char.
Now I want to check if this random string is unique in database using entityframework query like bellow. And if this string is not unique in database then it will continue the do loop until it finds a unique random string. Now my question is am I doing it correctly? Or is there any better way than that?
string randstr;
do {
randstr = RandomMan.MyRandomString(64);
} while (DataCtx.StorageFiles.Any(x => x.AwsUniqueFileName == randstr));
I cannot tell whether or not you are doing it correctly, but if you already have the row in DB, I could suggest concatenating (adding) your identity field id to the produced string so you make sure that the result is unique in the DB, given that your MyRandomString only produces chars (or no numbers at the end)
Let's say your generated string is abc and the id of the row you are updating is 53 then your final unique string is going to be abc53
Standard approach for this is to just generate GUID:
Console.WriteLine(Guid.NewGuid());
It's designed to be unique and highly unlikely to generate two identical GUIDs even on many instances at the same time so you don't need to worry much about atomicity of this operation.
The possibility of collision is so low that you can skip handling it at all, but just to be sure you can set unique key on this column and treat it as an exception, no need for loop for sure.
I am trying to figure out the best way to match items on a datagridview to items in an access database. (Think Quicken match transaction)
I import an excel sheet into a datagridview,from there it checks the access db looks for a match - if a match is found then it reports match in a column if not unmatched is reported.
i have tried to count the rows on an sql query - if = 1 then match is yes, but that for some reason will goof up sometimes.
so i am looking for the best way to do this.
Thanks - please let me know if you need any additional info.
There isn't a simple answer to this, and it depends on what your data looks like, and what you consider a "match" to be. As a very basic answer, this is one way to attack the problem. How far you take it is up to you...
Create an algorithm that takes all fields for a row and generates a "key" for it. For example if there are two fields [First], [Last] then perhaps the key would be "Bubba|Gump"
Apply that algorithm to both sets of data (the datagrid records and the access db records).
Compare the two sets of keys to determine what's identical/missing/added.
It's not foolproof but with some additional sophistication it'll take you surprisingly far.
I am wondering which method is the best way to store a list of integers in a sql column.
.....i.e. "1,2,3,4,6,7"
EDIT: These values represent other IDs in SQL tables. The row would look like
[1] [2]
id, listOfOtherIDs
The choices I have researched so far are:
A varchar of separated value that are "explode-able" i.e. by commas or tabs
An XML containing all the values individually
Using individual rows for each value.
Which method is the best method to use?
Thanks,
Ian
A single element of a record can only refer to one value; it's a basic database design principle.
You will have to change the database's design: use a single row for each value.
You might want to read up on normalization.
As is shown here in the description of the first normal form:
First normal form states that at every row and column intersection in the table there, exists a single value, and never a list of values. For example, you cannot have a field named Price in which you place more than one Price. If you think of each intersection of rows and columns as a cell, each cell can hold only one value.
While Jeroen's answer is valid for "multi-valued" attributes, there are genuine situations where multiple comma-separated values may actually be representing one large value. Things like path data (on a map), integer sequence, list of prime factors and many more could well be stored in a comma-separated varchar. I think it is better to explain what exactly are you storing and how do you need to retrieve and use that value.
EDIT:
Looking at your edit, if by IDs you mean PK of another table, then this sounds like a genuine M-N relation between this table and the one whose IDs you're storing. This stuff should really be stored in a separate gerund, which BTW is a table that would have the PK of each of these tables as FKs, thus linking the related rows of both tables. So Jeroen's answer very well suits your situation.
Is it possible to get the information why/how given row returned by FTS query was matched (or which substring caused row to match)?
For example, consider simpliest table with id and text columns, with FTS index on the later one.
SELECT * FROM Example
WHERE CONTAINS(text, 'FORMSOF(INFLECTIONAL, jump)');
This examplary query could return, say row {1, 'Jumping Jack'}.
Now, is it possible to somehow get information that this very row was matched because of 'Jumping' word? It doesn't even have to be exact information, more of a which substring caused row to match.
Why I'm asking - I got C# app that builds up those queries basing on user input (keywords to search for), and I need the very basic information why/how row was matched back, to use further in C# code.
If it's not possible, any alternatives?
EDIT in regards of Mike Burton's and LesterDove's replies:
The above example was trivial for obvious reasons and your solutions are ok having that in mind, however FTS queries might return results where regex or simple string matching (eg. LIKE) won't cut in. Consider:
Search for bind returns bound (past form).
Search for extraordinary returns amazing (synonym).
Both valid matches.
I've been looking for solutions to this problem and found this: NHunspell. However, I already got FTS & valid results using SQL Server, duplicating similar mechanism (building extra indexes, storing additional words/thezaurus files etc) doesn't look good.
Lester's answer however gave me some ideas that perhaps I could indeed split the original string to temporary table, and run the original FTS query on this split result. As it might work for my case (where DB is fairly small and queries are not very complex), in general case this approach might be out of question.
1/ Use a SPLIT function (many variations can be Googled) on your original substring, which will dump the individual substrings into a temp table of some sort, with one row per substring snippet.
2/ EDIT: You need to use CROSS APPLY to join to a table valued function:
SELECT * FROM Example E CROSS APPLY Split(E.text, ' ') AS S
WHERE CONTAINS(E.text, 'FORMSOF(INFLECTIONAL, jump)') AND S.String LIKE '%jump%';
*NOTE: You need to forage for your own user-defined Split function. I used this one and applied the first commenter's edit to allow for the space character as a delimiter.
So, E is your Example table. You're still FT searching on the text field for the word 'jump'. And now you're "joining" to a table comprised of the individual substring values of your text field. Finally, you're matching that against the word 'jump' by using LIKE or Instr.
One simple post-processing method would be to generate an equivalent Regular Expression for each WHERE clause article and use it to discover after the fact how the found data matches the specified pattern.
You can get SQL to tell you how it interpreted your query, including how it transformed your input.
SELECT occurrence, special_term, display_term, expansion_type, source_term
FROM sys.dm_fts_parser('FORMSOF(INFLECTIONAL, bind)', 1033, 0, 0)
returns
occurrence special_term display_term expansion_type source_term
1 Exact Match binds 2 bind
1 Exact Match binding 2 bind
1 Exact Match bound 2 bind
1 Exact Match bind 0 bind
This isn't precisely what you asked for, but it's a start. You could search your results for anything in the display_term column and probably figure out why it matched.