I am implementing full text search on a single entity, document which contains name and content. The content can be quite big (20+ pages of text). I am wondering how to do it.
Currently I am looking at using Redis and RedisSearch, but I am not sure if it can handle search in big chunks of text. We are talking about a multitenant application with each customer having more than 1000 documents that are quite big.
TLDR: What to use to search into big chunks of text content.
This space is a bit unclear to me, sorry for the confusion. Will update the question when I have more clarity.
I can't tell you what the right answer is, but I can give you some ideas about how to decide.
Normally if I had documents/content in a DB I'd be inclined to search there - assuming that the search functionality that I could implement was (a) functionally effect enough, (b) didn't require code that was super ugly, and (c) it wasn't going to kill the database. There's usually a lot of messing around trying to implement search features and filters that you want to provide to the user - UI components, logic components, and then translating that with how the database & query language actually works.
So, based on what you've said, the key trade-offs are probably:
Functionality / functional fit (creating the features you need, to work in a way that's useful).
Ease of development & maintenance.
Performance - purely on the basis that gathering search results across "documents" is not necessarily the fastest thing you can do with a IT system.
Have you tried doing a simple whiteboard "options analysis" exercise? If not try this:
Get a small number of interested and smart people around a whiteboard. You can do this exercise alone, but bouncing ideas around with others is almost always better.
Agree what the high level options are. In your case you could start with two: one based on MSSQL, the other based on Redis.
Draw up a big table - each option has it's own column (starting at column 2).
In Column 1 list out all the important things which will drive your decision. E.g. functional fit, Ease of development & maintenance, performance, cost, etc.
For each driver in column 1, do a score for each option.
How you do it is up to you: you could use a 1-5 point system (optionally you could use planning poker type approach to avoid anchoring) or you could write down a few key notes.
Be ready to note down any questions that come up, important assumptions, etc so they don't get lost.
Sometimes as you work through the exercise the answer becomes obvious. If it's really close you can rely on scores - but that's not ideal. It's more likely that of all the drivers listed some will be more important than others, so don't ignore the significance of those.
Related
We use a solution in C#.net where someone can call a phone number and speak a persons First, and then Last Name. Then the name is entered on a guest registry on our website. We use an XML dictionary file with 5,000 First Names and 89,000 last names that we got from the US Census. We are using the Microsoft.Speech.Recognition library, (maybe that's the problem).
Our problem is that even with relatively easy names like Joshua McDaniels we are getting about a 30% fail rate. The performance, (speed-wise), is fine it just doesn't grab a good portion of the names.
Now, I understand that ultimately the quality of the spoken name will dictate, sorry for the pun, how well the system performs, but what we would like to get close to 99% in "laboratory" conditions with perfect enunciation and no accent and then call it good. But even after several trials with the same person speaking, same name, same phone, same environment, we are getting a 25% fail rate.
My question is: Does anyone have an idea of a better way to go after this? We thought of maybe trying to use an API, that way the matches would be more relevant and current.
The current state of the technology is that it is very hard to recognize names, moreover a large list of them. You can recognize names from the phone book (500 entries) with good quality, but for thousands of them it is very hard. Speech recognition engines are certainly not designed for that, in particular offline ones like System.Speech.
You might get way better results with online systems like https://www.projectoxford.ai which use advanced DNN acoustic models and bigger vocabularies.
There were whole big companies built around the capability to recognize large name lists, for example Novauris
used patented technology for that. You might consider building something like that using open source engine, but it would be a large undertaking anyway.
I'm working with a product development firm having multiple releases simultaneously for same product.
We have around 4 environments with their own copy of SQL database and TFS branches.
Now the problem is we spend lot of time on merging code, resolving conflicts and merging within various branches to make sure we do not mess with deployment.
We are taking help of Redgate tool(new for this) for sql db side management but still feel like we are not in good condition.
Can you please suggest me best architecture/solution or set of tools that can be implemented ?
If you are concerned about the number of merging related activities that are going on then we need to reduce the number of merging activities. This is not going to be an easy thing to change as the culture and expectation within your organisation is currently tuned to produce this result.
You need to move towards a single line, or single branching model. If you are using Git then you can still use many short lived activity branches for Hotfixes or Releases as laid down in GitFlow, but your source line where you add all new code (DEV, MASTER, TRUNK, Main, whatever) should be a single line. As soon as you have feature or version branches you are in the world of merging.
There are a number of engineering activities and practices that you can use to support much of what you are physically doing now in the new model:
Feature toggles - This is your primary engineering solution for merging. If you are working on a single code line, and always have coders check in working code, then feature toggles allow you to ship features that are half done and you don't want folks to see. You hide them.... Now the first thing that you are going to throw out is "but we do database work and you can't do that there", and you would be wrong. Many organisations practice feature toggles and include database work. You need to have a solid and consistent practice of 'additive only' so that you don't break existing work and actually do work to make sure that both a new feature and an old one can coexist. There is work in that, but not as much as merging (in my experience) and not as error or bug prone. One key to remember is to think of them as Feature toggle and not code toggles. If you add a new feature then hide it till its ready. If you are incrementally improving an existing feature then just ship the new functionality. Achieving this WILL be hard and will require courage to implement major cultural changes at your organisation from coders and testers, all the way up to sales and management.
Definition of Done - Which leads to the question of how do we maintain quality in this new world of feature toggles? Think about this: if you have 3 feature teams all working on different functionality and one team decided to reduce their quality but what they have is buggy but good enough, what would be the impact? You are protected from this in a branching model until then end when you make all sorts of compromises to make everyone's mediocre (or just plane crap) code work together. Now we have to have this on every check-in and every release. So what do you need? You need a shared and agreed Definition of Done that represents the quality bar that must be met to ship. Without it you will have chaos. The cultural issue here is that you need everyone, every coder, and every tester, on board with the sacrosanct nature of the DOD. No you cant just compromise just this once as it will have a knock on effect.
Reduce cycle time - Which leads to our ship cycle. You need to 'ship' more regularly. Or more specifically you need to create potentially shippable increments of working software on a regular basis. This support the above in a number of ways, but first and foremost it reduces the amount of work that is under way. This will help reduce the complexity and help teams focus. With what is in effect shorter batch sizes we can get a much more regular adherence to the definition of done and have those touch points of "working software with no further work required to ship it". The side advantages here is that you increase your business ability to change as they can change at the end of each cycle sure in the knowledge that unfinished features are not going to in fact introduce complexity. You also gain the ability to inspect and adapt more frequently. Most companies, on gathering the evidence, find that more than 60% of their software is used little if ever. Lets use the reduced cycle time to get users in front of the software and only focus on building the 40% that they care about. (whoa! did we just get a 60% efficiency gain there?)
There are a number of other supporting practices that it would make a lot of sense for you to adopt to get there and I would probably recommend that you read the Scrum Guide (http://www.scrumguides.org/) and think about how you might start moving towards the goals above.
I'm writing a bot that will analyse posts and reply with a vaguely related strings from a database. I'm not aiming for coherence, just for vague similarity that could pass as someone ignorant to the topic (but knowledgeable enough to try to reply). What are some methods that would help me to choose the right reply?
One thing I've come up with is to create a vocabulary list, check which elements of the list are in the post, and get a reply from the database based on these results. This crude method has been successful about 10% of the time (based on 100 replies to random posts). I might expand the list by more words, but this method has its limit. Any better ones?
(P. S. The database is sizeable -- about 500 000 replies)
First of all, I think the best you can hope for will be about a 50% answer rate, unless you're prepared to write a lot of code.
If you're willing to get your hands dirty with some statistics, check out term frequency–inverse document frequency. Basically, you will use the frequency of uncommon words to determine what keywords are critical to the document, and use this as the input into the tf-idf algorithm to pull out other replies with those same keywords.
You can then combine this further with whitelisting and blacklisting techniques to ignore common words and prioritize certain keywords. You can then keep tuning those lists to enhance the algorithm as you see it work.
There are also simpler string metrics you can use to test basic similarity. Take a look at this list of string metrics.
You might want to look into vector-space mapping and resemblance. The "vaguely related" problem could be handled by resemblance statistical analysis most likely.
Check out this novel use of resemblance:
http://www.cromwell-intl.com/security/attack-study/
There is a PHP function called "similar_text()", (e.g.:
$percent_similar = similar_text($str1, $str2);) This works fairly well but I didn't come up with anything similar in C#. If you could get hold of the source for the PHP function you might try to translate it. I think there may be a Java version also.
I'm working on a project for normalizing URL's.(i.e different URL's that map to the same web page should be identified and redundancy should be reduced as like a search engine).
So I'd like a dataset containing different URL's in order to test my method. Please provide links for normalization dataset(s).
I'm implementing this project in C# and I'd like your suggestions. Thanks in advance.
Since you asked I'd like your suggestions, leaving your question very open and thus open to which kind of suggestion you might get, I will go ahead and give you my suggestions. Though I admit I am not 100% sure what problem you wish to tackle? Are you asking for a program/code specific suggestion? A strategy for how to setup such a project? or do you wish to collect inspirations/idea's and improve your existing workflow? If you are seeking this third thing, I would suggest to take a look into two scenarios, inspired by a lecture that one of my Artificial Intelligence teachers once gave. Lets dive for a moment to how Ant colonies organise themselves:
top-down approach: a fantasy Imagine a queen in an antcology prescribing for each and every ant their routes to the sub colonies and thereby normalising multiple trace routes that varous ants all undertake to go to the same place, then it seems you want to group the ants together and let each group use just 1 route to their goals, and remove possible duplicate routes. This is one way how to make their routes more efficient. In reality ants actually work differently :
bottom-up approach: the reality:
A single ant has little meaning, but when a whole ant colony is studies, an organisation reveals. Thi sis because the ants themselves follow the scent traces of other ants, that way following eachother and ultimately finding their way to the nest. This way, the cleverness does not need to come from above/from a central database, but a tiny bit of intelligence built in each ant will make the same path re-useable. >> In this way you might want to think building your normalisation technique within each hyperlink that needs to be normalized.
I hope this can give you the suggestions you wished, otherwise if your question was not strategy based but specific code-problem related, ask question with program code in it, that is often much easier to solve than finding the best strategy. Good luck! My 2 cents.
If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?
It feels like standard coding style debates that have been raging since programming began like "put the bracket on the following line" or "properly indent your (" are no longer essential.
I realize in languages where white space matters the diff will have to consider it but for languages where the style is a personal preference is there really a need to worry about it anymore?
Auto-format can really only address whitespace.
It wont address developers giving variables bizarre nonsensical names.
It won't address some developers having functions return null on an error vs throwing an exception.
I'm sure others can think of more examples.
This is what we do at my work:
We all use Eclipse. We don't have a policy for using Eclipse but somehow none of us is an IDEA/IntelliJ guy. We also think our code should be written with legacy in mind. This means our code has to be readable in a certain way even years after (#1) no matter who wrote it and if that person even is in the company anymore.
Eclipse has couple handy features, automatic format on save and a specific Formatter tool. As you can see from the linked screenshot, it can be configured with XML. Thus there's a bunch of premade XML:s available for every worker in our company so that when a new guy comes in, we walk him through of the whole process and configure their Eclipse for them (yes, it's slightly evil thing to do) so that it actually uses those formatting XML:s we have provided. We do not enforce automatic format on save, we don't want to be completely intrusive, we just want to push all our developers into the right directions. For even increased compatability, we mostly use rules defined in JCC.
Next comes the important part, the actual builds. We are those who embrace automatic builds and for that we use Hudson Continuous Integration Server. There's two important parts in our configurations beyond this:
We use CVS loginfo to trigger builds whenever something is committed.
We utilize several plugins available for Hudson, including Continuous Integration Game in conjuction with the most important one, Checkstyle.
The Checkstyle Plugin is the magician in our code style enforcement guide line:
After commiting code to CVS, Hudson build is triggered
After build has been completed succesfully (all unit tests pass etc.), Checkstyle inspects the actual source files
Checkstyle ranks the code based rules we have defined for it
Continuous Integration Game sees the result of Checkstyle and awards/takes away points for the person who has the ownership for the relevant part of the code
Leaderboard shows total points for every commiter in system
Basically this means that when anyone commits ugly code into our CVS, our build server automatically reduces that person's points.
This means that eventually any one of us can be ranked on the Leaderboard based on the general code quality in both look and OO principles such as Law of Demeter, Cyclomatic complexity etc. etc. Naturally this isn't a completely serious statistic, but it's a good indication you're doing something wrong when causing a build to be initiated in our CI won't reduce your points - most of our commits are worth between 1 and 5 points.
And is it working? Sort of, I don't think anyone of us at my work writes ugly or unmaintainable code and personally I love to hunt all kinds of scores so it's definitely motivating me to make code that looks nice and follows all the OO paradigms I know of.
And do we as a company really need it? I think we do as you should see from reading this entire answer, it can be considered a good practice for the advancements it brings.
#1: in a related note, I refactored legacy code from 2002 today which used those standards, didn't look "bad" at all even in its original form and certainly not worse in its new form
No, not really.
If you can actually get it to work consistently and not make it flag code has changed due to a different style of laying the code out.
However, this is just a small part of coding standards. It won't cover multiple return statements, the use or not of ternary operators, etc.
It is always nice if the coding style that the shop uses is the same one that is also followed by the development tools.
Otherwise, if there is a large body of code that already follows a shop standard which is NOT the same as that of the tools you have two choices:
Modify all of the code to follow the tool standard, or
Maintain the existing shop standard.
Many shops do the latter. Regardless, there does need to be some kind of standard, and it does need to be followed.
Some development tools allow you to tweak their standard. In some cases you may be able to bring the tools in alignment with the shop standard.
It probably doesn't matter that much anymore if you can ensure that everybody in the team sees the source code "correctly" formatted, whatever they think it is. However I've not seen a system that can do that - you can do parts of it (say, reformat before and after checkin/checkout) but these days you also have to consider web interfaces into the version control, external code review systems that interact directly with the version control system etc.
The main purpose of a standard code style is (IMHO) to ensure that you can read other team members' code easily without having to start reverse engineering it because all the code is written using the same sort of guiding principles. Indenting and parentheses placement seem to be a major hangup on this but they are only a very small and in my opinion, somewhat overblown and not very important part of the need to make code consistent.
Unfortunately I'm not aware of any tools that can automatically apply consistent coding principles to source code...
Yes, coding styles are needed if there is a desire to have a homogeneous code base. Such a code base can be useful in preventing individual ownership of parts of the code base, which can cause problems when people leave the team. If you can't imagine having wildly different styles and problems understanding all of it, just look at all the different ways English text can be organized in various communications, all written but quite different such as tweets, e-mail, text messages, IM, message board posts, etc. and changes in fonts, capitalization, decorations, etc.