Fastest way to search for terms in a text file? [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a list of terms (words), say around 500,000, they are loaded into some data structure, like a Dictionary or Trie perhaps.
In my program I want to open each text document and search for occurrences of these terms. When i find one I want to stop and transform the string in the text file (replacing it with the transformed string), and continue searching. Once complete with the file, I write to disk the new modified file.
My questions are as follows
What would be the best data structure to use for this purpose - a Tree type structure or .NET Dictionary
How do i search the text? Do I break it up into words and compare each chunk against the list I have, or some other method like RegEx, or .NET methods like Contains()?
I'm just looking for some advice on where to start, because I think speed will be really important when I'm dealing with very large and numerous text files.
EDIT: Yes the Transformation is same for each string - based on an algorithm - so each string will look different though. (like for example using a Cipher on the word to make is unreadable. Anyway I'm just looking for someone to point me in the right direction, I'm not familiar with many algorithms and data structures out there.

From a class I took once, I remember we covered a couple of different algorithms. Here are the ones that I remembered to be pretty effective with large text files...
Boyer-Moore:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Knuth-Morris-Pratt:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
These will only help with the lookup, then you can do the manipulation yourself

A hash table (Dictionary) is going to give you faster lookups than a tree structure. A well-built hash table can find a matching word entry with two or three probes, while a tree structure may require up to an order of magnitude more comparisons.
As for splitting up the words, it would seem to be simple enough to collect all alphabetical characters (and possibly digit characters) up to the next whitespace or punctuation character for each word. You will probably want to convert each word into all-lowercase before looking it up in the dictionary.

Related

cannot convert from 'char[]' to 'string[]' [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I was trying to make a save dialog from a rich text document but I keep getting an error on the text.
Error:
ArgumentNullException
Argument 2: cannot convert from 'char[]' to 'string[]'
I'm new to c# so I'm not sure how to fix this.
System.IO.File.WriteAllLines(ofd.FileName.ToString(), richTextBox1.Text.ToArray());
Use a different method:
System.IO.File.WriteAllText(ofd.FileName, richTextBox1.Text);
I'd recommend avoiding to use Lines for this; it's a waste of resources to split the text into N strings only to write them all back to a combined file
Why didn't your first attempt work? WriteAllLines requires an array of strings. If you call .ToArray() on a string you get an array of characters; strings and characters are very different things.
ToArray() is a LINQ method that works because a string can be treated as an enumerable sequence of char. Typically it's very rare that you would do so and you would probably use the dedicated string.ToCharArray() method if you did want a char array from a string. Mostly I use it when splitting strings on multiple characters: someString.Split(".,?!'-/:;()".ToCharArray()); as it's more readable than putting each char separately
You're more likely to use ToArray() later on for things like filtering one array using LINQ and turning the result into an array again:
Person[] smiths = people.Where(person => person.LastName == "Smith").ToArray();
Other points:
OpenFileDialog's FileName is already a string; you don't need to ToString() it
Please get into the habit of renaming your controls after you add them to a form (top of the properties grid, the (Name) line, takes two seconds). It's a lot easier for us on the internet (and you in 3 weeks' time) to be able to read usernameTextBox and go "oh, that's the username textbox" than it is to be wondering "is it textbox56 or textbox27 that is the username? I'll just go into the form designer and check.."

Generate skipable list based on chars [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am very sure that there is a technical term for this problem, but unfortunately I do not know it.
I have an alphabetical charset and the requirement is to create the combination of all the chars with a maximum length
The idea is (sample):
Generate a collection of A, AA, AAA, AAAA, AAAAA, AAAAAA
Next: A, AB, ABA, ABAA, ABAAA
Next A, AB, ABB, ABBA, ABBAA
The reason:
We have to query an API that delivers search results.
But if I don't get search hits from the API on AAA, I don't need to search for AAAA anymore, because it can't get search hits either. I can then move on to AAB.
The question:
My problem is that I'm not sure how the code has to be built to achieve this goal. I lack the structural approach.
I've already tried nested loops, but unfortunately I don't get the result.
I also used Combination Libraries, but they focus on other problems.
Many thanks for hints!
What you're looking for is a particular data structure called a Tree, but probably more specifically in your case, a Trie.
Trie data structures are commonly used in things like Autocomplete. With the image below, if someone typed "te", I can traverse the Trie and see what options would come after that (tea, ted, ten).
It looks like this would also fit your use case from what I can tell.

C# Replace multiple spaces in a string with newline [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
This questioned has been asked before in regard to other languages but I could't find anything on using regex or any other algorithm to solve this in C#.
For example:
Photosynthesis maintains atmospheric oxygen levels and supplies all of
the organic compounds and most of the energy necessary for life on
Earth. Most cases, oxygen is also released as a waste product. (((((THIS SERIES OF SPACES HERE THAT SUGGEST THE END OF A PARAGRAPH))))
Although photosynthesis is performed differently by different
species, the process always begins when energy from light is absorbed
by proteins called reaction centers that contain green chlorophyll
pigments.
should be formatted as:
Photosynthesis maintains atmospheric oxygen levels and supplies all of
the organic compounds and most of the energy necessary for life on
Earth.
Although photosynthesis is performed differently by different species,
the process always begins when energy from light is absorbed by
proteins called reaction centers that contain green chlorophyll
pigments.
How do I get this done?
var SpacedText = "Some sample text. This should be a new paragraph."
var NewlineText = Regex.Replace(SpacedText , #"\s{2,}", Environment.NewLine);
Change the 2 in the regex for however many spaces you want it to break on.
Environment.NewLine can be replaced with whatever newline delimiter you need (<br /> for html, or any listed here).
The best guess that I can think of is to match the end of sentence . and possible trailing whitespace, before also end of line, and replace it with . and carriage return/linefeed.
In this case the regex would be
\.\s*[\r\n]+
http://regex101.com/r/cU2tF9/1

Adding two different digit Numbers in c# ( without using BigInteger) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
I have a Task to do C#. I need to add two numbers.
The first number contains around 100 digits like "12822429847264872649624264924626466826446692............"
and second number also with 100 digits or more or less
by using this numbers i need task like add/sub/multiply/div
I done this using BigInteger in C#
But do I need to do this using arrays or strings?
Since they are both 100 digits just start with the last digit and in a for loop just add each one, but if the value is > 10 then remember to add one to the next digit.
This is how children learn to add, you just need to follow the same steps, but the answer should be in an array of 101 characters.
UPDATE:
Since you have shown some code now, it helps.
First, don't duplicate the code based on if str1 or str2 is larger, but make a function with that logic and pass in the larger one as the first parameter.
Determine the largest size and make certain the smaller value is also the same size, to make math easier.
The smaller one should have leading zeroes (padding), again to help keep the code simple.
You can also start by looking at the source code for structures such as BigInteger. They would provide you more insight into aspects such as computational efficiency and storage, particularly about multiplication and division. You can take a look at here or here.

Best way to read a FASTA file in c# [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a FASTA file containing several protein sequences. The format is like
----------------------
>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*
>protein3
MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY`
-----------------------------------
Is there a good way to read in this file and store the sequences separately?
Thanks
To do this one way is to:
Create a vector where each location
holds a name and the sequence
Go through the file line by line
If the line starts with > then add
an element to the end of the vector
and save the line.substring(1) to
the element as the protein name.
Initialize the sequence in the
element to equal "".
If the line.length == 0 then it is
blank and do nothing
Else the line doesn't start with >
then it is part of the sequence so
go current vector element.sequence
+= line. Thus way each line between >protein2 and >protein3 is
concatenated and saved to the
sequence of protein2
I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.
You could split the file on newline, and look for a > character to determine the name.
From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:
var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
var line = reader.ReadLine();
if(string.IsNullOrEmpty(line))
break;
if(line.StartsWith(">"))
StoreProteinName(line);
else
StoreSequence(line);
}
If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.
Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)
If you're interested in C or C++, check out the answers to this question over at Biostar:
http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics
Do yourself a favor, and don't reinvent the wheel if you don't have to.

Categories

Resources