Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a FASTA file containing several protein sequences. The format is like
----------------------
>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*
>protein3
MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY`
-----------------------------------
Is there a good way to read in this file and store the sequences separately?
Thanks
To do this one way is to:
Create a vector where each location
holds a name and the sequence
Go through the file line by line
If the line starts with > then add
an element to the end of the vector
and save the line.substring(1) to
the element as the protein name.
Initialize the sequence in the
element to equal "".
If the line.length == 0 then it is
blank and do nothing
Else the line doesn't start with >
then it is part of the sequence so
go current vector element.sequence
+= line. Thus way each line between >protein2 and >protein3 is
concatenated and saved to the
sequence of protein2
I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.
You could split the file on newline, and look for a > character to determine the name.
From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:
var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
var line = reader.ReadLine();
if(string.IsNullOrEmpty(line))
break;
if(line.StartsWith(">"))
StoreProteinName(line);
else
StoreSequence(line);
}
If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.
Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)
If you're interested in C or C++, check out the answers to this question over at Biostar:
http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics
Do yourself a favor, and don't reinvent the wheel if you don't have to.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Let's say I have few points : -5,-4,-3,-2,-1,0,1,2,3,4,5
I'm at point 0, I need to create a line that goes all through the points of 1,2,3,4,5,-1,-2... etc.
The line would start at 0 and end at whatever point that ends as the shortest.
The answer for this example would be that it'd go like this 0->1->2->3->4->5->-1->-2->-3->-4->-5 or that it'd go first to -1 and go all through the minus to the plus, same result (5*4=20 length).
If for example we'd go 0->1->-1->2->-2... it'd end as the longest line that goes straight from point to point (1+2+3+4+5+6+7+8+9+10=10*11/2=55 length)
The question is how to write this in code?
The points might also consist of 2 or 3 dimensional points, where the start would be (0,0,0,0) or whatever, eventually the line can go through all of these points, but which way will achieve the shortest line?
How to make it as a code, as we see it in the eye?
I think this is basically the Travelling Salesman problem. You've got N destinations, and each pair of destinations has a concrete length between them, and you're trying to find out the shortest travel time to visit all destinations.
You've got two different directions to pursue this, that I can see. First, is to read up on the Travelling Salesman problem and the various algorithms that have been proposed for it (it's a very famous algorithm problem) and then try to implement one in C# - though just to warn you, you should be very proficient in math, because it's not an easy problem. Or, alternatively, you can look for someone else's existing implementation for it and just use it without understanding the theoretical underpinnings.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Hello experts, I have to generate series of folders from a TextBox into specified location.I am having two textboxes to specify the limit of folders(say 30 folders).The problem am facing is that the folder names that i will be providing are alpha-numeric(say 121cs3h101) .
How to set limit when i provide an alpha-numeric values?
(For example: i provide textbox1=12cs3h101 and textbox2=12cs3h131 , i need the series limit to be generated). I am working with visual studio 2013 in c# windows form application. Thanks in advance.
ok I will try to give you a lead.
To parse a string or find specific characters one can use RegEx.Match or a simler method called String.Split. In both cases you have to be aware how your string is structured and how it can vary. The limits of variation are very important.
If as you say the beginning is always"12cs3h" you can either split the string at the character 'h'.
string[] sa = s.Split('h');
Or you can even use the index of 'h' (since the length seems to be fixed) and take the rest of the string to get the numbers.
int index = s.IndexOf('h');
The rest is up to you, ... convert, enumerate and so on.
EDIT: There is a nice method that does the enumeration job for you: Enumerable.Range Good luck
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
how can i get the numbers that are coded in a ascii art with sticks?
the numberss are in a txt file und it contains this:
I must convert this txt file in
3 2 1 4 5
1 4 5
I read the text file so:
using (StreamReader sr = new StreamReader("SourceFile.txt"))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
string allines = sb.ToString();
Now, like the answer of #Zotta i have to save in two different strings (the first 4 lines and the seconds, than
then will be easier
Your numbers are 4 lines tall each => Split input into blocks of 4 Lines each
Your numbers are separated by columns of whitespace => Search for colums containing only whitespaces and split.
After you separated all the numbers, use a lookup table.
I don't know why this question is down-voted so much but I think it's interesting question. I'll answer giving a general approach other than hardcoding the possible results by finding the characters and that would work with different "ASCII font".
If you're looking for a library, maybe you can look at captcha decoding on google. There is a comprehensive article here if you want to do it yourself for ASCII specifically:
http://www.boyter.org/decoding-captchas/
Also, since most libraries probably only support images, maybe you'll need to convert your ascii art text file into a bitmap by rendering it yourself.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
This questioned has been asked before in regard to other languages but I could't find anything on using regex or any other algorithm to solve this in C#.
For example:
Photosynthesis maintains atmospheric oxygen levels and supplies all of
the organic compounds and most of the energy necessary for life on
Earth. Most cases, oxygen is also released as a waste product. (((((THIS SERIES OF SPACES HERE THAT SUGGEST THE END OF A PARAGRAPH))))
Although photosynthesis is performed differently by different
species, the process always begins when energy from light is absorbed
by proteins called reaction centers that contain green chlorophyll
pigments.
should be formatted as:
Photosynthesis maintains atmospheric oxygen levels and supplies all of
the organic compounds and most of the energy necessary for life on
Earth.
Although photosynthesis is performed differently by different species,
the process always begins when energy from light is absorbed by
proteins called reaction centers that contain green chlorophyll
pigments.
How do I get this done?
var SpacedText = "Some sample text. This should be a new paragraph."
var NewlineText = Regex.Replace(SpacedText , #"\s{2,}", Environment.NewLine);
Change the 2 in the regex for however many spaces you want it to break on.
Environment.NewLine can be replaced with whatever newline delimiter you need (<br /> for html, or any listed here).
The best guess that I can think of is to match the end of sentence . and possible trailing whitespace, before also end of line, and replace it with . and carriage return/linefeed.
In this case the regex would be
\.\s*[\r\n]+
http://regex101.com/r/cU2tF9/1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a list of terms (words), say around 500,000, they are loaded into some data structure, like a Dictionary or Trie perhaps.
In my program I want to open each text document and search for occurrences of these terms. When i find one I want to stop and transform the string in the text file (replacing it with the transformed string), and continue searching. Once complete with the file, I write to disk the new modified file.
My questions are as follows
What would be the best data structure to use for this purpose - a Tree type structure or .NET Dictionary
How do i search the text? Do I break it up into words and compare each chunk against the list I have, or some other method like RegEx, or .NET methods like Contains()?
I'm just looking for some advice on where to start, because I think speed will be really important when I'm dealing with very large and numerous text files.
EDIT: Yes the Transformation is same for each string - based on an algorithm - so each string will look different though. (like for example using a Cipher on the word to make is unreadable. Anyway I'm just looking for someone to point me in the right direction, I'm not familiar with many algorithms and data structures out there.
From a class I took once, I remember we covered a couple of different algorithms. Here are the ones that I remembered to be pretty effective with large text files...
Boyer-Moore:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Knuth-Morris-Pratt:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
These will only help with the lookup, then you can do the manipulation yourself
A hash table (Dictionary) is going to give you faster lookups than a tree structure. A well-built hash table can find a matching word entry with two or three probes, while a tree structure may require up to an order of magnitude more comparisons.
As for splitting up the words, it would seem to be simple enough to collect all alphabetical characters (and possibly digit characters) up to the next whitespace or punctuation character for each word. You will probably want to convert each word into all-lowercase before looking it up in the dictionary.