Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I've been researching on finding an efficient solution to this. I've looked into diffing engines (google's diff-match-patch, python's diff) and some some longest common chain algorithms.
I was hoping on getting you guys suggestions on how to solve this issue. Any algorithm or library in particular you would like to recommend?
I don't know what "longest common [[chain? substring?]]" has to do with "percent difference", especially after seeing in a comment that you expect a very small % difference between two strings that differ by one character in the middle (so their longest common substring is about one half of the strings' length).
Ignoring the "longest common" strangeness, and defining "percent difference" as the edit distance between the strings divided by the max length (times 100 of course;-), what about:
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def percent_diff(first, second):
return 100*levenshtein_distance(a, b) / float(max(len(a), len(b)))
a = "the quick brown fox"
b = "the quick vrown fox"
print '%.2f' % percent_diff(a, b)
The Levenshtein function is from Stavros' blog. The result in this case would be 5.26 (percent difference).
In addition to difflib and other common subsequence libraries, if it's natural language text, you might look into stemming, which normalizes words to their root form. You can find several implementations in the Natural Language Toolkit ( http://www.nltk.org/ ) library. You can also compare blobs of natural language text more semantically by using N-Grams ( http://en.wikipedia.org/wiki/N-gram ).
Longest common chain? Perhaps this will help then: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Another area of interest might be the Levenshtein distance described here.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm currently using Bellman Ford algorithm to find the shortest paths with negative value. Is there any faster algorithm that would outperform Bellman Ford for finding shortest paths with negative values?
A simple improvement is to only check for "active" nodes instead of iterating on all of them as the naive implementation does.
The reason is that if a node didn't lead to improvements on any of its neighbors and didn't change value in last iteration there is no need to redo the computation again (it will still produce no improvements).
Pseudocode (Python, actually):
A = set([seed])
steps = 0
while len(A) > 0 and steps < number_of_nodes:
steps += 1
NA = set()
for node in A:
for nh in neighbours(node):
x = solution[node] + weight(node, nh)
if x < solution[nh]:
# We found an improvement...
solution[nh] = x
pred[nh] = node
NA.add(nh)
A = NA
A is the "active" node set, where an improvement was found on last step and NA is the "next-active" node set that will need to be checked for improvements on next iteration.
Initially the solution is set to +Infinity for all nodes except the seed where the solution is 0. Initially only the seed is in the "active" set.
Note that in case of negative-sum loops reachable from the seed the problem has no "minimum path" because you can get the total as low as you want by simply looping; this is the reason for the limit on the "steps" value.
If when coming out of the loop A is not empty then there is no solution to the minimum cost problem (there is a negative-sum loop and you can lower the cost by simply looping).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm trying to figure out a good way to write a little "algorithm", which would be able to find a mathematical a range between these two numbers:
Let's suppose maximum number is 1500 and minimum number would be 1;
By performing some sort of mathematical formula, method would be able to determine that best range between these two numbers is lets say 100;
So range would be:
100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500
Other example:
Maximum is 10, minimum 1;
Best range would be (let's say):
2,4,6,8,10
Are there any libraries in c# which offer this kind of solution or is there some neat mathematical formula used to determine this?
P.S. Guys there can be a remainder in the number as well...
I'm guessing I can divide the maximum number into let's say 7 fixed groups, and then just add up the divided number until I get the max value , no ?
Okay guys I've figured out an idea, lets suppose maximum number is a floating point number and is: 1326.44..., while the minimum is 132.5
I'm going to say that maximum range can be 7... So what I can do is divide 1326.44 with 7 and I'll get 189.49
So the first amount in range is:
var ranges = new[] { 132.5, 189.5 ... /*Now I just need to dynamically somehow add the rest of the range elements?*/ };
This is actually super easy. You have a min range value and a max range value, and you want a particular number of items in your range. Therefore, you simply need to calculate a step value, and then add that recursively to the minimum value until you're at the maximum value. For example:
var min = 132.5;
var max = 1326.44;
var count = 7;
var step = (max - min) / count;
var items = new List<double>();
for (var i = min; i <= max; i += step)
{
items.Add(i);
}
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I need to scramble chars in word. And this scrambling must not be random. In other words, each time of scrambling (on the same word) the new result must be equal to last scramble result (on the same word). Very simple example is XOR. But XOR is very easy to decode and I need something more strength. Could you recommend library for such purpose that identically works on C# and Javascript?
Thank you for any advice!:)
You can use random with fixed seed if you really want to scramble characters:
string input = "hello";
char[] chars = input.ToArray();
Random r = new Random(2011); // Random has a fixed seed, so it will always return same numbers (within same input)
for (int i = 0 ; i < chars.Length ; i++)
{
int randomIndex = r.Next(0, chars.Length);
char temp = chars[randomIndex];
chars[randomIndex] = chars[i];
chars[i] = temp;
}
return new string(chars);
Although I don't really agree about what you were trying to do, here's a link to MD5 javascript library (assuming what you're trying to do is some kind of encryption). As for the C# part, it's a built in feature.
You can use any of the built in .NET classes to generate random numbers and use these to scramble your string. For all the subsequent scramble attempts you can use the result from the first scramble operation. This assumes that the result from the first call is stored somewhere.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a FASTA file containing several protein sequences. The format is like
----------------------
>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*
>protein3
MKTLLLLAVIMIFGLLQAHGNLVNFHRMIKLTTGKEAALSYGF
CHCGVGGRGSPKDATDRCCVTHDCCYKRLEKRGCGTKFLSYKF
SNSGSRITCAKQDSCRSQLCECDKAAATCFARNKTTY`
-----------------------------------
Is there a good way to read in this file and store the sequences separately?
Thanks
To do this one way is to:
Create a vector where each location
holds a name and the sequence
Go through the file line by line
If the line starts with > then add
an element to the end of the vector
and save the line.substring(1) to
the element as the protein name.
Initialize the sequence in the
element to equal "".
If the line.length == 0 then it is
blank and do nothing
Else the line doesn't start with >
then it is part of the sequence so
go current vector element.sequence
+= line. Thus way each line between >protein2 and >protein3 is
concatenated and saved to the
sequence of protein2
I think maybe a little more detail about the exact file structure could be helpful. Just looking at what you have (and a quick peek at the samples on wikipedia) suggest that the name of the protein is prepended with a >, followed by at least one line break, so that would be a good place to start.
You could split the file on newline, and look for a > character to determine the name.
From there it is a little less clear because I'm not sure if the sequence data is all in one line (no linebreaks) or if it could have linebreaks. If there are none, then you should be able to just store that sequence information, and move on to the next protein name. Something like this:
var reader = new StreamReader("C:\myfile.fasta");
while(true)
{
var line = reader.ReadLine();
if(string.IsNullOrEmpty(line))
break;
if(line.StartsWith(">"))
StoreProteinName(line);
else
StoreSequence(line);
}
If it were me, I would probably use TDD and some sample data to build out a simple parser, and then keep plugging in samples until I felt I had covered all of major variances in the format.
Can you use a language other than C#? There are excellent libraries for dealing with FASTA files and other biological sequence in Perl, Python, Ruby, Java, and R (off the top of my head). They're usually branded Bio* (i.e. BioPerl, BioJava, etc)
If you're interested in C or C++, check out the answers to this question over at Biostar:
http://biostar.stackexchange.com/questions/1516/c-c-libraries-for-bioinformatics
Do yourself a favor, and don't reinvent the wheel if you don't have to.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
How can I draw candle charts in C#? Does anybody has any examples with a nice interface?
I've used the MSChart and found it to be pretty good. It supports candlestick charts. I've used ZedGraph as well but found a few graphical anomalies showed up on my charts but they were otherwise good as well.
I use this for stock data but its in VB
With Chart1.ChartAreas("myarea")
.AxisY.Maximum = (Math.Ceiling((HighValue * 100)) / 100)
.AxisY.Minimum = (Math.Floor((LowValue * 100)) / 100)
.AxisY.LabelStyle.Format = "{0.00}"
End With
Dim s1 As New Series
With s1
.ChartArea = "myarea"
.ChartType = SeriesChartType.Candlestick
.XValueType = ChartValueType.String
.YValueType = ChartValueType.Single
.YValuesPerPoint = 4
.CustomProperties = "PriceDownColor=Red, PriceUpColor=Green"
End With
For i = Globals.GraphColumns - 1 To 0 Step -1
OutData = Data_Array.Item(i)
s1.Points.AddXY(OutData.thedate, OutData.high, OutData.low, OutData.close, OutData.open)
Next
Chart1.Series.Add(s1)
Me.Controls.Add(Chart1)
ZedGraph is a very easy-to-use LGPLed charting library that can handle candlestick charts.
If you need to save an image to disk, it can do that. If you need to display an interactive graph that supports zooming/panning, it can do that as well with the excellent ZedGraphControl control.
I'm using the .netCharting library for this and it's pretty good. It supports all sorts of charts - candle included. One thing to watch out for is that with the current version (5.3) you have to reverse the high and low price - a pretty ugly and obvious bug. It's a commercial product, but reasonably priced, so could be worth it, depending on your project.
Maybe ChartDirector can be a good solution
http://www.advsofteng.com/doc/cdcomdoc/candlestick.htm
Try xamChart Control Trial version from Infragistics.
Here is another sample at CodeProject