hey guys i have a textfile i have divided it into 4 parts. i want to search each part for the words that appear in each part and score that word
exmaple
welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball.
i will want to return national = 1 as it appears only in one part etc
am working on determining text context using word position.
am working with c# and not very good in text processing
basically
if a word appears in the 4 sections it scores 4
if a word appears in the 3 sections it scores 3
if a word appears in the 2 sections it scores 2
if a word appears in the 1 section it scores 1
thanks in advance
so far i have this
var s = "welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball. ";
var numberOfParts = 4;
var eachPartLength = s.Length / numberOfParts;
var parts = new List<string>();
var words = Regex.Split(s, #"\W").Where(w => w.Length > 0); // this splits all words, removes empty strings
var wordsIndex = 0;
for (int i = 0; i < numberOfParts; i++)
{
var sb = new StringBuilder();
while (sb.Length < eachPartLength && wordsIndex < words.Count())
{
sb.AppendFormat("{0} ", words.ElementAt(wordsIndex));
wordsIndex++;
}
// here you have the part
Response.Write("[{0}]"+ sb);
parts.Add(sb.ToString());
var allwords = parts.SelectMany(p => p.Split(' ').Distinct());
var wordsInAllParts = allwords.Where(w => parts.All(p => p.Contains(w))).Distinct();
This question is very difficult to interpret. I don't fully understand your goal and it is my suspicion that you might not either.
In the absence of a clear requirement, there is no way to give a specific answer, so I will give a generic one:
Try writing a test that clearly specifies the exact behavior you want. You've got the beginnings of one with your sample string and the result you want but it's not unambiguous what you are looking for.
Make a test that, when it passes, demonstrates that one of the required behaviors is there. If that doesn't help you get a solution to the problem, come back and edit this question or make a new one that includes the test.
At the very least, you will be able to harvest better answers from this site.
Related
I have about 10 sentences inside my new List(file.txt). Example:
Django Unchained is the first time in 16 years that Leonardo DiCaprio didn’t get the top billing.
Django Unchained is a 2012 American revisionist Western film written and directed by Quentin Tarantino.
etc.
When I enter text to search box: "Django Unchained Tarantino", I want to display the most precise result at the top of the listBox and the rest sentences under this particular. Any ideas?
The word searching algorithm has several facets to it:
what search criteria are supposed to represent (usually a set or a sequence of words, but it might be more elaborate with exclusions etc.)
what search items are supposed to represent (likewise, usually a set or a sequence of words)
the relevance metric, applying the search criteria to a specific search item
Since the asker said in comments that it's the result with the highest number of matching words, I'm going to present here such an approach. Of course, the more usable search we want, the more sophisticated algorithm we need.
First of all, we need to decide what makes a "word". For simplicity, I'm going to assume it's a continuous sequence of alphanumeric characters, and both search criteria and results are given as alphanumeric sequences separated with non-alphanumeric blocks. Thus, the following split method is used:
public static IEnumerable<string> SplitSearchWords(string str)
{
int charIndex = 0;
int wordStart = 0;
while (charIndex < str.Length)
{
wordStart = charIndex;
if (char.IsLetterOrDigit(str[charIndex]))
{
while (charIndex < str.Length && char.IsLetterOrDigit(str[charIndex])) charIndex++;
yield return str.Substring(wordStart, charIndex-wordStart);
}
else
{
while (charIndex < str.Length && !char.IsLetterOrDigit(str[charIndex])) charIndex++;
}
}
}
Then, we need a method to calculate the search relevance (I'm assuming that many occurrences of search word don't make the search item more relevant, thus I use Intersect method):
public static int CalculateSearchRelevance(string searchItem, IEnumerable<string> searchWords)
{
var searchItemWords = SplitSearchWords(searchItem);
return searchWords.Intersect(searchItemWords, StringComparer.OrdinalIgnoreCase).Count();
}
And finally, to put it all together:
var query = "Django Unchained Tarantino";
var items = new List<string>()
{
"Django Unchained is the first time in 16 years that Leonardo DiCaprio didn’t get the top billing.",
"pineapples",
"Django Unchained is a 2012 American revisionist Western film written and directed by Quentin Tarantino.",
};
var searchWords = SplitSearchWords(query).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
var sortedItems = items.OrderByDescending(s => CalculateSearchRelevance(s, searchWords)).ToList();
Or, if you want to remove all items that are complete misses, you set the sortedItems to the following:
var sortedItems = items
.Select(item => new KeyValuePair<int, string>(CalculateSearchRelevance(item, searchWords), item))
.Where(relevanceItem => relevanceItem.Key > 0)
.OrderByDescending(relevanceItem => relevanceItem.Key)
.Select(relevanceItem => relevanceItem.Value)
.ToList();
Depending on one's needs, there might be some adjustments made to determine how query is turned into search criteria or how relevance metric is calculated, but the general idea should remain roughly the same.
Very new to programming/selenium automation in C# and have hit a bit of a stumbling block that I am struggling at resolving, I have looked around on here / google but nothing has come up matching what I am looking for, I could be wording my question slightly wrong so forgive me if that is the case.
What I need to achieve,
When logging into a website after entering a username/password we are prompted to enter a pin code, specifically a randomly generated combination – example “Please enter numbers 1, 2 & 3 of your PIN” - where (in the example) 1, 2, 3 can be anything from 1 to 6 (always chronological order), the message itself, “,” and “&” positions do not change – only the numbers.
** Info from the 'line' containing the message (and numbers) **
Inner HTML
Please enter numbers 1, 3 & 5 of your PIN
Outer HTML
Please enter numbers 1, 3 & 5 of your PIN
CSS Selector
h2.login-desktop-only
xPath
/html/body/section[2]/div/div/div/form/div/div[2]/div[1]/div/h2
For this situation I am using a ‘UAT site’ so I have control on the PIN, let’s say it is always 123456 – so 1 = 1, 2 =2, 3-3 and so on. I have no way to determine which numbers will be asked each time the test is run.
How can I ‘scrape’ the text from ‘Please enter numbers XXXXX’ and parse (I think that is the correct word) the data to separate the ‘scraped’ numbers and then in turn use that data to match the pre-declared ‘1 = 1’ etc etc to then end up selecting the correct number on the keypad?
I imagine this is going to need a list of IF statements but again I still do not know how to scrape / store the requested numbers. Ideally would like to keep this using c# (however if any Java examples exist I can work with that as a colleague is using java selenium - both of us are very new to this)
Any advice would be greatly appreciated.
(EDIT TO Add code from comment)
Many thanks for getting back to me, I tried that code and it has located the index position of the integers contained within that ‘string’. Currently it ‘prints’ out the index position, but how can I get that to give the value rather than print it?
I suppose if I could assign it to a variable I could then split the three numbers down to a unique variable that has IF statements to cover the IF 1 – then 1 IF 2 – then 2 and so on. If that makes sense?
public class Some_Class {
public static void main(String[] args) {
WebDriver driver = new SafariDriver();
driver.get("SomeWebsite");
driver.findElement(By.id("username")).sendKeys("XXXXXXX");
driver.findElement(By.id("password")).sendKeys("XXXXXXX ");
driver.findElement(By.id("login-button")).click();
/* --- This was my original plan to set the xpath as a string and then replace all with numbers only. This did not work as I thought.
{
WebElement str = driver.findElement(By.xpath("/html/body/section[2]/div/div/div/form/div/div[2]/div[1]/div/h2"));
String numberOnly = str.replaceAll("[^0-9]", "");
}
*/
WebElement option = driver.findElement(By.xpath("/html/body/section[2]/div/div/div/form/div/div[2]/div[1]/div/h2"));
String word=option.getText();
String check[]=word.split("");
for(int i=0; i<check.length ; i++)
{
if( Pattern.matches("\\d", check[i]))
{
System.out.println("found integer at i = "+ i);
}
}
}
}
Input : you need to scrape a string containing ' integers and alphabets' and return only integers.
Here's an example I have done using Selenium,Java.
Please change your URL and WebElement to scrape.
driver.get("https://en.wikipedia.org/wiki/Main_Page");
WebElement option = driver.findElement(By.xpath("//*[#id=\"articlecount\"]"));
String word=option.getText();
//here you get - 5,589,206 articles in English
String check[]=word.split("");
for(int i=0; i<check.length ; i++)
{
if( Pattern.matches("\\d", check[i]))
{
System.out.println("found integer at i = "+ i);
}
}
Basically that would print you the index at which you have integers. Use them
In the case of a challenge string that says
string pinRequest = "Please enter pin digits 1,4 & 8 of your pin.";
var pinNums = pinRequest.Where(Char.IsDigit).ToArray().ToList();
pinNums will be an integer array that has 3 parts which are equal to
{1,4,8}
Your pin challenge is then solved via:
string part1 = fullWord.ToArray()[pinNums[0] - 1].ToString();
string part2 = fullWord.ToArray()[pinNums[1] - 1].ToString();
string part3 = fullWord.ToArray()[pinNums[2] - 1].ToString();
where fullWord could be something like 123123 or Password123
I am not too sure how this code will handle pin challenges >= 10 digits in length.
In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
How can I split multiple joined words?
Need help understanding this Python Viterbi algorithm
How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
the
be
and
...
attorney
Here is a link to a small example of such a text file I would like to use.
Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the
word with rank n in the list of words has probability roughly 1/(n log
N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution.
I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?
Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs.
For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost.
So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant.
Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. From your code its not exactly clear what words is. This can be a real bottleneck if you don't choose a good data structure. As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word:
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
And I'd also consider serializing it to disk to avoid building it up every time.
Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement.
NOTE: I'm assuming all words are cased consistently.
I want to display specific arrays from a very large text file.
Below the coding is part of the file.
What I want to do is display specific strings from the text file.
For example the example shows the Footlocker page. On the Footlocker Shop page I want to retrieve the last 5 updates in the text file beginning with "footlocker" posting only Footlocker's most recent posts. I have tried many ways including array.sort I am not sure how you would do this. Thanks for your help.
Footlocker's page
//declaring string
string footlockerPosts =
sr.ReadToEnd();
//initialising string
string[] footlockerArray = footlockerPosts.Split('\n');
string[] sort = footlockerArray;
var target = "F";
var results = Array.FindAll(sort, f => f.Equals(target));
for (int i = footlockerArray.Length - 1; i > footlockerArray.Length - 7; i--)
{
footlockerArray.Reverse();
footlockerExistingBlogTextBox.Text += footlockerArray[i];
}
sr.Close();
return;
}
This is a small snippet of my file.
File
Footlocker,Rick,What a fabulous shop.
Footlocker,Ioela,Fantastic and incredible service.
Footlocker,Fisi,Can't wait to go back to shop!
Footlocker,Allui,Lovin' the new design and layout!
Footlocker,Rich,Can't wait for next season clothing range.
Hypebeast,Johnny,I didn’t get proper service from the shop assistant.
Hypebeast,Dalas,Awesome range of goods, great service.
Hypebeast,King,Cool music great staff.
Hypebeast,Nelson,Overated shop.
Hypebeast,Rick,Lovely place lovely people.
Hypebeast,Rick,What a fabulous shop.
Hypebeast,Ioela,Fantastic and incredible service.
Hypebeast,Fisi,Can't wait to go back to shop!
Hypebeast,Allui,Lovin' the new design and layout!
Hypebeast,Rich,Can't wait for next season clothing range.
Lonestar,Johnny,I didn’t get proper service from the waiter.
Lonestar,Dalas,Awesome range of food, great service.
Lonestar,King,Cool music great staff.
Lonestar,Nelson,Overated restaurant.
Lonestar,Rick,Lovely place lovely people.
Try the following and let me know if it gets you the results you want
var results = sort.Where(r => r.StartsWith(target)).Reverse();
foreach (string result in results)
{
footlockerExistingBlogTextBox.Text += result;
}
Hope that helps!
Also just a note: There is not reason to assign another variable equal to the original array. You can remove this line string[] sort = footlockerArray; and point to the original footlockerArray.
I am fairly new to C# programming and I am stuck on my little ASP.NET project.
My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.
Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.
This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).
With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.
I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.
The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!
Here is sample code:
var sList = new ArrayList();
string[][] urls = new string[100][];
int ctr = 0;
int j = 1;
foreach (Result res in srchResponse.results)
{
string content = res.text;
string pattern = #"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
MatchCollection matches = Regex.Matches(content, pattern);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
sList.Add(groups[0].Value.ToString());
}
}
sList.Sort();
foreach (Result res in srchResponse.results)
{
for (int i = 0; i < 100; i++)
{
if (sList[i] == sList[i + 1])
{
j++;
}
else
{
urls[ctr][0] = sList[i].ToString();
urls[ctr][1] = j.ToString();
urls[ctr][2] = res.text;
urls[ctr][3] = res.from_user;
urls[ctr][4] = res.profile_image_url;
ctr++;
j = 1;
}
}
}
The code then goes on to add each result into a StringBuilder method with the HTML.
Is now edite
The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)
However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.
Suppose you have a list of strings and you wish to determine the number of occurrences of each:
var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };
var counts = from note in notes
group note by note into g
select new { Note = g.Key, Count = g.Count() }
foreach(var count in counts)
Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);
Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.
I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.
Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.
Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.
But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.
If you don't need to know how many duplicates a specific entry has you could do the following:
LINQ Extension Methods
.Count()
.Distinct()
.Count()