Synchronously read file from wwwroot in Blazor Webassembly - c#

So I've been able to successfully read files asynchronously like so.
I'm making a simple word game (hangman), and I'm reading the words from some text files in wwwroot.
On the new game screen, I have the user pick the length of the word to guess and the set of words they want to use (e.g. regular words, halloween themed words, christmas themed words etc.)
The problem is when the user enters a number for the length of the word, I'm validating that there is a word of that length, and if there isn't then I set the number to the closest word length to what the user entered (e.g. if they entered 999 into the word length field and they were using the regular words list, it would be set to 18 because that's the longest word in the regular word list (telecommunications)). Basically, I need to read the file right when the user changes the word list, (at least with how I'm currently doing things) so that I can correctly validate if/when the user changes the word length field.
I've tried doing something like:
public static string ReadFile(string localUrl) {
return Http.GetStringAsync(localUrl).Result;
Unfortunately Blazor doesn't like this and it throws a runtime exception when I try to do this.
I'm currently binding my input fields to properties (like <input #bind="MyProperty">) because it seemed like an easy way to validate them, but properties can't be asynchronous. I had the idea that I could use #onchange in my word lists combobox to call an asynchronous method that updates the word list, but this feels a lot more messy and I feel like I'm probably going about this the wrong way.
Should I stick with my solution of using #onchange to call an asynchronous method to update my list of words, or should I be doing things differently? Maybe load all my word lists at the beginning (it'd only be a few kb)? Or is there a better way to synchronously load files?


.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Word vsto get text of document with hidden characters

I'm developing a text analysis vsto add-in for Word.
Therefore I get the text of the active document like this:
After that I analyze it. The analysis returns a list of positions that Word should comment (like character 3 - 6 and character 10 - 13).
The problem is that it seems like the comment from 3 to 6 is adding a character (that is hidden) to the document. Because all comments that Word is doing after the first one are placed one character too early.
Is there a way how I can fix that or how I can get the text with the hidden characters?
I found TextRetrievalMode but I can not get it working with that.
Basically, the answer is "No, you can't do it the way you propose."
Yes, Word does add "hidden characters" to the text flow that cannot be picked up using the object model. Trying to work with character index values is not going to work reliably. The reliable method is Word's built-in Find/Replace with wildcards. If RegEx is absolutely necessary, then some kind of Find/Replace within a character-index range (say, starting 5 characters before and ending 5 characters after the indices computed using RegEx) might be a way to double-check the result and pick up the correct Range.
Possibly, depending on what kind of analysis this is, it might be better to work with the closed file, leveraging the Office Open XML. That will not have the problem of "hidden characters" that Word uses for structural information. On the other hand, there's a lot of formatting information that breaks up text runs that needs to be contended with...

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

counting 1 word, 2 word and 3 word phrases occurrences in a webpage in C#/.NET

I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.

Is it possible to loop through a textbox's contents? If not, what's the best strategy to read line-by-line?

I am designing a crawler which will get certain content from a webpage (using either string manipulation or regex).
I'm able to get the contents of the webpage as a response stream (using the whole httpwebrequest thing), and then for testing/dev purposes, I write the stream content to a multi-line textbox in my ASP.NET webpage.
Is it possible for me to loop through the content of the textbox and then say "If textbox1.text.contains (or save the textbox text as a string variable), a certain string then increment a count". The problem with the textbox is the string loses formatting, so it's in one long line with no line breaking. Can that be changed?
I'd like to do this rather than write the content to a file because writing to a file means I would have to handle all sorts of external issues. Of course, if this is the only way, then so be it. If I do have to write to a file, then what's the best strategy to loop through each and every line (I'm a little overwhelmed and thus confused as there's many logical and language methods to use), looking for a condition? So if I want to look for the string "Hello", in the following text:
My name is xyz
I am xyz years of age
Hello blah blah blah
When I reach hello I want to increment an integer variable.
In my opinion you can split the content of the text in words instead of lines:
public int CountOccurences(string searchString)
int i;
var words = txtBox.Text.Split(" ");
foreach (var s in words)
if (s.Contains(searchString))
return i;
No need to preserve linebreaks, if I understand your purpose correctly.
Also note that this will not work for multiple word searches.
I do it this way in an project, there may be a better way to do it, but this works :)
string template = txtTemplate.Text;
string[] lines = template.Split(Environment.NewLine.ToCharArray());
That is a nice creative way.
However, I am returning a complex HTML document (for testing purposes, I am using Microsoft's homepage so I get all the HTML). Do I not have to specify where I want to break the line?
Given your method, if each line is in a collection (Which is a though I had), then I can loop through each member of the collection and look for the condition I want.
If textbox contents were returned with line-breaks representing where word-wrapping occurs, that result will be dependant on style (e.g. font-size, width of the textbox, etc.) rather than what the user actually entered. Depending on what you actually want to do, this is almost certainly NOT what you want.
If the user physically presses the 'carriage return / enter' key, the relevant character(s) will be included in the string.
Why do you need to have a textbox at all? Your real goal is to increment a counter based on the text that the crawler finds. You can accomplish this just by examining the stream itself:
Stream response = webRequest.GetResponse().GetResponseStream();
StreamReader reader = new StreamReader(response);
String line = null;
while ( line = reader.ReadLine() )
if (line.Contains("hello"))
// increment your counter
Extending this if line contains more than one instance of the string in question is left as an exercise to the reader :).
You can still write the contents to a text box if you want to examine them manually, but attempting to iterate over the lines of the text box is simply obscuring the problem.
The textbox was to show the contents of the html page. This is for my use so if I am running the webpage without any breakpoints, I can see if the stream is visually being returned. Also, it's a client requirement so they can see what is happening at every step. Not really worth the extra lines of code but it's trivial really, and the last of my concerns.
The code in the while loop I don't understand. Where is the instruction to go to the next line? This is my weakness with the readline method, as I seldom see the logic that forces the next line to be read.
I do need to store the line as a string var where a certain string is found, as I will need to do some operations (et a certain part of the string) so I've always been looking at readline.

