Getting Certain Text From URL

Getting Certain Text From URL - c#

I've been trying to do a couple things to this URL: "https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services"
First I need to parse out: writing-translation (Subject to change depending on the category). Then taking the '-' out of it so you would end up with: writing translation.
I've been trying myself with Regex, I am God AWFUL with it though, believe me I have trying though. If someone could give me an answer, and explain the Regex to me that they use, it would be awesome. Thank you so much.
i.e - my awful attempt (Just for the sake of it)
string MainCategory_link = firefoxDriver.FindElementByXPath("//a[#class='gig- sub-cat js-gtm-event-auto']").GetAttribute("href");
var Reg = new Regex("\".*?\"");
var matches = Reg.Matches(MainCategory_link);
foreach (var item in matches)
{
MessageBox.Show(item.ToString());
}
Updated code with segments attempt
string MainCategory_link = firefoxDriver.FindElementByXPath("//a[#class='gig-sub-cat js-gtm-event-auto']").GetAttribute("href");
var uri = new Uri(MainCategory_link);
foreach (var segment in uri.Segments)
{
MessageBox.Show(segment[1].ToString());
}

There is a Uri class that allows you to access different parts of the Uri via segments.
var uri = new Uri("https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services");
foreach(var segment in uri.Segments)
{
MessageBox.Show(segment);
}
/* Output
categories
writing-translation
SEO-keyword-optimization-services
*/
Therefore, to retrieve writing-translation you'd do:
var uri = new Uri("https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services");
MessageBox.Show(uri[1]);
And of course, you should perform bounds checks anytime you're accessing something via index to make sure it exists and not get an OutOfBoundsException.
Never ever use Regex unless you are absolutely positive a better option doesn't already exist. Regex should always be a last resort. In fact, it's probably better if you don't know Regex at all, because you'll just keep trying to use it at all the wrong times.

Related

C# How to parse through an inconsistently formatted text file, ignoring unneeded information

A little background. I am new to using C# in a professional setting. My experience is mainly in SQL. I have a file that I need to parse through to pull out certain pieces of information. I can figure out how to parse through each line, but have gotten stuck on searching for specific pieces of information. I am not interested in someone finishing this code for me. Instead, I am interested in pointers on where I can go from here.
Here is an example of the code I have written.
class Program
{
private static Dictionary<string, List<string>> _arrayLists = new Dictionary<string, List<string>>();
static void Main(string[] args)
{
string filePath = "c:\\test.txt";
StreamReader reader = new StreamReader(filePath);
string line;
while (null !=(line = reader.ReadLine()))
{
if (line.ToLower().Contains("disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
}
else
{
if (line.ToLower().Contains("subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
}
}
}
}
}
A little bit of explanation of the file. I basically need to pull data existing between the word Subscribed and the first ; i come to. Specifically I need to take the values such as dvd = 234 and assign them to their same variables in the code. Not every record will have the same variables.
Here is an example of the text file that I need to parse through.
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info
Edit
My apologies on the vague question. I have to learn how to ask better questions.
My thought process was to make sure the program associated all the details between subscribed and the ; as one record. I think the part that I am confused on is in reading the lines. In my head I see the loop reading the line Subscribed, and then going into a method and reading the next line and assigning the value, and so on until it hits the ;. Once that was done I am trying to figure out how to tell the program to exit that method, but to continue reading from the line right after the semi-colon. Perhaps I am over thinking this.
I will take the advice I have been give and see what I can come up with to solve this. Thank you.

From you question as it is now it is not clear what specific problem you are struggling with. I'd suggest you edit your question providing specific challenges you'd like to overcome. currently you problem statement is "have gotten stuck on searching for specific pieces of information". This is as unspecific as it can get.
Having said that I'll try to help you.
First, you will never get into an if like that:
line.ToLower().Contains("Disconnected")
Here you convert all the characters to lower case, and then you are trying to find a substring with capital "D" in it. The expression above will (almost) always evaluate to false.
Secondly, in order for your application to do what you want to do it needs to track the current parsing state. I'm going to ignore the "Disconnected" bit now, as you have not shown what significance it has.
I'll be assuming that you are trying to find everything between Subscribed and first semicolon in the file. I'll also make a couple of other assumption regarding to what can constitute a string, which I won't list here. These can be wrong, but this is my best guess given the information you've provided.
You program will start in a state "looking for subscription". You already set up the read loop, which is good. In this loop you read lines of the file, and you find one that contains word Subscription.
Once you found such line your parser need to move to "parsing subscription" state. In this state, when you read lines you look for lines like jjd = 768, perhaps with a semicolon in the end. You can check if the line match a pattern by using Regular Expressions.
Regular Expressions also can divide match to capturing groups, so that you can extract the name (jjd) and the value (768) separately. Presences or absence of the semicolon could be another RegEx group.
Note that RegEx is not the only way to handle this, but this is the first that comes to mind.
You then keeping matching the lines to your regex and extracting names and values until you come across the semicolon, at which point you switch back to "looking for subscription" state.
You use the current state, to decide how to process the next read line.
You continue until the end of the file.
Generally you want to read up on parsing.
Hope this helps.

As with all code solutions to problems there are many possible ways to achieve what you are looking for. Some will work better then others. Below is one way that could help point you in the right direction.
You can check if the string starts with a keyword or value such as "dvd" (see MSDN String.StartsWith).
If it does then you can split the string into an array of parts (see MSDN String.Split).
You can then get the values of each part from the string array using the index of the value you want.
Do what you need to with the value retrieved.
Continue checking each line for your key business rules (ie. The semicolon that will end the section). Maybe you could check the last character of the string. (see String.EndsWith)

When processing text files containing semi-structured data, state variables can simplify the algorithm. In the code below, a boolean state variable isInRecord is used to track when a line is in a record.
using System;
using System.Collections.Generic;
using System.IO;
namespace ConsoleApplication19
{
public class Program
{
private readonly static String _testData = #"
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info";
public static void Main(String[] args)
{
/* Create a temporary file containing the test data. */
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData), Path.GetRandomFileName());
File.WriteAllText(testFile, _testData);
try
{
var p = new Program();
var records = p.GetRecords(testFile);
foreach (var kvp in records)
{
Console.WriteLine("Record #" + kvp.Key);
foreach (var entry in kvp.Value)
{
Console.WriteLine(" " + entry);
}
}
}
finally
{
File.Delete(testFile);
}
}
private Dictionary<String, List<String>> GetRecords(String path)
{
var results = new Dictionary<String, List<String>>();
var recordNumber = 0;
var isInRecord = false;
using (var reader = new StreamReader(path))
{
String line;
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.StartsWith("Disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
isInRecord = false;
}
else if (line.StartsWith("Subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
isInRecord = true;
recordNumber++;
}
else if (isInRecord)
{
// Check if the line has a general format of "something = something".
var parts = line.Split("=".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
if (parts.Length != 2)
continue;
// Update the relevant dictionary key, or add a new key.
List<String> entries;
if (results.TryGetValue(recordNumber.ToString(), out entries))
entries.Add(line);
else
results.Add(recordNumber.ToString(), new List<String>() { line });
// Determine if the isInRecord state variable should be toggled.
var lastCharacter = line[line.Length - 1];
if (lastCharacter == ';')
isInRecord = false;
}
}
}
return results;
}
}
}

How can I match and return multiple instances of a string, where single apostrophes could be contained at any index?

Please note, the 'C#' tag was included intentionally, because I could accept C# syntax for my answer here, as I have the option of doing this both client-side and server-side. Read the 'Things You May Want To Know' section below. Also, the 'regex' tag was included because there is a strong possibility that the use of regular expressions is the best approach to this problem.
I have the following highlight Plug-In found here:
http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html
And here is the code in that plug-in:
/*
highlight v4
Highlights arbitrary terms.
<http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html>
MIT license.
Johann Burkard
<http://johannburkard.de>
<mailto:jb#eaio.com>
*/
jQuery.fn.highlight = function(pat) {
function innerHighlight(node, pat) {
var skip = 0;
if (node.nodeType == 3) {
var pos = node.data.toUpperCase().indexOf(pat);
if (pos >= 0) {
var spannode = document.createElement('span');
spannode.className = 'highlight';
var middlebit = node.splitText(pos);
var endbit = middlebit.splitText(pat.length);
var middleclone = middlebit.cloneNode(true);
spannode.appendChild(middleclone);
middlebit.parentNode.replaceChild(spannode, middlebit);
skip = 1;
}
}
else if (node.nodeType == 1 && node.childNodes && !/(script|style)/i.test(node.tagName)) {
for (var i = 0; i < node.childNodes.length; ++i) {
i += innerHighlight(node.childNodes[i], pat);
}
}
return skip;
}
return this.length && pat && pat.length ? this.each(function() {
innerHighlight(this, pat.toUpperCase());
}) : this;
};
jQuery.fn.removeHighlight = function() {
return this.find("span.highlight").each(function() {
this.parentNode.firstChild.nodeName;
with (this.parentNode) {
replaceChild(this.firstChild, this);
normalize();
}
}).end();
};
This plug-in works pretty easily.
If I wanted to highlight all instances of the word "Farm" within the following element...(cont.)
<div id="#myDiv">Farmers farm at Farmer's Market</div>
...(cont.) all I would need to do is use:
$("#myDiv").highlight("farm");
And then it would highlight the first four characters in "Farmers" and "Farmer's", as well as the entire word "farm" within the div#myDiv
No problem there, but I would like it to use this:
$("#myDiv").highlight("Farmers");
And have it highlight both "Farmers" AND "Farmer's". The problem is, of course, that I don't know the value of the search term (The term "Farmers" in this example) at runtime. So I would need to detect all possibilities of no more than one apostrophe at each index of the string. For instance, if I called $("#myDiv").highlight("Farmers"); like in my code example above, I would also need to highlight each instance of the original string, plus:
'Farmers
F'armers
Fa'rmers
Far'mers
Farm'ers
Farme'rs
Farmer's
Farmers'
Instances where two or more apostrophes are found sid-by-side, like "Fa''rmers" should, of course, not be highlighted.
I suppose it would be nice if I could include (to be highlighted) words like "Fa'rmer's", but I won't push my luck, and I would be doing well just to get matches like those found in my bulleted list above, where only one apostrophe appears in the string, at all.
I thought about regex, but I don't know the syntax that well, not to mention that I don't think I could do anything with a true/false return value.
Is there anyway to accomplish what I need here?
Things You May Want To Know:
The highlight plug-in takes care of all the case insensitive requirements I need, so no need to worry about that, at all.
Syntax provided in JavaScript, jQuery, or even C# is acceptable, considering the hidden input fields I use the values from, client-side, are populated, server-side, with my C# code.
The C# code that populates the hidden input fields uses Razor (i.e., I am in a C#.Net Web-Pages w/ WebMatrix environment. This code is very simple, however, and looks like this:
for (var n = 0; n < searchTermsArray.Length; n++)
{
<input class="highlightTerm" type="hidden" value="#searchTermsArray[n]" />
}

I'm copying this answer from your earlier question.
I think after reading the comments on the other answers, I've figured out what it is you're going for. You don't need a single regex that can do this for any possible input, you already have input, and you need to build a regex that matches it and its variations. What you need to do is this. To be clear, since you misinterpreted in your question, the following syntax is actually in JavaScript.
var re = new RegExp("'?" + "farmers".split("").join("'?") + "'?", "i")
What this does is take your input string, "farmers" and split it into a list of the individual characters.
"farmers".split("") == [ 'f', 'a', 'r', 'm', 'e', 'r', 's' ]
It then stitches the characters back together again with "'?" between them. In a regular expression, this means that the ' character will be optional. I add the same particle to the beginning and end of the expression to match at the beginning and end of the string as well.
This will create a regex that matches in the way you're describing, provided it's OK that it also matches the original string.
In this case, the above line builds this regex:
/'?f'?a'?r'?m'?e'?r'?s'?/
EDIT
After looking at this a bit, and the function you're using, I think your best bet will be to modify the highlight function to use a regex instead of a straight string replacement. I don't think it'll even be that hard to deal with. Here's a completely untested stab at it.
function innerHighlight(node, pat) {
var skip = 0;
if (node.nodeType == 3) {
var matchResult = pat.exec(node.data); // exec the regex instead of toUpperCase-ing the string
var pos = matchResult !== null ? matchResult.index : -1; // index is the location of where the matching text is found
if (pos >= 0) {
var spannode = document.createElement('span');
spannode.className = 'highlight';
var middlebit = node.splitText(pos);
var endbit = middlebit.splitText(matchResult[0].length); // matchResult[0] is the last matching characters.
var middleclone = middlebit.cloneNode(true);
spannode.appendChild(middleclone);
middlebit.parentNode.replaceChild(spannode, middlebit);
skip = 1;
}
}
else if (node.nodeType == 1 && node.childNodes && !/(script|style)/i.test(node.tagName)) {
for (var i = 0; i < node.childNodes.length; ++i) {
i += innerHighlight(node.childNodes[i], pat);
}
}
return skip;
}
What I'm attempting to do here is keep the existing logic, but use the Regex that I built to do the finding and splitting of the string. Note that I'm not doing the toUpper call anymore, but that I've made the regex case insensitive instead. As noted, I didn't test this at all, but it seems like it should be pretty close to a working solution. Enough to get you started anyway.
Note that this won't get you your hidden fields. I'm not sure what you need those for, but this will (if it's right) take care of highlighting the string.

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.

Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.

In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.

This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.

I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

Read in a file using a regular expression?

This is tangentially related to an earlier question of mine.
Essentially, the solution in that question worked great, but now I need to adapt it to work in a much larger analysis application. Simply using StreamReader.ReadToEnd() is not acceptable, since some of the files I will be reading in are very, very large. If there's been a mistake and someone forgot to clean up, they can theoretically be gigabytes big. Obviously I can't just read to the end of that.
Unfortunately, the normal read lines is also not acceptable, because some of the rows of data I am reading in contain stack traces - they obviously use /r/n in their formatting. Ideally, I would like to tell the program to read forward until it hits a match for a regex, which it then returns. Is there any functionality to do this in .net? If not, can I get some suggestions for how I'd go about writing it?
Edit: To make it a bit easier to follow my question, here's a paste of some of the important parts of the adapted code:
foreach (var fileString in logpath.Select(log => new StreamReader(log)).Select(fileStream => fileStream.ReadToEnd()))
{
const string junkPattern = #"\[(?<junk>[0-9]*)\] \((?<userid>.{0,32})\)";
const string severityPattern = #"INFO|ERROR|FATAL";
const string datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var records = Regex.Split(fileString, datePattern, RegexOptions.Multiline);
foreach (var record in records.Where(x => string.IsNullOrEmpty(x) == false))
......
The problem lies in the Foreach. .Select(fileStream => fileStream.ReadToEnd()) is gonna blow up memory badly, I just know it.

First off all, you should move your const definition to class declaration - the compiler will do that for you, but this should be done by yourself, just for better code readability.
As #Blam mentioned, you should use StringBuilder and StreamReader.ReadLine in pair, something like this:
foreach(var filePath in logpath)
{
var sbRecord = new StringBuilder();
using(var reader = new StreamReader(filePath))
{
do
{
var line = reader.ReadLine();
// check start of the new record lines
if (Regex.Match(line, datePattern) && sbRecord.Length > 0)
{
// your method for log record
HandleRecord(sbRecord.ToString());
sbRecord.Clear();
sbRecord.AppendLine(line);
}
// if no lines were added or datePattern didn't hit
// append info about current record
else
{
sbRecord.AppendLine(line);
}
} while (!reader.EndOfStream)
}
}
If I didn't understand something about your problem, please clarify this in comment.
Also, you can use ThreadPool for schedule the tasks for your lines, just for speed of your application.

Piglatin using Arrays

Last night I was messing around with Piglatin using Arrays and found out I could not reverse the process. How would I shift the phrase and take out the Char's "a" and "y" at the end of the word and return the original word in the phrase.
For instance if I entered "piggy" it would come out as "iggypay" shifting the word piggy so "p" is at the end of the word and "ay" is appended.
Here is the example code so you can try it as well.
public string ay;
public string PigLatin(string phrase)
{
string[] pLatin;
ArrayList pLatinPhrase = new ArrayList();
int wordLength;
pLatin = phrase.Split();
foreach (string pl in pLatin)
{
wordLength = pl.Length;
pLatinPhrase.Add(pl.Substring(1, wordLength - 1) + pl.Substring(0, 1) + "ay");
}
foreach (string p in pLatinPhrase)
{
ay += p;
}
return ay;
}
You will notice that is example is not programmed to find vowels and append them to the end along with "ay". Just simply a basic way of doing it.
If you where wondering how to reverse the above try this example of uPiglatinify
public string way;
public string uPigLatinify(string word)
{
string[] latin;
int wordLength;
// Using arrraylist to store split words.
ArrayList Phrase = new ArrayList();
// Split string phrase into words.
latin = word.Split(' ');
foreach (string i in latin)
{
wordLength = i.Length;
if (wordLength > 0)
{
// Grab 3rd letter from the end of word and append to front
// of word chopping off "ay" as it was not included in the indexing.
Phrase.Add(i.Substring(wordLength - 3, 1) + i.Substring(0, wordLength - 3) + " ");
}
}
foreach (string _word in Phrase)
{
// Add words to string and return.
way += _word;
}
return way;
}

Please don’t take this the wrong way, but although you can probably get people here to give you the C# code to implement the algorithm you want, I suspect this is not enough if you want to learn how it works. To learn the basics of programming, there are some good tutorials to delve into (whether websites or books). In particular, if you aspire to be a programmer, you will need to learn not just how to write code. In your example:
You should first write a specification of what your PigLatin function is supposed to do. Think about all the corner-cases: What if the first letter is a vowel? What if there are several consonants at the beginning? What if there are only consonants? What if the input starts with a number, a parenthesis, or a space? What if the input string is empty? Write down exactly what should happen in all of these cases — even if it’s “throw an exception”.
Only then can you implement the algorithm according to the specification (i.e. write the actual C# code). While doing this, you may find that the specification is incomplete, in which case you need to go back and correct it.
Once your code is finished, you need to test it. Run it on several testcases, especially the corner-cases you came up with above: For example, try PigLatin("air"), PigLatin("x"), PigLatin("1"), PigLatin(""), etc. In each case, make yourself aware first what behaviour you expect, and then see if the behaviour matches your expectation. If it doesn’t, you need to go back and fix the code.
Once you have implemented the forward PigLatin algorithm and it works (read: passes all your testcases), then you will already have the skills needed to write the reverse function youself. I guarantee you that you will feel achieved and excited then! Whereas, if you just copy the code from this website, you are setting yourself up for feeling dumb because you will think other people can do it and you can’t.
Of course, we are nonetheless happy to help you with specific technical questions, for example “What is the difference between ArrayList and List<string>?” or “What does the scope of a local variable mean?” (but search first — these may have already been asked before) — but you probably shouldn’t ask to have the code fully written and finished for you.

The work to split the phrase into words and recombine the words after transforming them is the same as in the original case. The difficulty is in un-pig-latin-ifying an individual word. With some error checking, I imagine you could do this:
string UnPigLatinify(string word)
{
if ((word == null) || !Regex.IsMatch(word, #"^\w+ay$", RegexOptions.IgnoreCase))
return word;
return word[word.Length - 3] + word.Substring(0, word.Length - 3);
}
The regular expression just checks to make sure the word is at least 3 letters long, composed of characters, and ends with "ay".
The actual transform takes the third to last letter (the original first letter) and appends the rest of the word minus the "ay" and the original letter.
Is this what you meant?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.