Html page parsing using Html Agility Pack

Html page parsing using Html Agility Pack - c#

I'm, trying to parse the IMDb page with Regex (I know HAP is better), but my RegEx is wrong, so may be you can advice me how to use HAP correctly.
This is the part of page I'm trying to parse. I need to take 2 numbers from here:
5 out of 5 people (so these two five's i need, two numbers)
<small>5 out of 5 people found the following review useful:</small>
<br>
<a href="/user/ur1174211/">
<h2>Interesting, Particularly in Comparison With "La Sortie des usines Lumière"</h2>
<b>Author:</b>
Snow Leopard
<small>from Ohio</small>
<br>
<small>10 March 2005</small>
and this is my code on c#
Regex reg1 = new Regex("([0-9]+(out of)+[0-9])");
for (int i = 0; i < number; i++)
{
Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
Match m = reg1.Match(header[i].InnerHtml);
if (!m.Success)
{
return;
}
else
{
string str1 = m.Value.Split(' ')[0];
string str2 = m.Value.Split(' ')[3];
if (!Int32.TryParse(str1, out index1))
{
return;
}
if (!Int32.TryParse(str2, out index2))
{
return;
}
Console.WriteLine("index1 = {0}", index1);
Console.WriteLine("index2 = {0}", index2);
}
}
Big thanks to everybody who read this.

Try this. This way you will take numbers not only digits.
Regex reg1 = new Regex(#"(\d* (out of) \d*)");
for (int i = 0; i < number; i++)
{
Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
Match m = reg1.Match(header[i].InnerHtml);
if (!m.Success)
{
return;
}
else
{
Regex reg2 = new Regex(#"\d+");
m = reg2.Match(m.Value);
string str1 = m.Value;
string str2 = m.NextMatch().Value;
if (!Int32.TryParse(str1, out index1))
{
return;
}
if (!Int32.TryParse(str2, out index2))
{
return;
}
Console.WriteLine("index1 = {0}", index1);
Console.WriteLine("index2 = {0}", index2);
}
}

if you have the InnerHtml of the small tag then this can also be done to get numbers
var title = "5 out of 5 people found the following review useful:";
var titleNumbers = title.ToCharArray().Where(x => Char.IsNumber(x));
EDIT
as #PulseLab suggests, i have an alternate method
var sd = s.Split(' ').Where((data) =>
{
var datum = 0;
int.TryParse(data, out datum);
return datum > 0;
}).ToArray();

Related

Best way to search a plain text string in an HTML string in c#?

This is the html string :
string htmlString = "<body lang=\"EN-US\" link=\"blue\" vlink=\"#954F72\"><div class=\"WordSection1\"><p class=\"MsoNormal\">Hi, </p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\">My name is Gaurav Illness.</p><p class=\"MsoNormal\"><span style=\"color:purple !important\">Today <b>MY relation</b>ship breakdown <span style=\"color:red\">happened?<o:p></o:p></span> </span></p><p class=\"MsoNormal\"><span style=\"color:red\"><o:p> </o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\" style=\"line-height:16.5pt\"><span style=\"font-size:10.0pt;font-family:"Arial",sans-serif;color:#1F497D\">Thanks<span style=\"text-transform:uppercase\">,<o:p></o:p>"
I am Extracting Plain text from this using this function :
private static string extractTextFromHtml(string htmlString)
{
// Remove new lines since they are not visible in HTML
html = html.Replace("\n", " ");
// Remove tab spaces
html = html.Replace("\t", " ");
// Remove multiple white spaces from HTML
html = Regex.Replace(html, "\\s+", " ");
// Remove HEAD tag
html = Regex.Replace(html, "<head.*?</head>", ""
, RegexOptions.IgnoreCase | RegexOptions.Singleline);
// Remove any JavaScript
html = Regex.Replace(html, "<script.*?</script>", ""
, RegexOptions.IgnoreCase | RegexOptions.Singleline);
// Replace special characters like &, <, >, " etc.
StringBuilder sbHTML = new StringBuilder(html);
// Note: There are many more special characters, these are just
// most common. You can add new characters in this arrays if needed
string[] OldWords = {" ", "&", """, "<", ">", "®", "©", "•", "™"};
string[] NewWords = { " ", "&", "\"", "<", ">", "Â®", "Â©", "â€¢", "â„¢" };
for (int i = 0; i < OldWords.Length; i++)
{
sbHTML.Replace(OldWords[i], NewWords[i]);
}
// Check if there are line breaks (<br>) or paragraph (<p>)
sbHTML.Replace("<br>", "\n<br>");
sbHTML.Replace("<br ", "\n<br ");
sbHTML.Replace("<p ", "\n<p ");
// Finally, remove all HTML tags and return plain text
return System.Text.RegularExpressions.Regex.Replace(
sbHTML.ToString(), "<[^>]*>", "");
}
This function returns :
"Hi,
My name is Gaurav Illness.
Today MY relationship breakdown happened?
I am GriESh and I
Am drugger.
Thanks,"
Now I send this Text to an API that detects weather there is an emotion or not in these sentences. The API gives a response of all the sentences which are emotional. For example, API says "Today MY relationship breakdown happened?" is emotional. Now I want to mark this sentence as purple color in the html for which I have to add a span around the sentence. To do this I have to find the start and end index of this sentence in the html code.
How can I find the start and end index of this sentence in the html code?
I have a code which gives me the indexes but I think it is not the best way to do. Can anyone suggest a better way?
This is my code example :
public static void findTextInHtml(string htmlCode)
{
string textToBeFind = "I am GriESh and IAm drugger.";
int i = 0;
int j = 0;
int startIndex = 0;
int endIndex = 0;
bool isHtml = false;
bool isbeingMatched = false;
while (i < htmlCode.Length)
{
if (htmlCode[i] == '<')
{
isHtml = true;
i++;
continue;
}
if (htmlCode[i] == '>')
{
isHtml = false;
i++;
continue;
}
if (isHtml)
{
i++;
continue;
}
if (textToBeFind[j] == htmlCode[i])
{
if (!isbeingMatched)
{
startIndex = i;
}
isbeingMatched = true;
j++;
if (j == textToBeFind.Length)
{
endIndex = i;
break;
}
}
else
{
isbeingMatched = false;
j = 0;
}
i++;
}
AddStartSpan(startIndex, htmlCode);
AddEndSpan(endIndex, htmlCode);
}

Install the nuget package HtmlAgilityPack
Then its easy to parse like this:
string htmlString = "<p class=\"MsoNormal\"><span style=\"color:red\"><o:p> </o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var inner = doc.DocumentNode.InnerText.TrimStartString(" ");
// inner = I am GriESh and IAm drugger.
To remove the nbsp; at the start of the InnerText
public static string TrimStartString(this string input, string prefixToRemove,
StringComparison comparisonType = StringComparison.OrdinalIgnoreCase)
{
if (input != null && prefixToRemove != null
&& input.StartsWith(prefixToRemove, comparisonType))
{
return input.Substring(prefixToRemove.Length);
}
else return input;
}

best way of splitting numbers from text and keeping text [duplicate]

This question already has answers here:
how can i split a string by multiple delimiters and keep the delimiters?
(1 answer)
RegEx - Match Numbers of Variable Length
(4 answers)
Closed 4 years ago.
I have a text file. One of the columns contains a field which contains text along with numbers.
I'm trying to figure out the best way to split the numbers and text.
Below is an example of the typical values in the field.
.2700 Aqr sh./Tgt sh.
USD 2.4700/Tgt sh.
Currently I'm making use of the Split function (code below) however feel there is probably a smarter way of doing this.
My assumption is there will only ever be one number in the text (I'm 99% sure this is the case) however I have only seen a few examples so its possible my code below will not work.
I have read a little on regex. But not sure I tested it properly as it didn't quite get the output I wanted. For example
string input = "USD 2.4700/Tgt sh.";
string[] numbers = Regex.Split(input, #"\D+");
foreach (string value in numbers)
{
if (!string.IsNullOrEmpty(value))
{
int i = int.Parse(value);
Console.WriteLine("Number: {0}", i);
}
}
But the output is,
2
47
Whereas I was expecting 2.47 and I also don't want to lose the text. My desired result is
myText = "USD Tgt sh."
myNum = 2.47
For the other example
myText = "Aqr sh./Tgt sh."
myNum = 0.27
My Code
string[] sData = sTerms.Split(' ');
double num;
bool isNum = double.TryParse(sData[0], out num);
if(isNum)
{
ma.StockTermsNum = num;
StringBuilder sb = new StringBuilder();
for (int i = 1; i < sData.Length; i++)
sb = sb.Append(sData[i] + " ");
ma.StockTerms = sb.ToString();
}
else
{
string[] sNSplit = sData[1].Split('/');
ma.StockTermsNum = Convert.ToDouble(sNSplit[0]);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < sData.Length; i++)
{
if (i == 1)
sb = sb.Append(sNSplit[i] + " ");
else
sb = sb.Append(sData[i] + " ");
}
ma.StockTerms = sb.ToString();
}

I suggest spliting by group, (...) in order to preserve delimiter:
string source = #".2700 Aqr sh./Tgt sh.";
//string source = "USD 2.4700/Tgt sh.";
// please, notice "(...)" in the pattern - group
string[] parts = Regex.Split(source, #"([0-9]*\.?[0-9]+)");
// combining all texts
string myText = string.Concat(parts.Where((v, i) => i % 2 == 0));
// combining all numbers
string myNumber = string.Concat(parts.Where((v, i) => i % 2 != 0));
Tests:
string[] tests = new string[] {
#".2700 Aqr sh./Tgt sh.",
#"USD 2.4700/Tgt sh.",
};
var result = tests
.Select(test => new {
text = test,
parts = Regex.Split(test, #"([0-9]*\.?[0-9]+)"),
})
.Select(item => new {
text = item.text,
myText = string.Concat(item.parts.Where((v, i) => i % 2 == 0)),
myNumber = string.Concat(item.parts.Where((v, i) => i % 2 != 0)),
})
.Select(item => $"{item.text,-25} : {item.myNumber,-15} : {item.myText}");
Console.WriteLine(string.Join(Environment.NewLine, result));
Outcome:
.2700 Aqr sh./Tgt sh. : Aqr sh./Tgt sh. : .2700
USD 2.4700/Tgt sh. : USD /Tgt sh. : 2.4700

Could by something like this regex:
string input = "USD 2.4700/Tgt sh.";
var numbers = Regex.Matches(input, #"[\d]+\.?[\d]*");
foreach (Match res in numbers)
{
if (!string.IsNullOrEmpty(res.Value))
{
decimal i = decimal.Parse(res.Value);
Console.WriteLine("Number: {0}", i);
}
}

I would suggest you to use System.Text.RegularExpressions.RegEx. Here is example how you can achieve it:
static void Main(string[] args)
{
string a1 = ".2700 Aqr sh./Tgt sh.";
string a2 = "USD 2.4700/Tgt sh.";
var firstStringNums = GetNumbersFromString(ref a1);
Console.Write("My Text: {0}",a1);
Console.Write("myNums: ");
foreach(double a in firstStringNums)
{
Console.Write(a +"\t");
}
var secondStringNums = GetNumbersFromString(ref a2);
Console.Write("My Text: {0}", a2);
Console.Write("myNums: ");
foreach (double a in secondStringNums)
{
Console.Write(a + "\t");
}
}
public static List<double> GetNumbersFromString(ref string input)
{
List<double> result = new List<double>();
Regex r = new Regex("[0-9.,]+");
var numsFromString = r.Matches(input);
foreach(Match a in numsFromString)
{
if(double.TryParse(a.Value,out double val))
{
result.Add(val);
input =input.Replace(a.Value, "");
}
}
return result;
}
The pattern is just an example and off course will not cover every case that you will imagine.

Find phone number from given string c#

I have one resume , i want to find user's contact number(Mobile no or Phone no)
from the resume, Need any idea or solution or any assistance for achieving the goal .
What i Have tried so far....
var numString = "";
string strData = ",38,,,,,,,,,,,,,,,,,,,,,,,,,,,,382350,,,,,0,,,,8141884584,,,,,,,,";
char[] separator = new char[] { ',' };
string[] strSplitArr = strData.Split(separator);
for (int q = 0; q < strSplitArr.Length; q++)
{
if (strSplitArr[q] != "")
{
int no = 0;
no = strSplitArr[q].Length;
if (no >= 10 && no <= 12)
{
numString += strSplitArr[q].ToString() + ", ";
}
}
}

I would suggest that you use Regular Expression
Here is a sample code to find US Phone numbers:
string text = MyInputMethod();
const string MatchPhonePattern =
#"\(?\d{3}\)?-? *\d{3}-? *-?\d{4}";
Regex rx = new Regex(MatchPhonePattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
int noOfMatches = matches.Count;
//Do something with the matches
foreach (Match match in matches)
{
//Do something with the matches
string tempPhoneNumber= match.Value.ToString(); ;
}

Capturing a number within curly brackets inside a string c#

I have posted this again as my previous post was admittedly rather ambiguous..sorry!!
I have a string and I want to capture the number inside it and then add one to it!.
For example I have an email subject header saying "Re: Hello (1)"
I want to capture that 1 and then raise it by 2, then 3,then 4,etc. The difficulty I am having is taking into consideration the growing numbers, once it becomes say 10 or 100, that extra digit kills my current Regex expression.
Any help would be praised as always!
int replyno;
string Subject = "Re: Hey :) (1)";
if (Subject.Contains("Re:"))
{
try
{
replyno = int.Parse(Regex.Match(Subject, #"\(\d+\)").Value);
replyno++;
Subject = Subject.Remove(Subject.Length - 3);
TextBoxSubject.Text = Subject + "("+replyno+")";
}
catch
{
TextBoxSubject.Text = Subject + " (1)";
}
}
else
{
TextBoxSubject.Text = "Re: " + Subject;
}
Current output from this code fails from the Int.TryParse

Try substituting this code:
var m = Regex.Match(Subject, #"\((\d+)\)");
replyno = int.Parse(m.Groups[1].Value);
The changes are:
capture just the digits in the regex
parse just the captured digits
I'd also recommend that you check m.Success instead of just catching the resulting exception.

The problem is with the way you remove and replace the reply no.
Change your code this way
int replyno;
string Subject = "Re: Hey :) (1)";
if (Subject.Contains("Re:"))
{
try
{
replyno = int.Parse(Regex.Match(Subject, #"(\d+)").Value);
replyno++;
Subject = Regex.Replace(Subject,#"(\d+)", replyno.ToString());
TextBoxSubject.Text = Subject ;
}
catch
{
TextBoxSubject.Text = Subject + " (1)";
}
}
else
{
TextBoxSubject.Text = "Re: " + Subject;
}

I don't normally deal with regex so here's how i'd do it.
string subject = "Hello (1)";
string newSubject = string.Empty;
for (int j = 0; j < subject.Length; j++)
if (char.IsNumber(subject[j]))
newSubject += subject[j];
int number = 0;
int.TryParse(newSubject, out number);
subject = subject.Replace(number.ToString(), (++number).ToString());

You don't necessarily need regex for this, but you can adjust yours to \((?<number>\d+)\)$ to fix the problem.
For a regex solution, you can access the match using a group:
for (int i = 0; i < 10; i++)
{
int currentLevel = 0;
var regex = new System.Text.RegularExpressions.Regex(#"\((?<number>\d+)\)$");
var m = regex.Match(inputText);
string strLeft = inputText + " (", strRight = ")";
if (m.Success)
{
var levelText = m.Groups["number"];
if (int.TryParse(levelText.Value, out currentLevel))
{
var numCap = levelText.Captures[0];
strLeft = inputText.Substring(0, numCap.Index);
strRight = inputText.Substring(numCap.Index + numCap.Length);
}
}
inputText = strLeft + (++currentLevel).ToString() + strRight;
output.AppendLine(inputText);
}
Instead, consider just using IndexOf and Substring:
// Example
var inputText = "Subject Line";
for (int i = 0; i < 10; i++)
{
int currentLevel = 0;
int trimStart = inputText.Length;
// find the current level from the string
{
int parenStart = 0;
if (inputText.EndsWith(")")
&& (parenStart = inputText.LastIndexOf('(')) > 0)
{
int numStrLen = inputText.Length - parenStart - 2;
if (numStrLen > 0)
{
var numberText = inputText.Substring(parenStart + 1, numStrLen);
if (int.TryParse(numberText, out currentLevel))
{
// we found a number, remove it
trimStart = parenStart;
}
}
}
}
// add new number
{
// remove existing
inputText = inputText.Substring(0, trimStart);
// increment and add new
inputText = string.Format("{0} ({1})", inputText, ++currentLevel);
}
Console.WriteLine(inputText);
}
Produces
Subject Line
Subject Line (1)
Subject Line (2)
Subject Line (3)
Subject Line (4)
Subject Line (5)
Subject Line (6)
Subject Line (7)
Subject Line (8)
Subject Line (9)
Subject Line (10)

loop through string to find substring

I have this string:
text = "book//title//page/section/para";
I want to go through it to find all // and / and their index.
I tried doing this with:
if (text.Contains("//"))
{
Console.WriteLine(" // index: {0} ", text.IndexOf("//"));
}
if (text.Contains("/"))
{
Console.WriteLine("/ index: {0} :", text.IndexOf("/"));
}
I was also thinking about using:
Foreach(char c in text)
but it will not work since // is not a single char.
How can I achieve what I want?
I tried this one also but did not display result
string input = "book//title//page/section/para";
string pattern = #"\/\//";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + input.IndexOf(match.Value));
}
Thank you in advance.

Simple:
var text = "book//title//page/section/para";
foreach (Match m in Regex.Matches(text, "//?"))
Console.WriteLine(string.Format("Found {0} at index {1}.", m.Value, m.Index));
Output:
Found // at index 4.
Found // at index 11.
Found / at index 17.
Found / at index 25.

Would it be possible using Split?
So:
string[] words = text.Split(#'/');
And then go through the words? You would have blanks, due to the //, but that might be possible?

If what you want is a list "book","title","page","section","para"
you can use split.
string text = "book//title//page/section/para";
string[] delimiters = { "//", "/" };
string[] result = text.Split(delimiters,StringSplitOptions.RemoveEmptyEntries);
System.Diagnostics.Debug.WriteLine(result);
Assert.IsTrue(result[0].isEqual("book"));
Assert.IsTrue(result[1].isEqual("title"));
Assert.IsTrue(result[2].isEqual("page"));
Assert.IsTrue(result[3].isEqual("section"));
Assert.IsTrue(result[4].isEqual("para"));

Sometin like:
bool lastCharASlash = false;
foreach(char c in text)
{
if(c == #'/')
{
if(lastCharASlash)
{
// my code...
}
lastCharASlash = true;
}
else lastCharASlash = false;
}
You can also do text.Split(#"//")

You could replace // and / with your own words and then find the last index of
string s = "book//title//page/section/para";
s = s.Replace("//", "DOUBLE");
s = s.Replace("/", "SINGLE");
IList<int> doubleIndex = new List<int>();
while (s.Contains("DOUBLE"))
{
int index = s.IndexOf("DOUBLE");
s = s.Remove(index, 6);
s = s.Insert(index, "//");
doubleIndex.Add(index);
}
IList<int> singleIndex = new List<int>();
while (s.Contains("SINGLE"))
{
int index = s.IndexOf("SINGLE");
s = s.Remove(index, 6);
s = s.Insert(index, "/");
singleIndex.Add(index);
}
Remember to first replace double, otherwise you'll get SINGLESINGLE for // instead of DOUBLE. Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html page parsing using Html Agility Pack - c#

Related

Best way to search a plain text string in an HTML string in c#?

best way of splitting numbers from text and keeping text [duplicate]

Find phone number from given string c#

Capturing a number within curly brackets inside a string c#

loop through string to find substring

Categories

Resources