HTML Content Parsing

HTML Content Parsing - c#

I have two code for getting no of characters inside templates first one is
string html = this.GetHTMLContent(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText);
}
string final = sb.ToString();
int lenght = final.Length;
And second one is
var length = doc.DocumentNode.SelectNodes("//text()")
.Where(x => x.NodeType == HtmlNodeType.Text)
.Select(x => x.InnerText.Length)
.Sum();
When I run both code return me different result.

Finally I identified the problem. the problem was inside loop I used appendLine() method instead of append() method. so it appended new line each time of looping. So that some white spaces it also recognized as character.

Related

HTMLAgilityPack error: "Multiple node elements can't be created."

I'm attempting to use the HTMLAgilityPack to get retrieve and edit inner text of some HTML. The inner text of each node i retrieve needs to be checked for matching strings and those matching strings to be highlighted like so:
var HtmlDoc = new HtmlDocument();
HtmlDoc.LoadHtml(item.Content);
var nodes = HtmlDoc.DocumentNode.SelectNodes("//div[#class='guide_subtitle_cell']/p");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(Methods.HighlightWords(htmlNode.InnerText, searchstring)), htmlNode);
}
This is the code for the HighlightWords method I use:
public static string HighlightWords(string input, string searchstring)
{
if (input == null || searchstring == null)
{
return input;
}
var lowerstring = searchstring.ToLower();
var words = lowerstring.Split(' ').ToList();
for (var i = 0; i < words.Count; i++)
{
Match m = Regex.Match(input, words[i], RegexOptions.IgnoreCase);
if (m.Success)
{
string ReplaceWord = string.Format("<span class='search_highlight'>{0}</span>", m.Value);
input = Regex.Replace(input, words[i], ReplaceWord, RegexOptions.IgnoreCase);
}
}
return input;
}
Can anyone suggest how to get this working or indicate what i'm doing wrong?

The problem is that HtmlTextNode.CreateNode can only create one node. When you add a <span> inside, that's another node, and CreateNode throws the exception you see.
Make sure that you are only doing a search and replace on the lowest leaf nodes (nodes with no children). Then rebuild that node by:
Create a new empty node to replace the old one
Search for the text in .InnerText
Use HtmlTextNode.Create to add the plain text before the text you want to highlight
Then add your new <span> with the highlighted text with HtmlNode.CreateNode
Then search for the next occurrence (start back at 1) until no more occurrences are found.

Your function HighlightWords must be returning multiple top-level HTML nodes. For example:
<p>foo</p>
<span>bar</span>
The HtmlAgilityPack only allows one top-level node to be returned. You can hardcode the return value for HighlightWords to test.
Also, this post has run across the same problem.

Linq query for building a dictionary from a reg file

I'm building a simple dictionary from a reg file (export from Windows Regedit). The .reg file contains a key in square brackets, followed by zero or more lines of text, followed by a blank line. This code will create the dictionary that I need:
var a = File.ReadLines("test.reg");
var dict = new Dictionary<String, List<String>>();
foreach (var key in a) {
if (key.StartsWith("[HKEY")) {
var iter = a.GetEnumerator();
var value = new List<String>();
do {
iter.MoveNext();
value.Add(iter.Current);
} while (String.IsNullOrWhiteSpace(iter.Current) == false);
dict.Add(key, value);
}
}
I feel like there is a cleaner (prettier?) way to do this in a single Linq statement (using a group by), but it's unclear to me how to implement the iteration of the value items into a list. I suspect I could do the same GetEnumerator in a let statement but it seems like there should be a way to implement this without resorting to an explicit iterator.
Sample data:
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.msu]
#="Microsoft.System.Update.1"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS]
#="WMP11.AssocFile.M2TS"
"Content Type"="video/vnd.dlna.mpeg-tts"
"PerceivedType"="video"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\OpenWithProgIds]
"WMP11.AssocFile.M2TS"=hex(0):
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx]
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx\{BB2E617C-0920-11D1-9A0B-00C04FC2D6C1}]
#="{9DBD2C50-62AD-11D0-B806-00C04FD706EC}"
Update
I'm sorry I need to be more specific. The files am looking at around ~300MB so I took the approach I did to keep the memory footprint down. I'd prefer an approach that doesn't require pulling the entire file into memory.

You can always use Regex:
var dict = new Dictionary<String, List<String>>();
var a = File.ReadAllText(#"test.reg");
var results = Regex.Matches(a, "(\\[[^\\]]+\\])([^\\[]+)\r\n\r\n", RegexOptions.Singleline);
foreach (Match item in results)
{
dict.Add(
item.Groups[1].Value,
item.Groups[2].Value.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).ToList()
);
}
I whipped this out real quick. You might be able to improve the regex pattern.

Instead of using GetEnumerator you can take advantage of TakeWhile and Split methods to break your list into smaller list (each sublist represents one key and its values)
var registryLines = File.ReadLines("test.reg");
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
while (registryLines.Count() > 0)
{
// Take the key and values into a single list
var keyValues = registryLines.TakeWhile(x => !String.IsNullOrWhiteSpace(x)).ToList();
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyValues != null && keyValues.Count > 0)
resultKeys.Add(keyValues[0], keyValues.Skip(1).ToList());
// Jumps to the next registry (+1 to skip the blank line)
registryLines = registryLines.Skip(keyValues.Count + 1);
}
EDIT based on your update
Update I'm sorry I need to be more specific. The files am looking at
around ~300MB so I took the approach I did to keep the memory
footprint down. I'd prefer an approach that doesn't require pulling
the entire file into memory.
Well, if you can't read the whole file into memory, it makes no sense to me asking for a LINQ solution. Here is a sample of how you can do it reading line by line (still no need for GetEnumerator)
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
using (StreamReader reader = File.OpenText("test.reg"))
{
List<string> keyAndValues = new List<string>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
// Adds key and values to a list until it finds a blank line
if (!string.IsNullOrWhiteSpace(line))
keyAndValues.Add(line);
else
{
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyAndValues != null && keyAndValues.Count > 0)
resultKeys.Add(keyAndValues[0], keyAndValues.Skip(1).ToList());
// Starts a new Key collection
keyAndValues = new List<string>();
}
}
}

I think you can use a code like this - if you can use memory -:
var lines = File.ReadAllText(fileName);
var result =
Regex.Matches(lines, #"\[(?<key>HKEY[^]]+)\]\s+(?<value>[^[]+)")
.OfType<Match>()
.ToDictionary(k => k.Groups["key"], v => v.Groups["value"].ToString().Trim('\n', '\r', ' '));
C# Demo
This will take 24.173 seconds for a file with more than 4 million lines - Size:~550MB - by using 1.2 GB memory.
Edit :
The best way is using File.ReadAllLines as it is lazy:
var lines = File.ReadAllLines(fileName);
var keyRegex = new Regex(#"\[(?<key>HKEY[^]]+)\]");
var currentKey = string.Empty;
var currentValue = string.Empty;
var result = new Dictionary<string, string>();
foreach (var line in lines)
{
var match = keyRegex.Match(line);
if (match.Length > 0)
{
if (!string.IsNullOrEmpty(currentKey))
{
result.Add(currentKey, currentValue);
currentValue = string.Empty;
}
currentKey = match.Groups["key"].ToString();
}
else
{
currentValue += line;
}
}
This will take 17093 milliseconds for a file with 795180 lines.

Finding all similar lines in a text file

I have a text file that contains some comma separated values. and it looks like this:
3,23500,R,5998,20.38,06/12/2013 01:44:17
2,23500,P,5983,20.234,06/12/2013 01:44:17
3,23501,R,5998,20.38,06/12/2013 01:44:18
2,23501,P,5983,20.235,06/12/2013 01:44:18
3,23502,R,6000,20.4,06/12/2013 01:44:19
2,23502,P,5983,20.236,06/12/2013 01:44:19
3,23503,R,5999,20.39,06/12/2013 01:44:20
2,23503,P,5983,20.236,06/12/2013 01:44:20
My task is to extract lines that start with same number in unique files. Eg in the above case you see some lines are starting with 2 and some with 3...there can be more cases like 4 and etc...
What would be the best and fastes approach to do this? The files that I am working with are quite big and sometimes are in magnitude of gigabytes...
I did split each line and store the first value that will be the number I am looking for in an array and then remove duplicate values from the array...it works but it is very slow!
This is my own code:
private void buttonBeginProcess_Click(object sender, EventArgs e)
{
var file = File.ReadAllLines(_fileName);
var nodeId = new List<int>();
foreach (var line in file)
{
nodeId.Add(int.Parse(line.Split(',')[0]));
}
//Unique numbers
nodeId = nodeId.Distinct().ToList();
}

var lines = File.ReadLines(myFilePath);
var lineGroups = lines
.Where(line => line.Contains(","))
.Select(line => new{key = line.Split(',')[0], line})
.GroupBy(x => x.key);
foreach(var lineGroup in lineGroups)
{
var key = lineGroup.Key;
var keySpecificLines = lineGroup.Select(x => x.line);
//save keySpecificLines to file
}

You could try using StreamReader / StreamWriter to process each file one line at a time:
var writers = new Dictionary<string, StreamWriter>();
using (StreamReader sr = new StreamReader(pathToFile))
{
while (sr.Peek() >= 0)
{
var line = sr.ReadLine();
var key = line.Split(new[]{ ',' },2)[0];
if (!lineGroups.ContainsKey(key))
{
writers[key] = new StreamWriter(GetPathToOutput(key));
}
writers[key].WriteLine(line);
}
}
foreach(StreamWriter sw in writers.Values)
{
sw.Dispose();
}
With this method, you ensure that your code never has to consume the entire input file, so it shouldn't matter how large your input files are. Of course the downside is it would have to keep an arbitrary number of files open throughout the process.

Insert attribute into HTML elements with regex C#

string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>

How to test whether a node contains particular string or character as its text value?

How to test whether a node contains particular string or character using C# code.
example:
<abc>
<foo>data testing</foo>
<foo>test data</foo>
<bar>data value</bar>
</abc>
Now I need to test the particular node value has the string "testing" ?
The output would be "foo[1]"

You can also that into an XPath document and then use a query:
var xPathDocument = new XPathDocument("myfile.xml");
var query = XPathExpression.Compile(#"/abc/foo[contains(text(),""testing"")]");
var navigator = xpathDocument.CreateNavigator();
var iterator = navigator.Select(query);
while(iterator.MoveNext())
{
Console.WriteLine(iterator.Current.Name);
Console.WriteLine(iterator.Current.Value);
}

This will determine if any elements (not just foo) contain the desired value and will print the element's name and it's entire value. You didn't specify what the exact result should be, but this should get you started. If loading from a file use XElement.Load(filename).
var xml = XElement.Parse(#"<abc>
<foo>data testing</foo>
<foo>test data</foo>
<bar>data value</bar>
</abc>");
// or to load from a file use this
// var xml = XElement.Load("sample.xml");
var query = xml.Elements().Where(e => e.Value.Contains("testing"));
if (query.Any())
{
foreach (var item in query)
{
Console.WriteLine("{0}: {1}", item.Name, item.Value);
}
}
else
{
Console.WriteLine("Value not found!");
}

You can use Linq to Xml
string someXml = #"<abc>
<foo>data testing</foo>
<foo>test data</foo>
</abc>";
XDocument doc = XDocument.Parse(someXml);
bool containTesting = doc
.Descendants("abc")
.Descendants("foo")
.Where(i => i.Value.Contains("testing"))
.Count() >= 1;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML Content Parsing - c#

Finally I identified the problem. the problem was inside loop I used appendLine() method instead of append() method. so it appended new line each time of looping. So that some white spaces it also recognized as character.

Related

HTMLAgilityPack error: "Multiple node elements can't be created."

Linq query for building a dictionary from a reg file

Finding all similar lines in a text file

Insert attribute into HTML elements with regex C#

How to test whether a node contains particular string or character as its text value?

Categories

Resources