My application converts several types of documents into HTML files. Then, it exposes generated files to users or search engine robots.
My problem is that some documents contain more than 100 pages and the generated HTML file is huge.
I am looking for a way to split HTML files into several pages.
One possible solution is split them by size and number of characters which is a tough solution because we should consider the style of HTML files.
For example, consider following HTML file:
<p>
-- So long paragraph with more than 100 lines
</p>
<table>
<tr>
<td> </td>
</tr>
...... more than 10 rows
</table>
the split mechanism should create several files for the paragraph and also, it should create one file for the table. like following:
PAGE1.HTML
<p>
-- contains 20 lines of original text
</p>
PAGE2.HTML
<p>
-- contains 20 lines of original text
</p>
PAGE3.HTML
<p>
-- contains 20 lines of original text
</p>
...
PAGE6.HTML
<p>
<table>
<tr>
<td> </td>
</tr>
...... more than 10 rows
</table>
</p>
please advice me, if you know a better solution or tools for achieving the solution?
You have to disentangle content from the HTML. If you opt for an intermediate format, that you control, you can generate HTML files with appropriate amount of content.
Trying to cut it after the HTML is generated is worse option, and inefficient one. You can try and navigate the HTML document using (e.g.) HtmlAgilityPack, but without intimate knowledge of what elements in what structure you actually generate it's hard to pinpoint the way of actually performing the split - and again, it will be much harder than splitting the content before it becomes HTML.
Related
I have a string containing a list of html tags with invalid tag formatting.
For example, I have a string such as that below:
<p>
<strong>Scale:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
<li>2 to 4 nodes</li>
</ul>
</p>
<p>
<strong>Single Node Data:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
<li>CPU: 6-26 cores (Intel)</li>
<li>RAM: 128GB to 2TB</li>
<li>Raw storage: 240GB to 16TB</li>
<li>Storage type: SSD + HDD</li>
<li>Network speed: Up to 25Gb</li>
</ul>
</p><img src="xxxxx"/>
I need to replace the tags ending with /> to </img>, such that <img src="xxxxx"/> would be replaced with <img src="xxxxx"></img>.
How would I achieve this using C#?
For what you are asking, you can go with either one of the following options
Option 1
You can use a 3rd party library that parses your HTML into tags (it actually renders it as XML) and separate each tag (and its content) in a string array/list
then you loop the list and check if the closing tag is proper, if not replace it with the proper one.
Here is the library
Option 2
You can create your own html parser, which would give you more control over the parser's logic, i found this example of C# HTML parser on CodeProject you can check it out.
I would like to parse my HTML page in a as generic way as possible. I don't want to build a parser every time the page has been changed so I would like to parse it smartly by the value of the tags.
I know that the HTML Agility Pack provides tools to read and search by the type of tag(td,strong,li etc), but I would like to iterate all the tags and find information which I know by the content of the tag and not by the type of the tag because the type can change.
Example:
The page:
<table>
<tr valign="top">
<td valign="top">Sex:<br />
</td><td valign="top">Male<br />
</td></tr>
<tr valign="top">
<td valign="top">Current City:<br />
</td><td valign="top">New York<br /></td>
I know that the value will be "Sex:" and the next tag will contain
the gender.
I know that the value will be "Current City:" and then the next
tag will be the city.
I know I can iterate by the tags and but if the tags change my parser will no longer work.
Can I iterate by values and not by the type of tags?
You could input all the nodes inside <table> into a HtmlNodeCollection. Then iterate through that list of nodes:
foreach (HtmlNode node in ListofNodes)
Within that, you could check the InnerHtml of each node to check for your specific strings? I guess the table has the same fields each time. Either that, or add id's/css class and look for that specific id/css class.
EDITED:
I have written some correct HTML and passed this as a string into an email,
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>/n<html>
<head>
</head>
<body>
<table>
<tr>
<th>X</th>
<th>Y</th>
</tr>
<tr>
<td>Overall</td>
<td>207,890</td>
</tr>
<tr>
<td>a</td>
<td>100,568</td>
</tr>
<tr>
<td>b</td>
<td>107,322</td>
</tr>
</table>
</body>
</html>
I re-wrote the HTML to be extremely simple, only using a table but its still not showing??
Generally email clients don't seem to like decently formatted HTML. Just from conversation I've had with HTML developers
Use inline styles even if that means repeating yourself. No style sheets even in head
No fancy floating of the divs
Put everything in tables for formatting
Generally pretend like it's 1999
Your problem is probably not only Outlook 2007 but most other email clients as well.
Make sure that your html is very simple and does not use many external resources, inline CSS is probably necessary. This article is a nice summary: http://css-tricks.com/using-css-in-html-emails-the-real-story/
Im using Asp.net and C# and im able to get the source code of a HTML page using webrequest and webresponse in a text file, now i want to get only some elements or html tags instead of whole source code, can any 1 help me in this?? and if possible can we save the elements and values in a mysql database. Suggest if any useful reference links??
Have a look at the HTML Agility Pack.
You would need to match the HTML with regular expressions then save the results to a desired location.
See: http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
here is an entire tutorial on that but the link provides the topic you are currently asking, there should be some examples too
http://www.tizag.com/htmlT/htmldiv.php
<div id="menu" align="right" >
HOME |
CONTACT |
ABOUT |
LINKS
</div>
<div id="content" align="left" >
<h5>Content Articles</h5>
<p>This paragraph would be your content
paragraph with all of your readable material.</p>
<h5 >Content Article Number Two</h5>
<p>Here's another content article right here.</p>
</div>
Here's some lines of the document:
<div class="rowleft">
<h3>Technical Fouls</h3>
<table class="num-left">
<tr class="datahl2b">
<td> </td>
<td>Players</td>
</tr>
<tr>
<td>DAL</td>
<td>
None</td>
</tr>
<tr>
<td>MIA</td>
<td>
Mike Miller</td>
<td>
Mike Miller, Jr.</td>
</tr>
</table>
</div>
I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.
One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.
Relevant
HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack
Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.
If your document is not well-formed XML, I would recommend using the HTML Agility Pack