C# Web scraper copying text

C# Web scraper copying text - c#

I have a web scraper written in C# for extracting data. I want to copy text from the web browser control and paste it into a Word file programmatically. When I try to extract rich text box content using its ID and InnerText, the text contains encoded characters like %2c.
I need to get the text with all formatting but I can't find any way. I have tried Encoding, HTTPUtility.UrlDecode, SendKeys and elem.InvokeMember() without success.
How can I programmatically copy and paste text from web browser control preserving formatting?
Here is the sample data to extract:
Description
The Advance Concepts Engineering team designs and develops new vehicles which will meet future regulatory requirements and customer competitive requirements. A qualified candidate will be responsible for the total vehicle packaging. The candidate will identify and resolve adaptation and packaging issues as the vehicle moves toward production. They will lead cross functional team meetings working with Systems & Components, Advance Manufacturing, Service, etc. to ensure that the solutions are optimized for all stages of the vehicle's life.
HtmlElement elem = wb.Document.GetElementById("ctl00_contplhDynamic_txtDescrContentHiddenTextarea");
if (elem == null) return;
elem.InvokeMember("Click");
//elem.InvokeMember("Select All");
//elem.InvokeMember("Copy");
SendKeys.SendWait("^a");
SendKeys.SendWait("^c");
Clipboard.Clear();
elem.Focus();
elem.InvokeMember("Right Click");
elem.InvokeMember("Select All");
elem.InvokeMember("Copy");
Clipboard.SetText(elem.InnerText);
string clipbrdText = Clipboard.GetText();
string data = elem.InnerText;
richTextBox1.Text = data;
string temp = System.Web.HttpUtility.UrlDecode(data);
Encoding iso = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(data);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);

The text with "%2c" etc has been encoded. If you are getting the content of a web page, you are decoding the HTML, not the URL. You can use HttpUtility.HtmlDecode, or if you are using .NET 4.0 or above you can also use WebUtility.HtmlDecode - this is available within the System.Net namespace.
You should note that Word does not use HTML for its formatting, so you won't be able to paste HTML tags and expect it to recognise them. i.e. <strong>Description</strong> will not result in bold text if you type that into Word.
EDIT:
It looks like you are mixing two different ways to copy the text in the code you pasted - both SendKeys.SendWait("^c"); and elem.InvokeMember("Copy");. I presume both of these methods work?
I think the problem you are having lies in the way you are getting the text. I see you're using Clipboard.GetText() to get the text. Try specifying that it is formatted text using Clipboard.GetText(TextDataFormat.Rtf) or Clipboard.GetText(TextDataFormat.Html). This should hopefully copy the string preserving the formatting.

Related

Save text with emoji to file become '?'

What I have to do
I have to create a text file (.txt, .doc, ...) with the exact text passed (so with emojis) by a .net WebaApi (and attach it to an email).
Situation:
I have a project with .net webapi. One of my routes consist of creating a text file and attach it to an email, with some text passed by a device that may contain emojis.
I can't figure out how to save emojis correctly. If I copy-paste an emoji into a word or notepad file it works, but if I save it through my code it doesn't. I suppose it is due to formatting, but I tried Unicode, UTF-32, UTF-8, ASCII,...
I tried many solutions found here on SO, but none of them worked for me.
For example this emoji (copy-pasted from .net debugger) --> 🎶 is converted into quotation mark or ¶ó, based on encoding used.
How can I save emoji as text into a file so that they can be read by the receivers?
This is what I've done:
//smsText is a string containing emojis
byte[] bytes = Encoding.Unicode.GetBytes(smsText);
Attachment attachment = new Attachment(new MemoryStream(bytes), tokenKey + ".doc");
attachment.ContentType = new ContentType("application/ms-word");
List <Attachment> attachments = new List<Attachment>();
attachments.Add(attachment);
//send email with attachments
Note that smsText, with debugger, contains the 🎶 correctly displayed.
The email correctly reach the receiver, with the .doc attachment, but the attachment doesn't contains the emojis

Your smsText contains a plaintext string. You can't just write that string into a stream or file that you then call a Word file*.
Word files are binary files with a specific format. You need to use a library that can write this format, or use Interop to interoperate with an existing Word installation.
See for example Free library to MS Word.
And if you're fine with plaintext files, just write the text's bytes to a stream and propagate the appropriate encoding (in this case Unicode, being UTF-16 on .NET).
*: yes you can, just like that Excel tries its best to format an HTML table as an Excel document, but you shouldn't.

Retrive the Url from an Html Img Tag

BackGround Info
Currently working on a C# web api that will be returning selected Img url's as base64. I currently have the functionality that would preform the base64 conversion however, I am getting a large amount of text which also include Img Url's which I will need to crop out of the string and give it to my function to convert the img to base 64. I read up on an lib.("HtmlAgilityPack;") that should make this task easy but when I am use it I get "HtmlDocument.cs" not found. However, I am not submitting a document, but sending it a string which is HTML. I read the doc and it is suppose to work with a string as well, but it is not working for me. This is the code using "HtmlAgilityPack".
NON WORKING CODE
foreach(var item in returnList)
{
if (item.Content.Contains("~~/picture~~"))
{
HtmlDocument doc = new HtmlDocument();
doc.Load(item.Content);
Error Message From HtmlAgilityPack
Question
I am receiving a string which is Html from SharePoint. This Html string may be tokenized with heading tokens and/or picture tokens. I am trying to isolate the retrieve the html from the img src Hmtl tag. I understand that regex may be impractical, but I would consider working with a regex expressions is it available to retrieve the url from img src.
Sample String
Bullet~~Increased Cash Flow</li><li>~~/Document Text Bullet~~Tax Efficient Organizational Structures</li><li>~~/Document Text Bullet~~Tax Strategies that Closely Align with Business Strategies</li><li>~~/Document Text Bullet~~Complete Knowledge of State and Local Tax Obligations</li></ul><p>~~/Document Heading 2~~is the firm of choice</p><p>~~/Document Text~~When it comes to accounting and advisory services is the unique firm of choice. As a trusted advisor to our clients, we bring an integrated client service approach with dedicated industry experience. Dixon Hughes Goodman respects the value of every client relationship and provides clients throughout the U.S. with an unwavering commitment to hands-on, personal attention from our partners and senior-level professionals.</p><p>~~/Document Text~~of choice for clients in search of a trusted advisor to deal with their state and local tax needs. Through our leading best practices and experience, our SALT professionals offer quality and ease to the client engagement. We are proud to provide highly comprehensive services.</p>
<p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/map-al.jpg" alt="map al" style="width:611px;height:262px;" /> 
<br></p><p><br></p><p>
~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/Firm_Telescope_Illustration.jpg" alt="Firm_Telescope_Illustration.jpg" style="margin:5px;width:155px;height:155px;" /> </p><p></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div>
Important
I am working with an HTML string, not a file.

The issue you are having is that C# is looking for a file and since it is not finding it, it tells you. This is not an error that will brake your app, it is just telling you that the file is not found and the Lib will than read the string given. This documentation can be found here https://htmlagilitypack.codeplex.com/SourceControl/latest#Trunk/HtmlAgilityPackDocumentation.shfbproj. The code below is a cookie cutter model that anyone can use.
Important
C# is looking for a file which can not be displayed, because it a string that is supplied. That is the message that you are getting, however your still will work as well with accordance to the doc provided and will not effect your code.
Exmample Code
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml("YourContent"); // can be a string or can be a path.
HtmlAttribute att = url.Attributes["src"];
Uri imgUrl = new System.Uri("Url"+ att.Value); // build your url

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].+?>", RegexOptions.IgnoreCase).Groups[1].Value;
It has been asked multiple times here.
also here

C# decoding "â„¢" to "TM"

on a web page there is following string
"Qualcomm Snapdragon™ S4"
when i get this string in my .net code the string convert to "Qualcomm Snapdragonâ„¢ S4"
the character "TM" change to â„¢
how can i decode "â„¢" back to "TM"
Update
follwoing is the code for downloaded string using webproxy
wc is webproxy
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8");
string html = Server.HtmlEncode(wc.DownloadString(url));

You should read the webpage in its proper encoding in the first place. In this case it seems you are reading with Encoding.Default (i.e. probably CP1252) and the page is really in UTF-8. This should be apparent either by reading the Content-Type header of the response or by looking for a <meta http-equiv="Content-Type" content='text/html; charset=utf-8'> in the content.
If you still need to do this after the fact, then use
var bytes = Encoding.Default.GetBytes(myString);
var correctString = Encoding.UTF8.GetString(bytes);
In any case you would need to know the exact encodings that were used on the page and for reading the malformed string in the first place. Furthermore I'd generally advise explicitly against using Encoding.Default because its value isn't fixed. It's just the legacy encoding on a Windows system for use in non-Unicode applications and also gets used as the default non-Unicode text file encoding. It should have no place whatsoever in handling external resources.

HTML String Encoding

I've a requirement to export ppt from C# without using introp dlls. I am able to do that but when I do append some HTML string i.e. "<b>Krishna</b><br/><strong>Ram</strong>" in any slide, it is showing the same text, not rendered one. Can any help me ?

It appears that PPT does not currently support HTML rendering directly in PPT. You must either export your slide show as HTML, or use the built in formatting as shown in answer to the following question: Apply Font Formatting to PowerPoint Text Programatically.
Set tr = ActiveWindow.Selection.SlideRange.Shapes(1).TextFrame.TextRange
With tr
.Text = "Hi There Buddy!"
.Words(1).Font.Bold = msoTrue
For an idea of the settings in C# and Office 2010 specifically see Font Members.
You should be able to test my assertion yourself, by HTML encoding your text using the HttpServerUtility.HtmlEncode Method:
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);

how to convert pdf file to text file using c#.net

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas

Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.

There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Web scraper copying text - c#

Related

Save text with emoji to file become '?'

Retrive the Url from an Html Img Tag

C# decoding "â„¢" to "TM"

HTML String Encoding

how to convert pdf file to text file using c#.net

Categories

Resources