I'm trying to grab a text from html page using HtmlAgilityPack. here is my code
var headers = doc.DocumentNode.SelectNodes("//h2");
if (headers != null)
{
foreach (HtmlNode item in headers)
{
textBox1.AppendText(item.InnerText);
Console.WriteLine(item.InnerText);
}
}
It show different results on the console and the textbox.
result on console:
Avril
Lavigne
result on textBox:
AvrilLavigne
I want it like:
Avril Lavigne
I can't figure out what the character between two words.
The original text on the html is : Avril Lavigne there is already space between Avril and Lavigne. but it does not on textbox.
Console.WriteLine write your input and then add Environement.NewLine after it.
You can use
var headers = doc.DocumentNode.SelectNodes("//h2");
if (headers != null)
{
textBox1.AppendText(string.Join(' ', headers.Select(item => item.innerHTML)));
}
Which will join each of you item.innerHTML and adding a space in between.
Try adding all of the text elements to an array and then pass that array to
var arr = new List<string>();
foreach(var item in items){
arr.Add(item.InnerText);
}
textBox1.AppendText(string.Join(" ", arr));
The TextBox control's AppendText method just appends whatever string you pass to it to the end of the control's Text property. The Console class's WriteLine method appends the string you pass to it to the console, and then it appends the end of line characters, that is, a carriage return & line feed.
If you want your text in the TextBox and in the console to be separated with spaces, you'll have to construct it yourself:
bool isFirst = true;
foreach (HtmlNode item in headers)
{
string textToAppend = (isFirst ? string.Empty : " " ) + item.InnerText;
isFirst = false;
textBox1.AppendText(textToAppend);
Console.Write(textToAppend);
}
In this case, Console.Write just outputs the string you pass to it without adding any end of line characters.
Related
I`m new in c#, I'm still learning that language. Now I try to make app which read text and to my data i need only specific lines. Text look like:
[HAEDING]
Some value
[HEADING]
Some other value
[HEADING]
Some other text
and continuation of this text in new line
[HEADING]
Last text
I try to write method which read text and put it into string[] by split it like this:
string[0] = Some value
string[1] = Some other value
string[2] = Some other text and continuation of this text in new line
string[3] = Last text
So I want to read line from value [HEADING] to value new line which is empty. I thought that is should write by ReadAllLines and line by line check start position on value [HEADING] and end position on empty value in new line. I try this code:
string s = "mystring";
int start = s.IndexOf("[HEADING]");
int end = s.IndexOf("\n", start);
string result = s.Substring(start, end - start);
but it's substring to all lines in my text not like loop between first [HEADING] and empty new line, second etc.
Maybe someone can help me with this?
You could try to split the string by "[HEADING]" to get the strings between these lines. Then you could join each string into a single line and trim the whitespace around the strings:
string content = #"[HEADING]
Some value
[HEADING]
Some other value
[HEADING]
Some other text
and continuation of this text in new line
[HEADING]
Last text";
var segments = content.Split(new[] { "[HEADING]"}, StringSplitOptions.RemoveEmptyEntries) // Split into multiple strings
.Select(p=>p.Replace("\r\n"," ").Replace("\r"," ").Replace("\n"," ").Trim()) // Join each single string into single line
.ToArray();
Result:
segments[0] = "Some value"
segments[1] = "Some other value"
segments[2] = "Some other text and continuation of this text in new line"
segments[3] = "Last text"
Here's a solution which avoids the substring/index checking, which could potentially be fraught with errors.
There are answers such as this one that use LINQ, but for a newcomer to the language, basic looping is an OK place to start. Also, this is not necessarily the best solution for efficiency or whatever.
This foreach loop will handle your case, and some of the "dirty" cases.
var segments = new List<string>();
bool headingChanged = false;
foreach (var line in File.ReadAllLines("somefilename.txt"))
{
// skip blank lines
if (string.IsNullOrWhitespace(line)) continue;
// detect a heading
if (line.Contains("[HEADING]")
{
headingChanged = true;
continue;
}
if (headingChanged)
{
segments.Add(line);
// this keeps us working on the same segment if there
// are more lines to be added to the segment
headingChanged = false;
}
else
{
segments[segments.Length - 1] += " ";
segments[segments.Length - 1] += line;
// you could replace the above two lines with string interpolation...
// segments[segments.Length - 1] = $"{segments[segments.Length - 1]} {line}";
}
}
In the above loop, the ReadAllLines obviates the need to check for \r and \n. Contains will handle [HEADING] no matter where it changes.
You don't need substring, you can just compare the value s == "[HEADING]".
Here's an easy to understand example:
var lines = System.IO.File.ReadAllLines(myFilePath);
var resultLines = new List<String>();
var collectedText = new List<String>();
foreach (var line in lines)
{
if (line == "[HEADING]")
{
collectedText = new List<String>();
}
else if (line != "")
{
collectedText.Add(line);
}
else //if (line == "")
{
var joinedText = String.Join(" ", collectedText);
resultLines.Add(joinedText);
}
}
return resultLines.ToArray();
the loop does this:
we go line by line
"start collecting" (create list) when we encounter with "[HEADING]" line
"collect" (add to list) line if not empty
"finish collecting" (concat and add to results list) when line is empty
Greets.
I'm calling a Window with .ShowDialog() and returning some lines form a textbox.
The lines return back to a List<>, but each character in the textbox getting returned is getting assigned to it's own index value within the List<>.
I essentially want to add an entire line from the textbox to it's own index value in the List<>
EXAMPLE:
I enter the below in the textbox that was called from the ShowDialog();
123456
87564
125
How do I add each line from the textbox to it's own index on the list?
This is what I have now. (No code on the textbox window that I enter these values into) (I realize I spelled it as imput...) When I debug and review the pos List<>, each character has it's own index ID..
private void GetPOs()
{
MultiLineImput getPOList = new MultiLineImput();
getPOList.ShowDialog();
foreach (char po in getPOList.listOfPOs.Text)
{
pos.Add(po.ToString());
}
if (pos.Count > 0)
{
string a = String.Join("", pos);
MessageBox.Show(a, "POs to Process");
}
else
{
if (!getPOList.wasCanceled.Equals(1))
{
MessageBox.Show("No values were passed", "Warning");
}
}
}
You're iterating over the characters of Text property, so each character is converted to string and added to list separately.
I'm not sure what you mean by adding the entire "line". In your example there's only one line, so you can rewrite this loop
foreach (char po in getPOList.listOfPOs.Text)
{
pos.Add(po.ToString());
}
to simply
pos.Add(getPOList.listOfPOs.Text);
if you meant to split this line in entries "123456", "87564", "125", you can do the following way:
foreach (string po in getPOList.listOfPOs.Text.Split(' '))
{
pos.Add(po);
}
if your textbox indeed support multiline input, you can split by Environment.NewLine, like this:
foreach (string po in getPOList.listOfPOs.Text.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
pos.Add(po);
}
If you iterate over a string, the iterator will pull one character at a time. It has no idea what a line break is.
I suggest you break the string up by line breaks, then iterate over the result, like so:
MultiLineInput getPOList = new MultiLineInput();
getPOList.ShowDialog();
var wholeText = getPOList.listOfPOs.Text;
var lines = wholeText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
foreach (string po in getPOList.listOfPOs.Text)
{
pos.Add(po);
}
//Etc.....
:58A:/C/81000098099CL
CBNINGLA
:72:/CODTYPTR/012
/CLEARING/0003
/SGI/DBLNNGLA
am trying to read the swift message above, line :58A: and line :72:, am having a little issue. My code only reads line :58A: like this C/81000098099CL, but I want it to read down the line before getting to line :72:, in short, the output should be like this for line :58A: C/81000098099CL CBNINGLA.
Same also for line :72:, this is because the messages come formatted in this form. This is my code below
if (line.StartsWith(":58A:"))
{
string[] narr = line.Split('/');
inflow202.BENEFICIARY_INSTITUTION = narr[2];
}
if (line.StartsWith(":72:"))
{
inflow202.RECEIVER_INFORMATION = line.Substring(5);
}
You can replace all new lines not followed by : with spaces (or empty string).
string output = Regex.Replace(text, #"\r?\n(?!:)", " ");
string[] lines = output.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string line in lines)
{
if (line.StartsWith(":58A:"))
{
}
else if (line.StartsWith(":72:"))
{
}
}
If the message always comes formatted in this form and : never occurs in the text except for these line starters, consider splitting the whole text into an array by : first. On 0th position there will be nothing, on all odd positions will be the number, on all even positions will be the content until next :. This solution will work providing that you are able to read the whole input into a single string first. I.e. having string message, you can do something like:
var splitted = message.Split(':');
for (i=1;i<= splitted.Length -1; i+=2){
if (splitted[i] == "58A") {
//do what you need to do, the text you need is stored in splitted[i+1]
}
...
}
I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}
I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.
Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;
Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx
Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.
I am making a small web analysis tool and need to somehow extract all the text blocks on a given url that contain more than X amount of words.
The method i currently use is this:
public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
_allText = sb.ToString();
}
catch (Exception)
{
}
_allText = System.Web.HttpUtility.HtmlDecode(_allText);
return _allText;
}
The problem here is that i get all text returned, even if its a meny text, a footer text with 3 words, etc.
I want to analyse the actual content on a page, so my idea is to somehow only parse the text that could be content (ie text blocks with more than X words)
Any ideas how this could be achieved?
Well, first approach can be a simple word count analisys on each node.InnerText value using string.Split function:
string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);
and append only text where words.Length is larger than 3.
Also see this question answer for some more tricks in raw text gathering.