how to split the string between two strings in c#?

how to split the string between two strings in c#? - c#

I have one String variable that contains HTML data.Now i want to split that html string into multiple string and then finally merge those strings into single one.
This is html string:
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</strong></span></p>
<p style="text-align: center;"><strong><span style="color: #008000;">para2</span> स्द्स्द्सद्स्द para2 again<br /></strong></p>
<p style="text-align: left;"><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
And this is my expected output:
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</strong></span><strong><span style="color: #008000;">para2</span>para2 again<br /></strong><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
My Split Logic is given below...
Split the HTML string into token based on </p> tag.
And take the first token and store it in separate string variable(firstPara).
Now take the each and every token and then remove any tag starting with<p and also ending with </p>.And store each value in separate variable.
4.Then take first token named firstPara and replace the tag </p> and then append each every token that we got through the step 3.
5.So,Now the variable firstPara has whole value...
Finally, we just append </p> at the end of the firstPara...
This is my problem...
Could you please step me to get out of this issue...

Here is regex example how to do it.
String pattern = #"(?<=<p.*>).*(?=</p>)";
var matches = Regex.Matches(text, pattern);
StringBuilder result = new StringBuilder();
result.Append("<p>");
foreach (Match match in matches)
{
result.Append(match.Value);
}
result.Append("</p>");
And this is how you should do it with Html Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodes = doc.DocumentNode.SelectNodes("//p");
StringBuilder result = new StringBuilder();
result.Append("<p>");
foreach (HtmlNode node in nodes)
{
result.Append(node.InnerHtml);
}
result.Append("</p>");

If you would like to split a string by another string, you may use string.Split(string[] separator, StringSplitOptions options) where separator is a string array which contains at least one string that will be used to split the string
Example
//Initialize a string of name HTML as our HTML code
string HTML = "<p><span style=\"text-decoration: underline; color: #ff0000;\"><strong>para1</strong></span></p> <p style=\"text-align: center;\"><strong><span style=\"color: #008000;\">para2</span> स्द्स्द्सद्स्द para2 again<br /></strong></p> <p style=\"text-align: left;\"><strong><span style=\"color: #0000ff;\">para3</span><br /></strong></p>";
//Initialize a string array of name strSplit to split HTML with </p>
string[] strSplit = HTML.Split(new string[] { "</p>" }, StringSplitOptions.None);
//Initialize a string of name expectedOutput
string expectedOutput = "";
string stringToAppend = "";
//Initialize i as an int. Continue if i is less than strSplit.Length. Increment i by 1 each time you continue
for (int i = 0; i < strSplit.Length; i++)
{
if (i >= 1) //Continue if the index is greater or equal to 1; from the second item to the last item
{
stringToAppend = strSplit[i].Replace("<p", "<"); //Replace <p by <
}
else //Otherwise
{
stringToAppend = strSplit[i]; //Don't change anything in the string
}
//Append strSplit[i] to expectedOutput
expectedOutput += stringToAppend;
}
//Append </p> at the end of the string
expectedOutput += "</p>";
//Write the output to the Console
Console.WriteLine(expectedOutput);
Console.Read();
Output
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</stro
ng></span> < style="text-align: center;"><strong><span style="color: #008000;">p
ara2</span> ?????????????? para2 again<br /></strong> < style="text-align: left;
"><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
NOTICE: Because my program does not support Unicode characters, it could not read स्द्स्द्सद्स्द. Thus, it was translated as ??????????????.
Thanks,
I hope you find this helpful :)

Related

C# String Remove characters between two characters

For Example i have string Like
"//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere"
I want to use //RemoveFromhere as starting point from where
and //RemoveTohere as ending point in between all character i want to remove

var sin = "BEFORE//RemoveFromhere"+
"<div>"+
"<p>my name is blawal i want to remove this div </p>"+
"</div>"+
"//RemoveTohereAFTER";
const string fromId = "//RemoveFromhere";
const string toId = "//RemoveTohere";
var from = sin.IndexOf(fromId) + fromId.Length;
var to = sin.IndexOf(toId);
if (from > -1 && to > from)
Console.WriteLine(sin.Remove(from , to - from));
//OR to exclude the from/to tags
from = sin.IndexOf(fromId);
to = sin.IndexOf(toId) + toId.Length;
Console.WriteLine(sin.Remove(from , to - from));
This gives results BEFORE//RemoveFromhere//RemoveTohereAFTER and BEFOREAFTER
See also a more general (better) option using regular expressions from Cetin Basoz added after this answer was accepted.

void Main()
{
string pattern = #"\n{0,1}//RemoveFromhere(.|\n)*?//RemoveTohere\n{0,1}";
var result = Regex.Replace(sample, pattern, "");
Console.WriteLine(result);
}
static string sample = #"Here
//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere
Keep this.
//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere
Keep this too.
";

Acting on the indentation of a c# multiline string

I want to write some Html from c# (html is an example, this might be other languages..)
For example:
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
will produce:
<div class="className">
<span>Mon text</span>
</div>
that's not very cool from the Html point of view...
The only way to have a correct HTML indentation will be to indent the C# code like this :
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
We get the correctly indented Html:
<div class="className">
<span>Mon text</span>
</div>
But indenting the C# like this really broke the readability of the code...
Is there a way to act on the indentation in the C# language ?
If not, does someone have a tip better than :
string div = "<div class=\"className\">" + Environment.NewLine +
" <span>Mon text</span>" + Environment.NewLine +
"</div>";
and better than
var sbDiv = new StringBuilder();
sbDiv.AppendLine("<div class=\"className\">");
sbDiv.AppendLine(" <span>Mon text</span>");
sbDiv.AppendLine("</div>");
What i use as a solution:
Greats thanks to #Yotam for its answer.
I write a little extension to make the alignment "dynamic" :
/// <summary>
/// Align a multiline string from the indentation of its first line
/// </summary>
/// <remarks>The </remarks>
/// <param name="source">The string to align</param>
/// <returns></returns>
public static string AlignFromFirstLine(this string source)
{
if (String.IsNullOrEmpty(source)) {
return source;
}
if (!source.StartsWith(Environment.NewLine)) {
throw new FormatException("String must start with a NewLine character.");
}
int indentationSize = source.Skip(Environment.NewLine.Length)
.TakeWhile(Char.IsWhiteSpace)
.Count();
string indentationStr = new string(' ', indentationSize);
return source.TrimStart().Replace($"\n{indentationStr}", "\n");
}
Then i can use it like that :
private string GetHtml(string className)
{
return $#"
<div class=""{className}"">
<span>Texte</span>
</div>".AlignFromFirstLine();
}
That return the correct html :
<div class="myClassName">
<span>Texte</span>
</div>
One limitation is that it will only work with space indentation...
Any improvement will be welcome !

You could wrap the string to the next line to get the desired indentation:
string div =
#"
<div class=""className"">
<span>Mon text</span>
</div>"
.TrimStart(); // to remove the additional new-line at the beginning
Another nice solution (disadvantage: depends on the indentation level!)
string div = #"
<div class=""className"">
<span>Mon text</span>
</div>".TrimStart().Replace("\n ", "\n");
It just removes the indentation out of the string. make sure the number of spaces in the first string of the Replace is the same amount of spaces your indentation has.

I like this solution more, but how about:
string div = "<div class='className'>\n"
+ " <span>Mon text</span>\n"
+ "</div>";
This gets rid of some clutter:
Replace " inside strings with ' so that you don't need to escape the quote. (Single quotes in HTML appear to be legal.)
You can then also use regular "" string literals instead of #"".
Use \n instead of Environment.NewLine.
Note that the string concatenation is performed during compilation, by the compiler. (See also this and this blog post on the subject by Eric Lippert, who previously worked on the C# compiler.) There is no runtime performance penalty.

Inspired by trimIndent() in Kotlin.
This code:
var x = #"
anything
you
want
".TrimIndent();
will produce a string:
anything
you
want
or "\nanything\n you\nwant\n"
Implementation:
public static string TrimIndent(this string s)
{
string[] lines = s.Split('\n');
IEnumerable<int> firstNonWhitespaceIndices = lines
.Skip(1)
.Where(it => it.Trim().Length > 0)
.Select(IndexOfFirstNonWhitespace);
int firstNonWhitespaceIndex;
if (firstNonWhitespaceIndices.Any()) firstNonWhitespaceIndex = firstNonWhitespaceIndices.Min();
else firstNonWhitespaceIndex = -1;
if (firstNonWhitespaceIndex == -1) return s;
IEnumerable<string> unindentedLines = lines.Select(it => UnindentLine(it, firstNonWhitespaceIndex));
return String.Join("\n", unindentedLines);
}
private static string UnindentLine(string line, int firstNonWhitespaceIndex)
{
if (firstNonWhitespaceIndex < line.Length)
{
if (line.Substring(0, firstNonWhitespaceIndex).Trim().Length != 0)
{
return line;
}
return line.Substring(firstNonWhitespaceIndex, line.Length - firstNonWhitespaceIndex);
}
return line.Trim().Length == 0 ? "" : line;
}
private static int IndexOfFirstNonWhitespace(string s)
{
char[] chars = s.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
if (chars[i] != ' ' && chars[i] != '\t') return i;
}
return -1;
}

If it is one long string then you can always keep the string in a text file and read it into your variable, e.g.
string text = File.ReadAllText(#"c:\file.txt", Encoding.UTF8);
This way you can format it anyway you want using a text editor and it won't negatively effect the look of your code.
If you're changing parts of the string on the fly then StringBuilder is your best option. - or if you did decide to read the string in from a text file, you could include {0} elements in your string and then use string.format(text, "text1","text2", etc) to change the required parts.

Extract heading text from HTML text

I have a textarea with tinyMCE text editor to make it RichTextEditor. I want to extract all heading(H1,H2 etc) text without style and formatting .
Suppose that txtEditor.InnerText gives me value like below:
<p><span style="font-family: comic sans ms,sans-serif; color: #993366; font-size: large; background-color: #33cccc;">This is before heading one</span></p>
<h1><span style="font-family: comic sans ms,sans-serif; color: #993366;">Hello This is Headone</span></h1>
<p>this is before heading2</p>
<h2>This is heading2</h2>
i want to get a list of heading tag's text only ? any kind of suggestion and guidance will be appreciated.

Use HtmlAgilityPack, and then it's easy :
var doc = new HtmlDocument();
doc.LoadHtml(txtEditor.InnerText);
var h1Elements = doc.DocumentNode.Descendants("h1").Select(nd => nd.InnerText);
string h1Text = string.Join(" ", h1Elements);

referencing Regular Expression to Read Tags in HTML
I believe that this is close to what you are looking for:
String h1Regex = "<h[1-5][^>]*?>(?<TagText>.*?)</h[1-5]>";
MatchCollection mc = Regex.Matches(html, h1Regex);

Removing a Selective Section from a String

I need help figuring out how to remove only the very last "</span>" tag from a string. He is an example of what one of the strings might look like, but sometimes there are a few
<DIV style="TEXT-ALIGN: center"><span style="text-decoration:underline;"> some text </span> </span></DIV>

var originalString = #"<DIV style='TEXT-ALIGN: center'><span style='text-decoration:underline;'> some text </span> </span></DIV>";
var lastIndex = originalString.LastIndexOf("</span>");
var newwString = originalString.Substring(0, lastIndex) + originalString.Substring(lastIndex + 7);

use this regex </span>(?=[(</span>)])

Parsing big string (HTML code)

I'm looking to parse some information on my application.
Let's say we have somewhere in that string:
<tr class="tablelist_bg1">
<td>Beja</td>
<td class="text_center">---</td>
<td class="text_center">19.1</td>
<td class="text_center">10.8</td>
<td class="text_center">NW</td>
<td class="text_center">50.9</td>
<td class="text_center">0</td>
<td class="text_center">1016.6</td>
<td class="text_center">---</td>
<td class="text_center">---</td>
</tr>
All rest that's above or below this doesn't matter. Remember this is all inside a string.
I want to get the values inside the td tags: ---, 19.1, 10.8, etc.
Worth knowing that there are many entries like this on the page.
Probably also a good idea to link the page here.
As you probably guessed I have absolutely no idea how to do this... none of the functions I know I can perform over the string (split etc.) help.
Thanks in advance

Just use String.IndexOf(string, int) to find a "<td", again to find the next ">", and again to find "</td>". Then use String.Substring to pull out a value. Put this in a loop.
public static List<string> ParseTds(string input)
{
List<string> results = new List<string>();
int index = 0;
while (true)
{
string next = ParseTd(input, ref index);
if (next == null)
return results;
results.Add(next);
}
}
private static string ParseTd(string input, ref int index)
{
int tdIndex = input.IndexOf("<td", index);
if (tdIndex == -1)
return null;
int gtIndex = input.IndexOf(">", tdIndex);
if (gtIndex == -1)
return null;
int endIndex = input.IndexOf("</td>", gtIndex);
if (endIndex == -1)
return null;
index = endIndex;
return input.Substring(gtIndex + 1, endIndex - gtIndex - 1);
}

Assuming your string is valid XHTML, you can use use an XML parser to get the content you want. There's a simple example here that shows how to use XmlTextReader to parse XML content. The example reads from a file, but you can change it to read from a string:
new XmlTextReader(new StringReader(someString));
You specifically want to keep track of td element nodes, and the text node that follows them will contain the values you want.

Use a loop to load each non empty line from the file into a String
Process the string character by character
Check for characters indicating the the begining of a td tag
use a substring function or just bulild a new string character by character to get all the content until the </td> tag begins.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how to split the string between two strings in c#? - c#

Related

C# String Remove characters between two characters

Acting on the indentation of a c# multiline string

Extract heading text from HTML text

Removing a Selective Section from a String

Parsing big string (HTML code)

Categories

Resources