For Example i have string Like
"//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere"
I want to use //RemoveFromhere as starting point from where
and //RemoveTohere as ending point in between all character i want to remove
var sin = "BEFORE//RemoveFromhere"+
"<div>"+
"<p>my name is blawal i want to remove this div </p>"+
"</div>"+
"//RemoveTohereAFTER";
const string fromId = "//RemoveFromhere";
const string toId = "//RemoveTohere";
var from = sin.IndexOf(fromId) + fromId.Length;
var to = sin.IndexOf(toId);
if (from > -1 && to > from)
Console.WriteLine(sin.Remove(from , to - from));
//OR to exclude the from/to tags
from = sin.IndexOf(fromId);
to = sin.IndexOf(toId) + toId.Length;
Console.WriteLine(sin.Remove(from , to - from));
This gives results BEFORE//RemoveFromhere//RemoveTohereAFTER and BEFOREAFTER
See also a more general (better) option using regular expressions from Cetin Basoz added after this answer was accepted.
void Main()
{
string pattern = #"\n{0,1}//RemoveFromhere(.|\n)*?//RemoveTohere\n{0,1}";
var result = Regex.Replace(sample, pattern, "");
Console.WriteLine(result);
}
static string sample = #"Here
//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere
Keep this.
//RemoveFromhere
<div>
<p>my name is blawal i want to remove this div </p>
</div>
//RemoveTohere
Keep this too.
";
Related
I am reading content from a text file which contains below contents
<ID> test data </Id> <Sub_Tab> test data </sub_tab> <form> form data </form>
My requirement is whatever I have inside ID, Sub_tab tags I want to take off the trailing and leading spaces from the content inside those tags, but the content inside the form tag should be untouched. My output should come as:
<iD>test data</Id> <Sub_Tab>test data</sub_tab> <form> form data </form>
Tried using many patterns, but none of them worked
Regex regex = new Regex(#"/>[ \t]+</");
string newContent = regex.Replace(fileContent, "><");
This kind of feels like overkill. Maybe because it's an overkill?
Anyway, you might be able to do this easily using regex. But at this time, I'm not familiar with regex.
So, this is my solution to your problem. Here it comes.
string input = "<ID> test data </Id> <Sub_Tab> test data </sub_tab> <form> form data </form>";
string find = "ƸƷ";
// ƸƷ - If you have these two characters in your input string, then this won't work.
// These characters (ƸƷ) can be replaced with any unique string. However, this function
// to work, that string should not be contained in the input string
// or it will mess the replace function. This can be done without using
// these characters. But it might require more coding. So, I'm going with this.
string str = input;
IList < string > strList = new List < string > ();
// Remove all content inside the form tags
while (true) {
if ((str.Contains("<form>")) && (str.Contains("</form>"))) {
int start = str.IndexOf("<form>");
int end = str.IndexOf("</form>");
string result = str.Substring(start, end - start + 7); // 7 = "</form>".Length
str = str.Replace(result, find);
strList.Add(result);
} else {
break;
}
}
// Manipulate the data
str = str.Replace(" <", "<").Replace("> ", ">");
// Add the contents inside the form tags
foreach(string val in strList) {
int place = str.IndexOf(find);
str = str.Remove(place, find.Length).Insert(place, val);
}
Console.WriteLine("Input String: " + input);
Console.WriteLine("Output String: " + str);
Example 01
<ID> test data </Id> <Sub_Tab> test data </sub_tab> <form> form data </form>
<ID>test data</Id><Sub_Tab>test data</sub_tab><form> form data </form>
Example 02
<ID> test data </Id> <Sub_Tab> test data </sub_tab> <form> form data <div> data </div> </form> <br>
<ID>test data</Id><Sub_Tab>test data</sub_tab><form> form data <div> data </div> </form><br>
Example 03
<ID> test data </Id> <form> <span> date </span> </form> <Sub_Tab> test data </sub_tab> <form> form data </form>
<ID>test data</Id><form> <span> date </span> </form><Sub_Tab>test data</sub_tab><form> form data </form>
Online Demo: https://rextester.com/FZU31740
I want to write some Html from c# (html is an example, this might be other languages..)
For example:
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
will produce:
<div class="className">
<span>Mon text</span>
</div>
that's not very cool from the Html point of view...
The only way to have a correct HTML indentation will be to indent the C# code like this :
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
We get the correctly indented Html:
<div class="className">
<span>Mon text</span>
</div>
But indenting the C# like this really broke the readability of the code...
Is there a way to act on the indentation in the C# language ?
If not, does someone have a tip better than :
string div = "<div class=\"className\">" + Environment.NewLine +
" <span>Mon text</span>" + Environment.NewLine +
"</div>";
and better than
var sbDiv = new StringBuilder();
sbDiv.AppendLine("<div class=\"className\">");
sbDiv.AppendLine(" <span>Mon text</span>");
sbDiv.AppendLine("</div>");
What i use as a solution:
Greats thanks to #Yotam for its answer.
I write a little extension to make the alignment "dynamic" :
/// <summary>
/// Align a multiline string from the indentation of its first line
/// </summary>
/// <remarks>The </remarks>
/// <param name="source">The string to align</param>
/// <returns></returns>
public static string AlignFromFirstLine(this string source)
{
if (String.IsNullOrEmpty(source)) {
return source;
}
if (!source.StartsWith(Environment.NewLine)) {
throw new FormatException("String must start with a NewLine character.");
}
int indentationSize = source.Skip(Environment.NewLine.Length)
.TakeWhile(Char.IsWhiteSpace)
.Count();
string indentationStr = new string(' ', indentationSize);
return source.TrimStart().Replace($"\n{indentationStr}", "\n");
}
Then i can use it like that :
private string GetHtml(string className)
{
return $#"
<div class=""{className}"">
<span>Texte</span>
</div>".AlignFromFirstLine();
}
That return the correct html :
<div class="myClassName">
<span>Texte</span>
</div>
One limitation is that it will only work with space indentation...
Any improvement will be welcome !
You could wrap the string to the next line to get the desired indentation:
string div =
#"
<div class=""className"">
<span>Mon text</span>
</div>"
.TrimStart(); // to remove the additional new-line at the beginning
Another nice solution (disadvantage: depends on the indentation level!)
string div = #"
<div class=""className"">
<span>Mon text</span>
</div>".TrimStart().Replace("\n ", "\n");
It just removes the indentation out of the string. make sure the number of spaces in the first string of the Replace is the same amount of spaces your indentation has.
I like this solution more, but how about:
string div = "<div class='className'>\n"
+ " <span>Mon text</span>\n"
+ "</div>";
This gets rid of some clutter:
Replace " inside strings with ' so that you don't need to escape the quote. (Single quotes in HTML appear to be legal.)
You can then also use regular "" string literals instead of #"".
Use \n instead of Environment.NewLine.
Note that the string concatenation is performed during compilation, by the compiler. (See also this and this blog post on the subject by Eric Lippert, who previously worked on the C# compiler.) There is no runtime performance penalty.
Inspired by trimIndent() in Kotlin.
This code:
var x = #"
anything
you
want
".TrimIndent();
will produce a string:
anything
you
want
or "\nanything\n you\nwant\n"
Implementation:
public static string TrimIndent(this string s)
{
string[] lines = s.Split('\n');
IEnumerable<int> firstNonWhitespaceIndices = lines
.Skip(1)
.Where(it => it.Trim().Length > 0)
.Select(IndexOfFirstNonWhitespace);
int firstNonWhitespaceIndex;
if (firstNonWhitespaceIndices.Any()) firstNonWhitespaceIndex = firstNonWhitespaceIndices.Min();
else firstNonWhitespaceIndex = -1;
if (firstNonWhitespaceIndex == -1) return s;
IEnumerable<string> unindentedLines = lines.Select(it => UnindentLine(it, firstNonWhitespaceIndex));
return String.Join("\n", unindentedLines);
}
private static string UnindentLine(string line, int firstNonWhitespaceIndex)
{
if (firstNonWhitespaceIndex < line.Length)
{
if (line.Substring(0, firstNonWhitespaceIndex).Trim().Length != 0)
{
return line;
}
return line.Substring(firstNonWhitespaceIndex, line.Length - firstNonWhitespaceIndex);
}
return line.Trim().Length == 0 ? "" : line;
}
private static int IndexOfFirstNonWhitespace(string s)
{
char[] chars = s.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
if (chars[i] != ' ' && chars[i] != '\t') return i;
}
return -1;
}
If it is one long string then you can always keep the string in a text file and read it into your variable, e.g.
string text = File.ReadAllText(#"c:\file.txt", Encoding.UTF8);
This way you can format it anyway you want using a text editor and it won't negatively effect the look of your code.
If you're changing parts of the string on the fly then StringBuilder is your best option. - or if you did decide to read the string in from a text file, you could include {0} elements in your string and then use string.format(text, "text1","text2", etc) to change the required parts.
I am trying to parse a website's HTML and then get text between two strings.
I wrote a small function to get text between two strings.
public string getBetween(string strSource, string strStart, string strEnd)
{
int Start, End;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
Start = strSource.IndexOf(strStart, 0) + strStart.Length;
End = strSource.IndexOf(strEnd, Start);
return strSource.Substring(Start, End - Start);
}
else
{
return string.Empty;
}
}
I have the HTML stored in a string called 'html'. Here is a part of the HTML that I am trying to parse:
<div class="info">
<div class="content">
<div class="address">
<h3>Andrew V. Kenny</h3>
<div class="adr">
67 Romines Mill Road<br/>Dallas, TX 75204 </div>
</div>
<p>Curious what <strong>Andrew</strong> means? Click here to find out!</p>
So, I use my function like this.
string m2 = getBetween(html, "<div class=\"address\">", "<p>Curious what");
string fullName = getBetween(m2, "<h3>", "</h3>");
string fullAddress = getBetween(m2, "<div class=\"adr\">", "<br/>");
string city = getBetween(m2, "<br/>", "</div>");
The output of the full name works fine, but the others have additional spaces in them for some reason. I tried various ways to avoid them (such as completely copying the spaces from the source and adding them in my function) but it didn't work.
I get an output like this:
fullName = "Andrew V. Kenny"
fullAddress = " 67 Romines Mill Road"
city = "Dallas, TX 75204 "
There are spaces in the city and address which I don't know how to avoid.
Trim the string and the unecessary spaces will be gone:
fullName = fullName.Trim ();
fullAddress = fullAddress.Trim ();
city = city.Trim ();
I have written one function to replace value of href with somevalue + original href value
say:-
<a href="/somepage.htm" id="test">
replace with
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
Places where no replacement needed:-
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
<a href="#" id="test">
<a href="javascript:alert('test');" id="test">
<a href="" id="test">
I have written following method, working with all the cases but not with blank value of href
public static string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (String.IsNullOrEmpty(text))
{
return text;
}
String value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
return value.Replace(absoluteUrl + "/", absoluteUrl);
}
Written ?!http|javascript|# to ignore http, javascript, #, so it is working for these cases, but if we consider following part
(?!http|javascript|#)(.*?)
And replace this * with +
(?!http|javascript|#)(.+?)
It is not working for empty case.
Changing * to + does not work, because you got it completely wrong:
* means "zero or more"
+ means "one or more"
So with + you are forcing the content to be at the place, rather that allowing the content to be missing.
Another thing you got wrong is the placement. The * at that place refers to .. Together, they mean "zero or more characters". So, this part already does not require any content. Therefore, since your regex currently does not work with null-content, something other seems to be requiring that.
Looking at the preceding expressions:
(?!http|javascript|#)(.*?)
The ?! is a zero-width negative lookahead. Zero-width. Negative. That means that it will not require any content either.
So, I got your code, pasted it into the online compiler, then I fed it with your example <a href="" id="test">:
using System.IO;
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string text = "<a href=\"\" id=\"test\">";
string pattern = "src|href";
string absoluteUrl = "YADA";
string value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Console.WriteLine(value);
}
}
and guess what it works:
Compiling the source code....
$mcs main.cs -out:demo.exe 2>&1
Executing the program....
$mono demo.exe
<a href="YADA" id="test">
So, either you are not telling the truth, or you have changed the code when posting it here, or you've got something completely other messed up in your code, sorry.
EDIT:
So, it turned out that the href="" was meant to be ignored.
Then the simplest thing you can do it to add another negative-lookahead that will block the href="" case explicitely. However, note that the placement of that group will be different. The current group is inside the quotes from href, so it cannot "peek" how the whole href-quotes look like. The new group must be before the quotes.
"<(.*?)(" + pattern + ")=(?!\"\")\"(?!http|javascript|#)(.*?)\"(.*?)>"
Note that just-before the first quote from href, I've added a (?!\"\") that will ensure that "there will be no such case that quote follows a quote".
I know that you are asking for RegEx.
But here is an alternative, because I think the use of Uri.IsWellFormedUriString worths it.
This way you also you can reuse the helpers functions:
public string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (isHrefRelativeURIPath(text)){
text = absoluteUrl + "/" + System.Text.RegularExpressions.Regex.Replace("///days/hours.htm", #"^\/+", "");
}
return text;
}
public bool isHrefRelativeURIPath(string value) {
if (isLink(value) ||
value.StartsWith("#") ||
value.StartsWith("javascript"))
{
return false;
}
// Others Custom exclusions
return true;
}
public bool isLink(string value) {
if (String.IsNullOrEmpty(value))
return false;
return Uri.IsWellFormedUriString("http://" + value, UriKind.Absolute);
}
I need help figuring out how to remove only the very last "</span>" tag from a string. He is an example of what one of the strings might look like, but sometimes there are a few
<DIV style="TEXT-ALIGN: center"><span style="text-decoration:underline;"> some text </span> </span></DIV>
var originalString = #"<DIV style='TEXT-ALIGN: center'><span style='text-decoration:underline;'> some text </span> </span></DIV>";
var lastIndex = originalString.LastIndexOf("</span>");
var newwString = originalString.Substring(0, lastIndex) + originalString.Substring(lastIndex + 7);
use this regex </span>(?=[(</span>)])