How to extract email from html link

How to extract email from html link - c#

Hi I have a csv file which I need to format (columns) email, they are in the csv as follows
john#domain.com"
dave.h#domain22.co.uk"
etc...
So i want to remove " and just use john#domain.com
I have the following
foreach (var clientI in clientImportList)
{
newClient = new DomainObjects.Client();
//Remove unwanted email text??
newClient.Email = clientI.Email
}

I would suggest to use HtmlAgilityPack and not parse it yourself:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
string href = link["href"].Value;
// use "mailto:john#domain.com" here..
}

You can test regular expressions here:
https://regex101.com/
Using your example, this seems to work:
mailto:(.*?)\\">
The library needed for regex is:
using System.Text.RegularExpressions;

I usually write myself little utility classes and extensions to handle things like this. Since this probably won't be the last time you have to do something like this you could do this:
Create an Extension of the string class:
public static class StringExtensions
{
public static string ExtractMiddle(this string text, string front, string back)
{
text = text.Substring(text.IndexOf(front) + 1);
return text.Remove(text.IndexOf(back));
}
}
And then do this(Could use better naming, but you get the point):
string emailAddress = text.ExtractMiddle(">", "<");

If you want to do it the index way, something like:
const string start = "<a href=\\mailto:";
const string end = "\\\">";
string asd1 = "john#domain.com\"";
int index1 = asd1.IndexOf(start);
int startPosition = index1 + start.Length;
int endPosition = asd1.IndexOf(end);
string email = asd1.Substring(startPosition, endPosition - startPosition);

Related

How to extract an url from a String in C#

I have this string :
"<figure><img
src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'
href='JavaScript:void(0);' onclick='return takeImg(this)'
tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>"
How can I retrieve this link :
http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg
All string are the same type so somehow I need to get substring between src= and href. But I don't know how to do that. Thanks.

If you parse HTML don't not use string methods but a real HTML parser like HtmlAgilityPack:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is your string
var linksAndImages = doc.DocumentNode.SelectNodes("//a/#href | //img/#src");
var allSrcList = linksAndImages
.Select(node => node.GetAttributeValue("src", "[src not found]"))
.ToList();

You can use regex:
var src = Regex.Match("the string", "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;

In general, you should use an HTML/XML parser when parsing a value from HTML code, but with a limited string like this, Regex would be fine.
string url = Regex.Match(htmlString, #"src='(.*?)'").Groups[1].Value;

If your string is always in same format, you can easily do this like so :
string input = "<figure><img src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href='JavaScript:void(0);' onclick='return takeImg(this)' tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
// link is between ' signs starting from the first ' sign so you can do :
input = input.Substring(input.IndexOf("'")).Substring(input.IndexOf("'"));
// now your string looks like : "http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg"
return input;

string str = "<figure><imgsrc = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
int pFrom = str.IndexOf("src = '") + "src = '".Length;
int pTo = str.LastIndexOf("'href");
string url = str.Substring(pFrom, pTo - pFrom);
Source :
Get string between two strings in a string

Q is your string in this case, i look for the index of the attribute you want (src = ') then I remove the first few characters (7 including spaces) and after that you look for when the text ends by looking for '.
With removing the first few characters you could use .IndexOf to look for how many to delete so its not hard coded.
string q =
"<figure><img src = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'" +
"tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
string z = q.Substring(q.IndexOf("src = '"));
z = z.Substring(7);
z = z.Substring(0, z.IndexOf("'"));
MessageBox.Show(z);
This is certainly not the most elegant way (look at the other answers for that :)).

Extract multiple values from string using C#

I'am creating my own forum. I've got problem with quoting messages. I know how to add quoting message into text box, but i cannot figure out how to extract values from string after post. In text box i've got something like this:
[quote IdPost=8] Some quoting text [/quote]
[quote IdPost=15] Second quoting text [/quote]
Could You tell what is the easiest way to extract all "IdPost" numbers from string after posting form ?.

by using a regex
#"\[quote IdPost=(\d+)\]"
something like
Regex reg = new Regex(#"\[quote IdPost=(\d+)\]");
foreach (Match match in reg.Matches(text))
{
...
}

var originalstring = "[quote IdPost=8] Some quoting text [/quote]";
//"[quote IdPost=" and "8] Some quoting text [/quote]"
var splits = originalstring.Split('=');
if(splits.Count() == 2)
{
//"8" and "] Some quoting text [/quote]"
var splits2 = splits[1].Split(']');
int id;
if(int.TryParse(splits2[0], out id))
{
return id;
}
}

I do not know exactly what is your string, but here is a regex-free solution with Substring :
using System;
public class Program
{
public static void Main()
{
string source = "[quote IdPost=8] Some quoting text [/quote]";
Console.WriteLine(ExtractNum(source, "=", "]"));
Console.WriteLine(ExtractNum2(source, "[quote IdPost="));
}
public static string ExtractNum(string source, string start, string end)
{
int index = source.IndexOf(start) + start.Length;
return source.Substring(index, source.IndexOf(end) - index);
}
// just another solution for fun
public static string ExtractNum2(string source, string junk)
{
source = source.Substring(junk.Length, source.Length - junk.Length); // erase start
return source.Remove(source.IndexOf(']')); // erase end
}
}
Demo on DotNetFiddle

Merge two EmailMessage in C#

What is the correct way to merge two EmailMessage. I tried following:
mergedMessage.Body.Text = message1.Body.Text + message2.Body.Text
But it is creating two html tags in the merged message, which is not correct.
Should I parse the message1.Body.Text and message2.Body.Text and fetch the content of html and copy to the mergedMessage?

You can do this
using System.Text.RegularExpressions;
const string HTML_TAG_PATTERN = "<.*?>";
static string StripHTML (this string inputString)
{
return Regex.Replace
(inputString, HTML_TAG_PATTERN, string.Empty);
}
mergedMessage.Body.Text = message1.Body.Text.StripHTML() + message2.Body.Text.StripHTML()

How to use replace only the first occurence of www

In my code behind in C# I have the following code. How do I change the replace so that only
the first occurance of www is replaced?
For example if the User enters www.testwww.com then I should be saving it as testwww.com.
Currently as per the below code it saves as www.com (guess due to substr code).
Please help. Thanks in advance.
private string FilterUrl(string url)
{
string lowerCaseUrl = url.ToLower();
lowerCaseUrl = lowerCaseUrl.Replace("http://", string.Empty).Replace("https://", string.Empty).Replace("ftp://", string.Empty);
lowerCaseUrl = lowerCaseUrl.Replace("www.", string.Empty);
string lCaseUrl = url.Substring(url.Length - lowerCaseUrl.Length, lowerCaseUrl.Length);
return lCaseUrl;
}

As Ally suggested. You are much better off using System.Uri. This also replaces the leading www as you wish.
private string FilterUrl(string url)
{
Uri uri = new UriBuilder(url).Uri; // defaults to http:// if missing
return Regex.Replace(uri.Host, "^www.", "") + uri.PathAndQuery;
}
Edit: The trailing slash is because of the PathAndQuery property. If there was no path you are left with the slash only. Just add another regex replace or string replace. Here's the regex way.
return Regex.Replace(uri.Host, "^www.", "") + Regex.Replace(uri.PathAndQuery, "/$", "");

I would suggest using indexOf(string) to find the first occurrence.
Edit: okay someone beat me to it ;)

You could use IndexOf like Felipe suggested OR do it the low tech way..
lowerCaseUrl = lowerCaseUrl.Replace("http://", string.Empty).Replace("https://", string.Empty).Replace("ftp://", string.Empty).Replace("http://www.", string.Empty).Replace("https://www.", string.Empty)
Would be interested to know what you're trying to achieve.

Came up with a cool static method, also works for replacing the first x occurrences:
public static string ReplaceOnce(this string s, string replace, string with)
{
return s.ReplaceCount(replace, with);
}
public static string ReplaceCount(this string s, string replace, string with, int howManytimes = 1)
{
if (howManytimes < 0) throw InvalidOperationException("can not replace a string less than zero times");
int count = 0;
while (s.Contains(replace) && count < howManytimes)
{
int position = s.IndexOf(replace);
s = s.Remove(position, replace.Length);
s = s.Insert(position, with);
count++;
}
return s;
}
The ReplaceOnce isn't necessary, just a simplifier. Call it like this:
string url = "http://www.stackoverflow.com/questions/www/www";
var urlR1 - url.ReplaceOnce("www", "xxx");
// urlR1 = "http://xxx.stackoverflow.com/questions/www/www";
var urlR2 - url.ReplaceCount("www", "xxx", 2);
// urlR2 = "http://xxx.stackoverflow.com/questions/xxx/www";
NOTE: this is case-sensitive as it is written

The Replace method will change all content of the string. You have to locate the piece you want to remove using IndexOf method, and remove using Remove method of string. Try something like this:
//include the namespace
using System.Globalization;
private string FilterUrl(string url)
{
// ccreate a Comparer object.
CompareInfo myCompare = CultureInfo.InvariantCulture.CompareInfo;
// find the 'www.' on the url parameter ignoring the case.
int position = myCompare.IndexOf(url, "www.", CompareOptions.IgnoreCase);
// check if exists 'www.' on the string.
if (position > -1)
{
if (position > 0)
url = url.Remove(position - 1, 5);
else
url = url.Remove(position, 5);
}
//if you want to remove http://, https://, ftp://.. keep this line
url = url.Replace("http://", string.Empty).Replace("https://", string.Empty).Replace("ftp://", string.Empty);
return url;
}
Edits
There was a part in your code that is removing a piece of string. If you just want to remove the 'www.' and 'http://', 'https://', 'ftp://', take a look the this code.
This code also ignore the case when it compares the url parameter and what you have been findind, on case, 'www.'.

How to select only the characters after a repeated symbol in a long string

If i am using C# and i have a string coming in from a database like this:
\RBsDC\1031\2011\12\40\1031-215338-5DRH44PUEM2J51GRL7KNCIPV3N-META-ENG-22876500BBDE449FA54E7CF517B2863E.XML
And i only want this part of the string:
1031-215338-5DRH44PUEM2J51GRL7KNCIPV3N-META-ENG-22876500BBDE449FA54E7CF517B2863E.XML
How can i get this string if there is more than one "\" symbol?

You can use the LastIndexOf() method of the String class:
string s = #"\RBsDC\1031\2011\12\40\1031-215338-5DRH44PUEM2J51GRL7KNCIPV3N-META-ENG-22876500BBDE449FA.xml";
Console.Out.WriteLine(s.Substring(s.LastIndexOf('\\') + 1));
Hope, this helps.

Use String.Split to split string by parts and then get the last part.
Using LINQ Enumerable.Last() :
text.Split('\\').Last();
or
// todo: add null-empty checks, etcs
var parts = text.Split('\\');
strign lastPart = parts[parts.Length - 1];

You can use a combination of String.LastIndexOf("\") and String.Substring(lastIndex+1). You could also use (only in the sample you provided) Path.GetFileName(theString).

string[] x= line.Split('\');
string goal =x[x.Length-1];
but linq will be easier

You can use regex or split the string by "\" symbol and take the last element of array
using System.Linq;
public class Class1
{
public Class1()
{
string s =
#"\RBsDC\1031\2011\12\40\1031-215338-5DRH44PUEM2J51GRL7KNCIPV3N-META-ENG-22876500BBDE449FA54E7CF517B2863E.XML";
var array = s.Split('\\');
string value = array.Last();
}
}

newstring = string.Substring(string.LastIndexOf(#"\")+1);

It seems like original string is like filePath.
This could be one easy solution.
string file = #"\RBsDC\1031\2011\12\40\1031-215338-5DRH44PUEM2J51GRL7KNCIPV3N-META-ENG-22876500BBDE449FA.xml";
string name = System.IO.Path.GetFileName(file);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract email from html link - c#

I would suggest to use HtmlAgilityPack and not parse it yourself: HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"]) { string href = link["href"].Value; // use "mailto:john#domain.com" here.. }

You can test regular expressions here: https://regex101.com/ Using your example, this seems to work: mailto:(.*?)\\"> The library needed for regex is: using System.Text.RegularExpressions;

Related

How to extract an url from a String in C#

Extract multiple values from string using C#

Merge two EmailMessage in C#

How to use replace only the first occurence of www

How to select only the characters after a repeated symbol in a long string

Categories

Resources