Reading Regular Expression in ASP.NET C# - c#

I'm able to read and download list of .jpg files on a page using this regular expression
MatchCollection match = Regex.Matches(htmlText,#"http://.*?\b.jpg\b", RegexOptions.RightToLeft);
Output example: http://somefiles.jpg from this line
<img src="http://somefiles.jpg"/> in html
Question:How could I read files in this kind of format?
I just want to extract files with .exe on the page. So on the example above ^ I just want to get the datavoila-setup.exe file. Sorry I'm a little noob and confuse how to do it T_T. Thanks in advance for anyone who could help me. :)
this is my updated codes but I'm getting error on the HtmlDocument doc = new HtmlDocument(); part "No Source Available" and I'm getting an null value for list :(
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.Load(streams);
var list = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
doc.Save("list");
}
this is Flipbed answer it works but not I'm not getting a clean catch :( I think there is something to edit on splitting the html to text
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
WebClient webclient = new WebClient();
string checkurl = webclient.DownloadString(urls);
List<string> list = new List<string>();//!3
//Splits the html into with \ into texts
string[] parts = htmlTexts.Split(new string[] { "\"" },//!3
StringSplitOptions.RemoveEmptyEntries);//!3
//Compares the split text with valid file extension
foreach (string part in parts)//!3
{
if (part.EndsWith(".exe"))//!3
{
list.Add(part);//!3
//Download the data into a Byte array
byte[] fileData = webclient.DownloadData(this.txtSiteAddress.Text + '/' + part);//!6
//Create FileStream that will write the byte array to
FileStream file =//!6
File.Create(this.txtDownloadPath.Text + "\\" + list);//!6
//Write the full byte array to the file
file.Write(fileData, 0, fileData.Length);//!6
//Download message complete
lblMessage.Text = "Download Complete!";
//Clears the textfields content
txtSiteAddress.Text = "";
txtDownloadPath.Text = "";
//Close the file so other processes can access it
file.Close();
break;
}
}

This is not an answer but too long for a comment. (I'll delete it later)
To resolve the issue it works, it doesn't work etc; a complete code, for those who may want to check
string html = #"";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Anirudh's Solution
var itemList = doc.DocumentNode.SelectNodes("//a//#href")//get all hrefs
.Select(p => p.InnerText)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns empty list
//correct one
var itemList2 = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns download/datavoila-setup.exe

Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack
You can use this code to retrieve all exe's using HtmlAgilityPack
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");
var itemList = doc.DocumentNode.SelectNodes("//a[#href]")//get all hrefs
.Select(p => p.Attributes["href"].Value)
.Where(x=>x.EndsWith("exe"))
.ToList();
itemList now contain all exe's

I would use FizzlerEx, it adds jQuery like syntax to HTMLAgilityPack. Use the ends-with selector to test the href attribute:
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html")
var page = document.DocumentNode;
foreach(var item in page.QuerySelectorAll("a[href$='exe']"))
{
var file = item.Attributes["href"].Value;
}
And an explanation of why it is bad to parse HTML with RegEx: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Instead of using regular expressions you could just use normal code.
List<string> files = new List<string>();
string[] parts = htmlText.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
if (part.EndsWith(".exe"))
files.Add(part);
}
In this case you would have all the found files in the files list.
EDIT:
You could do:
List<string> files = new List<string>();
string[] hrefs = htmlText.Split(new string[]{"href=\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string href in hrefs)
{
string[] possibleFile = href.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
if (possibleFile.Length() > 0 && possibleFile[0].EndsWith(".exe"))
files.Add(possibleFile[0]);
}
This would also check that the exe file is within a href.

Related

How to get input from textbox in C#?

First things first,I am very new to C# and programming all around. I have a standalone program that will read XML files in a certain location and convert them to plaintext files.
And I have a windows forms app that has a file directory and will display the chosen file in a textbox.
{
Stream myStream;
OpenFileDialog openFileDialog1 = new OpenFileDialog();
if (openFileDialog1.ShowDialog() == System.Windows.Forms.DialogResult.OK)
{
if ((myStream = openFileDialog1.OpenFile())!= null)
{
string strfilename = openFileDialog1.FileName;
string filetext = File.ReadAllText(strfilename);
textBox1.Text = filetext;
}
}
}
Below is a snippet of my conversion program.
string[] files = Directory.GetFiles("C:\\articles");
foreach (string file in files)
{
List<string> translatedLines = new List<string>();
string[] lines = File.ReadAllLines(file);
foreach(string line in lines)
{
if (line.Contains("\"check\""))
{
string pattern = "<[^>]+>";
string replacement = " ";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(line, replacement);
translatedLines.Add(result);
}
}
How would I modify my program to take the input from the textbox and then perform its conversion?
(And Yes I'm aware I have to combine the two programs.)
Use XDocument class to parse the XML formatted string to XML Document so that you can get values on each Nodes of the XML
XDocument xDoc = new XDocument();
xDoc = XDocument.Parse(filetext);
Now read the content:
var textValue = xDoc.Descendants("Response").First().Attribute("Status").Value;

How to get all the files form a directory where name does not contains 0?

Recently I have built a small converter that converts txt data to xml in a certain structure , I choose a folder and the program loops through all the files in that folder and write in a XML format all together in one xml document.
In the folder I have data names like:
Data.0001.txt
Data.0002.txt
Data.0003.txt
Data.0004.txt
Data.txt
and so on
I want only the files that dose NOT contain zeros in them because ones with zeros are just a back up copy for the others , and i have over 6000 file i can't filter them manually
Here is my code so far
static void Main(string[] args)
{
FolderBrowserDialog SelectFolder = new FolderBrowserDialog();
String path = #"C:\newpages";
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("Pages");
if (SelectFolder.ShowDialog() == DialogResult.OK)
{
var txt = string.Empty;
string[] Files = Directory.GetFiles((SelectFolder.SelectedPath));
int i = 1;
foreach (string path1 in Files)
{
String filename = Path.GetFileNameWithoutExtension((path1));
using (StreamReader sr = new StreamReader(path1))
{
txt = sr.ReadToEnd();
XmlElement id = doc.CreateElement("Page.id");
id.SetAttribute("Page.Nr", i.ToString());
id.SetAttribute("Pagetitle", filename);
XmlElement name = doc.CreateElement("PageContent");
XmlCDataSection cdata = doc.CreateCDataSection(txt);
name.AppendChild(cdata);
id.AppendChild(name); // page id appenndchild
root.AppendChild(id); // roots appenedchild
doc.AppendChild(root); //Main root
}
i++;
}
}
Console.WriteLine("finished");
Console.ReadKey();
doc.Save(Path.ChangeExtension(path, ".xml"));
}
}
Any help would be really nice guys
GetFiles returns the name of a file in a specified directory. Its return type is string[] so you can easily apply a Where to filter the file names as follow:-
var files = Directory.GetFiles("PathToYourDirec").Where(name => !name.Contains("0"));
On the string filename you could make sure it doesn't contain "0"
if(!filename.Contains("0"))
{
}
On the Files variable you can use a regex to filter out the filenames which contains only letters
var reg = new Regex(#"^([^0-9]*)$");
var files = Directory.GetFiles("path-to-folder")
.Where(path => reg.IsMatch(path))
.ToList();
The whole code can be largely simplified while solving this issue. You don't need a StreamReader just to read the whole file, and you may as well get the filename early and filter instead of going into the foreach and filtering :
static void Main(string[] args)
{
FolderBrowserDialog SelectFolder = new FolderBrowserDialog();
String path = #"C:\newpages";
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("Pages");
if (SelectFolder.ShowDialog() == DialogResult.OK)
{
// Don't declare txt here, you're overwriting and only using it in a nested loop, declare it as you use it there
// var txt = string.Empty;
//string[] Files = Directory.GetFiles((SelectFolder.SelectedPath));
// Change to getting FileInfos
var Files = new DirectoryInfo(SelectFolder.SelectedPath).GetFiles()
// Only keep those who don't contain a zero in file name
.Where(f=>!f.Name.Contains("0"));
int i = 1;
foreach (var file in Files)
{
//String filename = Path.GetFileNameWithoutExtension((path1));
// Don't need a StreamReader not a using block, just read the whole file at once with File.ReadAllText
//using (StreamReader sr = new StreamReader(path1))
//{
//txt = sr.ReadToEnd();
var txt = File.ReadAllText(file.FullName);
XmlElement id = doc.CreateElement("Page.id");
id.SetAttribute("Page.Nr", i.ToString());
id.SetAttribute("Pagetitle", file.FullName);
XmlElement name = doc.CreateElement("PageContent");
XmlCDataSection cdata = doc.CreateCDataSection(txt);
name.AppendChild(cdata);
id.AppendChild(name); // page id appenndchild
root.AppendChild(id); // roots appenedchild
doc.AppendChild(root); //Main root
//}
i++;
}
}
Console.WriteLine("finished");
Console.ReadKey();
doc.Save(Path.ChangeExtension(path, ".xml"));
}
I would also advise not working with the XML api you're using but with the more recent and simpler linq to XML one as that would simplify creating your elements too, see bellow a very simplified version of the whole code as i'd have writen it with LINQ and XElements
static void Main(string[] args)
{
FolderBrowserDialog SelectFolder = new FolderBrowserDialog();
String path = #"C:\newpages";
var root = new XElement("Pages");
if (SelectFolder.ShowDialog() == DialogResult.OK)
{
var FilesXML = new DirectoryInfo(SelectFolder.SelectedPath).GetFiles()
.Where(f => !f.Name.Contains("0"))
// Note that the index is 0 based, if you want to start with 1 just replace index by index+1 in Page.Nr
.Select((file, index) =>
new XElement("Page.id",
new XAttribute("Page.Nr",index),
new XAttribute("Pagetitle",file.FullName),
new XElement("PageContent",
new XCData(File.ReadAllText(file.FullName))
)));
// Here we already have all your XML ready, just need to add it to the root
root.Add(FilesXML);
}
Console.WriteLine("finished");
Console.ReadKey();
root.Save(Path.ChangeExtension(path, ".xml"));
}
You can try this, but I would suggest if you change the logic of creating name of backup files. It should not depend on "0" as character in it but instead of that firm text like "backup" should be mentioned in file name.
static void Main(string[] args)
{
FolderBrowserDialog SelectFolder = new FolderBrowserDialog();
String path = #"C:\newpages";
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("Pages");
if (SelectFolder.ShowDialog() == DialogResult.OK)
{
var txt = string.Empty;
string[] Files = Directory.GetFiles((SelectFolder.SelectedPath));
int i = 1;
foreach (string path1 in Files)
{
String filename = Path.GetFileNameWithoutExtension((path1));
if (!filename.Contains(".0"))
{
using (StreamReader sr = new StreamReader(path1))
{
txt = sr.ReadToEnd();
XmlElement id = doc.CreateElement("Page.id");
id.SetAttribute("Page.Nr", i.ToString());
id.SetAttribute("Pagetitle", filename);
XmlElement name = doc.CreateElement("PageContent");
XmlCDataSection cdata = doc.CreateCDataSection(txt);
name.AppendChild(cdata);
id.AppendChild(name); // page id appenndchild
root.AppendChild(id); // roots appenedchild
doc.AppendChild(root); //Main root
}
}
i++;
}
}
Console.WriteLine("finished");
Console.ReadKey();
doc.Save(Path.ChangeExtension(path, ".xml"));
}

C# Nested Loops

I'm having trouble to make some loops.
I'm using agilitypack. I have a TXT file with several links (1 per line), and for each link that txt want to navigate to the page and then later extract to be in xpath and write in a memo.
The problem I'm having and that the code is only carrying out the procedure for the last line of txt. Where am I wrong?
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
break;
}
}
try to change
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
to
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
You're overwriting memoEdit1.Text every time. Try
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
instead - note the += instead of =, which adds the new text every time.
Incidentally, constantly appending strings together isn't really the best way. Something like this might be better:
var Webget = new HtmlWeb();
var builder = new StringBuilder();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
builder.AppendFormat("{0}\r\n", node.ChildNodes[0].InnerHtml);
break;
}
}
memoEdit1.Text = builder.ToString();
Or, using LINQ:
var Webget = new HtmlWeb();
memoEdit1.Text = string.Join(
"\r\n",
File.ReadAllLines("c:\\test.txt")
.Select (line => Webget.Load(line).DocumentNode.SelectNodes("//*[#id='title-article']").First().ChildNodes[0].InnerHtml));
If you are only selecting 1 node in the inner loop then use SelectSingleNode Instead. Also the better practice when concatenating strings in a loop is to use StringBuilder:
StringBuilder builder = new StringBuilder();
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
builder.AppendLine(doc.DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml);
}
memoEdit1.Text = builder.ToString();
Using linq it will look like this:
var Webget = new HtmlWeb();
var result = File.ReadLines("c:\\test.txt")
.Select(line => Webget.Load(line).DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml));
memoEdit1.Text = string.Join(Environment.NewLine, result);

How can i remove HTML Tags from String by REGEX?

I am fetching data from Mysql but the issue is "HTML tags i.e.
<p>LARGE</p><p>Lamb;<br>;li;ul;
also being fetched with my data i just need "LARGE" and "Lamb" from above line. How can I separate/remove HTML tags from String?
I am going to assume that the HTML is intact, perhaps something like the following:
<ul><li><p>LARGE</p><p>Lamb<br></li></ul>
In which case, I would use HtmlAgilityPack to get the content without having to resort to regex.
var html = "<ul><li><p>LARGE</p><p>Lamb</p><br></li></ul> ";
var hap = new HtmlDocument();
hap.LoadHtml(html);
string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText);
// text is now "LARGELamb "
string[] lines = hap.DocumentNode.SelectNodes("//text()")
.Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray();
// lines is { "LARGE", "Lamb", " " }
If we assume that you are going to fix your html elements.
static void Main(string[] args)
{
string html = WebUtility.HtmlDecode("<p>LARGE</p><p>Lamb</p>");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
List<HtmlNode> spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList();
foreach (HtmlNode node in spanNodes)
{
Console.WriteLine(node.InnerHtml);
}
}
You need to use HTML Agility Pack.You can add reference like this.:
Install-Package HtmlAgilityPack
try this
// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
return StripHTMLExpression.Replace(target, string.Empty);
}
call
string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);
Assuming that:
the original string is always going to be in that specific format,
and that
you cannot add the HTMLAgilityPack,
here is a quick and dirty way of getting what you want:
static void Main(string[] args)
{
// Split original string on the 'separator' string.
string originalString = "<p>LARGE</p><p>Lamb;<br>;li;ul; ";
string[] sSeparator = new string[] { "</p><p>" };
string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None);
// Prepare to filter the 'prefix' and 'postscript' strings
string prefix = "<p>";
string postfix = ";<br>;li;ul; ";
int prefixLength = prefix.Length;
int postfixLength = postfix.Length;
// Iterate over the split string and clean up
string s = string.Empty;
for (int i = 0; i < splitString.Length; i++)
{
s = splitString[i];
if (s.Contains(prefix))
{
s = s.Remove(s.IndexOf(prefix), prefixLength);
}
if (s.Contains(postfix))
{
s = s.Remove(s.IndexOf(postfix), postfixLength);
}
splitString[i] = s;
Console.WriteLine(splitString[i]);
}
Console.ReadLine();
}
// Convert < > etc. to HTML
String sResult = HttpUtility.HtmlDecode(sData);
// Remove HTML tags delimited by <>
String result = Regex.Replace(sResult, #"enter code here<[^>]*>", String.Empty);

WebRequest multiple pages and load into StreamReader

I want to go to multiple pages using ASP.NET 4.0, copy all HTML and then finally paste it in a text box. From there I would like to run my parsing function, what is the best way to handle this?
protected void goButton_Click(object sender, EventArgs e)
{
if (datacenterCombo.Text == "BL2")
{
fwURL = "http://website1.com/index.html";
l2URL = "http://website2.com/index.html";
lbURL = "http://website3.com/index.html";
l3URL = "http://website4.com/index.html";
coreURL = "http://website5.com/index.html";
WebRequest objRequest = HttpWebRequest.Create(fwURL);
WebRequest layer2 = HttpWebRequest.Create(l2URL);
objRequest.Credentials = CredentialCache.DefaultCredentials;
using (StreamReader layer2 = new StreamReader(layer2.GetResponse().GetResponseStream()))
using (StreamReader objReader = new StreamReader(objRequest.GetResponse().GetResponseStream()))
{
originalBox.Text = objReader.ReadToEnd();
}
objRequest = HttpWebRequest.Create(l2URL);
//Read all lines of file
String[] crString = { "<BR> " };
String[] aLines = originalBox.Text.Split(crString, StringSplitOptions.RemoveEmptyEntries);
String noHtml = String.Empty;
for (int x = 0; x < aLines.Length; x++)
{
if (aLines[x].Contains(ipaddressBox.Text))
{
noHtml += (RemoveHTML(aLines[x]) + "\r\n");
}
}
//Print results to textbox
resultsBox.Text = String.Join(Environment.NewLine, noHtml);
}
}
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}
Instead of doing all this manually you should probably use HtmlAgilityPack instead then you could do something like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://google.com");
var targetNodes = doc.DocumentNode
.Descendants()
.Where(x=> x.ChildNodes.Count == 0
&& x.InnerText.Contains(someIpAddress));
foreach (var node in targetNodes)
{
//do something
}
If HtmlAgilityPack is not an option for you, simplify at least the download portion of your code and use a WebClient:
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString("http://google.com");
}

Categories

Resources