How can i escape "? - c#

if (richTextBox1.Lines[i].StartsWith(#"<a href=""") ||
richTextBox1.Lines[i].EndsWith(#""""))
The StartsWith should be <a href="
The EndsWith should be one single "
But the way it is now i'm getting no results.
Input for example:
Screen-reader users, click here to turn off ggg Instant.
I need to get this part:
/setprefs?suggon=2&prev=https://www.test.com/search?q%3D%2Band%2B%26espv%3D2%26biw%3D960%26bih%3D489%26source%3Dlnms%26tbm%3Disch%26sa%3DX%26ei%3DYrxxVb-hJqac7gba0YOgDQ%26ved%3D0CAYQ_AUoAQ&sig=0_seDQVVTDQQx1hvN3BRktZNFc9Ew%3D
The part between the
I also tried to use htmlagilitypack:
HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("https://www.test.com");
foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!newHtmls.Contains(hrefValue) && hrefValue.Contains("images"))
newHtmls.Add(hrefValue);
}
But this gave me only 1 link.
When i browse and see the page view-source and i make search and filter with the word image or images im getting over 350 results.
I tried also this solution:
var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
But it didnt give me the results i needed.
Forgot to mention that the view-source of the page content i copied it to richTextBox1 window and then i'm reading line by line the text from the richTextBox1 so maybe that's why i'm not getting the results as i need ?
for (int i = 0; i < richTextBox1.Lines.Length; i++)
{
if (richTextBox1.Lines[i].StartsWith("<a href=\"") &&
richTextBox1.Lines[i].EndsWith("\""))
{
listBox1.Items.Add(richTextBox1.Lines[i]);
}
}
Maybe the view-source content as it's in the browser(chrome) is not the same as in the richTextbox1. And maybe i should not read it line by line from the richTextBox1 maybe to read the whole text from the richTextBox1 first ?

Based on your input, EndsWith isn't doing to help (as your input actually ends with </a>. Your next-best option would be to store the location (position) of href=", then look for the next occurrence of a " beginning at your stored location. e.g.
var input = #"Screen-reader users, click here to turn off ggg Instant.";
var needle = #"href=""";
var start = input.IndexOf(needle);
if (start != -1)
{
start += needle.Length;
var end = input.IndexOf(#"""", start);
// final result:
var href = input.Substring(start, end - start).Dump();
}
Better than that would be to use an actual HTML parser (might I recommend HtmlAgilityPack?).

Related

How to replace text within a string based on their indices

I have a string of text coming from a database. I also have a list of links from a database which have a start index and length correstponding to my string. I want to append the links within the text to be links
<a href=...
I.e
var stringText = "Hello look at http://www.google.com and this hello.co.uk";
This would have in the database
Link:http://www.google.com
Index:14
Length:21
Link:hello.co.uk
Index:45
Length:11
I eventually want
var stringText = "Hello look at http://www.google.com and this hello.co.uk";
There may be many links in the string, so I need a way of looping through these links and replacing based on the index and length. I would just loop through and replace based on the link (string.replace) but causes issues if there are the same link twice
var stringText = "www.google.com www.google.com www.google.com";
www.google.com would become a link and the second time would make the link within the link... a link.
I can obviously find the first index, but if I change it at that point, the index's are no longer valid.
Is there an easy way to do this or am I missing something?
You simply need to remove the subject from source using String.Remove, then use String.Insert to insert your replacement string.
As #hogan suggested in comments you need to sort the replacement list and do the replacement in reverse order (from last to first) to make it work.
If you need to perform many replacements in single string I recommend StringBuilder for performance reasons.
I would use regural expressions.
Take a look at this: Regular expression to find URLs within a string
It might help.
Here's solution without Remove or Insert, or regexes. Just addition.
string stringText = "Hello look at http://www.google.com and this hello.co.uk!";
var replacements = new [] {
new { Link = "http://www.google.com", Index = 14, Length = 21 },
new { Link = "hello.co.uk", Index = 45, Length = 11 } };
string result = "";
for (int i = 0; i <= replacements.Length; i++)
{
int previousIndex = i == 0 ? 0 : replacements[i - 1].Index + replacements[i - 1].Length;
int nextIndex = i < replacements.Length ? replacements[i].Index : replacements[i - 1].Index + replacements[i - 1].Length + 1;
result += stringText.Substring(previousIndex, nextIndex - previousIndex);
if (i < replacements.Length)
{
result += String.Format("{1}", replacements[i].Link,
stringText.Substring(replacements[i].Index, replacements[i].Length));
}
}

Stripping out malformed HTML from string

Sometimes from a 3rd party API I get malformed HTML elements returned:
olor:red">Text</span>
when I expect:
<span style="color:red">Text</span>
For my context, the text content of the HTML is more important so it does not matter if I lose surrounding tags/formatting.
What would be the best way to strip out the malformed tags such that the first example would read
Text
and the second would not change?
I recommend you to take a look at the HtmlAgilityPack, which is a very handy tool also for HTML sanitization.
Here's an approach example by using the aforementioned library:
static void Main()
{
var inputs = new[] {
#"olor:red"">Text</span>",
#"<span style=""color:red"">Text</span>",
#"Text</span>",
#"<span style=""color:red"">Text",
#"<span style=""color:red"">Text"
};
var doc = new HtmlDocument();
inputs.ToList().ForEach(i => {
if (!i.StartsWith("<"))
{
if (i.IndexOf(">") != i.Length-1)
i = "<" + i;
else
i = i.Substring(0, i.IndexOf("<"));
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.InnerText);
}
else
{
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.OuterHtml);
}
});
}
Outputs:
Text
<span style="color:red">Text</span>
Text
<span style="color:red">Text</span>
<span style="color:red">Text</span>
If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions:
var r = new Regex(">([^>]+)<");
var text = "olor:red\">Text</span>";
var m = r.Match(text);
This will find every inner text of each tag.
Very crudely, you could strip out all 'tags' by stripping everything before a > and keeping everything before a <.
I'm assuming you also need to consider the situation where the text your receive is without tags: e.g. Text.
In pseudo-code:
returnText = ""
loop:
gtI = text.IndexOf(">")
ltI = text.IndexOf("<")
if -1==gtI and -1==ltI:
returnText += text
we're done
if gtI==-1:
returnText += text up to position ltI
return returnText
if ltI==-1:
returnText += text after gtI
return returnText
if ltI < gtI:
returnText += textBefore ltI
text = text after ltI
loop
// gtI < ltI:
text = text after gtI
loop
It's crude and can be done much better (and faster) with a custom coded parser, but essentially the logic would be the same.
You should really be asking why the API returns only part of what you require: I can't see why it should be returning ext</span> either, which really messes you up.

Searching for line of one text file in another text file, faster

Is there a faster way to search each line of one text file for occurrence in another text file, than by going line by line in both files?
I have two text files - one has ~2500 lines (let's call it TxtA), the other has ~86000 lines(TxtB). I want to search TxtB for each line in TxtA, and return the line in TxtB for each match found.
I currently have this setup as: for each line in TxtA, search TxtB line by line for a match. However this is taking a really long time to process. It seems like it would take 1-3 hours to find all the matches.
Here is my code...
private static void getGUIDAndType()
{
try
{
Console.WriteLine("Begin.");
System.Threading.Thread.Sleep(4000);
String dbFilePath = #"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader dbsr = new StreamReader(dbFilePath + "newdbcontents.txt");
List<string> dblines = new List<string>();
String newDataPath = #"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader nsr = new StreamReader(newDataPath + "HolidayList1.txt");
List<string> new1 = new List<string>();
string dbline;
string newline;
List<string> results = new List<string>();
while ((newline = nsr.ReadLine()) != null)
{
//Reset
dbsr.BaseStream.Position = 0;
dbsr.DiscardBufferedData();
while ((dbline = dbsr.ReadLine()) != null)
{
newline = newline.Trim();
if (dbline.IndexOf(newline) != -1)
{//if found... get all info for now
Console.WriteLine("FOUND: " + newline);
System.Threading.Thread.Sleep(1000);
new1.Add(newline);
break;
}
else
{//the first line of db does not contain this line...
//go to next dbline.
Console.WriteLine("Lines do not match - continuing");
continue;
}
}
Console.WriteLine("Going to next new Line");
System.Threading.Thread.Sleep(1000);
//continue;
}
nsr.Close();
Console.WriteLine("Writing to dbc3.txt");
System.IO.File.WriteAllLines(#"C:\WindowsApps\CRM\crm_interface\data\dbc3.txt", results.ToArray());
Console.WriteLine("Finished. Press ENTER to continue.");
Console.WriteLine("End.");
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine("Error: " + ex);
Console.ReadLine();
}
}
Please let me know if there is a faster way. Preferably something that would take 5-10 minutes... I've heard of indexing but didn't find much on this for txt files. I've tested regex and it's no faster than indexof. Contains won't work because the lines will never be exactly the same.
Thanks.
There might be a faster way, but this LINQ apporoach should be faster than 3 hours and is a sight better to read and maintain:
var f1Lines = File.ReadAllLines(f1Path);
var f2LineInf1 = File.ReadLines(f2Path)
.Where( line => f1Lines.Contains(line))
.Select(line => line).ToList();
Edit: tested and required less than 1 second for 400000 lines in file2 and 17000 lines in file1. I can use File.ReadLines for the big file which does not load all into memory at once. For the smaller file i need to use File.ReadAllLines since Contains needs the complete list of lines of file 1.
If you want to log the result in a third file:
File.WriteAllLines(logPath, f2LineInf1);
EDIT: Note that I'm assuming it's reasonable to read at least one file into memory. You may want to swap the queries below around to avoid loading the "big" file into memory, but even 86,000 lines at (say) 1K per line is going to be less than 2G of memory - which is relatively little to do something significant.
You're reading the "inner" file each time. There's no need for that. Load both files into memory and go from there. Heck, for exact matches you can do the whole thing in LINQ easily:
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
join line2 in File.ReadLines(dbFilePath + "newdbcontents.txt")
on line1 equals line2
select line1;
var commonLines = query.ToList();
But for non-joins it's still simple; just read one file completely first (explicitly) and then stream the other:
// Eagerly read the "inner" file
var lines2 = File.ReadAllLines(dbFilePath + "newdbcontents.txt");
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
from line2 in lines2
where line2.Contains(line1)
select line1;
var commonLines = query.ToList();
There's nothing clever here - it's just a really simple way of writing code to read all the lines in one file, then iterate over the lines in the other file and for each line check against all the lines in the first file. But even without anything clever, I strongly suspect it would perform well enough for you. Concentrate on simplicity, eliminate unnecessary IO, and see whether that's good enough before trying to do anything fancier.
Note that in your original code, you should be using using statements for your StreamReader variables, to ensure they get disposed properly. Using the above code makes it simple to not even need that though...
Quick and dirty because I've got to go... If you can do it in memory, try working with this snippet:
//string[] searchIn = File.ReadAllLines("File1.txt");
//string[] searchFor = File.ReadAllLines("File2.txt");
string[] searchIn = new string[] {"A","AB","ABC","ABCD", null, "", " "};
string[] searchFor = new string[] {"A","BC","BCD", null, "", " "};
matchDictionary;
foreach(string item in file2Content)
{
string[] matchingItems = Array.FindAll(searchIn, x => (x == item) || (!string.IsNullOrEmpty(x) && !string.IsNullOrEmpty(item) ? (x.Contains(item) || item.Contains(x)) : false));
}

formatting html in c#

i am having a variable in c# holding some string like this
string myText="my text which contains <div>i am text inside div</div>";
now i want to replace all "\n" (new line character) with "<br>" for this variable's data except for text inside div.
How do i do this??
Others have suggested using libraries such as HTMLAgilityPack. The former is indeed a nice tool, but if you don't need HTML parsing functionality beyond what you have requested, a simple parser should suffice:
string ReplaceNewLinesWithBrIfNotInsideDiv(string input) {
int divNestingLevel = 0;
StringBuilder output = new StringBuilder();
StringComparison comp = StringComparison.InvariantCultureIgnoreCase;
for (int i = 0; i < input.Length; i++) {
if (input[i] == '<') {
if (i < (input.Length - 3) && input.Substring(i, 4).Equals("<div", comp)){
divNestingLevel++;
} else if (divNestingLevel != 0 && i < (input.Length - 5) && input.Substring(i, 6).Equals("</div>", comp)) {
divNestingLevel--;
}
}
if (input[i] == '\n' && divNestingLevel == 0) {
output.Append("<br/>");
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
This should handle nested divs as well.
For something like this you will need to parse the HTML in order to distinguish the parts that you do want to make the replacement in from the ones you don't.
I suggest looking at the HTML agility pack - it can parse HTML fragments as well as malformed HTML. You can then query the resulting parse tree using XPath notation and do your replacement on the selected nodes.
That would require some fairly complicated RegEx, out of my league.
But you could try splitting the string:
string[] parts = myText.Split("<div>", "</div>");
for (int i = 0; i < parts.Length; i += 2) // only the even parts
parts[i] = string.Replace(...);
And then use a StringBuilder to re-assemble the parts.
I would split the string on div then look at the tokens if it starts with "div" then don't replace \n with BR if it does start with div then you need to find the closing div and split on that.. then take the 2nd token and do what you just did... of course as you are going to have to keep appending the tokens to a master string... I'll code up a sample here in a few minutes...
Use the string.Replace() method like this:
myText = myText.Replace("\n", "<br>")
You could consider using the Environment.NewLine property to find the newline chars. Are you sure they are not \n\r or \r\n etc...
You may have to pull the text inside the div out first if you dont want to parse that. Use a regex to find it and remove it then do the Replace() as above, then put the strings backtogether.

C# writing out text files matching listbox and contents of another text file

I have a file created from a directory listing. From each of item a user selects from a ListBox, the application reads the directory and writes out a file that matches all the contents. Once that is done it goes through each item in the ListBox and copies out the item that matches the ListBox selection. Example:
Selecting 0001 matches:
0001456.txt
0001548.pdf.
The code i am using isn't handling 0s very well and is giving bad results.
var listItems = listBox1.Items.OfType<string>().ToArray();
var writers = new StreamWriter[listItems.Length];
for (int i = 0; i < listItems.Length; i++)
{
writers[i] = File.CreateText(
Path.Combine(destinationfolder, listItems[i] + "ANN.TXT"));
}
var reader = new StreamReader(File.OpenRead(masterdin + "\\" + "MasterANN.txt"));
string line;
while ((line = reader.ReadLine()) != null)
{
for (int i = 0; i < listItems.Length; i++)
{
if (line.StartsWith(listItems[i].Substring(0, listItems[i].Length - 1)))
writers[i].WriteLine(line);
}
}
Advice for correcting this?
Another Sample:
I have 00001 in my listbox: it returns these values:
00008771~63.txt
00002005~3.txt
00009992~1.txt
00001697~1.txt
00000001~1.txt
00009306~2.txt
00000577~1.txt
00001641~1.txt
00001647~1.txt
00001675~1.txt
00001670~1.txt
It should only return:
00001641~1.txt
00001647~1.txt
00001675~1.txt
00001670~1.txt
00001697~1.txt
Or if someone could just suggest a better method for taking each line in my listbox searching for line + "*" and whatever matches writes out a textfile...
This is all based pretty much on the one example you gave, but I believe the problem is that when you are performing your matching, you are getting the substring if your list item value and chopping off the last character.
In your sample you are attempting to match files starting with "00001", but when you do the match you are getting substring starting at zero and value.length-1 characters, which in this case would be "0000". For example:
string s = "00001";
Console.WriteLine(s.Substring(0,s.Length-1));
results in
0000
So I think if you just changed this line:
if (line.StartsWith(listItems[i].Substring(0, listItems[i].Length - 1)))
writers[i].WriteLine(line);
to this
if (line.StartsWith(listItems[i]))
writers[i].WriteLine(line);
you would be in good shape.
Sorry if I misunderstood your question, but let's start with this:
string line = String.Empty;
string selectedValue = "00001";
List<string> matched = new List<string>();
StreamReader reader = new StreamReader(Path.Combine(masterdin, "MasterANN.txt"));
while((line = reader.ReadLine()) != null)
{
if(line.StartsWith(selectedValue))
{
matched.Add(line);
}
}
This will match all lines from your MasterANN.txt file which begins with "00001" and add them into a collection (later we'll work on writing this into a file, if required).
This clarifies something?

Categories

Resources