Regex that finds hyperlinks while excluding plain text - c#

So i'm looking to scrape rapidshare.com links from websites. I have the following regular expressions to find links:
<a href=\"(http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4}))\"
http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4})
How can I write a regex that will exclude text that is embedded in a tag. and only capture the text in >here
I also have to bare in mind that not all links are embedded in href tags. Some are just displayed in plain text.
Basically is there a wway to exclude patterns in regex ?
Thanks.

To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you'd use the pattern:
<a href="http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})[^>]*>(.*?)</a>
The [^>]* part matches everything else in your tag up until the end of the start tag.
The (.*?) performs a non-greedy capture of the inner text.
If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. There's probably a regex for it, but it would be terribly complicated. You're better off simply looking for non-anchor-tag links separately with the simple regex:
[^'"]http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})

How about like this, last part will try to match any thing except ' " >
http://rapidshare.com/files/(\d+)/([^'"> ]+)

How about something like:
/http:\/\/rapidshare.com\/files\/\d+\/[^<&\s]+\.\w{3,4}/
I got rid of the capturing groups, because I think you only had them in there because you thought you might need them to make sure the different groupings worked and you can add them back in if you really want them parsed out.
You can expand upon the [^<&"\s] as it only is excluding white spaces, the < character which could be the start of the tag, the & which would include things like and other HTML entities or the " which would be the end of the href=. but you could exclude any non-valid URI character if you wanted. This should cover your inline text as well as those embedded as href.

Related

Check if HTML contains anything visible

I want to check in c# if an html document is "visibly empty". Checking InnerHtml is not enough, because the HTML could contain an image or an empty table.
Are there any efficent ways to check a HTML document if there is anything, that translate to something that is not a whitespace?
The complicated way i'm thinking of is removing every html/body/p/br Tag, whitespaces and nbsp items, and checking if anything is left

Regex to remove Wiki Markup?

I'm trying to remove all Wiki markup from some text. I want to process the text but all the markup is messing it up. Do you know how to remove all the markup?
If regex isn't viable, I am using C#.
Example text: http://regexr.com/39fnb
Edit: I've come up with the following regex: ([[)|(]])|(Category:.)|({{.}})|(=+.+=)|([.*?])
It works at parsing some stuff, but not everything. For example, it can't parse the lines starting with | that have code. I tried adding something that could do that, but it didn't work.

Equivalent HTML code for Escape characters accepts in TextBlock

Need some HTML code for Escape characters that accepts and do its functionality in a TextBlock.
For Example:
for \n
My requirement is, I have a XML file which holds a field named Memo and it need to hold some text like in the below image
For CCJS, i need a tab to make it center. like wise the rest of text to be aligned.
XML tag:
memo="\tCCJS
\t==========
If the "CCJS" field is customized on the General Occurrence screen, then the same custamization should be made to the "CCJS Status" field on the conclusion block."
Above given is just for an example, I have more text like these so i need some set of HTML code to have all these Text accepted in xml and Textblock
I have gone through Here.. Still i dont found code for Tab. if i would get a full list of these codes, it would be helpful..
Thanks.
I've always just used the Line Feed character (which will work in xaml and is html encoded, it's also already included your example)
Dec. =
Hex. =
As example;
<TextBlock Text="Line One
Line Two"/>
Hope this helps.
PS - for your tabs, just get your spacing correct and utilize Preserve Whitespace / xml:space="preserve"

Best way to format randomly copied texts from other websites?

Problem:
My site allows users to copy/paste contents from other files/documents like MS Word and websites (eg CNN.com) into the Rich TextEditor we provide. This Rick TextEditor supports (and we too have to support) paste contents with embedded styles, this brings random styles, tags inline styles from content origin.
Eg: If you paste from any MS word document, it brings H1, H2, P, UL/OL/LI, STRONG, I, EM, TABLE etc. with their own styles. Same happens when you copy paste from other webpages.
How To Format?
I am looking for THE best way to handle the formatting of these kinds of user-generated contents. First, I need to keep the copied tags intact. Lets say, H1 was brought from user from MS Word - I have to keep this yet style on my own using given corporate branding.
Another problem is, when you copy/paste from external origin - some tags are not properly closed - this causes my layout break. How do we handle this?
For styles, m applying
.article * {
allKnownCSSProperties: myValues!important;
}
Any method would work. JavaScript, C# is preferred.
To strip out unwanted styles a simple regex would suffice. In Javascript:
/( style=['"][^'"]*['"])/g
I'd try to solve problem with lack of closed tags as this:
Parse whole message and collect tags that's not ends with /> and remove them if you're find same tag starts with </. Exclude tags that may not to have close tag, generate close tags for all tags that still in collection and place them at the end of yours Rich TextEditor layout. It may not work in some cases or looks clumsy but that first that comes in mind and it may help to solve the problem

How to get the substring of a string that contains HTML tags using C# .net?

I have a substring what contains HTML tag and I need to shorten it but display it with the same formatting as it appears on the string.
It doesn't have to be exactly X characters long, but it should be short enough to be displayed inside a panel with a certain width and height?
Is there any way I can achieve this using c#?
What about using CSS? I.e. displaying the panel with a fixed height regardless of its content?
Thanks..
Example: I have the following panel containing a label that contains text with html tags:
I need to remove the scroll bar without making the panel longer but keeping this height & this width..
If you have following html code:
<div class="div1"> Some Really Bold String </div>
You can provide css to hide the scroll bars,
.div { overflow:hidden; height:200px; width:200px;}
height and width values are just for example purpose.
overflow:hidden does not let the content of the div to expand out side the div.
you will find more information on overflow here.
You could use a regex to find the contents of the specific tag. Use a .substring to shorten the result afterwards.
A example could be:
<h1>head</h1>
<p>contents</p>
Regex could be:
<p\b[^>]*>(.*?)</p>
Result would be:
<p>contents</p>
Now just exclude the start and end tag. as its a fixed length.
I found more interesting reading about changing the content between HTML tags. Take a read here (regex ftw!):
http://www.thatsquality.com/articles/how-to-match-and-replace-content-between-two-html-tags-using-regular-expressions
Another solution that might not drive you as crazy if you want to solve it in c#:
HTML Agility Pack
Take a look at the examples part of the site. Great little tool!

Categories

Resources