I am currently creating a RSS feed linked to a custom built news column. The news column uses a series of query strings in order to direct the user to a specific post or posts! However the problem I am facing is that the rss feed is replacing some of these query strings with random numbers. For instance:
http://www.correlatesearch.com/news.aspx?cat=BusinessManagementControls&nw=
&nw= is being replace with
&
Can anyone direct to a way around this??
Many thanks!
My guess is that you're looking at the raw RSS - which is XML. Within XML, & has to be escaped as &. This is far from "random numbers".
I suspect you'll find that &nw= is actually being escaped to &nw= - in which case it's not actually changing your content at all. It's representing the text of your URL in an XML-appropriate way. When the XML is read by a client, it will (or should) understand it appropriately.
Is the feed an XML document? Then the replacement should take place. It is called escaping character entities.
And I don't see any "random numbers" that you referred to...
Related
I'm using an xml reader to parse some xml and I'm wondering if I can have it read in a character entity reference as straight text rather than converting it to the actual character. So if I called ReadInnerXml() on the node:
<param name="id">don't convert this</param>
I would get "don't convert this" as opposed to what I'm currently getting, which is "don't convert this". This is necessary as any characters or character entity references should be handed back the way the came due to them being legacy content.
Any help appreciated!
No, I don't know of any XML parser that has this feature. The job of an XML parser is to parse the input, and that's what it will do.
If you can't fix the consumer of this process to handle XML properly, your best bet might be to preprocess the text by replacing & by (say) § so it doesn't mean anything special to the XML parser.
I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.
I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?
If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)
Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);
I'm currently working on an ASP.NET Website where I want to retrieve data from an RSS feed. I can easily retrieve the data I want and get it to show in i.e. a Repeater control.
My problem is, that the blog (Wordpress) that I'm getting the RSS from uses \n for linebreaks which I obviously can't use in HTML. I need to replace these \n with a <br /> tag.
What I've done so far is:
SyndicationFeed myFeed = SyndicationFeed.Load(XmlReader.Create("urltofeed/"));
IEnumerable<SyndicationItem> items = myFeed.Items;
foreach(SyndicationItem item in items)
{
Feed f = new Feed();
f.Content = f.ConvertLineBreaks(item.Summary.Text);
f.Title = item.Title.Text;
feedList.Add(f);
}
rptEvents.DataSource = feedList;
rptEvents.DataBind();
Then having a Feed class with two properties: Title and Content and a helper-method to replace \n with <br />
However, I'm not sure if this is a good/pretty approach to get data from an RSS feed?
Thanks in advance,
Bo
If you are adverse to all the xml parsing in your code you can also run the rss xml schema through xsd and generate a topic and feed class in you code.
This classes should serialize/deserialize to xml. This may be overkill but it's worked great for me when integrating with a standard xml api for a third party.
Does it have anything to do with the type of rss feed you're consuming?
http://codex.wordpress.org/WordPress_Feeds
If it's not returning from the RSS feed formatted as you wish, you have little other choice. That's definitely not a particuarly bad way.
This is a sane approach in my opinion.
Remember that RSS is a data format and not HTML. Replacing \n with in order to get your wanted HTML is just something which has to be done every now and then. (Unless you want to use <pre>)
I find people stuffing way too much html in xml structures, not considering they will be used for other things than web pages. UI and data should be separated.
Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.