I have following string:
OK:<IDP RESULT="0" MESSAGE="some message" ID="oaisjd98asdh339wnf" MSGTYPE="Done"/>
I use this method to parse and get result:
public string MethodName(string capt)
{
var receivedData = capt.Split(' ').ToArray();
string _receivedReultValue = "";
foreach (string s in receivedData)
{
if (s.Contains('='))
{
string[] res = s.Split('=').ToArray();
if (res[0].ToUpper() == "RESULT")
{
string resValue = res[1];
resValue = resValue.Replace("\\", " ");
_receivedReultValue = resValue.Replace("\"", " ");
}
}
}
return _receivedReultValue.Trim();
}
Is there better way to parse string like this to extract data?
What you have isn't all that bad. But, because it's XML you could do this:
class Program
{
static void Main(string[] args)
{
var capt = "OK:<IDP RESULT=\"0\" MESSAGE=\"some message\" ID=\"oaisjd98asdh339wnf\" MSGTYPE=\"Done\"/>";
var stream = new MemoryStream(Encoding.Default.GetBytes(capt.Substring(capt.IndexOf("<"))));
var kvpList = XDocument.Load(XmlReader.Create(stream))
.Elements().First()
.Attributes()
.Select(a => new
{
Attr = a.Name.LocalName,
Val = a.Value
});
}
}
That would give you an IEnumerable of that anonymous type.
You can use XDocument, assuming that you will remove the "OK:" at the beginning you can do it like this:
static void Main(string[] args)
{
var str = "<IDP RESULT=\"0\" MESSAGE=\"some message\" ID=\"oaisjd98asdh339wnf\" MSGTYPE=\"Done\"/>";
var doc = XDocument.Parse(str);
var element = doc.Element("IDP");
Console.WriteLine("RESULT: {0}", element.Attribute("RESULT").Value);
Console.WriteLine("MESSAGE: {0}", element.Attribute("MESSAGE").Value);
Console.WriteLine("ID: {0}", element.Attribute("ID").Value);
Console.WriteLine("MSGTYPE: {0}", element.Attribute("MSGTYPE").Value);
Console.ReadKey();
}
EDIT: I tested the code above on .NET 4.5. For 3.5 I had to change it a bit
static void Main(string[] args)
{
const string str = "<IDP RESULT=\"0\" MESSAGE=\"some message\" ID=\"oaisjd98asdh339wnf\" MSGTYPE=\"Done\"/>";
var ms = new MemoryStream(Encoding.ASCII.GetBytes(str));
var rdr = new XmlTextReader(ms);
var doc = XDocument.Load(rdr);
var element = doc.Element("IDP");
Console.WriteLine("RESULT: {0}", element.Attribute("RESULT").Value);
Console.WriteLine("MESSAGE: {0}", element.Attribute("MESSAGE").Value);
Console.WriteLine("ID: {0}", element.Attribute("ID").Value);
Console.WriteLine("MSGTYPE: {0}", element.Attribute("MSGTYPE").Value);
Console.ReadKey();
}
Sure. It looks like XML, you may use normal XML methods for this.
if you remove "OK" and add
<?xml version="1.0" ?>
<IDP RESULT="0" MESSAGE="some message" ID="oaisjd98asdh339wnf" MSGTYPE="Done"/>
this can be parsed by any XML decoder. Try xmllint to check it out.
You can Regex to obtain all key/value pairs:
string str = #"OK:<IDP RESULT=""0"" MESSAGE=""some message"" ID=""oaisjd98asdh339wnf"" MSGTYPE=""Done""/>";
var matches = Regex.Matches(str, #"(?<Key>\w+)=""(?<Value>[^""]+)""");
then you can access RESULT attribute:
var match = matches.OfType<Match>()
.FirstOrDefault(match => match.Groups["Key"].Value == "RESULT");
if (match != null)
{
result = match.Groups["Value"].Value;
}
try this is too simple
string xml = #"OK:<IDP RESULT=""0"" MESSAGE=""some message"" ID=""oaisjd98asdh339wnf"" MSGTYPE=""Done""/>";
XElement xElement = XElement.Parse(new string(xml.Skip(3).ToArray()));
//for example message
var message = xElement.Attribute("MESSAGE").Value;
Related
I'm trying to program an API for discord and I need to retrieve two pieces of information out of the HTML code of the web page https://myanimelist.net/character/214 (and other similar pages with URLs of the form https://myanimelist.net/character/N for integers N), specifically the URL of the Character Picture (in this case https://cdn.myanimelist.net/images/characters/14/54554.jpg) and the name of the character (in this case Youji Kudou). Afterwards I need to save those two pieces of information to JSON.
I am using HTMLAgilityPack for this, yet I can't quite see through it. The following is my first attempt:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
{
Console.WriteLine(node.InnerHtml);
}
}
Unfortunately, this produces no output. If I followed the path correctly (which is probably the first mistake) it should be "tr/td/div/a/img". I get no errors, it runs, yet I get no output.
My second attempt is:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var script = htmlDoc.DocumentNode.Descendants()
.Where(n => n.Name == "tr/td/a/img")
.First().InnerText;
// Return the data of spect and stringify it into a proper JSON object
var engine = new Jurassic.ScriptEngine();
var result = engine.Evaluate("(function() { " + script + " return src; })()");
var json = JSONObject.Stringify(engine, result);
Console.WriteLine(json);
Console.ReadKey();
}
But this also doesn't work.
How can I extract the required information?
EDIT:
So, I've come quite further now, and I've found a solution to finding the link. It was rather simple. But now I'm stuck with finding the name of the character. The website is structured the same on every other link there is (changing the last number) so, I want to find many different ones via for loop. Here's how I tried to do it:
for (int i = 1; i <= 1000; i++)
{
HtmlWeb web = new HtmlWeb();
var html = "https://myanimelist.net/character/" + i;
var htmlDoc = web.Load(html);
foreach (var item in htmlDoc.DocumentNode.SelectNodes("//*[#]"))
{
string n;
n = item.GetAttributeValue("src", "");
foreach (var item2 in htmlDoc.DocumentNode.SelectNodes("//*[#src and #alt='" + n + "']"))
{
Console.WriteLine(item2.GetAttributeValue("src", ""));
}
}
}
in the first foreach I would try to search for the name, which is concluded always at the same position (e.g http://prntscr.com/o1uo3c and http://prntscr.com/o1uo91 and to be specific: http://prntscr.com/o1xzbk) but I haven't found out how yet. Since the structure in the HTML doesn't have any body type I can follow up with. The second foreach loop is to search for the URL which works by now and the n should give me the name, so I can figure it out for each different character.
I was able to extract the character name and image from https://myanimelist.net/character/214 using the following method:
public static CharacterData ExtractCharacterNameAndImage(string url)
{
//Use the following if you are OK with hardcoding the structure of <div> elements.
//var tableXpath = "/html/body/div[1]/div[3]/div[3]/div[2]/table";
//Use the following if you are OK with hardcoding the fact that the relevant table comes first.
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
Where CharacterData is defined as follows:
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
Afterwards, the character data can be serialized to JSON using any of the tools from How to write a JSON file in C#?, e.g. json.net:
var url = "https://myanimelist.net/character/214";
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
Which outputs
{
"Name": "Youji Kudou",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
If you would prefer the Name to include the Japanese in parenthesis, replace GetDirectInnerText() with just InnerText, which results in:
{
"Name": "Youji Kudou (工藤耀爾)",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
Alternatively, if you prefer you could pull the character name from the document title:
var title = string.Concat(htmlDoc.DocumentNode.SelectNodes("/html/head/title").Select(n => n.InnerText.Trim()));
var index = title.IndexOf("- MyAnimeList.net");
if (index >= 0)
title = title.Substring(0, index).Trim();
How did I determine the correct XPath strings?
Firstly, using Firefox 66, I opened the debugger and loaded https://myanimelist.net/character/214 in the window with the debugging tools visible.
Next, following the instructions from How to find xpath of an element in firefox inspector, I selected the Youji Kudou (工藤耀爾) node and copied its XPath, which turned out to be:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
I then tried to select this node using SelectNodes()... and got a null result. But why? To determine this I created a debugging routine that would break the path into successively longer portions and determine where the failure occurs:
static void TestSelect(HtmlDocument htmlDoc, string xpath)
{
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i-1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
This output the following:
Input path: /html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
Testing "/html": result count = 1
Testing "/html/body": result count = 1
Testing "/html/body/div[1]": result count = 1
Testing "/html/body/div[1]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]": result count = null
As you can see, something goes wrong selecting the <tbody> path element. Manual inspection of the InnerHtml returned by selecting /html/body/div[1]/div[3]/div[3]/div[2]/table revealed that, for some reason, the server is not including the <tbody> tag when returning HTML to the HtmlWeb object -- possibly due to some difference in request header(s) provided by Firefox vs HtmlWeb. Once I omitted the tbody path element I was able to query for the character name successfully using:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]
A similar process provided the following working path for the image:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img
Since the two queries are finding contents in the same <table>, in my final code I selected the table only once in a separate step, and removed some of the hardcoding as to the specific nesting of <div> elements.
Demo fiddle here.
Alright, to finnish it up, I've rounded the Code, gratefully assisted by dbc, and implemented nearly completly into the project. Just if someone in later days maybe has a identical question, here they go. This outputs out of a defined number all the character names, links and images and writes it into a JSON file and could be adapted for other websites.
using System;
using System.Linq;
using Newtonsoft.Json;
using HtmlAgilityPack;
using System.IO;
namespace SearchingHTML
{
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
public class Program
{
public static CharacterData ExtractCharacterNameAndImage(string url)
{
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
public static void Main()
{
int max = 10000;
string fileName = #"C:\Users\path of your file.json";
Console.WriteLine("Environment version: " + Environment.Version);
Console.WriteLine("Json.NET version: " + typeof(JsonSerializer).Assembly.FullName);
Console.WriteLine("HtmlAgilityPack version: " + typeof(HtmlDocument).Assembly.FullName);
Console.WriteLine();
for (int i = 6; i <= max; i++)
{
try
{
var url = "https://myanimelist.net/character/" + i;
var htmlDoc = new HtmlWeb().Load(url);
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
TextWriter tsw = new StreamWriter(fileName, true);
tsw.WriteLine(json);
tsw.Close();
} catch (Exception ex) { }
}
}
}
}
/*******************************************************************************************************************************
****************************************************IF TESTING IS REQUIERED****************************************************
*******************************************************************************************************************************
*
* static void TestSelect(HtmlDocument htmlDoc, string xpath)
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i - 1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
*******************************************************************************************************************************
*********************************************FOR TESTING ENTER THIS INTO MAIN CLASS********************************************
*******************************************************************************************************************************
*
* var url2 = "https://myanimelist.net/character/256";
var data2 = ExtractCharacterNameAndImage(url2);
var json2 = JsonConvert.SerializeObject(data2, Formatting.Indented);
Console.WriteLine(json2);
var nameXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]";
var imageXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefox);
TestSelect(htmlDoc, imageXpathFromFirefox);
var nameXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]";
var imageXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefoxFixed);
TestSelect(htmlDoc, imageXpathFromFirefoxFixed);
*******************************************************************************************************************************
*******************************************************************************************************************************
*******************************************************************************************************************************
*/
I am implementing a utility method to convert queryString to JsonString.
My code is as follows:
public static string GetJsonStringFromQueryString(string queryString)
{
var nvs = HttpUtility.ParseQueryString(queryString);
var dict = nvs.AllKeys.ToDictionary(k => k, k => nvs[k]);
return JsonConvert.SerializeObject(dict, new KeyValuePairConverter());
}
when I test with the following code:
var postString = "product[description]=GreatStuff" +
"&product[extra_info]=Extra";
string json = JsonHelper<Product>.GetJsonStringFromQueryString(postString);
I got
{
"product[description]":"GreatStuff",
"product[extra_info]":"Extra",
...
}
what I would like to get is
{
"product":{
"description": "GreatStuff",
"extra_info" : "Extra",
...
}
}
How can I achieve this without using System.Web.Script Assembly? (I am on Xamarin and have no access to that library)
You need to remove the product[key] (excepting the product property name or key...) part to get what you want...
That is, you should pre-process your query string before parsing it this way:
string queryString = "product[description]=GreatStuff" +
"&product[extra_info]=Extra";
var queryStringCollection = HttpUtility.ParseQueryString(queryString);
var cleanQueryStringDictionary = queryStringCollection.AllKeys
.ToDictionary
(
key => key.Replace("product[", string.Empty).Replace("]", string.Empty),
key => queryStringCollection[key]
);
var holder = new { product = cleanQueryStringDictionary };
string jsonText = JsonConvert.SerializeObject(holder);
I wonder how could i remove the html tags using htmlagilitypack as below ?
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(Description);
//markups to be removed
var markups = new List<string> { "br","ol","ul","li" };
thanks
you can use this method
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
}
return cleaned;
}
//markups to be removed
var markups = new List<string> { "br", "ol", "ul", "li" };
var xpath = String.Join(" | ", markups.Select(x => "//" + x));
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
if (nodes != null)
{
foreach (var node in nodes)
{
node.Remove();
}
}
I have the following code:
static void Main(string[] args)
{
XmlDocument xml = new XmlDocument();
xml.Load(#"C:\MR.xml");
XmlNodeList stations = xml.SelectNodes("//FileDump/Message/Attachment");
var Message_ID = xml.SelectSingleNode("//FileDump/Message/MsgID").InnerXml;
Console.WriteLine("Message ID is :{0}", Message_ID);
foreach (XmlNode station in stations)
{
var File_Name = station.SelectSingleNode("FileName").InnerXml;
var File_ID = station.SelectSingleNode("FileID").InnerXml;
}
}
FileID and FileName do not always exist in some files. How can I avoid NullReferenceExceptions in this case?
I would try to something like this if that check has to happen in lot of places and to keep the code simple and clear
public static class Helpers
{
public static string GetInnerXml(this XmlNode node, string innerNodeName)
{
string innerXml = "";
XmlNode innerNode = node.SelectSingleNode(innerNodeName);
if (innerNode != null)
{
innerXml = innerNode.InnerXml;
}
return innerXml;
}
}
and use it like this
static void Main(string[] args)
{
XmlDocument xml = new XmlDocument();
xml.Load(#"C:\MR.xml");
XmlNodeList stations = xml.SelectNodes("//FileDump/Message/Attachment");
var Message_ID = xml.GetInnerXml("//FileDump/Message/MsgID");
Console.WriteLine("Message ID is :{0}", Message_ID);
foreach (XmlNode station in stations)
{
var File_Name = station.GetInnerXml("FileName");
var File_ID = station.GetInnerXml("FileID");
}
}
You could do something like:
string FileName= "";
string File_ID = "";
if (station.SelectSingleNode("FileName") != null)
File_Name = station.SelectSingleNode("FileName").InnerXml;
if (station.SelectSingleNode("FileID") != null)
File_ID = station.SelectSingleNode("FileID").InnerXml;
And continue processing if the vars are not the empty string ... ("") ...
static void Main(string[] args)
{
XmlDocument xml = new XmlDocument();
xml.Load(#"C:\MR.xml");
XmlNodeList stations = xml.SelectNodes("//FileDump/Message/Attachment");
var Message_ID = xml.SelectSingleNode("//FileDump/Message/MsgID").InnerXml;
Console.WriteLine("Message ID is :{0}", Message_ID);
foreach (XmlNode station in stations)
{
var fileNameNode = station.SelectSingleNode("FileName");
var fileIdNode = station.SelectSingleNode("FileID");
var File_Name = fileNameNode == null ? (string)null : fileNameNode.InnerXml;
var File_ID = fileIdNode == null ? (string)null : fileIdNode.InnerXml;;
}
}
I usually use extension methods for handling unexpected nulls.
public static string GetValueIfNotNull(this XmlAttribute xmlAttribute)
{
if (xmlAttribute == null)
{
return null;
}
return xmlAttribute.Value;
}
Then I can do myElement.Attribute("someAttribute").GetValueIfNotNull();
I'm trying to make a Windows RT program and I can't seem to figure out how to get the value of the root element. The xmldocument only contains:
<double>0.7423</double>
How would I go about getting the value "0.7422" using c# and window store? Every time I try something it returns a null value.
This is what I've tried so far:
`var getRate = from query in xmlDoc.Descendants("double")
select new
{
Rate = query.Value
};
foreach (var query in getRate)
{
rate = Convert.ToDouble(query.Rate);
}`
I also tried this:
`var rate= xmlDoc.Root.Element("double").Value;
var rate= xmlDoc.Element("double").Value;
rate = (double)XElement.Load(xmlstream);`
But rate always returns a null value.
Try this
string xml = "<double>0.7423</double>";
var document = XDocument.Parse(xml);
var doubleValue = document.Descendants("double").FirstOrDefault().Value;
You can access root element of document via Root property:
double d = (double)XDocument.Load(path_to_xml).Root;
But in this case you even don't need to create document. You can create element:
double d = (double)XElement.Load(path_to_xml);
Not tested, but it's the good way I think
XmlDocument doc = new XlmDocument();
doc.Load("path");
XmlNode root = doc.DocumentElement.InnerText
Quite obvious:
internal class Program
{
private static void Main(string[] args)
{
var xml = "<double>0.7423</double>";
Debug.WriteLine("Method1: {0}", Method1(xml));
Debug.WriteLine("Method2: {0}", Method2(xml));
Debug.WriteLine("Method3: {0}", Method3(xml));
}
private static double Method1(string xml)
{
var xdoc = XDocument.Parse(xml);
var doubleStr = xdoc.Root.Value;
var doubleValue = double.Parse(doubleStr, CultureInfo.InvariantCulture);
return doubleValue;
}
private static double Method2(string xml)
{
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xml);
return double.Parse(xmlDoc.FirstChild.InnerText, CultureInfo.InvariantCulture);
}
private static double Method3(string xml)
{
var doubleStr = xml.Substring(
xml.IndexOf(">") + 1,
xml.IndexOf("</") - xml.IndexOf(">") - 1
);
return double.Parse(doubleStr, CultureInfo.InvariantCulture);
}
}
Took me a while but it was simpler than I thought. Here's how I did it:
var xelement = XElement.Parse(outputtext);
rate = (double)xelement;
Thank you everyone for your help/suggestions!