Parsing XML for Converting Time Text Markup to WebVTT

Parsing XML for Converting Time Text Markup to WebVTT - c#

I working on a web application that can take in a subtitle file in either Time Text Markup(TTML) or WebVTT format. If the file is Timed Text, I want to translate it to WebVTT. This is mostly not an issue, the one problem I'm having is that if the TTML has HTML as part of the text content, then the HTML tags get dropped.
For example:
<p begin="00:00:08.18" dur="00:00:03.86">(Music<br />playing)</p>
results in:
(Musicplaying)
The code I use is:
private const string TIME_FORMAT = "hh\\:mm\\:ss\\.fff";
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fileLocation);
XDocument xdoc = xmldoc.ToXDocument();
var ns = (from x in xdoc.Root.DescendantsAndSelf()
select x.Name.Namespace).First();
List<TTMLElement> elements =
(
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = item.Value,
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
duration = TimeSpan.Parse(item.Attribute("dur").Value),
}
).ToList<TTMLElement>();
StringBuilder sb = new StringBuilder();
sb.AppendLine("WEBVTT");
sb.AppendLine();
for (int i = 0; i < elements.Count; i++)
{
sb.AppendLine(i.ToString());
sb.AppendLine(elements[i].startTime.ToString(TIME_FORMAT) + " --> " + elements[i].startTime.Add(elements[i].duration).ToString(TIME_FORMAT));
sb.AppendLine(elements[i].text);
sb.AppendLine();
}
Any thoughts on what I'm missing or if there is just a better way of doing this or even if there is already a solution for converting Time Text to WebVTT would be appreciated. Thanks.

I finally came back to this project and I also found a solution to my problem.
First in this section:
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = item,
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
endTime = item.Attribute("dur") != null ?
TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
TimeSpan.Parse(item.Attribute("end").Value)
}
item is of type XElement so an XmlReader object can be created from it resulting in this function:
private static string ReadInnerXML(XElement parent)
{
var reader = parent.CreateReader();
reader.MoveToContent();
var innerText = reader.ReadInnerXml();
return innerText;
}
For my purposes of removing the html inside the node I modified the function to look like this:
private static string ReadInnerXML(XElement parent)
{
var reader = parent.CreateReader();
reader.MoveToContent();
var innerText = reader.ReadInnerXml();
innerText = Regex.Replace(innerText, "<.+?>", " ");
return innerText;
}
Finally resulting in the above lambda looking like this:
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = ReadInnerXML(item),
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
endTime = item.Attribute("dur") != null ?
TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
TimeSpan.Parse(item.Attribute("end").Value)
}

Microsoft has a tool which generates both formats:
HTML5 Video Caption Maker
This demo allows you to create simple video caption files. Start by loading a video in a format your browser can play. Then alternately play and pause the video, entering a caption for each segment.
If you have a saved WebVTT or TTML caption file for your video, you may load it, edit the text of existing segments, or append new segments.
If you want to do it programmatically, answers to other questions may help.

Related

how to get value from xml by Linq

i was reading huge xml file of 5GB size by using the following code, and i was success to get the first element Testid but failed to get another element TestMin coming under different namespace
this is the xml i am having
which i am getting as null
.What is wrong here?
EDIT
GMileys answer giving error like The ':' character, hexadecimal value 0x3A, cannot be included in a name

The element es:qRxLevMin is a child element of xn:attributes, but it looks like you are trying to select it as a child of xn:vsDataContainer, it is a grandchild of that element. You could try changing the following:
var dataqrxlevmin = from atts in pin.ElementsAfterSelf(xn + "VsDataContainer")
select new
{
qrxlevmin = (string)atts.Element(es + "qRxLevMin"),
};
To this:
var dataqrxlevmin = from atts in pin.Elements(string.Format("{0}VsDataContainer/{1}attributes", xn, es))
select new
{
qrxlevmin = (string)atts.Element(es + "qRxLevMin"),
};
Note: I changed your string concatenation to use string.Format for readability purposes, either is technically fine to use, but string.Format is a better approach.

What about this approach?
XDocument doc = XDocument.Load(path);
XName utranCellName = XName.Get("UtranCell", "un");
XName qRxLevMinName = XName.Get("qRxLevMin", "es");
var cells = doc.Descendants(utranCellName);
foreach (var cell in cells)
{
string qRxLevMin = cell.Descendants(qRxLevMinName).FirstOrDefault();
// Do something with the value
}

try this code which is very similar to your code but simpler.
using (XmlReader xr = XmlReader.Create(path))
{
xr.MoveToContent();
XNamespace un = xr.LookupNamespace("un");
XNamespace xn = xr.LookupNamespace("xn");
XNamespace es = xr.LookupNamespace("es");
while (!xr.EOF)
{
if(xr.LocalName != "UtranCell")
{
xr.ReadToFollowing("UtranCell", un.NamespaceName);
}
if(!xr.EOF)
{
XElement utranCell = (XElement)XElement.ReadFrom(xr);
}
}
}

actually namespace was the culprit,what i did is first loaded the small section i am getting from.Readform method in to xdocument,then i removed all the namespace,then i took the value .simple :)

How to replace the InnerText of a Comment

I've tried the following:
comment.InnerText=comment.InnerText.Replace(comment.InnerText,new_text);
Which doesn't work because we can only read the InnerText property. How do I effectively change the InnerText value so I can save the modifications to WordProcessing.CommentsPart.Comments and MainDocumentPart.Document ?
EDIT: DocumentFormat.OpenXml.Wordprocessing.Comment is comment's class.
EDIT 2: The method:
public void updateCommentInnerTextNewWorkItem(List<Tuple<Int32, String, String>> list){
//DOCX.CDOC.Comments -> WordProcessingCommentsPart.Comments
//DOCX._CIT -> Dictionary<int,string>
foreach (var comm in DOCX.CDOC.Comments)
{
foreach (var item in list)
{
foreach (var item_cit in DOCX._CIT)
{
if (((Comment)comm).InnerText.Contains("<tag>") && item.Item3.Contains(item_cit.Value))
{
comm.InnerXml = comm.InnerXml.Replace(comm.InnerText, item.Item1 + "");
//comm.InnerText.Replace(comm.InnerText,item.Item1+"");
//DOCX.CDOC.Comments.Save();
//DOCX.DOC.MainDocumentPart.Document.Save();
}
if (((Comment)comm).InnerText.Contains("<tag class") && item.Item3.Contains(item_cit.Value))
{
//comm.InnerText.Replace(comm.InnerText, item.Item1 + "");
comm.InnerXml = comm.InnerXml.Replace(comm.InnerText, item.Item1 + "");
//DOCX.CDOC.Comments.Save();
//DOCX.DOC.MainDocumentPart.Document.Save();
}
}
}
}
DOCX.CDOC.Comments.Save();
DOCX.DOC.MainDocumentPart.Document.Save();
}

It's read-only because it returns the XML content with all XML tags removed. So setting it would strip it of all XML tags.
If the text you want to replace does not span tags you could just replace the text in the XML:
comment.InnerXml=comment.InnerXml.Replace(comment.InnerText,new_text);

It is not such easy(but still not complex). Comment has structure as well as document's body - it could contain Paragraphs, Runs etc. InnerText will just return to you text values of all runs of all paragraphs in this comment, so now you understand why you can not just set this value.
So first you have to remove all comment's paragraphs:
comment.RemoveAllChildren<Paragraph>();
Next step is to add new paragraph with run that contains text you need:
Paragraph paragraph = new Paragraph();
Run run = new Run();
Text text = new Text();
text.Text = "My comment";
run.Append(text);
paragraph.Append(run);
comment.Append(paragraph);
After all do not forget to save changes:
doc.MainDocumentPart.WordprocessingCommentsPart.Comments.Save();

Ahh....This is a little complex.And I have ever had the same problem.
You will need the XmlElement Class.And for example, there is a variable named xmlDoc which has been instantiated from XmlDocument.
And then you should use the method SelectSingleNode to get the reference of which XmlNode you want to edit.Here you need to transform the XmlNode into XmlElement by using this(Suppose the XmlNode is named 'node'):
XmlElement XmlEle = (XmlElement)node;
Also in easy way, you can use this:
XmlElement XmlEle = (XmlElement)xmlDoc.SelectSingleNode("dict/integer");
And now you can use the variable XmlEle to replace the InnerText because it's just a reference.
Like this:
XmlEle.InnerText = TopNumber.ToString();

just use not innterxml , user text
foreach (Paragraph paragraph in document.MainDocumentPart.Document.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
bool ss = paragraph.InnerXml.Contains("commentRangeStart");
bool ee = paragraph.InnerXml.Contains("commentRangeEnd");
if (ss && ee)
{
foreach (Run run in paragraph.Elements<Run>())
{
foreach (Text text in run.Elements<Text>())
{
text.Text = "your word " ;
}
}
}
}

Parsing HTML page with HtmlAgilityPack using LINQ

How can i parse html using Linq on a webpage and add values to a string. I am using the HtmlAgilityPack on a metro application and would like to bring back 3 values and add them to a string.
here is the url = http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N
I would like to get the values from the following see "belwo"
"Balance:",
"Transactions in",
"Received"
WebResponse x = await req.GetResponseAsync();
HttpWebResponse res = (HttpWebResponse)x;
if (res != null)
{
if (res.StatusCode == HttpStatusCode.OK)
{
Stream stream = res.GetResponseStream();
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
string appName = htmlDocument.DocumentNode.Descendants // not sure what t
string a = "Name: " + WebUtility.HtmlDecode(appName);
}
}

Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.
Cheers, Aaron.
// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);
// extract the transactions 'h3' title, the node we want is directly before it
var transTitle =
(from h3 in doc.DocumentNode.Descendants("h3")
where h3.InnerText.ToLower() == "transactions"
select h3).FirstOrDefault();
// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens =
(from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
where !string.IsNullOrEmpty(row.Trim())
let line = row.Trim().Split(':')
where line.Length == 2
select new { name = line[0].Trim(), value = line[1].Trim() });
// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();
'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:
Balance: 5 LTC
Transactions in: 2
Received: 5 LTC
Transactions out: 0
Sent: 0 LTC

the document you have to parse is not the most well formed for parsing many elements are missing the class or at least id attribute but what you want to get is a second p tag
content in it
you can try this
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var pNodes = htmlDocument.DocumentNode.SelectNodes("//p")
[1].InnerHtml.ToString().Split(new string[] { "<br />" }, StringSplitOptions.None).Take(3);
string vl="Balance:"+pNodes[0].Split(':')[1]+"Transactions in"+pNodes[1].Split(':')[1]+"Received"+pNodes[2].Split(':')[1];

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.

XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}

var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);

var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz

public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).

https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!

I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

How can I get the href attribute value out of an <?xml-stylesheet> node?

We are getting an XML document from a vendor that we need to perform an XSL transform on using their stylesheet so that we can convert the resulting HTML to a PDF. The actual stylesheet is referenced in an href attribute of the ?xml-stylesheet definition in the XML document. Is there any way that I can get that URL out using C#? I don't trust the vendor not to change the URL and obviously don't want to hardcode it.
The start of the XML file with the full ?xml-stylesheet element looks like this:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="http://www.fakeurl.com/StyleSheet.xsl"?>

As a processing instruction can have any contents it formally does not have any attributes. But if you know there are "pseudo" attributes, like in the case of an xml-stylesheet processing instruction, then you can of course use the value of the processing instruction to construct the markup of a single element and parse that with the XML parser:
XmlDocument doc = new XmlDocument();
doc.Load(#"file.xml");
XmlNode pi = doc.SelectSingleNode("processing-instruction('xml-stylesheet')");
if (pi != null)
{
XmlElement piEl = (XmlElement)doc.ReadNode(XmlReader.Create(new StringReader("<pi " + pi.Value + "/>")));
string href = piEl.GetAttribute("href");
Console.WriteLine(href);
}
else
{
Console.WriteLine("No pi found.");
}

Linq to xml code:
XDocument xDoc = ...;
var cssUrlQuery = from node in xDoc.Nodes()
where node.NodeType == XmlNodeType.ProcessingInstruction
select Regex.Match(((XProcessingInstruction)node).Data, "href=\"(?<url>.*?)\"").Groups["url"].Value;
or linq to objects
var cssUrls = (from XmlNode childNode in doc.ChildNodes
where childNode.NodeType == XmlNodeType.ProcessingInstruction && childNode.Name == "xml-stylesheet"
select (XmlProcessingInstruction) childNode
into procNode select Regex.Match(procNode.Data, "href=\"(?<url>.*?)\"").Groups["url"].Value).ToList();
xDoc.XPathSelectElement() will not work since it for some reasone cannot cast an XElement to XProcessingInstruction.

You can also use XPath. Given an XmlDocument loaded with your source:
XmlProcessingInstruction instruction = doc.SelectSingleNode("//processing-instruction(\"xml-stylesheet\")") as XmlProcessingInstruction;
if (instruction != null) {
Console.WriteLine(instruction.InnerText);
}
Then just parse InnerText with Regex.

To find the value using a proper XML parser you could write something like this:
using(var xr = XmlReader.Create(input))
{
while(xr.Read())
{
if(xr.NodeType == XmlNodeType.ProcessingInstruction && xr.Name == "xml-stylesheet")
{
string s = xr.Value;
int i = s.IndexOf("href=\"") + 6;
s = s.Substring(i, s.IndexOf('\"', i) - i);
Console.WriteLine(s);
break;
}
}
}

private string _GetTemplateUrl(XDocument formXmlData)
{
var infopathInstruction = (XProcessingInstruction)formXmlData.Nodes().First(node => node.NodeType == XmlNodeType.ProcessingInstruction && ((XProcessingInstruction)node).Target == "mso-infoPathSolution");
var instructionValueAsDoc = XDocument.Parse("<n " + infopathInstruction.Data + " />");
return instructionValueAsDoc.Root.Attribute("href").Value;
}

XmlProcessingInstruction stylesheet = doc.SelectSingleNode("processing-instruction('xml-stylesheet')") as XmlProcessingInstruction;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing XML for Converting Time Text Markup to WebVTT - c#

Related

how to get value from xml by Linq

How to replace the InnerText of a Comment

Parsing HTML page with HtmlAgilityPack using LINQ

Grab all text from html with Html Agility Pack

How can I get the href attribute value out of an <?xml-stylesheet> node?

Categories

Resources