Get raw text from markdown'ed text - c#

In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed. Of course the MD -> HTML step must be avoided and the work must be done on the MD'ed text. Performance is a requirement. Thx.

In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed.
We store both representations of the text in the database:
Raw Markdown suitable for editing
HTML-ized version suitable for output
and when we display it, we use the HTML-ized output version and simply apply our standard HTML stripping algorithms.

Forgive me if I'm misunderstanding (or simply under-understanding) what you need to do here, but it occurs to me that if there are more reads (page views) than there are inserts (additions of new markdown'ed records) to this database, that from a perfomance standpoint you may be able to make the biggest gain by saving a version of the text with all markup stripped in a separate field in the database. That way your front-end doesn't have to repeatedly parse what it reads from the database before displaying to the browser... it would be parsed only once when new records were added.
Whether or not this actually makes sense from a performance standpoint depends on a variety of variables specific to your situation... how big the text entries are, how often records are inserted versus read, etc.

The way that I would handle this is by defining a formatter interface for the class containing/representing your marked down text. You'd then have concrete implementations that support HTML formatting and plain text formatting. All you would need to do is inject the correct implementation and call the formatter.
Your plain text formatter could simply iterate through the characters in the string, copying characters until it hits some markdown. It would then skip the markdown and start outputting again when it hits the text.
public interface IFormatter
{
string Format();
}
public class HtmlFormatter: IFormatter
{
public Format()
{
return ...string translated to HTML...
}
}
public class PlainTextFormatter : IFormatter
{
public Format()
{
...go through and remove all markdown and return rest
}
}
public class Post : IFormattable
{
public IFormatter Formatter { get; set; }
public Post( IFormatter formatter )
{
this.Formatter = formatter ?? new HtmlFormatter();
}
public Format()
{
return this.Formatter.Format();
}
}

Here is the path I'm taking: I will modify the markdown code so that, with a switch, I can either produce html or simple text. Once the excerpt has been generated, I will surely store it in the DB.
I won't tag any answer as the solution since there are many ways to do it. Everyone gets my vote ;)

Related

Traverse and HtmlEncode strings in Json.net C#

I struggle with safely encoding html-like text in json. The text should be written into a <textarea>, transferred by ajax to the server (.net45 mvc) and stored in a database in a json-string.
When transferring to server, I get the famous "A potentially dangerous Request.Form value was detected" 500 server error. To avoid this message, I use the [AllowHtml] attribute on the model that are transferred. By doing so I open up for XSS-vulnerability, in case anyone paste in { "key1": "<script>alert(\"danger!\")</script>" }. As such, I would like to use something like
tableData.Json = AntiXssEncoder.HtmlEncode(json, true);
Problem is I cannot do this on the full json string, as it will render something like
{
"key1&quot: ...}
which of course is not what I want. It should be more like
{ "key1": "<script>alert("danger!")</script>" }
With this result the user can write whatever code they want, but I can avoid it to be rendered as html, and just display it as ordinary text. Does anyone know how to traverse json with C# (Newtonsoft Json.NET) such that strings can be encoded with AntiXssEncoder.HtmlEncode(... , ....);? Or am I on a wrong track here?
Edit:
The data is non-uniform, so deserialization into uniform objects is not an option.
The data will probably be opened to the public, so storing the data encoded would ease my soul.
If you already have the data as a JSON string, you could parse it into proper objects with something like Json.NET using JsonConvert.DeserializeObject() (or anything else, there are actually quite a few options to choose from). Once it's plain objects, you can go through them and apply any encoding you want, then serialize them again into a JSON string. You can also have a look at this question and its answers.
Another approach that you may take is just leave it alone until actually inserting stuff into the page DOM. You can store unencoded data in the database, you can even send it to the client without HTML encoding as JSON data (of course it needs to be encoded for JSON, but any serializer does that). You need to be careful not to generate it this way directly into the page source though, but as long as it's an AJAX response with text/json content type, it's fine. Then on the client, when you decide to insert it into the actual textarea, you need to make sure you insert it as text, and not html. Technically this could mean using jQuery's .text() instead of .html(), or your template engine's or client-side data binding solution's relevant method (text: instead of html: in Knockout, #: instead of #= in say Kendo UI, etc.)
The advantage of this is latter approach is that when sending the data, the server (something like an API) does not need to know or care about where or how a client will use the data, it's just data. The client may need different encoding for an HTML or a Javascript context, the server cannot necessarily choose the right one.
If you know it's just that text area though where this data is needed, you can of course take the first (your original) approach, encode it on the server, that's equally good (some may argue that's even better in that scenario).
The problem with answering this question is that details count a lot. In theory, there are a myriad of ways you could do it right, but sometimes a good solution differs from a vulnerable one in one single character.
So this is the solution I went for. I added the [AllowHtml] attribute in the ViewModel, so that I could send raw html from the textarea (through ajax).
With this attribute I avoid the System.Web.HttpRequestValidationException that MVC gives to protect against XSS dangers.
Then I traverse the json-string by parsing it as a JToken and encode the strings:
public class JsonUtils
{
public static string HtmlEncodeJTokenStrings(string jsonString)
{
var reconstruct = JToken.Parse(jsonString);
var stack = new Stack<JToken>();
stack.Push(reconstruct);
while (stack.Count > 0)
{
var item = stack.Pop();
if (item.Type == JTokenType.String)
{
var valueItem = item as JValue;
if(valueItem == null)
continue;
var value = valueItem.Value<string>();
valueItem.Value = AntiXssEncoder.HtmlEncode(value, true);
}
foreach (var child in item.Children())
{
stack.Push(child);
}
}
return reconstruct.ToString();
}
}
The resulting json-string will still be valid and I store it in DB. Now, when printing it in a View, I can use the strings directly from json in JS.
When opening it again in another <textarea> for editing, I have to decode the html entities. For that I "stole" some js-code (decodeHtmlEntities) from string.js; of course adding the licence and credit note.
Hope this helps anyone.

Should I serialize my situation as a list or single objects?

I have a short question to more experianced developers.
I am using C# and I am in a bit of dilemma.
I have a list of objects which I would like to serialize into xml format.
For this I am using such a method which I have found out helpful, which is shown below.
public static void WriteToXmlFile<T>(string filePath, T objectToWrite, bool append = false) where T : new()
{
TextWriter writer = null;
try
{
var serializer = new XmlSerializer(typeof(T));
writer = new StreamWriter(filePath, append);
serializer.Serialize(writer, objectToWrite);
}
finally
{
if (writer != null)
writer.Close();
}
}
Then I have a list of objects (normally few hundres of instances, but could be up to few thousands) more or less containing parameters as below:
2-3 integers
1-2 DateTimes
2 bools
List<string> containing 1-5 short (1 word) strings
string containing 10-30 characters
string containing 30-1000 characters
and maybe few short string more..
Then I am wondering if I should serialize the whole such a list into one xml file or it is better to serialize each object to seperate file. I am wondering mainly for purpose of stability (as I was counting it should not reach the limit of size of xml file, but I am not sure) and performance (I am a bit of scary that seperate files would multiply required time). Maybe there is some more aspects to be considered.
I would be thankful for the opinion of some experts.
Which way should I follow ?
Regards !
That depends on how are you going to use serialized results. If you don't use serialized objects separately and always require whole list then go for one file. Otherwise go for one file as well but if you find during testing that it causes issues with performance then it would make sense to review your architecture and use XMLReader or other type of storage such as database.

c# : Loading typed data from file without casting

Is there a way to avoid casting to a non-string type when reading data from a text file containing exclusively integer values separated by integer markers ('0000' for example) ?
(Real-Life example : genetic DNA sequences, each DNA marker being a digit sequence.)
EDIT :
Sample data : 581684531650000651651561156843000021484865321200001987984948978465156115684300002148486532120000198798400009489786515611568430000214848653212000019879849480006516515611684531650000651651561156843000021 etc...
Unless I use a binary writer and read bytes, rather than text (because that is how data written at first), I think this a funky idea, so "NO" would be the straight answer for this.
Just wanted to get a definitive confirmation to that here, just to be definitely sure.
I welcome any intermediate solution to write/read this kind of data efficiently without having to code a custom reader GUI to display it outside my app, intelligibly (in some generic reader/viewer).
The short answer is no, because a text file is a string of characters.
The long answer is sort of yes; if you put your data into a format like XML, a deserializer can implicitly cast the data back to the correct type (without you having to do it manually) based on your schema.
If you have control over the format, consider using a binary format for your file and use e.g. BinaryReader.ReadInt32.
rather then just casting, you really should use the .TryParse(...) method(s) of the types you are trying to read. This is a much more type-safe solution.
And to answer your question, other then using a binary file, there is not (to my knowledge) a way to do this without casting (or using the TryParse methods)
The only way to control all the read process is to read bytes. Else you read strings.
Edit : I Didn't talk about automatic serialization via XML because of the details on the file format you gave.
If the data is text and you need to access it as an integer, a conversion will be required. The only question is which code does the conversion.
Depending upon the file format, you could look for classes or libraries that already handle them. Otherwise, keep your code well organized so you don't have to pay attention to the conversion too much.
Some options:
// Could throw exceptions
var x = Convert.ToInt32(text);
var x = Int32.Parse(text);
// Won't throw an exception, just check the results
int x = 0;
if (Int32.TryParse(text, out x)) { ... }

Linq To Text Files

I have a Text File (Sorry, I'm not allowed to work on XML files :(), and it includes customer records. Each text file looks like:
Account_ID: 98734BLAH9873
User Name: something_85
First Name: ILove
Last Name: XML
Age: 209
etc... And I need to be able to use LINQ to get the data from these text files and just store them in memory.
I have seen many Linq to SQL, Linq to BLAH but nothing for Linq to Text. Can someone please help me out abit?
Thank you
You can use the code like that
var pairs = File.ReadAllLines("filename.txt")
.Select(line => line.Split(':'))
.ToDictionary(cells => cells[0].Trim(), cells => cells[1].Trim())
Or use the .NET 4.0 File.ReadLines() method to return an IEnumerable, which is useful for processing big text files.
The concept of a text file data source is extremely broad (consider that XML is stored in text files). For that reason, I think it is unlikely that such a beast exists.
It should be simple enough to read the text file into a collection of Account objects and then use LINQ-to-Objects.
Filehelpers is a really great open source solution to this:
http://filehelpers.sourceforge.net/
You just declare a class with attributes, and FileHelpers reads the flat file for you:
[FixedLengthRecord]
public class PriceRecord
{
[FieldFixedLength(6)]
public int ProductId;
[FieldFixedLength(8)]
[FieldConverter(typeof(MoneyConverter))]
public decimal PriceList;
[FieldFixedLength(8)]
[FieldConverter(typeof(MoneyConverter))]
public decimal PriceOnePay;
}
Once FileHelpers gives you back an array of rows, you can use Linq to Objects to query the data
We've had great success with it. I actually think Kaerber's solution is a nice simple solution, maybe stave of migrating to FileHelpers till you really need the extra power.

Can you represent CSV data in Google's Protocol Buffer format?

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem.
Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig.
Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). But each file will have different column names for each field.
As well as capturing the original CSV data I need to be able to add extra information to the file before saving. And I was hoping to make this future proof by handling different file versions.
So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file?
Well, it's certainly representable. Something like:
message CsvFile {
repeated CsvHeader header = 1;
repeated CsvRow row = 2;
}
message CsvHeader {
require string name = 1;
require ColumnType type = 2;
}
enum ColumnType {
DECIMAL = 1;
STRING = 2;
}
message CsvRow {
repeated CsvValue value = 1;
}
// Note that the column is implicit based on position within row
message CsvValue {
optional string string_value = 1;
optional Decimal decimal_value = 2;
}
message Decimal {
// However you want to represent it (there are various options here)
}
I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc.
Well, protobuf-net (my version) is based on regular .NET types, so no (since it won't cope with different schemas all the time). But Jon's version might allow dynamic types. Personally, I'd just use CSV and run it through GZipStream - I expect that will be fine for the purpose.
Edit: actually, I forgot: protobuf-net does support extensible objects, but you need to be a bit careful... it would depend on the full context, I expect.
Plus Jon's approach of nested data would probably work too.

Categories

Resources