I am building auto correct for string input encoding. And I want to build a regex for encoding pattern.
For example:
var encoding = "utd-8";
Correct c = new Correct(encoding);
var c.Correct();
And the output is utf-8.
I have most of the work (and using some open source coding from some great people that wrote beautiful stuff). Can some one help please?
UPDATE
What I need in the end is the regex pattern for the right encoding.
The user input a encoding name iso-8859-1 and it check if its valid.
You shouldn't decide on which technology to use before you have figured out how to solve the problem; are Regular Expressions really necessary?
If I understand your question correctly, you want to check whether the input string looks alot like one of the supported encodings. Before writing a single line of code, you'll have to figure out:
Which encodings are you supporting? Are you supporting aliases (UTF-16 is the same as Unicode)?
How much is the input string allowed to be different from the chosen encoding (utd-8, utd-9, utd9, td9, 9)?
Given the input string "utf-36", would the output be UTF-16 or UTF-32?
Perhaps you can take a look at one of the string distance algorithms (for example, http://en.wikipedia.org/wiki/Levenshtein_distance) for inspiration on the subject. There are a ton of links in the "see also" section there.
Related
I'm scraping a social platform using selenium, and a lot of users use special characters like HEᑕƘᏔ®✞ℍ, fire Emojis and so on. These characters turn into questions marks like "HE?????????".
I've tried to use the decode and encode utilities but I've had absolutely no luck.
See here:
WebUtility.HtmlDecode(string);
WebUtility.HtmlEncode(string);
I get the feeling I'm barking up the wrong tree here, but have no idea where to start, as special character answers normally talk about Unicode, and I'm pretty sure this isn't relevant in this case.
EDIT:
This is how I'm fetching the content using selenium
title = driver.FindElement(By.XPath("//*[#id=\"header-
section\"]/div[2]/div/div/div/div/div[1]/div/h1")).Text;
What you are doing is looking at HTML decode and encode rather which replaces letters to make them HTML safe for example £ becomes £
You want to look at text encoding, as this controls which characters are available with different characters sets giving you different characters. If a character is not available in the character set you are using it shows as a question mark or black block.
You can use Encoding.Convert() see this discussion for more info.
It is likely you will want to convert your input to UTF-8 text encoding to see the full character set.
Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "'" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.
The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]
Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.
I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?
As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.
I am currently looking to detect whether an URL is encoded or not. Here are some specific examples:
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL21lZGlhLXBsYXllci8%3D&b=13
http://www.linxology.com/browse.php?u=Oi8vZXNwbnN0YXIuY29tL290aGVyX2ZpbGVzL2VzcG5zdGFyL25hdl9iZy1vZmYucG5n&b=13
Can you please give me a Regular Expression for this?
Is there a self learning regular expression generator out there which can filter a perfect Regex as the number of inputs are increased?
If you are interested in the base64-encoded URLs, you can do it.
A little theory. If L, R are regular languages and T is a regular transducer, then LR (concatenation), L & R (intersection), L | R (union), TR(L) (image), TR^-1(L) (kernel) are all regular languages. Every regular language has a regular expression that generates it, and every regexp generates a regular language. URLs can be described by regular language (except if you need a subset of those that is not), almost every escaping scheme (and base64) is a regular transducer. Therefore, in theory, it's possible.
In practice, it gets rather messy.
A regex for valid base64 strings is ([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(==|[A-Za-z0-9+/]=)
If it is embedded in a query parameter of an url, it will probably be urlencoded. Let's assume only the = will be urlencoded (because other characters can too, but don't need to).
This gets us to something like [?&][^?&#=;]+=([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
Another possibility is to consider only those base64 encoded URLs that have some property - in your case, thy all begin with "://", which is fortunate, because that translates exactly to 4 characters "Oi8v". Otherwise, it would be more complex.
This gets [?&][^?&#=;]+=Oi8v([A-Za-z0-9+/]{4})*(|[A-Za-z0-9+/]{2}(%3D%3D|[A-Za-z0-9+/]%3D)
As you can see, it gets messier and messier. Therefore, I'd recommend you rather to
break the URL in its parts (eg. protocol, host, query string)
get the parameters from the query string, and urldecode them
try base64 decode on the values of the parameters
apply your criterion for "good encoded URLs"
Well, depending on what is in that encoded text, you might not even need a regular expression. If there are multiple querystring parameters in that one "u" key, perhaps you could just check the length of the text on each querystring value, and if it is over (say) 50, you can assume it's probably encoded. I doubt any unencoded single parameters would be as long as these, since those would have to be string data, and therefore they would probably need to be encoded!
This question may be harder than you realize. For example:
I could say that if a query string includes a question mark character then what follows it is encoded.
Now, it may be simple encoding like "?year=2009" or complicated like in your examples.
Or
The site URLs could use URL rewriting (like this site does). Look at the URL of this question. The "615958" is encoded and... no question marks were used!
In fact, you could say that the entire URL is encoded!
Perhaps you need to better define what you mean by "encoded".
You can't reliably parse URL using regex. (Is this an SO mantra yet?)
Here are some specific examples:
It's not clear what ‘encoded’ means — can you give some counter-examples of URLs you consider “not encoded”?
Are you talking about the Base64 encoding in the ‘u’ parameter? Whilst it is possible to say whether a string is a valid Base64 string, it's not possible to detect Base64 and distinguish it from anything else; for example the word “sausages” also happens to be valid Base64 (it decodes to '\xb1\xab\xacj\x07\xac').
I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?
On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.
I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.
Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.