Regular Expression finding quotations

Regular Expression finding quotations - c#

i am trying to check if a string is a quotation with regex in C#.
For e.g.
string x = "The flora and fauna of Britain \"has been transported to almost every corner of the globe since colonial times\" (Plants and Animals of Britain, 1942: 8).;
string y = "Morris et al (2000: 47) state \"that the debate of these particular issues should be left to representative committees.\"";
x and y are two quotations and the regex (or alternative solution) should be able to return true.
I came with this but there is a small problem:
string pattern = #"([‘'""]([\w\W]+?)[)])|(([\w\W]+?)[(]([\w\W]+?)[’'""])";
Is there any alternatives? Thanks in advance.
The project is an anti-plagiarism web application. The application found that these strings(quotation) was copied from the web. Now assume the user wants not to include these quotations in the search results, the question is how to do it.
The search results are stored in database, i am using EF and linq as such:
var webSearches = _db.WebSearches.Where(x => x.SubmissionId == submissionId).GroupBy(x => x.PlagiarisedText).Select(x => x.FirstOrDefault()).OrderBy(x => x.Id);
I want to filter the result (plagiarisedText) by not including quotations.
Thanks for replies, I appreciate.

Use \\\".
Use Regex.IsMatch() to find if it contains or not.
Console.WriteLine(Regex.IsMatch(x, "\\\""));// true if it contains ", otherwise false

If Regex is not a requirement you can use String functions:
int first = str.IndexOf('"');
int last = str.LastIndexOf('"');
if (str.Substring(first, last - first) != string.Empty)
{
// true
}

If it will be true when the first and the end characters are both "s, then you can simple use the following regex:
".*"

Related

Extract substring between startsequence and endsequence in C# using LINQ

I have an XML instance that contains processing instructions. I want a specific one (the schematron declaration):
<?xml-model href="../../a/b/c.sch" schematypens="http://purl.oclc.org/dsdl/schematron"?>
There may or may not be more than these very processing instructions present, so I can't rely on its position in the DOM; it is guaranteed, on the other hand, that there will be only one (or none) such Schematron file reference. Thus, I get it like so:
XProcessingInstruction p = d.Nodes().OfType<XProcessingInstruction>()
.Where(x => x.Target.Equals("xml-model") &&
x.Data.Contains("schematypens=\"http://purl.oclc.org/dsdl/schematron\""))
.FirstOrDefault();
In the example given, the content of p.Data is the string
href="../../a/b/c.sch" schematypens="http://purl.oclc.org/dsdl/schematron"
I need to extract the path specified via #href (i. e. in this example I would want the string ../../a/b/c.sch) without double quotes. In other words: I need the substring after href=" and before the next ". I'm trying to achieve my goal with LINQ:
var a = p.Data.Split(' ').Where(s => s.StartsWith("href=\""))
.Select(s => s.Substring("href=\"".Length))
.Select(s => s.TakeWhile(c => c != '"'));
I would have thought this gave me a IEnumerable<char> which I could then convert to a string in one of the ways described here, but that's not the case: According to LINQPad, I seem to be getting a IEnumerabale<IEnumerable<char>> which I can't manage to make into a string.
How could this be done correctly using LINQ? Maybe I'd better be using Regex within LINQ?
Edit: After typing this down, I came up with a working solution, but it seems very inelegant:
string a = new string
(
p.Data.Substring(p.Data.IndexOf("href=\"") + "href=\"".Length)
.TakeWhile(c => c != '"').ToArray()
);
What would be a better way?

Try this:
var input = #"<?xml-model href=""../../a/b/c.sch"" schematypens=""http://purl.oclc.org/dsdl/schematron""?>";
var match = Regex.Match(input, #"href=""(.*?)""");
var url = match.Groups[1].Value;
That gives me ../../a/b/c.sch in url.
Please don't use Regex for general XML parsing, but for this situation it's fine.

Regex matching dynamic words within an html string

I have an html string to work with as follows:
string html = new MvcHtmlString(item.html.ToString()).ToHtmlString();
There are two different types of text I need to match although very similar. I need the initial ^^ removed and the closing |^^ removed. Then if there are multiple clients I need the ^ separating clients changed to a comma(,).
^^Client One- This text is pretty meaningless for this task, but it will exist in the real document.|^^
^^Client One^Client Two^Client Three- This text is pretty meaningless for this task, but it will exist in the real document.|^^
I need to be able to match each client and make it bold.
Client One- This text is pretty meaningless for this task, but it will exist in the real document.
Client One, Client Two, Client Three- This text is pretty meaningless for this task, but it will exist in the real document.
A nice stack over flow user provided the following but I could not get it to work or find any matches when I tested it on an online regex tester.
const string pattern = #"\^\^(?<clients>[^-]+)(?<text>-.*)\|\^\^";
var result = Regex.Replace(html, pattern,
m =>
{
var clientlist = m.Groups["clients"].Value;
var newClients = string.Join(",", clientlist.Split('^').Select(s => string.Format("<strong>{0}</strong>", s)));
return newClients + m.Groups["text"];
});
I am very new to regex so any help is appreciated.

I'm new to C# so forgive me if I make rookie mistakes :)
const string pattern = #"\^\^([^-]+)(-[^|]+)\|\^\^";
var temp = Regex.Replace(html, pattern, "<strong>$1</strong>$2");
var result = Regex.Replace(temp, #"\^", "</strong>, <strong>");
I'm using $1 even though MSDN is vague about using that syntax to reference subgroups.
Edit: if it's possible that the text after - contains a ^ you can do this:
var result = Regex.Replace(temp, #"\^(?=.*-)", "</strong>, <strong>");

Code an elegant way to strip strings

I am using C# and in one of the places i got list of all peoples names with their email id's in the format
name(email)\n
i just came with this sub string stuff just off my head. I am looking for more elegant, fast ( in the terms of access time, operations it performs), easy to remember line of code to do this.
string pattern = "jackal(jackal#gmail.com)";
string email = pattern.SubString(pattern.indexOf("("),pattern.LastIndexOf(")") - pattern.indexOf("("));
//extra
string email = pattern.Split('(',')')[1];
I think doing the above would do sequential access to each character until it finds the index of the character. Works ok now since name is short, but would struggle when having a large name ( hope people don't have one)

A dirty hack would be to let microsoft do it for you.
try
{
new MailAddress(input);
//valid
}
catch (Exception ex)
{
// invalid
}
I hope they would do a better job than a custom reg-ex.
Maintaining a custom reg-ex that takes care of everything might involve some effort.
Refer: MailAddress
Your format is actually very close to some supported formats.
Text within () are treated as comments, but if you replace ( with < and ) with > and get a supported format.

The second parameter in Substring() is the length of the string to take, not the ending index.
Your code should read:
string pattern = "jackal(jackal#gmail.com)";
int start = pattern.IndexOf("(") + 1;
int end = pattern.LastIndexOf(")");
string email = pattern.Substring(start, end - start);
Alternatively, have a look at Regular Expression to find a string included between two characters while EXCLUDING the delimiters

C# - Searching strings

I can't seem to find a good solution to this issue. I've got an array of strings that are fed in from a report that I recieve about lost or stolen equipment. I've been using the string.IndexOf function through the rest of the form and it works quite well. This issue is with the field that says if the device was lost or stolen.
Example:
"Lost or Stolen? Lost"
"Lost or Stolen? Stolen"
I need to be able to read this but when I do string.IndexOf(#"Lost") it will always return lost because it's in the question.
Unfortunately I'm not able to change the form itself in any way and due to the nature of how it's submited I can't just write code the knocks the first 15 or so characters off the string because that may be too few in some cases.
I would really like something in C# that would allow me to continue to search a string after the first result is found so that the logic would look like:
string my_string = "Lost or Stolen? Stolen";
searchFor(#"Stolen" in my_string)
{
Found Stolen;
Does it have "or " infront of it? yes;
ignore and keep searching;
Found Stolen again;
return "Equipment stolen";
}

Couple of options here. You could look for the last index of a space and take the rest of the string:
string input = "Lost or Stolen? Stolen";
int lastSpaceIndex = input.LastIndexOf(' ');
string result = input.Substring(lastSpaceIndex + 1);
Console.WriteLine(result);
Or you could split it and take the last word:
string input = "Lost or Stolen? Lost";
string result = input.Split(' ').Last();
Console.WriteLine(result);
Regex is also an option, but overkill given the simpler solutions above. A nice shortcut that fits this scenario is to use the RegexOptions.RightToLeft option to get the first match starting from the right:
string result = Regex.Match(input, #"\w+", RegexOptions.RightToLeft).Value;

If I understand your requirement, you're looking for an instance of Lost or Stolen after a ?:
var q = myString.IndexOf("?");
var lost = q >= 0 && myString.IndexOf("Lost", q) > 0;
var stolen = q >= 0 && myString.IndexOf("Stolen", q) > 0;
// or
var lost = myString.LastIndexOf("Lost") > myString.IndexOf("?");
var stolen = myString.LastIndexOf("Stolen") > myString.IndexOf("?");
// don't forget
var neither = !lost && !stolen;

You can look for the string 'Lost' and if it occurs twice, then you can confirm it is 'Lost'.

Its possible in this case that you could use index of on a substring knowing that it is always going to say lost or stolen first
so you parse out the lost or stolen, then like for you keyword to match the remaining string.
something like:
int questionIndex = inputValue.indexOf("?");
string toMatch = inputValue.Substring(questionIndex);
if(toMatch == "Lost")

If it works for your use case, it might be easier to use .EndsWith().
bool lost = my_string.EndsWith("Lost");

Function to Make Pascal Case? (C#)

I need a function that will take a string and "pascal case" it. The only indicator that a new word starts is an underscore. Here are some example strings that need to be cleaned up:
price_old => Should be PriceOld
rank_old => Should be RankOld
I started working on a function that makes the first character upper case:
public string FirstCharacterUpper(string value)
{
if (value == null || value.Length == 0)
return string.Empty;
if (value.Length == 1)
return value.ToUpper();
var firstChar = value.Substring(0, 1).ToUpper();
return firstChar + value.Substring(1, value.Length - 1);
}
The thing the above function doesn't do is remove the underscore and "ToUpper" the character to the right of the underscore.
Also, any ideas about how to pascal case a string that doesn't have any indicators (like the underscore). For example:
companysource
financialtrend
accountingchangetype
The major challenge here is determining where one word ends and another starts. I guess I would need some sort of lookup dictionary to determine where new words start? Are there libraries our there to do this sort of thing already?
Thanks,
Paul

You can use the TextInfo.ToTitleCase method then remove the '_' characters.
So, using the extension methods I've got:
http://theburningmonk.com/2010/08/dotnet-tips-string-totitlecase-extension-methods
you can do somethingl ike this:
var s = "price_old";
s.ToTitleCase().Replace("_", string.Empty);

Well the first thing is easy:
string.Join("", "price_old".Split(new [] { '_' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Substring(0, 1).ToUpper() + s.Substring(1)).ToArray());
returns PriceOld
Second thing is way more difficult. As companysource could be CompanySource or maybe CompanysOurce, can be automated but is quite faulty. You will need an English dictionary, and do some guessing (ah well, I mean alot) on which combination of words is correct.

Try this:
public static string GetPascalCase(string name)
{
return Regex.Replace(name, #"^\w|_\w",
(match) => match.Value.Replace("_", "").ToUpper());
}
Console.WriteLine(GetPascalCase("price_old")); // => Should be PriceOld
Console.WriteLine(GetPascalCase("rank_old" )); // => Should be RankOld

With underscores:
s = Regex.Replace(s, #"(?:^|_)([a-z])",
m => m.Groups[1].Value.ToUpper());
Without underscores:
You're on your own there. But go ahead and search; I'd be surprised if nobody has done this before.

For your 2nd problem of splitting concatenated words, you could utilize our best friends Google & Co. If your concatenated input is made up of usual english words, the search engines have a good hit rate for the single words as an alternative search query
If you enter your sample input, Google and Bing suggest the following:
original | Google | Bing
=====================================================================
companysource | company source | company source
financialtrend | financial trend | financial trend
accountingchangetype | accounting changetype | accounting change type
See this exaple.
Writing a small screen scraper for that should be fairly easy.

for those who needs a non regex solution
public static string RemoveAllSpaceAndConcertToPascalCase(string status)
{
var textInfo = new System.Globalization.CultureInfo("en-US").TextInfo;
var titleCaseStr = textInfo.ToTitleCase(status);
string result = titleCaseStr.Replace("_","").Replace(" ", "");
return result;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression finding quotations - c#

Use \\\". Use Regex.IsMatch() to find if it contains or not. Console.WriteLine(Regex.IsMatch(x, "\\\""));// true if it contains ", otherwise false

If Regex is not a requirement you can use String functions: int first = str.IndexOf('"'); int last = str.LastIndexOf('"'); if (str.Substring(first, last - first) != string.Empty) { // true }

If it will be true when the first and the end characters are both "s, then you can simple use the following regex: ".*"

Related

Extract substring between startsequence and endsequence in C# using LINQ

Regex matching dynamic words within an html string

Code an elegant way to strip strings

C# - Searching strings

Function to Make Pascal Case? (C#)

Categories

Resources