Lucene RegexQuery doesn't seem to apply Regex? - c#

Code:
private static void AddTextToIndex(string filename, string pdfBody, IndexWriter writer)
{
Document doc = new Document();
doc.Add(new Field("fileName", filename.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("pdfBody", pdfBody.ToString(), Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.AddDocument(doc);
}
protected void txtBoxSearchPDF_Click(object sender, EventArgs e)
{
//some code
string searchQuery = txtBoxSearchString.Text;
Term t = new Term("fileName",searchQuery+"/i");
RegexQuery regQuer = new RegexQuery(t);
TopDocs resultDocs = indexSearch.Search(regQuer, indexReader.MaxDoc);
var hits = resultDocs.ScoreDocs;
foreach (var hit in hits)
{
var documentFromSearcher = indexSearch.Doc(hit.Doc);
string getResult = documentFromSearcher.Get("fileName");
string formattedResult = getResult.Replace(" ", "%20");
sb.AppendLine(#"" + getResult+"");
sb.AppendLine("<br>");
}
}
Basically all I'm trying to do is use Regex so that I can match things exactly but I want the search to be case insensitive. But adding the /i option doesn't actually make it a regular expression, all it seems to do is make the search term literally whatever was entered in the text box concatenated with the /i.
Any ideas?

Case sensitivity depends mostly on the Analyzer you use.
A RegexQuery is a MultiTermQuery which means it will get rewritten to something similar to a BooleanQuery with a SHOULD occurence on all the terms that match the regex.
At search, the terms in your index will be enumerated and matched against your regex. The matching terms will be added as clauses to the BooleanQuery.
Your regex obviously does not get through the analyzer, so you have to adjust it manually to match your terms.
And, the regex syntax does not support many features... See the docs.
Actually, I simplified the explanation, what really happens is more complicated because many optimizations take place (all the terms are not enumerated, the regex is compiled to a finite state automaton, the querty does not necessarily get rewritten to a BooleanQuery etc). But what happens behind the scenes will have the same outcome as what I've explained here.

Related

Regex matching dynamic words within an html string

I have an html string to work with as follows:
string html = new MvcHtmlString(item.html.ToString()).ToHtmlString();
There are two different types of text I need to match although very similar. I need the initial ^^ removed and the closing |^^ removed. Then if there are multiple clients I need the ^ separating clients changed to a comma(,).
^^Client One- This text is pretty meaningless for this task, but it will exist in the real document.|^^
^^Client One^Client Two^Client Three- This text is pretty meaningless for this task, but it will exist in the real document.|^^
I need to be able to match each client and make it bold.
Client One- This text is pretty meaningless for this task, but it will exist in the real document.
Client One, Client Two, Client Three- This text is pretty meaningless for this task, but it will exist in the real document.
A nice stack over flow user provided the following but I could not get it to work or find any matches when I tested it on an online regex tester.
const string pattern = #"\^\^(?<clients>[^-]+)(?<text>-.*)\|\^\^";
var result = Regex.Replace(html, pattern,
m =>
{
var clientlist = m.Groups["clients"].Value;
var newClients = string.Join(",", clientlist.Split('^').Select(s => string.Format("<strong>{0}</strong>", s)));
return newClients + m.Groups["text"];
});
I am very new to regex so any help is appreciated.
I'm new to C# so forgive me if I make rookie mistakes :)
const string pattern = #"\^\^([^-]+)(-[^|]+)\|\^\^";
var temp = Regex.Replace(html, pattern, "<strong>$1</strong>$2");
var result = Regex.Replace(temp, #"\^", "</strong>, <strong>");
I'm using $1 even though MSDN is vague about using that syntax to reference subgroups.
Edit: if it's possible that the text after - contains a ^ you can do this:
var result = Regex.Replace(temp, #"\^(?=.*-)", "</strong>, <strong>");

c# searching large text file

I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.
Currently I have been using IndexOf to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.
How can I optimize searching speed? I've tried Contains() but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed
example
while ((line = myStream.ReadLine()) != null)
{
if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
{
LineIndex.Add(CurrentPosition);
LinesCounted += 1;
}
}
The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:
Boyer-Moore is "the standard", I think:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
But there are lots more out there:
http://www-igm.univ-mlv.fr/~lecroq/string/
including Morris-Pratt:
http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
and Knuth-Morris-Pratt:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).
You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])
Unfortunately, I don't think there's a whole lot you can do in straight C#.
I have found the Boyer-Moore algorithm to be extremely fast for this task. But I found there was no way to make even that as fast as IndexOf. My assumption is that this is because IndexOf is implemented in hand-optimized assembler while my code ran in C#.
You can see my code and performance test results in the article Fast Text Search with Boyer-Moore.
Have you seen these questions (and answers)?
Processing large text file in C#
Is there a way to read large text file in parts?
Matching a string in a Large text file?
Doing it the way you are now seems to be the way to go if all you want to do is read the text file. Other ideas:
If it is possible to pre-sort the data, such as when it gets inserted into the text file, that could help.
You could insert the data into a database and query it as needed.
You could use a hash table
You can user regexp.Match(String). RegExp Match is faster.
static void Main()
{
string text = "One car red car blue car";
string pat = #"(\w+)\s+(car)";
// Instantiate the regular expression object.
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match"+ (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group"+i+"='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
}
}
m = m.NextMatch();
}
}

What is the best way to match a set of regular expressions in HashSet to a string in ASP.NET using C#?

I was wondering if I'm doing the following ASP.NET C# regexp match in the most efficient way?
I have a set of regular expressions in a HashSet that I need to match to an input string, so I do:
HashSet<string> hashMatchTo = new HashSet<string>();
hashMatchTo.Add(#"regexp 1");
hashMatchTo.Add(#"regexp 2");
hashMatchTo.Add(#"regexp 3");
hashMatchTo.Add(#"regexp 4");
hashMatchTo.Add(#"regexp 5");
//and so on
string strInputString = "Some string";
bool bMatched = false;
foreach (string strRegExp in hashMatchTo)
{
Regex rx = new Regex(strRegExp, RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
if (rx.IsMatch(strInputString))
{
bMatched = true;
break;
}
}
Two things jump out at me. The first is that you can populate a collection at the same time you create it, like so:
HashSet<string> hashMatchTo = new HashSet<string>()
{
#"^regexp 1$",
#"^regexp 2$",
#"^regexp 3$",
#"^[\w\s]+$",
#"^regexp 5$"
//and so on
};
The second is that you should use the static version of IsMatch(), like so:
string strInputString = "Some string";
bool bMatched = false;
foreach (string strRegExp in hashMatchTo)
{
if (Regex.IsMatch(strInputString, strRegExp,
RegexOptions.CultureInvariant | RegexOptions.IgnoreCase))
{
bMatched = true;
break;
}
}
Console.WriteLine(bMatched);
}
The reason for doing this is that the static Regex methods automatically cache whatever Regex objects they create. But be aware that the cache size is only 15 by default; if you think you'll be using more than that, you'll need to increase the value of CacheSize property.
If your goal is a simple "does match any? true/false" then concatenate all of your regex into one big regex and just run that.
string strRegexp = string.Join("|", listOfRegex.ToArray());
bool bIsMatched = Regex.IsMatch(strInputString, strRegExp, RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
Console.WriteLine(bMatched);
No "foreach" looping
Better readability
No need to mess with the static Regex caching
While processing it will short circuit much like it does in the loop version with "break", but less method calls will be made which (should) improve performance.
I dont see any thing wrong. I will consider readability over efficiency as long as it is fast enough and meets the business requirements.
It depends upon your set content, I don't know how many is really many. But you may think about searching criteria based on case by case basis. Make your program know what and where to search instead of running through all of hash-set content to check for possible issues.
I used to work with a simple regular expression to extract from 2000 provided urls information that is to be displayed in a listview but it degraded the whole program performance severely.

Can't get CJKAnalyzer/Tokenizer to recognise japanese text

i'm working with Lucene.NET and it's great. then worked on how to get it to search asian languages. as such, i moved from the StandardAnalyzer to the CJKAnalyzer.
this works fine for korean (although StandardAnalyzer worked ok for korean!), and chinese (which did not), but i still cannot get the program to recognise japanese text.
just as a very small example, i write a tiny database (using the CJKAnalyzer) with a few words in it, then try and read from the database:
public void Write(string text, AnalyzerType type)
{
Document document = new Document();
document.Add(new Field(
"text",
text,
Field.Store.YES,
Field.Index.ANALYZED));
IndexWriter correct = this.chineseWriter;
correct.AddDocument(document);
}
that's for the writing. and for the reading:
public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
{
Analyzer analyzer = this.chineseAnalyzer;
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
var query = parser.Parse(text);
// Get the fields.
TopFieldCollector collector = TopFieldCollector.create(
new Sort(),
maxResults,
false,
true,
true,
false);
// Then use the searcher.
this.searcher.Search(
query,
null,
collector);
// Holds the results
List<Document> documents = new List<Document>();
// Get the top documents.
foreach (var scoreDoc in collector.TopDocs().scoreDocs)
{
var doc = this.searcher.Doc(scoreDoc.doc);
documents.Add(doc);
}
// Send the list of docs back.
return documents.ToArray();
}
whereby chineseWriter is just an IndexWriter with the CJKAnalyzer passed in, and chineseAnalyzer is just the CJKAnalyzer.
any advice on why japanese isn't working? the input i send seems fair:
プーケット
is what i will store, but cannot read it. :(
EDIT: I was wrong... Chinese doesn't really work either: it the search term is longer than 2 characters, it stops working. Same as Japanese.
EDIT PART 2: I've now seen that the problem is using the prefix search. If I search for the first 2 characters and use an asterisk, then it works. As soon as I go over 2, then it stops to work. i guess this is because of the way the word is tokenized? If I search for the full term, then it does find it. Is there anyway to use prefix search in Lucene.NET for CJK? プ* will work, but プーケ* will find nothing.
I use StandardTokenizer. Atleast for Japanese and Korean text it is able to tokenize the words which contains 3 character or 4. But only worry is for Chinese character. It does tokenize the Chinese language but 1 character at a time.

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();

Categories

Resources