Can't get CJKAnalyzer/Tokenizer to recognise japanese text

Can't get CJKAnalyzer/Tokenizer to recognise japanese text - c#

i'm working with Lucene.NET and it's great. then worked on how to get it to search asian languages. as such, i moved from the StandardAnalyzer to the CJKAnalyzer.
this works fine for korean (although StandardAnalyzer worked ok for korean!), and chinese (which did not), but i still cannot get the program to recognise japanese text.
just as a very small example, i write a tiny database (using the CJKAnalyzer) with a few words in it, then try and read from the database:
public void Write(string text, AnalyzerType type)
{
Document document = new Document();
document.Add(new Field(
"text",
text,
Field.Store.YES,
Field.Index.ANALYZED));
IndexWriter correct = this.chineseWriter;
correct.AddDocument(document);
}
that's for the writing. and for the reading:
public Document[] ReadMultipleFields(string text, int maxResults, AnalyzerType type)
{
Analyzer analyzer = this.chineseAnalyzer;
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer);
var query = parser.Parse(text);
// Get the fields.
TopFieldCollector collector = TopFieldCollector.create(
new Sort(),
maxResults,
false,
true,
true,
false);
// Then use the searcher.
this.searcher.Search(
query,
null,
collector);
// Holds the results
List<Document> documents = new List<Document>();
// Get the top documents.
foreach (var scoreDoc in collector.TopDocs().scoreDocs)
{
var doc = this.searcher.Doc(scoreDoc.doc);
documents.Add(doc);
}
// Send the list of docs back.
return documents.ToArray();
}
whereby chineseWriter is just an IndexWriter with the CJKAnalyzer passed in, and chineseAnalyzer is just the CJKAnalyzer.
any advice on why japanese isn't working? the input i send seems fair:
プーケット
is what i will store, but cannot read it. :(
EDIT: I was wrong... Chinese doesn't really work either: it the search term is longer than 2 characters, it stops working. Same as Japanese.
EDIT PART 2: I've now seen that the problem is using the prefix search. If I search for the first 2 characters and use an asterisk, then it works. As soon as I go over 2, then it stops to work. i guess this is because of the way the word is tokenized? If I search for the full term, then it does find it. Is there anyway to use prefix search in Lucene.NET for CJK? プ* will work, but プーケ* will find nothing.

I use StandardTokenizer. Atleast for Japanese and Korean text it is able to tokenize the words which contains 3 character or 4. But only worry is for Chinese character. It does tokenize the Chinese language but 1 character at a time.

Related

Gembox document removes table of contents

I'm using Gembox document to replace some text in a docx document and it works great. However, I have a table of contents field that disappears after saving the document.
I tried doing the following but the field still disappears leaving only the placeholder text:
var toc = (TableOfEntries)document.GetChildElements(true, ElementType.TableOfEntries).First();
toc.Update();
document.GetPaginator(new PaginatorOptions() { UpdateFields = true });

UPDATE (2021-01-15):
Please try again with the latest version from the BugFixes page or from NuGet.
The latest version will work on the machine that uses culture with ';' character as a list separator.
Or you can specify that culture like this:
var toc = (TableOfEntries)document.GetChildElements(true, ElementType.TableOfEntries).First();
CultureInfo.CurrentCulture = new CultureInfo("fr");
toc.Update();
document.GetPaginator(new PaginatorOptions() { UpdateFields = true });
Also, the issue with the missing tab stop should be resolved now as well.
ORIGINAL:
When I tried to update your TOC from MS Word, I got the following:
No table of contents entries found.
After investigating the field's code of your TOC element, I figured out what the problem is.
This is the instruction text that you have:
{ TOC \h \z \t "TitreChapitre;1;SousTitreChapitre;2" }
These semicolon character separators (;) are culture-dependent. In other words, updating this TOC element will work on a machine that has a French region and settings, but it won't work when you have an English region and settings.
I'm currently on vacation, so I can't do anything about this. When I come back I will fix this problem for you.
For now, can you use the following as a workaround (I also noticed an issue with missing TabStop, this workaround will cover that as well):
var toc = (TableOfEntries)document.GetChildElements(true, ElementType.TableOfEntries).First();
var section = toc.Parent as Section;
var tocWidth = section.PageSetup.PageWidth - section.PageSetup.PageMargins.Left - section.PageSetup.PageMargins.Right;
var toc1Style = document.Styles["toc 1"] as ParagraphStyle;
var toc1TabStop = new TabStop(tocWidth - toc1Style.ParagraphFormat.RightIndentation, TabStopAlignment.Right, TabStopLeader.Dot);
toc1Style.ParagraphFormat.Tabs.Add(toc1TabStop);
var toc2Style = document.Styles["toc 2"] as ParagraphStyle;
var toc2TabStop = new TabStop(tocWidth - toc2Style.ParagraphFormat.RightIndentation, TabStopAlignment.Right, TabStopLeader.Dot);
toc2Style.ParagraphFormat.Tabs.Add(toc2TabStop);
toc.InstructionText = toc.InstructionText.Replace(';', ',');
toc.Update();
document.GetPaginator(new PaginatorOptions() { UpdateFields = true });
I hope this works for you.

Parse Line and Break it into Variables

I have a text file that contain only the FULL version number of an application that I need to extract and then parse it into separate Variables.
For example lets say the version.cs contains 19.1.354.6
Code I'm using does not seem to be working:
char[] delimiter = { '.' };
string currentVersion = System.IO.File.ReadAllText(#"C:\Applicaion\version.cs");
string[] partsVersion;
partsVersion = currentVersion.Split(delimiter);
string majorVersion = partsVersion[0];
string minorVersion = partsVersion[1];
string buildVersion = partsVersion[2];
string revisVersion = partsVersion[3];

Altough your problem is with the file, most likely it contains other text than a version, why dont you use Version class which is absolutely for this kind of tasks.
var version = new Version("19.1.354.6");
var major = version.Major; // etc..

What you have works fine with the correct input, so I would suggest making sure there is nothing else in the file you're reading.
In the future, please provide error information, since we can't usually tell exactly what you expect to happen, only what we know should happen.
In light of that, I would also suggest looking into using Regex for parsing in the future. In my opinion, it provides a much more flexible solution for your needs. Here's an example of regex to use:
var regex = new Regex(#"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9])");
var match = regex.Match("19.1.354.6");
if (match.Success)
{
Console.WriteLine("Match[1]: "+match.Groups[1].Value);
Console.WriteLine("Match[2]: "+match.Groups[2].Value);
Console.WriteLine("Match[3]: "+match.Groups[3].Value);
Console.WriteLine("Match[4]: "+match.Groups[4].Value);
}
else
{
Console.WriteLine("No match found");
}
which outputs the following:
// Match[1]: 19
// Match[2]: 1
// Match[3]: 354
// Match[4]: 6

Is there any way to "substitute" numbers in string C#?

I have html code, which I need to parse on the fly. I need to find exact divs there, which all have id of "content-text-" and then 6 numbers (like "content-text-123456"), which I don't know beforehand. Is there any way to "substitute" the numbers at the end of the string I'm searching for (like "content-text-######")? Searching for "content-text-" does not work.
I'm doing this project on Windows Phone 8.1 with C# if it matters.
EDIT:
WPPageResponse response = JsonConvert.DeserializeObject<WPPageResponse>(json);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response.content);
foreach (var node in doc.DocumentNode.Descendants("div").Where(div => div.GetAttributeValue("id", "") == "content-text-######"))
{
// Gather data what it returns
}
Here is some code if it helps. It works if I know the numbers and search with them, but the thing is that I can't know all the numbers there.

You can use Regex for this.
string data = "MyTest = 5564327";
string output = Regex.Replace(data, #"\d", "#");
Console.WriteLine(output);
Console.Read();
Output is:
MyTest = #######

Lucene RegexQuery doesn't seem to apply Regex?

Code:
private static void AddTextToIndex(string filename, string pdfBody, IndexWriter writer)
{
Document doc = new Document();
doc.Add(new Field("fileName", filename.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("pdfBody", pdfBody.ToString(), Field.Store.NO, Field.Index.NOT_ANALYZED));
writer.AddDocument(doc);
}
protected void txtBoxSearchPDF_Click(object sender, EventArgs e)
{
//some code
string searchQuery = txtBoxSearchString.Text;
Term t = new Term("fileName",searchQuery+"/i");
RegexQuery regQuer = new RegexQuery(t);
TopDocs resultDocs = indexSearch.Search(regQuer, indexReader.MaxDoc);
var hits = resultDocs.ScoreDocs;
foreach (var hit in hits)
{
var documentFromSearcher = indexSearch.Doc(hit.Doc);
string getResult = documentFromSearcher.Get("fileName");
string formattedResult = getResult.Replace(" ", "%20");
sb.AppendLine(#"" + getResult+"");
sb.AppendLine("<br>");
}
}
Basically all I'm trying to do is use Regex so that I can match things exactly but I want the search to be case insensitive. But adding the /i option doesn't actually make it a regular expression, all it seems to do is make the search term literally whatever was entered in the text box concatenated with the /i.
Any ideas?

Case sensitivity depends mostly on the Analyzer you use.
A RegexQuery is a MultiTermQuery which means it will get rewritten to something similar to a BooleanQuery with a SHOULD occurence on all the terms that match the regex.
At search, the terms in your index will be enumerated and matched against your regex. The matching terms will be added as clauses to the BooleanQuery.
Your regex obviously does not get through the analyzer, so you have to adjust it manually to match your terms.
And, the regex syntax does not support many features... See the docs.
Actually, I simplified the explanation, what really happens is more complicated because many optimizations take place (all the terms are not enumerated, the regex is compiled to a finite state automaton, the querty does not necessarily get rewritten to a BooleanQuery etc). But what happens behind the scenes will have the same outcome as what I've explained here.

Regex matching dynamic words within an html string

I have an html string to work with as follows:
string html = new MvcHtmlString(item.html.ToString()).ToHtmlString();
There are two different types of text I need to match although very similar. I need the initial ^^ removed and the closing |^^ removed. Then if there are multiple clients I need the ^ separating clients changed to a comma(,).
^^Client One- This text is pretty meaningless for this task, but it will exist in the real document.|^^
^^Client One^Client Two^Client Three- This text is pretty meaningless for this task, but it will exist in the real document.|^^
I need to be able to match each client and make it bold.
Client One- This text is pretty meaningless for this task, but it will exist in the real document.
Client One, Client Two, Client Three- This text is pretty meaningless for this task, but it will exist in the real document.
A nice stack over flow user provided the following but I could not get it to work or find any matches when I tested it on an online regex tester.
const string pattern = #"\^\^(?<clients>[^-]+)(?<text>-.*)\|\^\^";
var result = Regex.Replace(html, pattern,
m =>
{
var clientlist = m.Groups["clients"].Value;
var newClients = string.Join(",", clientlist.Split('^').Select(s => string.Format("<strong>{0}</strong>", s)));
return newClients + m.Groups["text"];
});
I am very new to regex so any help is appreciated.

I'm new to C# so forgive me if I make rookie mistakes :)
const string pattern = #"\^\^([^-]+)(-[^|]+)\|\^\^";
var temp = Regex.Replace(html, pattern, "<strong>$1</strong>$2");
var result = Regex.Replace(temp, #"\^", "</strong>, <strong>");
I'm using $1 even though MSDN is vague about using that syntax to reference subgroups.
Edit: if it's possible that the text after - contains a ^ you can do this:
var result = Regex.Replace(temp, #"\^(?=.*-)", "</strong>, <strong>");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can't get CJKAnalyzer/Tokenizer to recognise japanese text - c#

I use StandardTokenizer. Atleast for Japanese and Korean text it is able to tokenize the words which contains 3 character or 4. But only worry is for Chinese character. It does tokenize the Chinese language but 1 character at a time.

Related

Gembox document removes table of contents

Parse Line and Break it into Variables

Is there any way to "substitute" numbers in string C#?

Lucene RegexQuery doesn't seem to apply Regex?

Regex matching dynamic words within an html string

Categories

Resources