Lucene docID reliability

Lucene docID reliability - c#

Hi
If only insert operation occur on lucene index (no delete/update), is it true that docID is not changing ? and its also reliable
if it is true, i want to use it as loading FieldCache incrementally to lower down the overhead of loading all documents, what is the best solution for that ??

I'm not quite sure what you're planning to do with the field cache, but my understanding of document ids is that they can change during an insert, depending on pending deletes, merge policies etc.
i.e. Document ID should not be used past a commit boundary on a reopened index reader
Hope this helps,

The document id is static within a segment. IndexReader.Open (usually) opens a DirectoryReader which combines several SegmentReader. You'll need to pass the "bottom" reader to the FieldCache for the population to work correctly.
Here's an example from FieldCache with frequently updating index which ensures that only the newly read segment is read by the FieldCache, instead of the topmost reader (which will considered changed at every commit).
var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;
// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);
// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
if (sub.MaxDoc() < subReaderId) {
subReaderId -= sub.MaxDoc();
return false;
}
return true;
});
var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];

Related

Basic Read CSV File Questions

Thanks in advance, C# newb here having a few issues.
I this CSV file provided daily, large, and has no header. I only need certain items out of this file.
Here is the code I have so far.
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false,
};
using (var reader = new StreamReader(iFile.FileName))
using (var csv = new CsvReader(reader, config))
{
var records = new List<BQFile>();
csv.Read();
csv.ReadHeader();
while (csv.Read())
{
var record = new BQFile()
{
SNumber = csv.GetField<string>("SNumber"),
FOBPoint = csv.GetField<string>("FOBPoint")
};
}
What I am not understanding since this CSV files 150+ fields, is how do grab the correct data. For example, if SNumber is column 46, FOBPoint is column 123. I am finding the CSVHelper documentation a little limited to me.
Any help is appreciated.

What I am not understanding since this CSV files 150+ fields, is how do grab the correct data
By index, because there is no header
In your BQFile, decorate the properties with an attribute of [Index(NNN)] where N is the column number (0-based). The IndexAttribute is found in CsvHelper.Configuration.Attributes namespace - I mention this because Entity Framework also has an Index attribute; be sure you use the correct one
pubic class BQFile{
[Index(46)]
public string SNumber { get; set;}
...
}
Then do:
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false,
};
using (var reader = new StreamReader(iFile.FileName))
using (var csv = new CsvReader(reader, config))
{
var records = csv.GetRecords<BQFile>();
...
records is an enumeration on top of the file stream (via CSVHelper, which reads records as it goes and creates instances of BQFile). You can only enumerate it once, and then after you're done enumerating it the filestream will be at the end - if you wanted to re-read the file you'd have to Seek the stream or renew the reader. Also, the file is only read (in chunks, progressively) as you enumerate. If you return records somewhere, so you drop out of the using and you thus dispose the reader, you'll get an error when you try to start reading from records (because it's disposed)
To work with records, you either foreach it, processing the objects you get as you go:
foreach(BQFile bqf in records){
//do stuff with each BQFile here
}
Or if you want to load it all into memory, you can do something like ToList() it so you end up with a bunch of BQFile in a List, and then you can e.g. access them randomly, read them over and over etc..
var bqfs = records.ToList();
ps; I don't know, when you said "it's column 46" if that's counting from 1 or 0.. You might have to adjust your 46.

How can I achieve DynamoDB Pagination Same as SQL/MYSQL(Total count of items and I can jump to any other page) in C#

how to implement pagination using DynamoDB in c# same as SQL/MYSQL.
I have check document for ScanRequest and QueryRequest that one is working perfectly fine.
But I need Pagination same as we all do in SQL/MYSQL like in initial call I need total number of item count and page wise records and easily I can jump to any other page also.
So anyone can suggest good solution or any alternative solution?
Thank you in advance.
Dictionary<string, AttributeValue> lastKeyEvaluated = null;
do
{
var request = new ScanRequest
{
TableName = "Test",
Limit = 5,
ExclusiveStartKey = lastKeyEvaluated
};
var response = await Client.ScanAsync(request);
lastKeyEvaluated = response.LastEvaluatedKey;
}
while (lastKeyEvaluated.Count != 0);
Dictionary<string, Condition> conditions = new Dictionary<string, Condition>();
// Title attribute should contain the string "Adventures"
Condition titleCondition = new Condition();
titleCondition.ComparisonOperator = ComparisonOperator.EQ;
titleCondition.AttributeValueList.Add(new AttributeValue { S = "Company#04" });
conditions["PK"] = titleCondition;
var request = new QueryRequest();
request.Limit = 2;
request.ExclusiveStartKey = null;
request.TableName = "Test";
request.KeyConditions = conditions;
var result = await Client.QueryAsync(request);

DynamoDB supports an entirely different type of pagination. The concept of page number, offset, etc. is another paradigm from the SQL database world that has no parallel in DynamoDB. However, pagination is supported, just not the way many expect.
You may want to consider changing how your client application paginates to accommodate. If you have a small amount of information to paginate (e.g. under 1MB of data), you might consider passing it to the client and letting the client implement the pagination (e.g. page 4 of 15).
If you have too much data to paginate client-side, you may want to consider updating your client application to accommodate how DynamoDB paginates (e.g. using LastEvaluatedKey and ExclusiveStartKey).

AMO get partitions where data are processed but not indexes

I am writing a script that return all unprocessed partitions within a measure group using the following command:
objMeasureGroup.Partitions.Cast<Partition>().Where(x => x.State != AnalysisState.Processed)
After doing some experiments, it looks like this property indicates if the data is processed and doesn't mention the indexes.
After searching for hours, i didn't find any method to list the partitions where data is processed but indexes are not.
Any suggestions?
Environment:
SQL Server 2014
SSAS multidimensional cube
Script are written within a SSIS package / Script task

First, ProcessIndexes is an incremental operation. So if you run it twice the second time will be pretty quick because there is nothing to do. So I would recommend just running it on the cube and not worrying about whether it was previously run. However if you do need to analyze the current state then read on.
The best way (only way I know of) to distinguish whether ProcessIndexes has been run on a partition is to study the DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT DMVs as seen below.
The DISCOVER_PARTITION_STAT DMV returns one row per aggregation with the rowcount. The first row of that DMV has a blank aggregation name and represents the rowcount of the lowest level data processed in that partition.
The DISCOVER_PARTITION_DIMENSION_STAT DMV can tell you about whether indexes are processed and which range of values by each dimension attribute are in this partition (by internal IDs, so not super easy to interpret). We assume at least one dimension attribute is set to be optimized so it will be indexed.
You will need to add a reference to Microsoft.AnalysisServices.AdomdClient also to simplify running these DMVs:
string sDatabaseName = "YourDatabaseName";
string sCubeName = "YourCubeName";
string sMeasureGroupName = "YourMeasureGroupName";
Microsoft.AnalysisServices.Server s = new Microsoft.AnalysisServices.Server();
s.Connect("Data Source=localhost");
Microsoft.AnalysisServices.Database db = s.Databases.GetByName(sDatabaseName);
Microsoft.AnalysisServices.Cube c = db.Cubes.GetByName(sCubeName);
Microsoft.AnalysisServices.MeasureGroup mg = c.MeasureGroups.GetByName(sMeasureGroupName);
Microsoft.AnalysisServices.AdomdClient.AdomdConnection conn = new Microsoft.AnalysisServices.AdomdClient.AdomdConnection(s.ConnectionString);
conn.Open();
foreach (Microsoft.AnalysisServices.Partition p in mg.Partitions)
{
Console.Write(p.Name + " - " + p.State + " - ");
var restrictions = new Microsoft.AnalysisServices.AdomdClient.AdomdRestrictionCollection();
restrictions.Add("DATABASE_NAME", db.Name);
restrictions.Add("CUBE_NAME", c.Name);
restrictions.Add("MEASURE_GROUP_NAME", mg.Name);
restrictions.Add("PARTITION_NAME", p.Name);
var dsAggs = conn.GetSchemaDataSet("DISCOVER_PARTITION_STAT", restrictions);
var dsIndexes = conn.GetSchemaDataSet("DISCOVER_PARTITION_DIMENSION_STAT", restrictions);
if (dsAggs.Tables[0].Rows.Count == 0)
Console.WriteLine("ProcessData not run yet");
else if (dsAggs.Tables[0].Rows.Count > 1)
Console.WriteLine("aggs processed");
else if (p.AggregationDesign == null || p.AggregationDesign.Aggregations.Count == 0)
{
bool bIndexesBuilt = false;
foreach (System.Data.DataRow row in dsIndexes.Tables[0].Rows)
{
if (Convert.ToBoolean(row["ATTRIBUTE_INDEXED"]))
{
bIndexesBuilt = true;
break;
}
}
if (bIndexesBuilt)
Console.WriteLine("indexes have been processed. no aggs defined");
else
Console.WriteLine("no aggs defined. need to run ProcessIndexes on this partition to build indexes");
}
else
Console.WriteLine("need to run ProcessIndexes on this partition to process aggs and indexes");
}

I am posting this answer as additional information of #GregGalloway excellent answer
After searching for a while, the only way to know if partition are processed is using DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT.
I found an article posted by Daren Gossbel describing the whole process:
SSAS: Are my Aggregations processed?
In the artcile above the author provided two methods:
using XMLA
One way in which you can find it out with an XMLA discover call to the DISCOVER_PARTITION_STAT rowset, but that returns the results in big lump of XML which is not as easy to read as a tabular result set.
example
<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
<RequestType>DISCOVER_PARTITION_STAT</RequestType>
<Restrictions>
<RestrictionList>
<DATABASE_NAME>Adventure Works DW</DATABASE_NAME>
<CUBE_NAME>Adventure Works</CUBE_NAME>
<MEASURE_GROUP_NAME>Internet Sales</MEASURE_GROUP_NAME>
<PARTITION_NAME>Internet_Sales_2003</PARTITION_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
</PropertyList>
</Properties>
</Discover>
using DMV queries
If you have SSAS 2008, you can use the new DMV feature to query this same rowset and return a tabular result.
example
SELECT *
FROM SystemRestrictSchema($system.discover_partition_stat
,DATABASE_NAME = 'Adventure Works DW 2008'
,CUBE_NAME = 'Adventure Works'
,MEASURE_GROUP_NAME = 'Internet Sales'
,PARTITION_NAME = 'Internet_Sales_2003')
Similar posts:
How to find out using AMO if aggregation exists on partition?
Detect aggregation processing state with AMO?

Checking duplication using LINQ in collection

I have function which inserts record in database. I want to make sure that there are no duplicate entries in database. Function first checks if there is query string parameter. If there is, then it acts like edit mode otherwise insert mode. There is a function which can return currently added records in database. I need to check duplication based on two columns before insertion in database.
myService = new myService();
myFlow mf = new myFlow();
if (!string.IsNullOrEmpty(Request["myflowid"]))
{
mf = myService.Getmyflow(Convert.ToInt32(Request["myflowid"]));
}
int workcount = 0;
int.TryParse(txtWorkCount.Text, out workcount);
mf.Name = txtName.Text.Trim();
mf.Description = txtDescription.Text.Trim();
mf.FunctionCode = txtFunctioneCode.Text.Trim();
mf.FunctionType = txtFunctioneType.Text.Trim();
mf.WorkCount = workcount;
if (mf.WorkFlowId == 0)
{
mf.SortOrder = 0;
mf.Active = true;
mf.RecordDateTime = DateTime.Now;
message = "Saved Successfully";
}
else
{
_editMode = true;
message = "Update Successfully";
}
}
int myflowId = mfService.AddEditmyflow(mf);
I want to check duplication based on functiontype and functioncode. Another function mfService.Getmyflows() can return currently added records in database.
How can I check duplication using Linq?

First of all, what database do you use? Many databases support upsert behavior (update or insert depending of was data found or not). For example, MERGE in ms sql, MERGE in oracle, INSERT .. ON DUPLICATE in mysql and so on. This could be preferred solution. Upsert is usually an atomic operation.
In your particular case do you you transactions? Are you sure no one will insert data after you ensured about duplicates but before you have inserted your record? Example:
#1 thread #2 thread
look for duplicates
... look for duplicate
no duplicates found ...
no duplicates found
insert data_1
insert data_1
This will end up with duplicates you trying to avoid.
According to your code you populating data from GUI and adding only one item.
If you have access to myService code you could add method to query item by your two columns, instead of querying all items via mfService.Getmyflows() and looking through this collection inside your code. It would be more performant (especially if you have indexes in that columns) and more memory efficient.
And finally, existing of a single element inside collection can be easily done:
var alreadyExist = mfService.Getmyflows()
.Any(x => x.Column1 == value1 && x.Column2 == value2);

Need help in using MySqlTransaction

Hi all i would like to use MySqlTransaction in my requirement. Actually i am having a doubt regarding that i.e as per my requirement i will have to delete different values from database.
The process i am doing is as follows. Assume that i am having 2 EmpIDs where this EmpID will hold different values which may be multiple. I will store the corresponding values for that particular EmpID using Dictionary and then i will save them to a list corresponding to the EmpID.
Assume that i am having list element as follows
For EmpID 1 i will have 1,2. I will check for the maximum value from the datbase in this list if exists i would like to delete this EmpID from the database.
For EmpID 2 i will have 1,2. But in my database i will have 3 as maximum values. So this one fails . I would like to rollback the previously deleted item .
Is it possible to do with a transaction if so can any one help me in solving this
Sample i code
if(findMax(lst,iEmpID)
{
obj.delete("storeprocname"); // this will occur when my list has maximum value
}
else
{
//Here i would like to rollback my previous one referring to the delete method in class file
}
My sample code
if (findMaxPayPeriodID(lstPayPeriodID, iEmpIDs)) //Assume for the first time maxpayperiod exists and for the second time it fails how to rollback then
{
if (findSequence(lstPayPeriodID)) // Assume this is also true for first time
{
for (int ilstPayperiodID = 0; ilstPayperiodID < lstPayPeriodID1.Count; ilstPayperiodID++)
{
oAdmin.Payperiodnumber = (int)lstPayPeriodID1[ilstPayperiodID];
for (int ilistPayYear = iPayYearcnt; ilistPayYear < lstPayYear1.Count; ilistPayYear++)
{
oAdmin.PayYear = (int)lstPayYear1[ilistPayYear];
iPayYearcnt++;
break;
}
for (int ilistDateTime = idtcnt; ilistDateTime < lstDateTime1.Count; ilistDateTime++)
{
idtcnt++;
oAdmin.PaymentDate = lstDateTime1[ilistDateTime];
break;
}
}
if (oAdmin.deletePayRoll(oSqlTran))
{
oMsg.Message = "Deleted Sucessfully";
oMsg.AlertMessageBox(out m_locallblMessage);
Page.Controls.Add(m_locallblMessage);
oAdmin.FedTaxID = ddlFedTaxID.SelectedValue;
oAdmin.PayFrequency = ddlPaymentType.SelectedValue.ToString();
mlocal_strStoredProcName = "uspSearchPayRoll";
oAdmin.getPayRollDetails(out mlocal_ds, mlocal_strStoredProcName);
//grdPayroll.Visible = true;
grdPayroll.DataSource = mlocal_ds;
grdPayroll.DataBind();
if (mlocal_ds != null)
{
btnDelete.Visible = true;
}
else
btnDelete.Visible = false;
}
lstPayPeriodID.Clear();
lstDateTime.Clear();
lstPayYear.Clear();
iPayIDcnt = 0;
iPayYearcnt = 0;
idtcnt = 0;
}
else
{
rollback should be done
}

You don't provide enough information - esp. since it seems that you will use a Stored Procedure for the delete operation all bets are off...
The only option I can think of is to make sure that you find first the maximum EmpId not from one list BUT from all lists first... then just check that against the DB and act accordingly...
This way the DB will only be hit twice (for the check and for the delete/Stored Procedure)... which is definetely better in terms of scaling etc.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Lucene docID reliability - c#

Hi If only insert operation occur on lucene index (no delete/update), is it true that docID is not changing ? and its also reliable if it is true, i want to use it as loading FieldCache incrementally to lower down the overhead of loading all documents, what is the best solution for that ??

I'm not quite sure what you're planning to do with the field cache, but my understanding of document ids is that they can change during an insert, depending on pending deletes, merge policies etc. i.e. Document ID should not be used past a commit boundary on a reopened index reader Hope this helps,

Related

Basic Read CSV File Questions

How can I achieve DynamoDB Pagination Same as SQL/MYSQL(Total count of items and I can jump to any other page) in C#

AMO get partitions where data are processed but not indexes

Checking duplication using LINQ in collection

Need help in using MySqlTransaction

Categories

Resources