I have this code for insertion of data into database. I have created a tupled List of <string, double, string[]> and add elements to List inside a nested while loop. Here is the code....
System.IO.StreamReader file = new System.IO.StreamReader(#"C:\Users\Malik\Desktop\research_fields.txt");
Program p = new Program();
var dd = new List<Tuple<string, double, string>>();
//string document = "The trie data structure has many properties which make it especially attractive for representing large files of data. These properties include fast retrieval time, quick unsuccessful search determination, and finding the longest match to a given identifier. The main drawback is the space requirement. In this paper the concept of trie compaction is formalized. An exact algorithm for optimal trie compaction and three algorithms for approximate trie compaction are given, and an analysis of the three algorithms is done. The analysis indicate that for actual tries, reductions of around 70 percent in the space required by the uncompacted trie can be expected. The quality of the compaction is shown to be insensitive to the number of nodes, while a more relevant parameter is the alphabet size of the key.";
//string[] document = get_Abstract();
string line;
try
{
SqlConnection con = new SqlConnection("Data Source=KHIZER;Initial Catalog=subset_aminer;Integrated Security=True");
con.Open();
SqlCommand query = con.CreateCommand();
query.CommandText = "select p_abstract from sub_aminer_paper where pid between 1 and 500 and DATALENGTH(p_abstract) != 0";
SqlDataReader reader = query.ExecuteReader();
string summary = null;
while (reader.Read())
{
summary = reader["p_abstract"].ToString();
while ((line = file.ReadLine()) != null)
{
dd.Add(Tuple.Create(line, p.calculate_CS(line, summary), summary));
}
var top_value = dd.OrderByDescending(x => x.Item2).FirstOrDefault();
if (top_value != null)
{
// look up record using top_value.Item3, and then store top_value.Item1
var abstrct = top_value.Item3.ToString();
var r_field = top_value.Item1.ToString();
write_To_Database(abstrct, r_field);
}
}
reader.Close();
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.WriteLine("Executing finally block.");
}
I have debugged it in visual studio 2013 using c#, I have seen the statement that is inside the inner while loop i.e. dd.Add(Tuple.Create(line, p.calculate_CS(line, summary), summary)); executes only once while it should be executes 22 times as reader.Read() has a length 22 documents.
I have checked it by taking only single string document shown as //comment in code and it works fine but not with reading documents from database.
Not getting why is it so. Any suggestions will be highly appreciated.
To get inside while loop, your line = file.ReadLine()) != null should be true. If you only get there once, I suspect you have only one line in your file, therefore, no matter how many elements your document array has, code inside while will execute only once.
Overall, however, your while loop code doesn't make much sense to me. You are going to read all you text from the file in your first iteration of for and then while loop will be skipped forever. If it's your intention to read all lines exactly once, move while before the for.
To further improve your code look up ReadLines and AddRange pages.
And to find a max value in the colleciton instead of
var top_value = dd.OrderByDescending(x => x.Item2).FirstOrDefault();
use Max:
var top_value = dd.Max(x => x.Item2);
Update:
var lines = System.IO.File.ReadLines(#"C:\Users\Malik\Desktop\research_fields.txt");
while (reader.Read())
{
summary = reader["p_abstract"].ToString();
dd.AddRange(lines
.Select( line =>
Tuple.Create(line, p.calculate_CS(line, summary), summary)
)
);
// rest of your stuff
}
Related
I have a flat file with an unfortunately dynamic column structure. There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. For example, my flat file might resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...
The same feed the next day may resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...
The thing is, I don't care much about all the tiers; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. I need normalize the feed to something resembling this to inject into a relational database:
StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...
What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? Every attempt I've made feels kludgy to me.
Some things I've done:
I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good.
I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant.
I ran into a simular problem a few years ago. I used a Dictionary to map the columns, it was not pretty, but it worked.
First make a Dictionary:
private Dictionary<int, int> GetColumnDictionary(string headerLine)
{
Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
List<string> columnNames = headerLine.Split('|').ToList();
string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
for (int index = 0; index < columnNames.Count; index++)
{
if (columnNames[index] == "StatisticID")
{
columnDictionary.Add(0, index);
}
if (columnNames[index] == "FileId")
{
columnDictionary.Add(1, index);
}
if (columnNames[index] == maxTierObjectColumnName)
{
columnDictionary.Add(2, index);
}
if (columnNames[index] == "Status")
{
columnDictionary.Add(3, index);
}
}
return columnDictionary;
}
private string GetMaxTierObjectColumnName(List<string> columnNames)
{
// Edit this function if Tier ObjectId is greater then 9
var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();
return maxTierObjectColumnName;
}
And after that it's simply running thru the file:
private List<DataObject> ParseFile(string fileName)
{
StreamReader streamReader = new StreamReader(fileName);
string headerLine = streamReader.ReadLine();
Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);
string line;
List<DataObject> dataObjects = new List<DataObject>();
while ((line = streamReader.ReadLine()) != null)
{
var lineValues = line.Split('|');
string statId = lineValues[columnDictionary[0]];
dataObjects.Add(
new DataObject()
{
StatisticId = lineValues[columnDictionary[0]],
FileId = lineValues[columnDictionary[1]],
ObjectId = lineValues[columnDictionary[2]],
Status = lineValues[columnDictionary[3]]
}
);
}
return dataObjects;
}
I hope this helps (even a little bit).
Personally I would not try to reformat your file. I think the easiest approach would be to parse each row from the front and the back. For example:
itemArray = getMyItems();
statisticId = itemArray[0];
fileId = itemArray[1];
//and so on for the rest of your pre-tier columns
//Then get the second to last column which will be the last tier
lastTierId = itemArray[itemArray.length -1];
Since you know the last tier will always be second from the end you can just start at the end and work your way forwards. This seems like it would be much easier than trying to reformat the datafile.
If you really want to create a new file, you could use this approach to get the data you want to write out.
I don't know C# syntax, but something along these lines:
split line in parts with | as separator
get parts [0], [1], [length - 2] and [length - 1]
pass the parts to the database handling code
I am transferring documents held in a SqlServer database to a RavenDB database. The process is simple but, in line with RavenDB's 'safety first' principle, only the first 128 documents are actually stored.
I understand why this is the case and I also know this limit can be adjusted for the document store as a whole, however, given that storing more than 128 docs in a given operation is not too unusual, I would like to know what the best practice is?
Here is my code to show an example of how I get around the limit. Is there a more elegant way?
public static void CopyFromSqlServerToRaven(IDocumentSession db)
{
var counter = 0;
using (var connection = new SqlConnection(SqlConnectionString))
{
var command = new SqlCommand(SqlGetContentItems, connection);
command.Connection.Open();
using (var reader = command.ExecuteReader())
{
while (reader.Read())
{
counter++;
db.Store(new ContentItem
{
Id = reader["Id"].ToString(),
Title = reader["Name"].ToString(),
Description = reader["ShortDescription"].ToString()
});
if (counter < 128) continue;
db.SaveChanges();
counter = 0;
}
}
db.SaveChanges();
}
}
You can store how much documents that you want in one operation. No need to save changes in chunks.
Hi
If only insert operation occur on lucene index (no delete/update), is it true that docID is not changing ? and its also reliable
if it is true, i want to use it as loading FieldCache incrementally to lower down the overhead of loading all documents, what is the best solution for that ??
I'm not quite sure what you're planning to do with the field cache, but my understanding of document ids is that they can change during an insert, depending on pending deletes, merge policies etc.
i.e. Document ID should not be used past a commit boundary on a reopened index reader
Hope this helps,
The document id is static within a segment. IndexReader.Open (usually) opens a DirectoryReader which combines several SegmentReader. You'll need to pass the "bottom" reader to the FieldCache for the population to work correctly.
Here's an example from FieldCache with frequently updating index which ensures that only the newly read segment is read by the FieldCache, instead of the topmost reader (which will considered changed at every commit).
var directory = FSDirectory.Open(new DirectoryInfo("index"));
var reader = IndexReader.Open(directory, readOnly: true);
var documentId = 1337;
// Grab all subreaders.
var subReaders = new List<IndexReader>();
ReaderUtil.GatherSubReaders(subReaders, reader);
// Loop through all subreaders. While subReaderId is higher than the
// maximum document id in the subreader, go to next.
var subReaderId = documentId;
var subReader = subReaders.First(sub => {
if (sub.MaxDoc() < subReaderId) {
subReaderId -= sub.MaxDoc();
return false;
}
return true;
});
var values = FieldCache_Fields.DEFAULT.GetInts(subReader, "newsdate");
var value = values[subReaderId];
I have logs stored in a txt file in the following format.
======8/4/2010 10:20:45 AM=========================================
Processing Donation
======8/4/2010 10:21:42A M=========================================
Sending information to server
======8/4/2010 10:21:43 AM=========================================
I need to parse these lines into a list where the information betweeen "====" lines is counted as one record to display on web page using paging in ASP.NET MVC.
Example: The first record entry would be
======8/4/2010 10:20:45 AM=================================================
Processing Donation
I had no luck so far. How can I do it?
Whilst reading in the file could you do a check to see if the line ends with =====
var sBuilder = new StringBuilder()
bool lineEnd = false;
var items = new List<string>();
string currentLine = String.Empty
using(var file = new StringReader("log.txt"))
{
while( (currentLine = file.ReadLine()) != null)
{
if(currentLine.EndsWith("===="))
{
items.Add(sBuilder.ToString());
sBuilder.Clear();
}
else
sBuilder.Append(currentLine);
}
}
It's a bit verbose but might give you some ideas
So... Ignore the verbose code in my other answer. Instead use this two line wonder:
string texty = "=====........"; //File data
var matches = Regex.Matches(texty, #"={6}(?<Date>.+)={41}\s*(?<Message>.+)");
var results = matches.Cast<Match>().Select(m => new {Date = m.Groups["Date"], Message = m.Groups["Message"]});
I always forget about regular expressions.
I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#
Here is my code:
private static Hashtable importFile(Hashtable myHashtable, String myFileName)
{
StreamReader sr = new StreamReader(myFileName);
CSVReader csvReader = new CSVReader();
ArrayList tempArray = new ArrayList();
int count = 0;
while (!sr.EndOfStream)
{
String temp = sr.ReadLine();
if (temp.StartsWith(" "))
{
ServMissing.Add(temp);
}
else
{
tempArray = csvReader.CSVParser(temp);
Boolean first = true;
String key = "";
String value = "";
foreach (String x in tempArray)
{
if (first)
{
key = x;
first = false;
}
else
{
value += x + ",";
}
}
myHashtable.Add(key, value);
}
count++;
}
Console.WriteLine("Import Count: " + count);
return myHashtable;
}
if (myHashtable.ContainsKey(key))
duplicates.Add(key);
else
myHashtable.Add(key, value);
A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.
ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.
In most situations, I'd say check for the key, but in this case, its better to catch the exception.
Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:
Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();
//Insert item
if (!dict.ContainsKey(key))
dict[key] = new List<string>();
dict[key].Add(value);
You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.
If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.
Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.
Here's one way to do this using LINQ.
CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
while (!sr.EndOfStream)
{
source.Add(sr.ReadLine());
}
}
List<string> ServMissing =
source
.Where(s => s.StartsWith(" ")
.ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource =
(
from s in source
where !s.StartsWith(" ")
let parsed = csvReader.CSVParser(s)
where parsed.Any()
let first = parsed.First()
let rest = String.Join( "," , parsed.Skip(1).ToArray())
select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest) //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
myHashTable.Add(g.Key, g.First());
if (g.Skip(1).Any())
{
myExtras.Add(g.Key);
}
}
Thank you all.
I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.