Produce a crossword-like dataset - c#

I have to write a function that takes a list of string(words that vary in length), and an int(size of data set eg. int value of 4 will be 4 columns and four rows in table), and with this I must produce a crossword like block(block being the dataset) that will hold as many of the words in the list as possible, like a crossword they can cross each other if the letters match at the right places, and the words must be all mixed up, read in every direction(like a crossword puzzle).
I can't seem to find code to help me with this, so far I have the basic structure of the dataset, here it is, any help will be appreciated, thanks.
public WordsDs WordMixer(List<string> wordList, int size)
{
if ((wordList == null) || (size < 2))
{
return null;
}
//shuffle the words in the list so that they are in a random order
Random random = new Random();
var sortedList = wordList.OrderBy(i => random.Next()).ToList();
//create a dataset for the words
DataSet ds = new DataSet();
DataTable dt = new DataTable();
//add columns and rows according to the size parameter
for (int i = 0; i < size; i++)
{
dt.Columns.Add(i.ToString(), typeof(string));
}
for (int i = 0; i < size; i++)
{
dt.Rows.Add(i);
}
for (int i = 0; i < wordList.Count; i++)
{
}//for (int i = 0; i < wordList.Count; i++)
}

You could just use a two-dimensional array to hold the characters. I guess the tricky part is from the word list work out where there a letter is shared between two words. I guess start with the least frequently used letter and work from there!
Interesting Article
http://blogs.teamb.com/craigstuntz/2010/01/11/38518/
Stack Overflow Question may help (although in c++ - might be of use)
Best data structure for crossword puzzle search
Other links to code generators.
http://www.pscode.com/vb/scripts/ShowCode.asp?txtCodeId=6082&lngWId=10
http://dotnetslackers.com/articles/net/Creating-a-programming-crossword-puzzle.aspx
One in c
http://pdos.csail.mit.edu/cgi-bin/theme-cword

Related

How easily create an array from .csv and search in it using C#? [duplicate]

This question already has answers here:
Reading CSV files using C#
(12 answers)
Reading data from a CSV to an array of arrays (C#)
(2 answers)
Closed 5 years ago.
There is a 1.csv file
name1;5547894;bnt652147
name2;5546126;bnt956231
name3;5549871;nhy754497
How in fast and elegant way, may be in one line, read this file and add separated values to 2d array?
And then, how we can easily and quickly seach for some string in that array?
Using an Array of arrays or a List of arrays is much easier, but a 2D array can be done.
For a List of arrays:
var listInput = File.RealAllLines("1.csv").Select(line => line.Split(';')).ToList();
To find any rows containing a string:
var find = "5549871";
var ContainingRows = listInput.Where(r => r.Any(s => s.Contains(find))).ToList();
To find a row containing an exact match:
var EqualRows = listInput.Where(r => r.Any(s => s == find)).ToList();
If you know there is just one match, you can replace ToList() with First().
If you know more about your search, you could create an index (Dictionary) instead to speed up retrieval.
Unfortunately, there aren't any corresponding 2D array creation features, you must know the size to create it.
var array2d = new string[listInput.Length, 3];
for (int row = 0; row < listInput.Length; ++row) {
for (int col = 0; col < 3; ++col)
array2d[row, col] = listInput[row][col];
}
Searching it isn't going to be fast unless you create some type of index, but also easy.
var findrow = -1;
for (int row = 0; row < array2d.GetUpperBound(0); ++row) {
for (int col = 0; col < array2d.GetUpperBound(1); ++col) {
if (array2d[row,col].Contains(find)) {
findrow = row;
break;
}
}
}

Is simhash function that reliable?

I have been strugling with simhash algorithm for a while. I implemented it according to my understanding on my crawler. However, when I did some test, It seemed not so reliable to me.
I calculated fingerprint for 200.000 different text data and saw that, some different content had same fingerprints. So there are a big posibility of collision.
My implementation code is below.
My question is that: If My implementation is right, there is a big collision on this algorithm. How come google use this algorithm? Otherwise, what's the problem with my algorithm?
public long CalculateSimHash(string input)
{
var vector = GenerateVector(input);
//5- Generate Fingerprint
long fingerprint = 0;
for (var i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
var zz = Convert.ToInt64(1 << i);
fingerprint += Math.Abs(zz);
}
}
return fingerprint;
}
private int[] GenerateVector(string input)
{
//1- Tokenize input
ITokeniser tokeniser = new OverlappingStringTokeniser(2, 1);
var tokenizedValues = tokeniser.Tokenise(input);
//2- Hash values
var hashedValues = HashTokens(tokenizedValues);
//3- Prepare vector
var vector = new int[HashSize];
for (var i = 0; i < HashSize; i++)
{
vector[i] = 0;
}
//4- Fill vector according to bitsetof hash
foreach (var value in hashedValues)
{
for (var j = 0; j < HashSize; j++)
{
if (IsBitSet(value, j))
{
vector[j] += 1;
}
else
{
vector[j] -= 1;
}
}
}
return vector;
I can see a couple of issues. First, you're only getting a 32-bit hash, not a 64-bit, because you're using the wrong types. See https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/left-shift-operator
It's also best not to use a signed integer type here, to avoid confusion. So:
// Generate Fingerprint
ulong fingerprint = 0;
for (int i = 0; i < HashSize; i++)
{
if (vector[i] > 0)
{
fingerprint += 1UL << i;
}
}
Second issue is: I don't know how your OverlappingStringTokenizer works -- so I'm only guessing here -- but if your shingles (overlapping ngrams) are only 2 characters long, then a lot of these shingles will be found in a lot of documents. Chances are that two documents will share a lot of these features even if the purpose and meaning of the documents is quite different.
Because words are the smallest simple unit of meaning when dealing with text, I normally count my tokens in terms of words, not characters. Certainly 2 characters is far too small for an effective feature. I like to generate shingles from, say, 5 words, ignoring punctuation and whitespace.

Replicate record in a DataTable based on value of a column using C#

I have a record in a dataTable as shown below.
1 Test 7Dec2014 15:40 one,two,three
Since the last column has 3 comma separated values, the resultant DataTable should like below with replicated records.
1 Test 7Dec2014 15:40 one
2 Test 7Dec2014 15:40 two
3 Test 7Dec2014 15:40 three
Please help me with an optimized way to achieve the above result.
The optimized way I found for the above problem is as below. If anybody has a better solution please let me know.
string[] strValues;
for (int i = 0; i < dtTable.Rows.Count; i++)
{
strValues= dtTable.Rows[i]["Column_Name"].ToString().Split(',');
if (strValues.Length > 1)
{
dtTable.Rows[i]["Column_Name"] = strValues[0];
for (int j = 1; j < strValues.Length; j++)
{
var TargetRow = dtTable.NewRow();
var OriginalRow = dtTable.Rows[i];
TargetRow.ItemArray = OriginalRow.ItemArray.Clone() as object[];
TargetRow["Column_Name"] = strValues[j];
dtTable.Rows.Add(TargetRow);
}
}
}

Coded UI - Filling DataTable from a UI Table (HtmlTable)

In Coded UI, I am facing a problem with Filling my Data Table on the top of HTmlTable on the UI.
Actually it is taking a lot of time to fill the datatable when there are 1000 of rows on the UI Table. I am working like this :
DataTable TestDataTable = new DataTable();
for (int i = 0; i < Table.RowCount; i++)
{
HtmlRow hr = (HtmlRow)Table.Rows[i];
for (int k = 0; k < hr.CellCount; k++)
{
TestDataTable.Rows[i][k] = hr.Cells[k].FriendlyName;
}
}
Its working Fine but as said it takes a lot of time. So is there any way i could fill the dataTable FASTER ?
Thanks,
Aashish GUpta
Possibly moving TestDataTable.Rows[i] out of the inner loop as it may be doing a full table (i,k) index evaluation every time.
HtmlRow hr = (HtmlRow)Table.Rows[i];
DataRow dest = (DataRow)TestDataTable.Rows[i];
for (int k = 0; k < hr.CellCount; k++)
{
dest[k] = hr.Cells[k].FriendlyName;
}
Have altered the data type in the code above based on the asker's comment. As the member types within DataTable are not specified I have assumed HtmlRow for dest.

How to specify format for individual cells with Excel.Range.set_Value()

When I write a whole table into an excel worksheet, I know to work with a whole Range at once instead of writing to individual cells. However, is there a way to specify format as I'm populating the array I'm going to export to Excel?
Here's what I do now:
object MissingValue = System.Reflection.Missing.Value;
Excel.Application excel = new Excel.Application();
int rows = 5;
int cols = 5;
int someVal;
Excel.Worksheet sheet = (Excel.Worksheet)excel.Workbooks.Add(MissingValue).Sheets[1];
Excel.Range range = sheet.Range("A1", sheet.Cells(rows,cols));
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
}
}
range.set_Value(MissingValue, rangeData);
Now suppose that I want some of those numbers to be formatted as percentages. I know I can go back on a cell-by-cell basis and change the formatting, but that seems to defeat the whole purpose of using a single Range.set_Value() call. Can I make my rangeData[,] structure include formatting information, so that when I call set_Value(), the cells are formatted in the way I want them?
To clarify, I know I can set the format for the entire Excel.Range object. What I want is to have a different format specified for each cell, specified in the inner loop.
So here's the best "solution" I've found so far. It isn't the nirvanna I was looking for, but it's much, much faster than setting the format for each cell individually.
// 0-based indexes
static string RcToA1(int row, int col)
{
string toRet = "";
int mag = 0;
while(col >= Math.Pow(26, mag+1)){mag++;}
while (mag>0)
{
toRet += System.Convert.ToChar(64 + (byte)Math.Truncate((double)(col/(Math.Pow(26,mag)))));
col -= (int)Math.Truncate((double)Math.Pow(26, mag--));
}
toRet += System.Convert.ToChar(65 + col);
return toRet + (row + 1).ToString();
}
static Random rand = new Random(DateTime.Now.Millisecond);
static string RandomExcelFormat()
{
switch ((int)Math.Round(rand.NextDouble(),0))
{
case 0: return "0.00%";
default: return "0.00";
}
}
struct ExcelFormatSpecifier
{
public object NumberFormat;
public string RangeAddress;
}
static void DoWork()
{
List<ExcelFormatSpecifier> NumberFormatList = new List<ExcelFormatSpecifier>(0);
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
NumberFormatList.Add(new ExcelFormatSpecifier
{
NumberFormat = RandomExcelFormat(),
RangeAddress = RcToA1(rowIndex, colIndex)
});
}
}
range.set_Value(MissingValue, rangeData);
int max_format = 50;
foreach (string formatSpecifier in NumberFormatList.Select(p => p.NumberFormat).Distinct())
{
List<string> addresses = NumberFormatList.Where(p => p.NumberFormat == formatSpecifier).Select(p => p.RangeAddress).ToList();
while (addresses.Count > 0)
{
string addressSpecifier = string.Join(",", addresses.Take(max_format).ToArray());
range.get_Range(addressSpecifier, MissingValue).NumberFormat = formatSpecifier;
addresses = addresses.Skip(max_format).ToList();
}
}
}
Basically what is happening is that I keep a list of the format information for each cell in NumberFormatList (each element also holds the A1-style address of the range it applies to). The original idea was that for each distinct format in the worksheet, I should be able to construct an Excel.Range of just those cells and apply the format to that range in a single call. This would reduce the number of accesses to NumberFormat from (potentially) thousands down to just a few (however many different formats you have).
I ran into an issue, however, because you apparently can't construct a range from an arbitrarily long list of cells. After some testing, I found that the limit is somewhere between 50 and 100 cells that can be used to define an arbitrary range (as in range.get_Range("A1,B1,C1,A2,AA5,....."). So once I've gotten the list of all cells to apply a format to, I have one final while() loop that applies the format to 50 of those cells at a time.
This isn't ideal, but it still reduces the number of accesses to NumberFormat by a factor of up to 50, which is significant. Constructing my spreadsheet without any format info (only using range.set_Value()) takes about 3 seconds. When I apply the formats 50 cells at a time, that is lengthened to about 10 seconds. When I apply the format info individually to each cell, the spreadsheet takes over 2 minutes to finish being constructed!
You can apply a formatting on the range, and then populate it with values you cannot specify formatting in you object[,] array
You apply the formatting to each individual cell within the inner loop via
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
Excel.Range r2 = sheet.Cells( r, c );
r2.xxxx = "";
}
}
Once you have r2, you can change the cell format any way you want.

Categories

Resources