Extracting text with greater font weight - c#

I have a number of documents with predicted placement of certain text which I'm trying to extract. For the most part, it works very well, but I'm having difficulties with a certain fraction of documents which have slightly thicker text.
Thin text:
Thick text:
I know it's hard to tell the difference at this resolution, but if you look at MO DAY YEAR TIME (2400) portion, you can tell that the second one is thicker.
The thin text gives me exactly what is expected:
09/28/2015
0820
However, the thick version gives me a triple of every character with white space in between each duplicated character:
1 1 11 1 1/ / /1 1 19 9 9/ / /2 2 20 0 01 1 15 5 5
1 1 17 7 70 0 02 2 2
I'm using the following code to extract text from documents:
public static Document GetDocumentInfo(string fileName)
{
// Using 11 in x 8.5 in dimensions at 72 dpi.
var boudingBoxes = new[]
{
new RectangleJ(446, 727, 85, 14),
new RectangleJ(396, 702, 43, 14),
new RectangleJ(306, 680, 58, 7),
new RectangleJ(378, 680, 58, 7),
new RectangleJ(446, 680, 45, 7),
new RectangleJ(130, 727, 29, 10),
new RectangleJ(130, 702, 29, 10)
};
var data = GetPdfData(fileName, 1, boudingBoxes);
// I would populated the new document with extracted data
// here, but it's not important for the example.
var doc = new Document();
return doc;
}
public static string[] GetPdfData(string fileName, int pageNum, RectangleJ[] boundingBoxes)
{
// Omitted safety checks, as they're not important for the example.
var data = new string[boundingBoxes.Length];
using (var reader = new PdfReader(fileName))
{
if (reader.NumberOfPages < 1)
{
return null;
}
RenderFilter filter;
ITextExtractionStrategy strategy;
for (var i = 0; i < boundingBoxes.Length; ++i)
{
filter = new RegionTextRenderFilter(boundingBoxes[i]);
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
}
return data;
}
}
Obviously, if nothing else works, I can get rid of duplicate characters after reading them in, as there is a very apparent pattern, but I'd rather find a proper way than a hack. I tried looking around for the past few hours, but couldn't find anyone encountering a similar issue.
EDIT:
I finally came across this SO question:
Text Extraction Duplicate Bold Text
...and in the comments it's indicated that some of the lower quality PDF producers duplicate text to simulate boldness, so that's one of the things that might be happening. However, there is a mention of omitting duplicate text at the location, which I don't know how can be achieved since this portion of my code...
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
...reads in the duplicated text completely in any of the specified locations.
EDIT:
I now have come across documents that duplicate contents up to four times to simulate thickness. That's a very strange way of doing things, but I'm sure designers of that method had their reasons.
EDIT:
I produced A solution (see my answer). It processes the data after it's already extracted and removes any repetitions. Ideally this would have been done during the extraction process, but it can get pretty complicated and this seemed like a very clean and easy way of getting the same accomplished.

As #mkl has suggested, one way of tackling this issue is to override LocationExtractionStrategy; however, things get pretty complicated since it would require comparison of locations for each character found at specific boundaries. I tried doing some research in order to accomplish that, but due to poor documentation, it was getting a bit out of hand.
So, instead as I created a post-processing method, loosely based around what #TheMuffinMan has suggested, to clean up any repetitions. I decided not to deal with pixels, but rather with character count anomalies in known static locations. In my case, I know that the second data piece extracted can never be greater than three characters, so it's a good comparison point for me. If you know the document layout, you can use anything on it that you know will always be of fixed length.
After I extract the data with the method listed in my original post, I check to see if the second data piece is greater than three in length. If it returns true, then I divide the given length by three, as that's the most characters it can have and since all repitions come out to even length, I know I'll get an even number of repetition cases:
var data = GetPdfData(fileName, 1, boudingBoxes);
if (data[1].Length > 3)
{
var count = data[1].Length / 3;
for (var i = 0; i < data.Length; ++i)
{
data[i] = RemoveRepetitions(data[i], count);
}
}
As you can see, I then loop over the data and pass each piece into RemoveRepetitions() method:
public static string RemoveRepetitions(string original, int count)
{
if (original.Length % count != 0)
{
return null;
}
var temp = new char[original.Length / count];
for (int i = 0; i < original.Length; i += count)
{
temp[i / count] = original[i];
}
return new string(temp);
}
This method takes the string and the number of expected repetitions, which we calculated earlier. One thing to note is that I don't have to worry about the white spaces that are inserted in the duplicated process, as the example shows in the original post, due to the fact that count will represent the total number of characters where only one should have been.

Related

Packing item from set of available packs

Suppose there is an Item that a customer is ordering - in this case it turns out they are ordering 176 (totalNeeded) of this Item.
The database has 5 records associated with this item that this item can be stored in:
{5 pack, 8 pack, 10 pack, 25 pack, 50 pack}.
A rough way of packing this would be:
Sort the array from biggest to smallest.
While (totalPacked < totalNeeded) // 176
{
1. Maintain an <int, int> dictionary which contains Keys of pack id's,
and values of how many needed
2. Add the largest pack, which is not larger than the amount remaining to pack,
increment totalPacked by the pack size
3. If any remainder is left over after the above, add the smallest pack to reduce
waste
e.g., 4 needed, smallest size is 5, so add one 5; one extra item packed
}
Based on the above logic, the outcome would be:
You need: 3 x 50 packs, 1 x 25 pack, 1 x 5 pack
Total Items: 180
Excess = 4 items; 180 - 176
The above is not too difficult to code, I have it working locally. However, it is not truly the best way to pack this item. Note: "best" means, smallest amount of excess.
Thus ... we have an 8 pack available, we need 176. 176 / 8 = 22. Send the customer 22 x 8 packs, they will get exactly what they need. Again, this is even simpler than the pseudo-code I wrote ... see if the total needed is evenly divisible by any of the packs in the array - if so, "at the very least" we know that we can fall back on 22 x 8 packs being exact.
In the case that the number is not divisible by an array value, I am attempting to determine possible way that the array values can be combined to reach at least the number we need (176), and then score the different combinations by # of Packs needed total.
If anyone has some reading that can be done on this topic, or advice of any kind to get me started it would be greatly appreciated.
Thank you
This is a variant of the Subset Sum Problem (Optimization version)
While the problem is NP-Complete, there is a pretty efficient pseudo-polynomial time Dynamic Programming solution to it, by following the recursive formulas:
D(x,i) = false x<0
D(0,i) = true
D(x,0) = false x != 0
D(x,i) = D(x,i-1) OR D(x-arr[i],i
The Dynamic Programming Solution will build up a table, where an element D[x][i]==true iff you can use the first i kinds of packs to establish sum x.
Needless to say that D[x][n] == true iff there is a solution with all available packs that sums to x. (where n is the total number of packs you have).
To get the "closest higher number", you just need to create a table of size W+pack[0]-1 (pack[0] being the smallest available pack, W being the sum you are looking for), and choose the value that yields true which is closest to W.
If you wish to give different values to the different pack types, this becomes Knapsack Problem, which is very similar - but uses values instead a simple true/false.
Getting the actual "items" (packs) chosen after is done by going back the table and retracing your steps. This thread and this thread elaborate how to achieve it with more details.
If this example problem is truly representative of the actual problem you are solving, it is small enough to try every combination with brute force using recursion. For example, I found exactly 6,681 unique packings that are locally maximized, with a total of 205 that have exactly 176 total items. The (unique) solution with minimum number of packs is 6, and that is { 2-8, 1-10, 3-50 }. Total runtime for the algorithm was 8 ms.
public static List<int[]> GeneratePackings(int[] packSizes, int totalNeeded)
{
var packings = GeneratePackingsInternal(packSizes, 0, new int[packSizes.Length], totalNeeded);
return packings;
}
private static List<int[]> GeneratePackingsInternal(int[] packSizes, int packSizeIndex, int[] packCounts, int totalNeeded)
{
if (packSizeIndex >= packSizes.Length) return new List<int[]>();
var currentPackSize = packSizes[packSizeIndex];
var currentPacks = new List<int[]>();
if (packSizeIndex + 1 == packSizes.Length) {
var lastOptimal = totalNeeded / currentPackSize;
packCounts[packSizeIndex] = lastOptimal;
return new List<int[]> { packCounts };
}
for (var i = 0; i * currentPackSize <= totalNeeded; i++) {
packCounts[packSizeIndex] = i;
currentPacks.AddRange(GeneratePackingsInternal(packSizes, packSizeIndex + 1, (int[])packCounts.Clone(), totalNeeded - i * currentPackSize));
}
return currentPacks;
}
The algorithm is pretty straightforward
Loop through every combination of number of 5-packs.
Loop through every combination of number of 8-packs, from remaining amount after deducting specified number of 5-packs.
etc to 50-packs. For 50-pack counts, directly divide the remainder.
Collect all combinations together recursively (so it dynamically handles any set of pack sizes).
Finally, once all the combinations are found, it is pretty easy to find all packs with least waste and least number of packages:
var packSizes = new int[] { 5, 8, 10, 25, 50 };
var totalNeeded = 176;
var result = GeneratePackings(packSizes, totalNeeded);
Console.WriteLine(result.Count());
var maximal = result.Where (r => r.Zip(packSizes, (a, b) => a * b).Sum() == totalNeeded).ToList();
var min = maximal.Min(m => m.Sum());
var minPacks = maximal.Where (m => m.Sum() == min).ToList();
foreach (var m in minPacks) {
Console.WriteLine("{ " + string.Join(", ", m) + " }");
}
Here is a working example: https://ideone.com/zkCUYZ
This partial solution is specifically for your pack sizes of 5, 8, 10, 25, 50. And only for order sizes at least 40 large. There are a few gaps at smaller sizes that you'll have to fill another way (specifically at values like 6, 7, 22, 27 etc).
Clearly, the only way to get any number that isn't a multiple of 5 is to use the 8 packs.
Determine the number of 8-packs needed with modular arithmatic. Since the 8 % 5 == 3, each 8-pack will handle a different remainder of 5 in this cycle: 0, 2, 4, 1, 3. Something like
public static int GetNumberOf8Packs(int orderCount) {
int remainder = (orderCount % 5);
return ((remainder % 3) * 5 + remainder) / 3;
}
In your example of 176. 176 % 5 == 1 which means you'll need 2 8-packs.
Subtract the value of the 8-packs to get the number of multiples of 5 you need to fill. At this point you still need to deliver 176 - 16 == 160.
Fill all the 50-packs you can by integer dividing. Keep track of the leftovers.
Now just fit the 5, 10, 25 packs as needed. Obviously use the larger values first.
All together your code might look like this:
public static Order MakeOrder(int orderSize)
{
if (orderSize < 40)
{
throw new NotImplementedException("You'll have to write this part, since the modular arithmetic for 8-packs starts working at 40.");
}
var order = new Order();
order.num8 = GetNumberOf8Packs(orderSize);
int multipleOf5 = orderSize - (order.num8 * 8);
order.num50 = multipleOf5 / 50;
int remainderFrom50 = multipleOf5 % 50;
while (remainderFrom50 > 0)
{
if (remainderFrom50 >= 25)
{
order.num25++;
remainderFrom50 -= 25;
}
else if (remainderFrom50 >= 10)
{
order.num10++;
remainderFrom50 -= 10;
}
else if (remainderFrom50 >= 5)
{
order.num5++;
remainderFrom50 -= 5;
}
}
return order;
}
A DotNetFiddle

Add string to array

I am trying to add a string to an array, I have done a lot of research, and came up with two options but neither work, I get a pop-up and the details make it sound like my array is out of bounds, Both methods are inside my addSpam function. Any ideas on how to fix either method?
namespace HW8_DR
{
class Tester : Spam_Scanner
{
private string[] spam = {"$$$", "Affordable", "Bargain", "Beneficiary", "Best price", "Big bucks",
"Cash", "Cash bonus", "Cashcashcash", "Cents on the dollar", "Cheap", "Check",
"Claims", "Collect", "Compare rates", "Cost", "Credit", "Credit bureaus",
"Discount", "Earn", "Easy terms", "F r e e", "Fast cash", "For just $XXX",
"Hidden assets", "hidden charges", "Income", "Incredible deal", "Insurance",
"Investment", "Loans", "Lowest price", "Million dollars", "Money", "Money back",
"Mortgage", "Mortgage rates", "No cost", "No fees", "One hundred percent free",
"Only $", "Pennies a day", "Price", "Profits", "Pure profit", "Quote", "Refinance",
"Save $", "Save big money", "Save up to", "Serious cash", "Subject to credit",
"They keep your money – no refund!", "Unsecured credit", "Unsecured debt",
"US dollars", "Why pay more?"};
public static double countSpam = 0;
public static double wordCount = 0;
public static string posSpam = "";
public void tester(string email)
{
for(int i = 0; i < spam.Length-1; i++)
if(email.Contains(spam[i]))
{
countSpam++;
posSpam = string.Concat(posSpam, spam[i], "\r\n\r\n");
}
wordCount = email.Split(' ').Length;
}
public void addSpam(string spamFlag)
{
//attempt 1 to add string to spam array
Array.Resize(ref spam, spam.Length + 1);
spam[spam.Length] = spamFlag;
//attempt 2 to add string to spam array
string[] temp = new string[spam.Length + 1];
Array.Copy(spam, temp, spam.Length);
temp.SetValue(spamFlag, spam.Length);
Array.Copy(temp, spam, temp.Length);
}
}
}`
Simple solution: don't use an array! List<T> is much better suited for this.
using System.Collections.Generic;
...
private List<string> spam = {"$$$", "Affordable", "Bargain", "Beneficiary", ... }
...
public void addSpam(string spamFlag)
{
spam.Add(spamFlag);
}
DLeh's answer is best - this is what a List<T> is for, so that's your solution.
But the reason things are failing for you is that you're attempting to access an index that is one higher than the max index of the array. The highest index is always one less than the length, because arrays are zero-based.
int[] arr = new[] { 1, 2, 3 };
Console.WriteLine(arr.Length); // 3
Console.WriteLine(arr[0]); // 1
Console.WriteLine(arr[1]); // 2
Console.WriteLine(arr[2]); // 3
Console.WriteLine(arr[3]); // Exception
To access the last item in an array, you either need to use:
var lastItem = arr[arr.Length - 1];
// or
var lastItem = arr[arr.GetUpperBound(0)];
Array.Resize(ref spam, spam.Length + 1);
spam[spam.Length] = spamFlag;
Here you're trying to write to index 58 (spam.Length after the re-size) of a 58-element zero-indexed array; that is, it goes from 0 to 57.
You should use:
Array.Resize(ref spam, spam.Length + 1);
spam[spam.Length - 1] = spamFlag;
That said, you should really use List<string> instead. Among other things it does the resizing of the internal array it uses in batches rather than on every Add(), which makes things much more efficient, as well as being easier.
If you really need an array for some reason, then use List<string> for most of the work, and then call ToArray() at the end.
While DLeh raise legit points about how it's better to use a List that dynamically grows, and Joe's answer provides a great explanation, I want to add on to a few more things.
Firstly, to Answer your question, to fix either method, you probably wanted to do spam[spam.Length-1] = spamFlag instead of spam[spam.Length] = spamFlag in attempt one. Because indexes start at 0 and the last index within the bound is thus length -1 (As Joe pointed out)
Your second attempt will not work as an exception is thrown if either array is too short. See this dotnetPerl link on the explanation. Basically it isn't recommended to use Array.Copy as you have to ensure the types and lengths are the same as well.
To elaborate on Array.Resize(), it should be noted that it's actually a misnomer in which C# doesn't actually resize the array. Rather, it creates a new array and copies the contents over. It always does this unless an exception is thrown, so from a performance point of view this is discouraged. (You could have a variable to keep track of how full your array is, and always grow it by double the amount, this avoids you from having to always create a new array.)
This is also used in hashtables as well where if a certain bucket is full, we grow it and rehash everything back in (it's a really fast lookup table/data structure).
Read this DotNetPerl tutorial on Array Resize
However, a lot of times a List is better, but of course there may be a reason why you don't want to/can't use. I know of a few classes that explicitly tell us not to use built in data structures to learn how arrays work.

find word and score based on positions

hey guys i have a textfile i have divided it into 4 parts. i want to search each part for the words that appear in each part and score that word
exmaple
welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball.
i will want to return national = 1 as it appears only in one part etc
am working on determining text context using word position.
am working with c# and not very good in text processing
basically
if a word appears in the 4 sections it scores 4
if a word appears in the 3 sections it scores 3
if a word appears in the 2 sections it scores 2
if a word appears in the 1 section it scores 1
thanks in advance
so far i have this
var s = "welcome to the national basketball finals,the basketball teams here today have come a long way. without much delay lets play basketball. ";
var numberOfParts = 4;
var eachPartLength = s.Length / numberOfParts;
var parts = new List<string>();
var words = Regex.Split(s, #"\W").Where(w => w.Length > 0); // this splits all words, removes empty strings
var wordsIndex = 0;
for (int i = 0; i < numberOfParts; i++)
{
var sb = new StringBuilder();
while (sb.Length < eachPartLength && wordsIndex < words.Count())
{
sb.AppendFormat("{0} ", words.ElementAt(wordsIndex));
wordsIndex++;
}
// here you have the part
Response.Write("[{0}]"+ sb);
parts.Add(sb.ToString());
var allwords = parts.SelectMany(p => p.Split(' ').Distinct());
var wordsInAllParts = allwords.Where(w => parts.All(p => p.Contains(w))).Distinct();
This question is very difficult to interpret. I don't fully understand your goal and it is my suspicion that you might not either.
In the absence of a clear requirement, there is no way to give a specific answer, so I will give a generic one:
Try writing a test that clearly specifies the exact behavior you want. You've got the beginnings of one with your sample string and the result you want but it's not unambiguous what you are looking for.
Make a test that, when it passes, demonstrates that one of the required behaviors is there. If that doesn't help you get a solution to the problem, come back and edit this question or make a new one that includes the test.
At the very least, you will be able to harvest better answers from this site.

Comparing 2 huge lists using C# multiple times (with a twist)

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.
I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:
First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.
I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:
Prefix Code 1: 1234567 <-- this code is 7 digits long.
I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.
Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".
Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.
Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.
Thanks for your help.
Update:
Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.
I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.
UPDATE
You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.
const Int32 minimumPrefixLength = 3;
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var numberPrefixes = numbers
.Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
.First(n.StartsWith))
.ToList();
So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?
Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.
MPL Runtime
-----------------
1 10.198 ms
2 1.179 ms
3 205 ms
4 130 ms
5 107 ms
Just to give an rough idea - I did just one run and have a lot of other stuff going on.
Original answer
I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.
I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.
List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);
Then I used the following LINQ to Object query to find the prefix of each number.
var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();
Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.
Test application
Note that generating many prefixes is really slow and might take a minute or two.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace Test
{
static class Program
{
static void Main()
{
// Set number of prefixes and calls to not more than 50 to get results
// printed to the console.
Console.Write("Generating prefixes");
List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
Console.WriteLine();
Console.Write("Generating calls");
List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
Console.WriteLine();
Console.WriteLine("Processing started.");
Stopwatch stopwatch = new Stopwatch();
const Int32 minimumPrefixLength = 5;
stopwatch.Start();
var groupedPefixes = prefixes
.GroupBy(p => p.Substring(0, minimumPrefixLength))
.ToDictionary(g => g.Key, g => g);
var result = calls
.GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
.First(c.Number.StartsWith))
.Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
.ToList();
stopwatch.Stop();
Console.WriteLine("Processing finished.");
Console.WriteLine(stopwatch.Elapsed);
if ((prefixes.Count <= 50) && (calls.Count <= 50))
{
Console.WriteLine("Prefixes");
foreach (String prefix in prefixes.OrderBy(p => p))
{
Console.WriteLine(String.Format(" prefix={0}", prefix));
}
Console.WriteLine("Calls");
foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
{
Console.WriteLine(String.Format(" number={0} duration={1}", call.Number, call.Duration));
}
Console.WriteLine("Result");
foreach (Call call in result.OrderBy(c => c.Number))
{
Console.WriteLine(String.Format(" prefix={0} accumulated duration={1}", call.Number, call.Duration));
}
}
Console.ReadLine();
}
private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<String> prefixes = new List<String>(count);
StringBuilder stringBuilder = new StringBuilder(maximumLength);
while (prefixes.Count < count)
{
stringBuilder.Length = 0;
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
String prefix = stringBuilder.ToString();
if (prefixes.Count % 1000 == 0)
{
Console.Write(".");
}
if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
{
prefixes.Add(stringBuilder.ToString());
}
}
return prefixes;
}
private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
{
Random random = new Random();
List<Call> calls = new List<Call>(count);
StringBuilder stringBuilder = new StringBuilder();
while (calls.Count < count)
{
stringBuilder.Length = 0;
stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);
for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
{
stringBuilder.Append(random.Next(10));
}
if (calls.Count % 1000 == 0)
{
Console.Write(".");
}
calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
}
return calls;
}
private class Call
{
public Call (String number, Decimal duration)
{
this.Number = number;
this.Duration = duration;
}
public String Number { get; private set; }
public Decimal Duration { get; private set; }
}
}
}
It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.
Then create a dictionary from carrier to an int or long (the total).
Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.
The easiest data structure that would do this fairly efficiently would be a list of sets. Make a Set for each carrier to contain all the prefixes.
Now, to associate a call with a carrier:
foreach (Carrier carrier in carriers)
{
bool found = false;
for (int length = 1; length <= 7; length++)
{
int prefix = ExtractDigits(callNumber, length);
if (carrier.Prefixes.Contains(prefix))
{
carrier.Calls.Add(callNumber);
found = true;
break;
}
}
if (found)
break;
}
If you have 10 carriers, there will be 70 lookups in the set per call. But a lookup in a set isn't too slow (much faster than a linear search). So this should give you quite a big speed up over a brute force linear search.
You can go a step further and group the prefixes for each carrier according to the length. That way, if a carrier has only prefixes of length 7 and 4, you'd know to only bother to extract and look up those lengths, each time looking in the set of prefixes of that length.
How about dumping your data into a couple of database tables and then query them using SQL? Easy!
CREATE TABLE dbo.dialled_numbers ( number VARCHAR(100), minutes INT )
CREATE TABLE dbo.prefixes ( prefix VARCHAR(100) )
-- now populate the tables, create indexes etc
-- and then just run your query...
SELECT p.prefix,
SUM(n.minutes) AS total_minutes
FROM dbo.dialled_numbers AS n
INNER JOIN dbo.prefixes AS p
ON n.number LIKE p.prefix + '%'
GROUP BY p.prefix
(This was written for SQL Server, but should be very simple to translate for any other DBMS.)
Maybe it would be simpler (not necessarily more efficient) to do it in a database instead of C#.
You could insert the rows on the database and on insert determine the carrier and include it in the record (maybe in an insert trigger).
Then your report would be a sum query on the table.
I would probably just put the entries in a List, sort it, then use a binary search to look for matches. Tailor the binary search match criteria to return the first item that matches then iterate along the list until you find one that doesn't match. A binary search takes only around 15 comparisons to search a list of 30,000 items.
You may want to use a HashTable in C#.
This way you have key-value pairs, and your keys could be the phone numbers, and your value the total minutes. If a match is found in the key set, then modify the total minutes, else, add a new key.
You would then just need to modify your searching algorithm, to not look at the entire key, but only the first 7 digits of it.

Does any one know of a faster method to do String.Split()?

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:
values = line.Split(delimiter);
where line is the a string that holds the values that are seperated by the delimiter.
Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.
Thanks!
The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.
But there's one thing you can do and that's to implement this as a generator:
public static IEnumerable<string> GetSplit( this string s, char c )
{
int l = s.Length;
int i = 0, j = s.IndexOf( c, 0, l );
if ( j == -1 ) // No such substring
{
yield return s; // Return original and break
yield break;
}
while ( j != -1 )
{
if ( j - i > 0 ) // Non empty?
{
yield return s.Substring( i, j - i ); // Return non-empty match
}
i = j + 1;
j = s.IndexOf( c, i, l - i );
}
if ( i < l ) // Has remainder?
{
yield return s.Substring( i, l - i ); // Return remaining trail
}
}
The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.
The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.
It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:
1,"Something, with a comma",2,3
The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.
That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.
So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.
The algorithm for that is relatively simple:
Stop at every comma;
When you hit quotes continue until you hit the next set of quotes;
Handle escaped quotes (ie \") and arguably escaped commas (\,).
Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.
So if you really want performance, it's time to hand parse.
Or, better yet, use someone else's optimized solution like this fast CSV reader.
By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.
Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.
string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();
Trade result = new Trade();
index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);
index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));
return result;
Note that a ReadOnlySpan.Split() will soon be part of the framework. See
https://github.com/dotnet/runtime/pull/295
Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.
Hope this helps
I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:
The Solution
With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.
public int Split(string value, char separator)
{
int resultIndex = 0;
int startIndex = 0;
// Find the mid-parts
for (int i = 0; i < value.Length; i++)
{
if (value[i] == separator)
{
this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
resultIndex++;
startIndex = i + 1;
}
}
// Find the last part
this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
resultIndex++;
return resultIndex;
How To Use
The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.
var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
Console.WriteLine("It works!");
}
The Split methods returns the number of items found, so you can easily iterate through the results like this:
var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
Console.WriteLine(splitter.Results[i]);
}
This approach has advantages and disadvantages.
You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.
You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.
One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.
If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.
I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.
Dave
Some very thorough analysis on String.Slit() vs Regex and other methods.
We are talking ms savings over very large strings though.
The main problem(?) with String.Split is that it's general, in that it caters for many needs.
If you know more about your data than Split would, it can make an improvement to make your own.
For instance, if:
You don't care about empty strings, so you don't need to handle those any special way
You don't need to trim strings, so you don't need to do anything with or around those
You don't need to check for quoted commas or quotes
You don't need to handle quotes at all
If any of these are true, you might see an improvement by writing your own more specific version of String.Split.
Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.
The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.
However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.
CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.
The ODBC solution recommended above looks at first glance to be basically the same approach.
I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.
As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:
"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG
(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)
This CSV reading framework will deal with all of that, and is also very efficient:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
This is my solution:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
Return kwds
End Function
Here is a version with benchmarking:
Public Shared Function FastSplit(inputString As String, separator As String) As String()
Dim sw As New Stopwatch
sw.Start()
Dim kwds(1) As String
Dim k = 0
Dim tmp As String = ""
For l = 1 To inputString.Length - 1
tmp = Mid(inputString, l, 1)
If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
kwds(k) &= tmp
Next
sw.Stop()
Dim fsTime As Long = sw.ElapsedTicks
sw.Start()
Dim strings() As String = inputString.Split(separator)
sw.Stop()
Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)
Return kwds
End Function
Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)
FastSplit took 8 whereas split took 10
FastSplit took 214 whereas split took 216
FastSplit took 10 whereas split took 12
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 12
FastSplit took 7 whereas split took 9
FastSplit took 6 whereas split took 8
FastSplit took 5 whereas split took 7
FastSplit took 10 whereas split took 13
FastSplit took 9 whereas split took 232
FastSplit took 7 whereas split took 8
FastSplit took 8 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 215 whereas split took 217
FastSplit took 10 whereas split took 231
FastSplit took 8 whereas split took 10
FastSplit took 8 whereas split took 10
FastSplit took 7 whereas split took 9
FastSplit took 8 whereas split took 10
FastSplit took 10 whereas split took 1405
FastSplit took 9 whereas split took 11
FastSplit took 8 whereas split took 10
Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.
public static unsafe List<string> SplitString(char separator, string input)
{
List<string> result = new List<string>();
int i = 0;
fixed(char* buffer = input)
{
for (int j = 0; j < input.Length; j++)
{
if (buffer[j] == separator)
{
buffer[i] = (char)0;
result.Add(new String(buffer));
i = 0;
}
else
{
buffer[i] = buffer[j];
i++;
}
}
buffer[i] = (char)0;
result.Add(new String(buffer));
}
return result;
}
You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.
String.split is rather slow, if you want some faster methods, here you go. :)
However CSV is much better parsed by a rule based parser.
This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)
http://www.csdgn.org/code/rule-tokenizer
private static final String[] fSplit(String src, char delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+1;
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}
private static final String[] fSplit(String src, String delim) {
ArrayList<String> output = new ArrayList<String>();
int index = 0;
int lindex = 0;
while((index = src.indexOf(delim,lindex)) != -1) {
output.add(src.substring(lindex,index));
lindex = index+delim.length();
}
output.add(src.substring(lindex));
return output.toArray(new String[output.size()]);
}

Categories

Resources