Extract text with iText with embedded fonts - c#

I am trying to use iTextSharp (v5.5.12.1) to extract the text from the following PDF:
https://structure.mil.ru/files/morf/military/files/ENGV_1929.pdf
Unfortunately, it seems like they are using a number of embedded custom fonts, which are defeating me.
For now, I have a working solution using OCR, but the OCR can be imprecise, reading some characters wrongly and also adding additional spaces between characters. It would be ideal if I could extract the text directly.
public static string ExtractTextFromPdf(Stream pdfStream, bool addNewLineBetweenPages = false)
{
using (PdfReader reader = new PdfReader(pdfStream))
{
string text = "";
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i);
if (addNewLineBetweenPages && i != reader.NumberOfPages)
{
text += Environment.NewLine;
}
}
return text;
}
}

The issue here is that the glyphs in the embedded font programs have non-standard glyph names (G00, G01, ...) and are identified only by glyph name. Thus, one has to establish a mapping from these glyph names to Unicode characters. One can do so e.g. by inspecting the fonts programs in the PDF (for example with font forge) and visually recognizing the glyphs by name. E.g. like here
(As you can recognize, there are some gaps for glyphs of the font in question which are not used in the document at hand. Some missing glyphs you can guess, some not.)
Then you have to inject these mappings into iText. As the mappings are hidden (private static members of the GlyphList class), you either have to patch iText itself or use reflection:
void InitializeGlyphs()
{
FieldInfo names2unicodeFiled = typeof(GlyphList).GetField("names2unicode", BindingFlags.Instance | BindingFlags.NonPublic | BindingFlags.Static);
Dictionary<string, int[]> names2unicode = (Dictionary<string, int[]>) names2unicodeFiled.GetValue(null);
names2unicode["G03"] = new int[] { ' ' };
names2unicode["G0A"] = new int[] { '\'' };
names2unicode["G0B"] = new int[] { '(' };
names2unicode["G0C"] = new int[] { ')' };
names2unicode["G0F"] = new int[] { ',' };
names2unicode["G10"] = new int[] { '-' };
names2unicode["G11"] = new int[] { '.' };
names2unicode["G12"] = new int[] { '/' };
names2unicode["G13"] = new int[] { '0' };
names2unicode["G14"] = new int[] { '1' };
names2unicode["G15"] = new int[] { '2' };
names2unicode["G16"] = new int[] { '3' };
names2unicode["G17"] = new int[] { '4' };
names2unicode["G18"] = new int[] { '5' };
names2unicode["G19"] = new int[] { '6' };
names2unicode["G1A"] = new int[] { '7' };
names2unicode["G1B"] = new int[] { '8' };
names2unicode["G1C"] = new int[] { '9' };
names2unicode["G1D"] = new int[] { ':' };
names2unicode["G23"] = new int[] { '#' };
names2unicode["G24"] = new int[] { 'A' };
names2unicode["G25"] = new int[] { 'B' };
names2unicode["G26"] = new int[] { 'C' };
names2unicode["G27"] = new int[] { 'D' };
names2unicode["G28"] = new int[] { 'E' };
names2unicode["G29"] = new int[] { 'F' };
names2unicode["G2A"] = new int[] { 'G' };
names2unicode["G2B"] = new int[] { 'H' };
names2unicode["G2C"] = new int[] { 'I' };
names2unicode["G2D"] = new int[] { 'J' };
names2unicode["G2E"] = new int[] { 'K' };
names2unicode["G2F"] = new int[] { 'L' };
names2unicode["G30"] = new int[] { 'M' };
names2unicode["G31"] = new int[] { 'N' };
names2unicode["G32"] = new int[] { 'O' };
names2unicode["G33"] = new int[] { 'P' };
names2unicode["G34"] = new int[] { 'Q' };
names2unicode["G35"] = new int[] { 'R' };
names2unicode["G36"] = new int[] { 'S' };
names2unicode["G37"] = new int[] { 'T' };
names2unicode["G38"] = new int[] { 'U' };
names2unicode["G39"] = new int[] { 'V' };
names2unicode["G3A"] = new int[] { 'W' };
names2unicode["G3B"] = new int[] { 'X' };
names2unicode["G3C"] = new int[] { 'Y' };
names2unicode["G3D"] = new int[] { 'Z' };
names2unicode["G42"] = new int[] { '_' };
names2unicode["G44"] = new int[] { 'a' };
names2unicode["G45"] = new int[] { 'b' };
names2unicode["G46"] = new int[] { 'c' };
names2unicode["G46._"] = new int[] { 'c' };
names2unicode["G47"] = new int[] { 'd' };
names2unicode["G48"] = new int[] { 'e' };
names2unicode["G49"] = new int[] { 'f' };
names2unicode["G4A"] = new int[] { 'g' };
names2unicode["G4B"] = new int[] { 'h' };
names2unicode["G4C"] = new int[] { 'i' };
names2unicode["G4D"] = new int[] { 'j' };
names2unicode["G4E"] = new int[] { 'k' };
names2unicode["G4F"] = new int[] { 'l' };
names2unicode["G50"] = new int[] { 'm' };
names2unicode["G51"] = new int[] { 'n' };
names2unicode["G52"] = new int[] { 'o' };
names2unicode["G53"] = new int[] { 'p' };
names2unicode["G54"] = new int[] { 'q' };
names2unicode["G55"] = new int[] { 'r' };
names2unicode["G56"] = new int[] { 's' };
names2unicode["G57"] = new int[] { 't' };
names2unicode["G58"] = new int[] { 'u' };
names2unicode["G59"] = new int[] { 'v' };
names2unicode["G5A"] = new int[] { 'w' };
names2unicode["G5B"] = new int[] { 'x' };
names2unicode["G5C"] = new int[] { 'y' };
names2unicode["G5D"] = new int[] { 'z' };
names2unicode["G62"] = new int[] { 'Ш' };
names2unicode["G63"] = new int[] { 'Р' };
names2unicode["G6A"] = new int[] { 'И' };
names2unicode["G6B"] = new int[] { 'А' };
names2unicode["G6C"] = new int[] { 'М' };
names2unicode["G6D"] = new int[] { 'в' };
names2unicode["G6E"] = new int[] { 'Ф' };
names2unicode["G70"] = new int[] { 'Е' };
names2unicode["G72"] = new int[] { 'Б' };
names2unicode["G73"] = new int[] { 'Н' };
names2unicode["G76"] = new int[] { 'С' };
names2unicode["G7A"] = new int[] { 'К' };
names2unicode["G7B"] = new int[] { 'В' };
names2unicode["G7C"] = new int[] { 'О' };
names2unicode["G7D"] = new int[] { 'к' };
names2unicode["G7E"] = new int[] { 'З' };
names2unicode["G80"] = new int[] { 'Г' };
names2unicode["G81"] = new int[] { 'П' };
names2unicode["G82"] = new int[] { 'у' };
names2unicode["G85"] = new int[] { '»' };
names2unicode["G88"] = new int[] { 'т' };
names2unicode["G8D"] = new int[] { '’' };
names2unicode["G90"] = new int[] { 'У' };
names2unicode["G91"] = new int[] { 'Т' };
names2unicode["GA1"] = new int[] { 'Ц' };
names2unicode["GA2"] = new int[] { '№' };
names2unicode["GAA"] = new int[] { 'э' };
names2unicode["GAB"] = new int[] { 'я' };
names2unicode["GAC"] = new int[] { 'і' };
names2unicode["GAD"] = new int[] { 'б' };
names2unicode["GAE"] = new int[] { 'й' };
names2unicode["GAF"] = new int[] { 'р' };
names2unicode["GB0"] = new int[] { 'с' };
names2unicode["GB2"] = new int[] { 'х' };
names2unicode["GB5"] = new int[] { '“' };
names2unicode["GB9"] = new int[] { 'п' };
names2unicode["GBA"] = new int[] { 'о' };
names2unicode["GBD"] = new int[] { '«' };
names2unicode["GC1"] = new int[] { 'ф' };
names2unicode["GC8"] = new int[] { 'а' };
names2unicode["GCB"] = new int[] { 'е' };
names2unicode["GCE"] = new int[] { 'ж' };
names2unicode["GCF"] = new int[] { 'з' };
names2unicode["GD2"] = new int[] { 'и' };
names2unicode["GD3"] = new int[] { 'н' };
names2unicode["GDC"] = new int[] { '–' };
names2unicode["GE3"] = new int[] { 'л' };
}
After executing that method you can extract the text using your method:
InitializeGlyphs();
using (FileStream pdfStream = new FileStream(#"ENGV_1929.pdf", FileMode.Open))
{
string result = ExtractTextFromPdf(pdfStream, true);
File.WriteAllText(#"ENGV_1929.txt", result);
Console.WriteLine("\n\nENGV_1929.pdf\n");
Console.WriteLine(result);
}
The result:
From Notices to Mariners
Edition No 29/2019
(English version)
Notiсes to Mariners from Seсtion II «Сharts Сorreсtion», based on the original sourсe information, and
NAVAREA XIII, XX and XXI navigational warnings are reprinted hereunder in English. Original Notiсes to
Mariners from Seсtion I «Misсellaneous Navigational Information» and from Seсtion III «Nautiсal
Publiсations Сorreсtion» may be only briefly annotated and/or a referenсe may be made to Notiсes from
other Seсtions. Information from Seсtion IV «Сatalogues of Сharts and Nautiсal Publiсations Сorreсtion»
сonсerning the issue of сharts and publiсations is presented with details.
Digital analogue of English version of the extracts from original Russian Notices to Mariners is available
by: http://structure.mil.ru/structure/forces/hydrographic/info/notices.htm
СНАRTS СОRRЕСTIОN
Вarents Sea
3493 Сharts 18012, 17052, 15005, 15004
Amend 1. Light to light Fl G 4s 1M at
front leading lightbeacon 69111’32.2“N 33129’48.0“E
2. Light to light Fl G 4s 1M at
rear leading lightbeacon 69111’34.85“N 33129’44.25“E
Cancel coastal warning
MURMANSK 71/19
...
Beware, you will see that quite often instead of a Latin character a similar looking Cyrillic character is used. Apparently the document was created manually by someone who didn't consider typographical correctness very important..
So, if you want to search in the text, you should first normalize the text and your search terms (e.g. use the same character for the Latin 'c' and the Cyrillic 'с').

Related

Is there a way to merge two array of objects with same keys, summing up the other property's value?

I'm trying to merge two arrays and sum the values having the same keys. Is it possible to do so?
public struct BassoValues
{
public int BassoId { get; set; }
public decimal Amount { get; set; }
public BassoValues(int bassoId, decimal amount)
{
BassoId = bassoId;
Amount = amount;
}
}
var arrayOne = new BassoValues[4]
arrayOne[0] = new BassoValues() { BassoId = 1, Amount = 1};
arrayOne[1] = new BassoValues() { BassoId = 2, Amount = 10};
arrayOne[2] = new BassoValues() { BassoId = 3, Amount = 20};
arrayOne[3] = new BassoValues() { BassoId = 4, Amount = 30};
var arrayTwo = new BassoValues[4]
arrayTwo[0] = new BassoValues() { BassoId = 1, Amount = 1};
arrayTwo[1] = new BassoValues() { BassoId = 2, Amount = 10};
arrayTwo[2] = new BassoValues() { BassoId = 3, Amount = 20};
arrayTwo[3] = new BassoValues() { BassoId = 4, Amount = 30};
I want to achieve the following result.
var arrayFinal = new BassoValues[4]
arrayFinal[0] = new BassoValues() { BassoId = 1, Amount = 2};
arrayFinal[1] = new BassoValues() { BassoId = 2, Amount = 20};
arrayFinal[2] = new BassoValues() { BassoId = 3, Amount = 40};
arrayFinal[3] = new BassoValues() { BassoId = 4, Amount = 60};
This is how I am trying to achieve the result:
for (int i = 0; i < arrayOne.Length; i++)
{
for (int j = 0; j < arrayTwo.Length; j++)
{
if (arrayOne[0].BassoId == arrayTwo[0].BassoId)
{
var bassoId = arrayOne[0].BassoId;
var sum = arrayOne[0].Amount + arrayTwo[0].Amount;
arrayFinal[0] = new BassoValues() { bassoId, sum};
}
}
}
It'll work in cases when some ids aren't contained in both arrays and if ids can repeat inside one array as well.
var result = arrayOne.Concat(arrayTwo).GroupBy(x => x.BassoId)
.Select(x => new BassoValues(x.Key, x.Sum(y => y.Amount)))
.ToArray();
One solution would be to join the arrays using the Id:
var sumarray = (from a1 in arrayOne
join a2 in arrayTwo on a1.BassoId equals a2.BassoId
select new BassoValues {BassoId = a1.BassoId, Amount = a1.Amount + a2.Amount}).ToArray();
EDIT: In case that each array can contain multiple entries with the same ID and you want to sum them up then the linq-join solution will not suffice anymore. You could group by the id and calculate the sums per id in a loop:
List<BassoValues> Result = new List<BassoValues>();
foreach (var element in arrayOne.GroupBy(x => x.BassoId))
{
BassoValues temp = new BassoValues {BassoId = element.Key};
temp.Amount = arrayTwo.Where(x => x.BassoId == temp.BassoId).Sum(x => x.Amount) + element.Sum(x => x.Amount);
Result.Add(temp);
}
You say your arrays' sizes are fixed to 4.
var arrayFinal = new BassoValues[4]; // create final array
// loop each array
for (int i = 0; i < 4; i++)
{
int amount = arrayOne[i].Amount + arrayTwo[i].Amount;
arrayFinal[i] = new BassoValues() { BassoId = (i+1), Amount = amount };
}

recursive appropriate route selection in list

I want to try to select routes from known points as seen below. But I cannot achieve to get appropriate route chain for example for point 2 to 8. According to this trial route1-route2 and route3 will be selected, but actually must be only route1 and route3, another example point 1 to point 11, my trial will give result all routes. How can I eliminate unnecessary routes.Is this possible with this trial or should I change my point of view?
static void Main(string[] args)
{
var route1 = new List<int> { 1, 2, 3, 4 };
var route2 = new List<int> { 1, 5, 6, 7 };
var route3 = new List<int> { 1, 8, 9, 10 };
var route4 = new List<int> { 10, 11, 12, 13 };
List<List<int>> routeList = new List<List<int>>();
routeList.Add(route1);
routeList.Add(route2);
routeList.Add(route3);
routeList.Add(route4);
int start = 3;
int end = 9;
var vistedRoutes = new List<List<int>>();
foreach(var route in routeList.FindAll(r => r.Contains(start)))
{
vistedRoutes.Add(route);
routeList.Remove(route);
FindPath(vistedRoutes, routeList, start, end);
if (vistedRoutes.Last().Contains(end))
{
break;
}
}
Console.WriteLine("done");
}
static void FindPath(List<List<int>> visitedRoutes, List<List<int>> remainingRoutes, int start, int end)
{
if (visitedRoutes.Last().Contains(end))
{
return;
}
for (int i = 0; i < remainingRoutes.Count; i++ )
{
var route = remainingRoutes[i];
foreach (var point in route)
{
if (visitedRoutes.Last().Contains(point))
{
visitedRoutes.Add(route);
var newRemainingRoutes = new List<List<int>>(remainingRoutes);
newRemainingRoutes.Remove(route);
FindPath(visitedRoutes, newRemainingRoutes, start, end);
if (visitedRoutes.Last().Contains(end))
{
return;
}
else
{
visitedRoutes.Remove(route);
}
}
}
}
}

how to add array elements in one array according to condition in other array in C#?

I have two arrays say one is string array and the other is int array
string array has---> "11","11","11","11","12","12" elements and the int array has 1,2,3,4,5,6 respectively.
I want result two arrays containing string array--->"11","12"
and int array---->10,11
If the string array has duplicate elements, the other array containing that respective index value must be added .For example "11" is in 1st,2nd,3rd,4th index So its corresponding value must sum of all those elements in other array.Can it be done?
I have written some code but unable to do it..
static void Main(string[] args)
{
//var newchartValues = ["","","","","","",""];
//var newdates = dates.Split(',');
//string[] newchartarray = newchartValues;
//string[] newdatearray = newdates;
int[] newchartValues = new int[] { 1, 2, 3, 4, 5, 6 };
string[] newdates = new string[] { "11", "11","11","12","12","12" };
int[] intarray = new int[newchartValues.Length];
List<int> resultsumarray = new List<int>();
for (int i = 0; i < newchartValues.Length - 1; i++)
{
intarray[i] = Convert.ToInt32(newchartValues[i]);
}
for (int i = 0; i < newdates.Length; i++)
{
for (int j = 0; j < intarray.Length; j++)
{
if (newdates[i] == newdates[i + 1])
{
intarray[j] += intarray[j + 1];
resultsumarray.Add(intarray[j]);
}
}
resultsumarray.ToArray();
}
}
I don't quite get what you need, but I think I fixed your code, result will contain 10 and 11 in this example:
int[] newchartValues = new int[] { 1, 2, 3, 4, 5, 6 };
string[] newdates = new string[] { "11", "11", "11", "11", "12", "12" };
List<int> result = new List<int>();
if (newdates.Length == 0)
return;
string last = newdates[0];
int cursum = newchartValues[0];
for (var i = 1; i <= newdates.Length; i++)
{
if (i == newdates.Length || newdates[i] != last)
{
result.Add(cursum);
if (i == newdates.Length)
break;
last = newdates[i];
cursum = 0;
}
cursum += newchartValues[i];
}
Here is an approach that should do what you want:
List<int> resultsumarray = newdates
.Select((str, index) => new{ str, index })
.GroupBy(x => x.str)
.Select(xg => xg.Sum(x => newchartValues[x.index]))
.ToList();
Result is a List<int> with two number: 6, 15
Something like this?
int[] newchartValues = new int[] { 1, 2, 3, 4, 5, 6 };
int[] newdates = new int[] { 11, 11,11,12,12,12 };
var pairs = Enumerable.Zip(newdates, newchartValues, (x, y) => new { x, y })
.GroupBy(z => z.x)
.Select(g => new { k = g.Key, s = g.Sum(z => z.y) })
.ToList();
var distinctDates = pairs.Select(p => p.k).ToArray();
var sums = pairs.Select(p => p.s).ToArray();

Problem transforming dictionary with Linq

I've got a dictionary laid out like so:
Dictionary<string, List<Series>> example = new Dictionary<string, List<Series>>();
example.Add("Meter1",new List<Series>(){ new Series{ name="Usage", data = new double[] {1,2,3}},
new Series{ name = "Demand", data= new double[]{4,5,6}}});
example.Add("Meter2", new List<Series>(){ new Series{ name="Usage", data = new double[] {1,2,3}},
new Series{ name = "Demand", data= new double[]{4,5,6}}});
What I need is:
Dictionary<string, List<Series>> exampleResult = new Dictionary<string, List<Series>>();
exampleResult.Add("Usage", new List<Series>(){ new Series{ name="Meter1", data = new double[] {1,2,3}},
new Series{ name = "Meter2", data= new double[]{1,2,3}}});
exampleResult.Add("Demand", new List<Series>(){ new Series{ name="Meter1", data = new double[] {4,5,6}},
new Series{ name = "Meter2", data= new double[]{4,5,6}}});
That is, the dictionary projected "sideways", with the name of each Series as the key in the new dictionary, with the key of the old dictionary used as the name of the series.
Here's the series class...
public class Series
{
public string name { get; set; }
public double[] data { get; set; }
}
Sorry if I am not expressing this problem clearly, please ask any questions you'd like, and thanks in advance for any help...
EDITED TO ADD EXAMPLE
Create a grouping and then select out the new keys and values to create a dictionary. Like this:
// source data
var d = new Dictionary<string, Series[]>
{
{
"key1", new[]
{
new Series
{
name = "Usage",
data = new double[] {1, 2, 3}
},
new Series
{
name = "Demand",
data = new double[] {4, 5, 6}
}
}
},
{
"key2", new[]
{
new Series
{
name = "Usage",
data = new double[] {1, 2, 3}
},
new Series
{
name = "Demand",
data = new double[] {4, 5, 6}
}
}
}
};
// transform
var y = (
from outer in d
from s in outer.Value
let n = new
{
Key = s.name,
Series = new Series
{
name = outer.Key,
data = s.data
}
}
group n by n.Key
into g
select g
).ToDictionary(g1 => g1.Key,
g2 => g2.Select(g3 => g3.Series).ToArray());
/* results:
var y = new Dictionary<string, Series[]>
{
{
"Usage",
new[]
{
new Series
{
name = "key1",
data = new double[] { 1, 2, 3 }
},
new Series
{
name = "key2",
data = new double[] { 1, 2, 3 }
}
}
},
{
"Demand",
new[]
{
new Series
{
name = "key1",
data = new double[] {4, 5, 6},
},
new Series
{
name = "key2",
data = new double[] {4, 5, 6}
}
}
}
};
*/
Try this:
example
.SelectMany(x => x.Value
.Select(y => y.name)
).Distinct()
.ToDictionary(
x => x,
x => example
.SelectMany(y => y.Value
.Where(z => z.name == x)
.Select(z => new Series{ name = y.Key, data = z.data })
).ToList()
)

C# - Creating array where the array value has multiple objects and each one has a value too

I have just recently been doing something in C#, i would like to know how to do something like this.
Array[0] =
Array['Value'] = 2344;
Array['LocationX'] = 0;
Array['LocationY'] = 0;
Array[1] =
Array['Value'] = 2312;
Array['LocationX'] = 2;
Array['LocationY'] = 1;
Array[2] =
Array['Value'] = 2334;
Array['LocationX'] = 4;
Array['LocationY'] = 3;
The data it self its not important, the thing is that i know how to do this in PHP. But in C# i don't, and I've tried some ways and no luck.
In PHP i could just do something like this:
$Array[0]->Value = 2344;
$Array[0]->LocationX = 0;
$Array[0]->LocationY = 0;
And those values would be added to the Array.
In C# i've tried this and doesn't work that way.
Could someone enlighten me in how to do this in C#?
Thanks.
Well, you could have an array of instances of a class that you write like so:
public class DataForArray
{
public int Value { get; set; }
public int LocationX { get; set; }
public int LocationY { get; set; }
}
Then something like this:
DataForArray[] array = new DataForArray[10];
array[0] = new DataForArray();
array[0].Value = 2344;
etc...
Either write a class or struct to hold Value, LocationX and LocationY.
struct Foo
{
Foo(value, x, y)
{
Value = value;
LocationX = x;
LocationY = y;
}
Foo() {}
int Value;
int LocationX;
int LocationY;
}
Foo[] f = new []
{
new Foo(1, 2, 3),
new Foo(2, 3, 4)
}
or alternatively initialize the array this way:
Foo[] f = new []
{
new Foo() { Value = 1, LocationX = 2, LocationY = 3 },
new Foo() { Value = 4, LocationX = 5, LocationY = 6 },
}
Or use an Array of Dictionary<string, int>.
Dictionary<string, int>[] array = new []
{
new Dictionary<string, int>() {{ "Value", 1 }, {"LocationX", 2}, {"LocationY", 3 }},
new Dictionary<string, int>() {{ "Value", 4 }, {"LocationX", 5}, {"LocationY", 6 }}
}
Which is only recommended if it needs to be dynamic (means: you want to have different values in each element of the array or your keys are in strings, not known at compile-time.) Unless it is just hard to maintain.
in C#, you can try something like this
// initialize array
var list = new[]
{
new {Value = 2344, LocationX = 0, LocationY = 0},
new {Value = 2312, LocationX = 2, LocationY = 4},
new {Value = 2323, LocationX = 3, LocationY = 1}
}.ToList();
// iterate over array
foreach (var node in list)
{
var theValue = node.Value;
var thePosition = new Point(node.LocationX, node.LocationY);
}
// iterate over array with filtering ( value > 2300 )
foreach (var node in list.Where(el => el.Value > 2300))
{
var theValue = node.Value;
var thePosition = new Point(node.LocationX, node.LocationY);
}
// add again
list.Add(new { Value = 2399, LocationX = 9, LocationY = 9 });
Here is a link that details the use of Multidimensional arrays
http://msdn.microsoft.com/en-us/library/aa288453(VS.71).aspx
You can use anonymous type in C# like that:
var arr = new[] {
new{Value = 1, LocationX = 2, LocationY = 3},
new{Value = 1, LocationX = 2, LocationY = 3},
new{Value = 1, LocationX = 2, LocationY = 3},
new{Value = 1, LocationX = 2, LocationY = 3},
new{Value = 1, LocationX = 2, LocationY = 3} };
Only one problem is that properties in anonymous type are read-only. So You can't do something like that:
arr[1].Value = 2

Categories

Resources