changing data to xml - c#

I have been tasked with finishing a demo for a client as our main developer is currently unavailable. All the programming experience I have is that I briefly crossed swords with C# 5 years ago but haven't used it since.
I need help with turning exported data into XML if that makes sense. Currently we have a build which takes a PDF form extracts the required data which is shown to work via command prompt. I need to be able to turn this data into XML, so it can be queried to a database. The idea of the program is to take required data from a PDF convert it to XML and query it to a database where the data is stored. We are using the C# language in conjunction with the iTextsharp library. I would post the code but I'm not allowed to.
So I am asking can anyone help me out? Maybe point me towards an example of how this is done and or explain as simply as possible how I would go about doing this? I wouldn't usually ask for help from others but because the fact I haven't coded in years, it has left me feeling intimidated.

This may prove usefull to you... Taken from a way to parse PDF data using iTextSharp
using iTextSharp.text.pdf;
using iTextSharp.text;
private void openPDF()
{
string str = "";
string newFile = "c:\\New Document.pdf";
Document doc = new Document();
PdfReader reader = new PdfReader("c:\\New Document.pdf");
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] bt = reader.GetPageContent(i);
str += ExtractTextFromPDFBytes(bt);
}
}
private string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters)) {
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 && CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
{
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return resultString;
}
catch
{
return string.Empty;
}
}
private bool CheckToken(string[] tokens, char[] recent)
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a)))
{
return true;
}
}
return false;
}
Once the data is parsed it's then a matter of converting each line to a class (as recommended by Andrei V in your post's comment), serializing it to XML and then storing either the XML file itself to the database or the XML data to the database.

Related

How to write a word from a string. c#

I've started recently to learn c# and I have a problem.
The problem gives me a sentence and a number and my program has to return that number's word.
Here is what I've made:
using System;
string inputData = Console.ReadLine();
string text = inputData;
inputData=Console.ReadLine();
int x = Convert.ToInt32(inputData);
string currentWord = String.Empty;
int wordCount = 1;
for (int i = 0; i < text.Length; ++i)
{
if (text[i] == ' ')
{
wordCount++; currentWord = String.Empty;
while (text[i] == ' ') i++;
}
if (text[i] != ' ') { currentWord += text[i]; }
if (wordCount == x) Console.WriteLine(currentWord);
}
Console.Read();
For the sentence " I have two pens" and the number 2, the program returns
h
ha
hav
have.
What do I do wrong?
I would change it a bit. The problem why it's is giving multiple values, is because you don't terminate the iteration. (break the forloop). I've added some comment to it:
class Program
{
static void Main(string[] args)
{
// some test data (instead of readline)
string inputData = " I have two pens";
string text = inputData;
inputData = "2";
int x = Convert.ToInt32(inputData);
string currentWord = String.Empty;
int wordCount = 1;
for (int i = 0; i < text.Length; i++)
{
// when the character is a space and the currentWord has something or it was the last character of the text
// a new word has been found.
if ((text[i] == ' ' && !string.IsNullOrEmpty(currentWord)) || i==text.Length-1))
{
// when a new word has been found, just check it.
if (wordCount == x)
{
Console.WriteLine(currentWord);
// this is where your solution fails.
// when the word has been found, break the iteration.
break;
}
wordCount++;
currentWord = String.Empty;
}
else if (text[i] != ' ')
{
currentWord += text[i];
}
}
Console.Read();
}
}
You could use System.Linq for this, which is much easier.
// Split the string on space and create an array.
// Skip some elements and select the first.
var s = text.Split(' ').Skip(x).FirstOrDefault();
Console.WriteLine(s);

Function, which deletes all code comments

I need to do a function which deletes all comments from the text(code). My code is almost finished, but it doesn't work if comment starts in the first line of the file. It says index out of bounds, I tried changing for loops to start from 1 and then if to(text[i] == '/' && text[i - 1] == '/') but it doesn't work.
Any suggestion how can I fix that or improve my code because it looks weird.
public void RemoveComments(string text)
{
for (int i = 0; i < text.Length; i++)
{
if (text[i] == '/' && text[i + 1] == '/')
{
text = text.Remove(i, 2);
for (int j = i; j < text.Length; j++)
{
if (text[j] != '\n')
{
text = text.Remove(j, 1);
j--;
}
else if (text[j] == '\n')
{
text = text.Remove(j, 1);
j--;
while (text[j] == ' ')
{
text = text.Remove(j, 1);
j--;
}
i = j;
break;
}
}
}
else if (text[i] == '/' && text[i + 1] == '*')
{
text = text.Remove(i, 2);
for (int j = i; j < text.Length; j++)
{
if (text[j] != '*' && text[j + 1] != '/')
{
text = text.Remove(j, 1);
j--;
}
else if (text[j] == '*' && text[j + 1] == '/')
{
text = text.Remove(j, 2);
j = j - 2;
while (text[j] == ' ')
{
text = text.Remove(j, 1);
j--;
if (text[j] == '\n')
{
text = text.Remove(j, 1);
j--;
}
}
i = j;
break;
}
}
}
}
Console.WriteLine(text);
}
EDIT: Now I did many experiments and I found that the problem is with(in // loop) I needed this loop this to fix some small aligment problems:
while (text[j] == ' ')
{
text = text.Remove(j, 1);
j--;
}
Test.txt file.
//int a;
int c; //int d;
Console.Write/*Line*/("Hhehehe");
if(1>0)
/*ConsoleWriteLine("Yes")*/
//Nooo
Looks like you have C# code files. Thus you can use the power of Roslyn. Simply parse code file into syntax tree and then visit that tree with visitor which skips comments:
var code = File.ReadAllText("Code.cs");
SyntaxTree tree = CSharpSyntaxTree.ParseText(code);
var root = (CompilationUnitSyntax)tree.GetRoot();
var codeWithoutComments = new CommentsRemover().Visit(root).ToString();
Console.WriteLine(codeWithoutComments);
Visitor:
class CommentsRemover : CSharpSyntaxRewriter
{
public override SyntaxTrivia VisitTrivia(SyntaxTrivia trivia)
{
switch(trivia.Kind())
{
case SyntaxKind.SingleLineCommentTrivia:
case SyntaxKind.MultiLineCommentTrivia:
return default; // new SyntaxTrivia() // if C# <= 7.0
default:
return trivia;
}
}
}
Sample code file:
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp
{
/* Sample
Multiline Comment */
class Program
{
static void Main(string[] args)
{
// Comment
Console.Write/*Line*/("Hello, World!"); // Print greeting
/*ConsoleWriteLine("Yes")*/
}
}
}
Output:
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp
{
class Program
{
static void Main(string[] args)
{
Console.Write("Hello, World!");
}
}
}
Notes: As you can see, after removing comments from the lines which had nothing except comment, you get empty lines. You can create one more visitor to remove empty lines. Also consider to remove XML comments as well.
You have a loop based on text.Length
for (int i = 0; i < text.Length; i++)
But inside of the loop you are shorten the text. At a certain point it is smaller as the origin text.Length and you running out of index I guiess

Check next loop iteration data unless at end of loop c#

I'm making a program which is creating an ASCII image. Based on the asterix input it produces different things. To start I'm making a basic outline however I have an issue where I cannot add something when checking last for loop iteration.
Method code:
private List<string> DrawOutline(List<string> inputLines)
{
List<string> output = new List<string>();
int door = r.Next(0, inputLines.Last().Length);
for (int li = 0; li < inputLines.Count; li++)
{
char[] curLine = inputLines[li].ToCharArray();
string outputLine1 = string.Empty;
string outputLine2 = string.Empty;
for (int i = 0; i < curLine.Length -1; i++)
{
Console.WriteLine(curLine[i]);
if (curLine[i] == '*')
{
outputLine1 += "+---";
outputLine2 += "| ";
}
else
{
outputLine1 += " ";
outputLine2 += " ";
}
if(li < curLine.Length - 1)
{
if (curLine[i] == '*' && curLine[i + 1] != '*')
{
outputLine1 += "+";
outputLine2 += "|";
}
}
}
output.Add(outputLine1);
output.Add(outputLine2);
}
return output;
}
When I run this, it works fine however will not add '+' and '|' to the last line of outputLines. This is because the line :
if(li < curLine.Length -1)
However without the -1 it will throw an exception because I am using [i+1] to decide something. Is there a way to check only if it won't throw an exception?
you can check if the end of the array has been reached by using the OR ( || ) statement. If the first statement of the OR statement returns true, the second is not checked. This is called short-circuiting. No error should be thrown in this case.
private List<string> DrawOutline(List<string> inputLines)
{
List<string> output = new List<string>();
int door = r.Next(0, inputLines.Last().Length);
for (int li = 0; li < inputLines.Count; li++)
{
char[] curLine = inputLines[li].ToCharArray();
string outputLine1 = string.Empty;
string outputLine2 = string.Empty;
for (int i = 0; i < curLine.Length -1; i++)
{
Console.WriteLine(curLine[i]);
if (curLine[i] == '*')
{
outputLine1 += "+---";
outputLine2 += "| ";
}
else
{
outputLine1 += " ";
outputLine2 += " ";
}
if (curLine[i] == '*' && (curLine.Length == i+1 || curLine[i + 1] != '*'))
{
outputLine1 += "+";
outputLine2 += "|";
}
}
output.Add(outputLine1);
output.Add(outputLine2);
}
return output;
}
Change if (li < curLine.Length - 1) to if (i < curLine.Length - 1)

find common substrings in 2 string in c#

I have strings like:
1) Cookie:ystat_tw_ss376223=9_16940400_234398;
2) Cookie:zynga_toolbar_fb_uid=1018132522
3) GET /2009/visuels/Metaboli_120x600_UK.gif HTTP/1.1
4) GET /2010/07/15/ipad-3hk-smv-price-hk/ HTTP/1.1
1 ad 2 have common substtring{cookie:}
3 and 4 have common substtring{GET /20, HTTP/1.1}
I want to find all common substrings that have the length more than three characters(contain space character) between 2 strings.(like 1 and 2)
i want to code in c#. i have a program but it has some problems.
Could anyone help me?
public static string[] MyMCS2(string a, string b)
{
string[] st = new string[100];
// List<string> st = new List<string>();
List<char> f = new List<char>();
int ctr = 0;
char[] str1 = a.ToCharArray();
char[] str2 = b.ToCharArray();
int m = 0;
int n = 0;
while (m < str1.Length)
{
for (n = 0; n < str2.Length; n++)
{
if (m < str1.Length)
{
if (str1[m] == str2[n])
{
if ((m > 1) && (n > 1) &&(str1[m - 1] == str2[n - 1]) && (str1[m - 2] == str2[n - 2]))
{
//f[m]= str1[m];
f.Add(str1[m]);
char[] ff = f.ToArray();
string aaa = new string(ff);
if (aaa.Length >= 3)
{
st[ctr] = aaa + "()";
//st.Add(aaa);
ctr++;
}
kk = m;
m++;
}
else if ((n == 0) ||(n == 1))
{
f.Add(str1[m]);
kk = m;
m++;
}
else
f.Clear();
}
//else if ((str1[m] == str2[n]) && (m == str1.Length - 1) && (n == str2.Length - 1))
//{
// f.Add(str1[m]);
// char[] ff = f.ToArray();
// string aaa = new string(ff);
// if (aaa.Length >= 3)
// {
// st[ctr] = aaa;
// ctr++;
// }
// // m++;
//}
else if ((str1[m] != str2[n]) && (n == (str2.Length - 1)))
{
m++;
}
else if ((m > 1) && (n > 1) && (str1[m] != str2[n]) && (str1[m - 1] == str2[n - 1]) && (str1[m - 2] == str2[n - 2]) && (str1[m - 3] == str2[n - 3]))
{
//
char[] ff = f.ToArray();
string aaa = new string(ff);
if (aaa.Length >= 3)
{
st[ctr] = aaa + "()" ;
//st.Add(aaa);
ctr++;
f.Clear();
}
//f.Clear();
//for (int h = 0; h < ff.Length; h++)
//{
// f[h] = '\0';
//}
}
else if (str1[m] != str2[n])
continue;
}
}
}
//int gb = st.Length;
return st;
}
This is an exact matching problem not a substring. You can solve it with aho-corasick algorithm. Use the first string and compute a finite state machine. Then process the search string. You can extend the aho-corasick algorithm to use a wildcard and search also for substrings. You can try this animated example: http://blog.ivank.net/aho-corasick-algorithm-in-as3.html

Parse a persian pdf file to txt and its images

I used this code for convert a English pdf and it work perfectly, but when i use it for Persian file, its output has no Persian character !! how i can parse a Unicode pdf to a text file and a folder contains image files?
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
namespace Spider.Utils
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}
}
}
if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
{
resultString += c.ToString();
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return CleanupContent(resultString);
}
catch
{
return "";
}
}
private string CleanupContent(string text)
{
string[] patterns = { #"\\\(", #"\\\)", #"\\226", #"\\222", #"\\223", #"\\224", #"\\340", #"\\342", #"\\344", #"\\300", #"\\302", #"\\304", #"\\351", #"\\350", #"\\352", #"\\353", #"\\311", #"\\310", #"\\312", #"\\313", #"\\362", #"\\364", #"\\366", #"\\322", #"\\324", #"\\326", #"\\354", #"\\356", #"\\357", #"\\314", #"\\316", #"\\317", #"\\347", #"\\307", #"\\371", #"\\373", #"\\374", #"\\331", #"\\333", #"\\334", #"\\256", #"\\231", #"\\253", #"\\273", #"\\251", #"\\221"};
string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" };
for (int i = 0; i < patterns.Length; i++)
{
string regExPattern = patterns[i];
Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
text = regex.Replace(text, replace[i]);
}
return text;
}
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="tokens">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
return false;
}
#endregion
}
}
Is there a specific reason you're not using the relatively new parsing classes? I don't know the Persian language, but the first Persian PDF I found on Google works, and it's much less code:
PdfReader reader = new PdfReader(pdfPath);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
ITextExtractionStrategy strategy = parser.ProcessContent(
i, new SimpleTextExtractionStrategy()
);
sb.Append(strategy.GetResultantText());
}
A number of bugs have been fixed recently, so I'm using the latest iTextSharp SVN build. Also your question title includes parsing images, but your code doesn't, so the example above is not extracting any images.

Categories

Resources