I want to extract file from PDF to a textbox in asp.net, and I have tried this code from the project here
I have successfully extract the text from my PDF, but the result is exported to .txt file first, and the result doesn't have any line, and there aren't any whitespace between words.
If this is the example of the PDF text
Hello World
This is the word ----------------------------------------------- This is word too
End of Hello World
The result will be like this
HelloWorld Thisistheword Thisiswordtoo EndofHelloWorld
What should I do so I can have a space between every word, and add new line in every line?
Also in this http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET I saw the following code:
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
What's the use of it?
Edit:
After searching for some more, I found the solution in the comment here
I just need to update my itextsharp.dll to the newer version ( I use version 5.4.4.0 ) and added the function like what the comment says and now the result is good like what I wanted it to be
There seems to be some sort of Trim() function happening in the PDFParser.
In addition to that,in the ExtractTextFromPDFBytes method, the newline tokens it is checking is incorrect, it should not be 'TD', 'Td':
Check for iTextSharp.text.Chunk.NEWLINE
Related
I am new to this site, and I don't know if I am providing enough info - I'll do my best =)
If you use Notepad++, then you will know what I am talking about -- When a user loads a .exe into Notepad++, the NUL / \x0 character is replaced by NULL, which has a black background, and white text. I tried pasting it into Visual Studio, hoping to obtain the same output, but it just pasted some spaces...
Does anyone know if this is a certain key-combination, or something? I would like to put the NULL character in replacement of \x0, just like Notepad++ =)
Notepad++ is a rich text editor unlike your regular notepad. It can display custom graphics so common in all modern text editors. While reading a file whenever notepad++ encounters the ASCII code of a null character then instead of displaying nothing it adds the string "NULL" to the UI setting the text background colour to black and text colour to white which is what you are seeing. You can show any custom style in your rich text editor too.
NOTE: This is by no means an efficient solution. I'm clearly traversing a read string 2 times just to take benefit of already present methods. This can be done manually in a single pass. It is just to give a hint about how you can do it. Also I wrote the code carefully but haven't ran it because I don't have the tools at the moment. I apologise for any mistakes let me know I'll update it
Step 1 : Read a text file by line (line ends at '\n') and replace all instances of null character of that line with the string "NUL" using the String.Replace(). Finally append the modified text to your RichTextBox.
Step 2 : Re traverse your read line using String.IndexOf() finding start indexes of each "NUL" word. Using these indexed you select text from RichTextBox and then style that selected text using RichTextBox.SelectionColor and RichTextBox.SelectionBackColor
richTextBoxCursor basically just represents the start index of each line in RichTextBox
StreamReader sr = new StreamReader(#"c:\test.txt" , Encoding.UTF8);
int richTextBoxCursor = 0;
while (!sr.EndOfStream){
richTextBoxCursor = richTextBox.TextLength;
string line = sr.ReadLine();
line = line.Replace(Convert.ToChar(0x0).ToString(), "NUL");
richTextBox.AppendText(line);
i = 0;
while(true){
i = line.IndexOf("NUL", i) ;
if(i == -1) break;
// This specific select function select text start from a certain start index to certain specified character range passed as second parameter
// i is the start index of each found "NUL" word in our read line
// 3 is the character range because "NUL" word has three characters
richTextBox.Select(richTextBoxCursor + i , 3);
richTextBox.SelectionColor = Color.White;
richTextBox.SelectionBackColor = Color.Black;
i++;
}
}
Notepad++ may use custom or special fonts to show these particular characters. This behavior also may not appropriate for all text editors. So, they don't show them.
If you want to write a text editor that visualize these characters, you probably need to implement this behavior programmatically. Seeing notepad++ source can be helpful If you want.
Text editor
As far as I know in order to make Visual Studio display non printable characters you need to install an extension from the marketplace at https://marketplace.visualstudio.com.
One such extension, which I have neither tried nor recomend - I just did a quick search and this is the first result - is
Invisible Character Visualizer.
Having said that, copy-pasting binaries is a risky business.
You may try Edit > Advanced > View White Space first.
Binary editor
To really see what's going on you could use the VS' binary editor: File->Open->(Open with... option)->Binary Editor -> OK
To answer your question.
It's a symbolic representation of 00H double byte.
You're copying and pasting the values. Notepad++ is showing you symbols that replace the representation of those values (because you configured it to do so in that IDE).
I am working on a C# application that calculates some values. They are placed into a DataGridView. When I select values from a column of the DataGridView and I paste them to a column of a table from a Word document, I want the values to be center aligned. Even if I set all cells of the DataGridView to be center aligned, when I copy and paste the text into the Word table, they show up left aligned.
The code used to copy the table into clipboard is
Clipboard.SetDataObject(this.dataGridView1.GetClipboardContent());
How can I prepare the text so that when I paste it, it will show centered? If I copy and paste text from another column of Word, it maintains the original alignment. This indicates there are some special characters surrounding the cell values. I don't know how to view those characters (maybe I can add them to each value).
I'm going to cheat and start with an incomplete answer.
This RTF text seems like it should work for a single value:
{\rtf1\ansi\qc
This is some centered text.}
That's a "hard newline" after \qc or I think just a \n. If I put that RTF into a file then open in WordPad it shows up centered.
I've been playing with System.Windows.Forms.ClipBoard.
Clipboard.SetData(DataFormats.Rtf, "{\\rtf1\\qc\n{\\b foo}}");
If I run the above even in a console application, I can next ctrl-V paste into MS Word and the bold works, but unfortunately the centering doesn't.
In any case, I then looked at pasting into a MS Word table and clearly it's not just a matter of text with newlines, some delimiter or other is required to show cell boundaries. So not only does the RTF I have not work, there's likely at least one more step / wrapper beyond the RTF to get a "column" not just a block of text.
Feel free to not vote, I just thought perhaps something here might be helpful to avoid both of us doing the same thing twice.
EDIT: DataFormats.Html may also work and seems it could even be the format normally used by your grid control. (though it also supports CSV)
However there's an extra clipboard header for HTML I haven't figured out yet described here: How to set HTML to clipboard in C#?
I will post my solution in here, so it can be used (and improved) by other people. We need to place a populated dataGridView object on form and a button. The button calls function CopyColumn(). I have some problems to properly format the code (some of the long strings are separated as text by stackoverflow, maybe somebody will help with including them into code.
void CopyColumn()
{
if (this.dataGridView1.GetCellCount(DataGridViewElementStates.Selected) > 0)
{
try
{
Clipboard.SetDataObject(this.dataGridView1.GetClipboardContent());
string sText = Clipboard.GetText();
string sColumn = FormatColumn(sText);
Clipboard.SetData(DataFormats.Rtf, sColumn); // this will set the proper format of the Uncertainty column in clipboard memory
}
catch (System.Runtime.InteropServices.ExternalException)
{
MessageBox.Show("The Clipboard could not be accessed. Please try again.");
}
}
}
string FormatColumn(string sValues)
{
int nlines = NumLines(sValues);
string[] values = Values(sValues);
string sStart = #"{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang3081{\fonttbl{\f0\froman\fprq2\fcharset0 Times New Roman;}}
{*\generator Riched20 10.0.14393}\viewkind4\uc1";
string sEnd = "}";
string sRowStart = #"\trowd\trgaph85\trleft5\trbrdrl\brdrs\brdrw10 \trbrdrt\brdrs\brdrw10 \trbrdrr\brdrs\brdrw10 \trbrdrb\brdrs\brdrw10 \trpaddl85\trpaddr85\trpaddfl3\trpaddfr3
\clvertalc\clbrdrl\brdrw10\brdrs\clbrdrt\brdrw10\brdrs\clbrdrr\brdrw10\brdrs\clbrdrb\brdrw10\brdrs \cellx1706
\pard\intbl\widctlpar\qc ";
string sRowEnd = #"\cell\row";
string sFormattedColumn = sStart;
string sRow = string.Empty;
for(int i = 0; i < nlines; i++) {
sRow = sRowStart + values[i] + sRowEnd;
sFormattedColumn += sRow;
}
sFormattedColumn += sEnd;
return sFormattedColumn;
}
int NumLines(string sValue)
{
string[] values = sValue.Split('\r');
return values.Length;
}
string[] Values(string sValue)
{
string[] values = sValue.Split('\r');
for(int i = 0; i < values.Length; i++) {
values[i] = values[i].Replace("\n", "");
}
return values;
}
I fear this is not related to your VBA, but a bug in MS Word (tested for 2010 and 2016).
It's enough to do the following (not sure it's the same bug you're suffering from):
1. Open a new Word doc.
2. Insert a table (let's say 4 columns, 4 rows).
3. Mark all cells and format them as centered.
4. Type some text into a cell, select it and copy it.
Now there are two scenarios:
A: Paste the copied text into a single free cell -> The pasted text is centered, as expected.
B: Mark several adjacent free cells and paste the text -> The pasted text is left-aligned instead of centered!!
If I use LibreOffice (which I recommend to you, too), all this works fine (no surprise).
No, I did not pay for MS Office, but I have to use it at work :/
I am trying to read specific word from text file I know its easy and I have done but I need to read from sentence i.e. if file contain
WC|110916|F-12003||ZET5.4|27019570 then i need to pic "27019570" this specific word, I did with substring(26,8) splitting with characters and its works but every line not having specific size/length so splitting words is not proper solution for this.
In short I need to know how do i check (|) this character and its position on every sentence which includes in text file.
Thanks in Advance :)
you can split each line by '|' character . it returns an array then you can select the desired index.
var textFromFile = "WC|110916|F-12003||ZET5.4|27019570";
var goalText = textFromFile.Split('|')[5];
if you're using .NET 3.5 or higher, it's easy using LINQ with File.ReadAllLines
string fullFilePath = #"C:\ed\cc\filename.txt";
List<string> items = File.ReadAllLines(fullFilePath ).Select(line=>line.Split('|').Last()).ToList();
I am writing a Word plugin to read all text in a document and saving it to a text file.
The text file generated will be used by another application of mine, and so I need to mark the end of every page's text by a '\f' character. My current logic merely saves the file though word as a plain text file, by using
object format = WdSaveFormat.WdFormatText;
...
Application.ActiveDocument.SaveAs( ..., ref format, ... );
The best method I found to insert a break was using ActiveDocument.Selection.InsertBreak().
Is there some way to determine the positions of page breaks in the original Word document so that I know where to insert the '\f' character?
This is one of the hard way to do it
use computestastics() for line number and get the no of lines
use goto to goto last line in the documnet and insert a Hard EOF
ex:
Selection.GoTo(wdGoToLine,wdGoToAbsolute,4)
The only thing that I can think of right now is for you to save it as an html which will give you a tag for every paragraph. Then you can get the begining of each paragraph text and used that to find the first starting position of each paragraph on the original document.
Also, you can do a Selection.Find and search for "^p" which is a paragraph mark.
I have a program that generates a plain text file. The structure (layout) is always the same. Example:
Text File:
LinkLabel
"Hello, this text will appear in a LinkLabel once it has been
added to the form. This text may not always cover more than one line. But will always be surrounded by quotation marks."
240, 780
So, to explain what is going on in that file:
Control
Text
Location
And when a button on the Form is clicked, and the user opens one of these files from the OpenFileDialog dialog, I need to be able to Read each line. Starting from the top, I want to check to see what control it is, then starting on the second line I need to be able to get all text inside the quotation marks (regardless of whether is is one line of text or more), and on the next line (after the closing quotation mark), I need to extract the location (240, 780)... I have thought of a few ways of going about this but when I go to write it down and put it to practice, it doesn't make much sense and end up figuring out ways that it won't work.
Has anybody ever done this before? Would anybody be able to provide any help, suggestions or advice on how I'd go about doing this?
I have looked up CSV files but that seems too complicated for something that seems so simple.
Thanks
jase
You could use a regular expression to get the lines from the text:
MatchCollection lines = Regex.Matches(File.ReadAllText(fileName), #"(.+?)\r\n""([^""]+)""\r\n(\d+), (\d+)\r\n");
foreach (Match match in lines) {
string control = match.Groups[1].Value;
string text = match.Groups[2].Value;
int x = Int32.Parse(match.Groups[3].Value);
int y = Int32.Parse(match.Groups[4].Value);
Console.WriteLine("{0}, \"{1}\", {2}, {3}", control, text, x, y);
}
I'll try and write down the algorithm, the way I solve these problems (in comments):
// while not at end of file
// read control
// read line of text
// while last char in line is not "
// read line of text
// read location
Try and write code that does what each comment says and you should be able to figure it out.
HTH.
You are trying to implement a parser and the best strategy for that is to divide the problem into smaller pieces. And you need a TextReader class that enables you to read lines.
You should separate your ReadControl method into three methods: ReadControlType, ReadText, ReadLocation. Each method is responsible for reading only the item it should read and leave the TextReader in a position where the next method can pick up. Something like this.
public Control ReadControl(TextReader reader)
{
string controlType = ReadControlType(reader);
string text = ReadText(reader);
Point location = ReadLocation(reader);
... return the control ...
}
Of course, ReadText is the most interesting one, since it spans multiple lines. In fact it's a loop that calls TextReader.ReadLine until the line ends with a quotation mark:
private string ReadText(TextReader reader)
{
string text;
string line = reader.ReadLine();
text = line.Substring(1); // Strip first quotation mark.
while (!text.EndsWith("\"")) {
line = reader.ReadLine();
text += line;
}
return text.Substring(0, text.Length - 1); // Strip last quotation mark.
}
This kind of stuff gets irritating, it's conceptually simple, but you can end up with gnarly code. You've got a comparatively simple case:one record per file, it gets much harder if you have lots of records, and you want to deal nicely with badly formed records (consider writing a parser for a language such as C#.
For large scale problems one might use a grammar driven parser such as this: link text
Much of your complexity comes from the lack of regularity in the file. The first field is terminated by nwline, the second by delimited by quotes, the third terminated by comma ...
My first recomendation would be to adjust the format of the file so that it's really easy to parse. You write the file so you're in control. For example, just don't have new lines in the text, and each item is on its own line. Then you can just read four lines, job done.