I am working on Document management project and I want to extract text from pdf. How can I achieve this. I am using Itextsharp to extract pdf on local system
This is a function I am using for this purpose. Path is a FTP Server Path
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}
It throws an exception
'ftp:\\###\index\500199.pdf not found as file or resource.'
[### is my ftp server]
PdfReader has a bunch of constructor overloads but most of them rely on RandomAccessSourceFactory to convert whatever is passed in into a Stream format. When you pass a string in it is checked if it is a file on disk and if not it is checked if it can be converted to a Uri as one of file:/, http:// or https:// link. This is your first point of failure because none of these checks handle the ftp protocol and you ultimately end up at a local resource loader which doesn't work for you.
You could try converting your string to an explicit Uri but that actually won't work, either:
//This won't work
new PdfReader(new Uri(path))
The reason that this won't work is because iText tells .Net to use CredentialCache.DefaultCredentials when loading remote resources however that concept doesn't exist in the FTP world.
Long story short, when using FTP you'll want to download the files on your own. Depending on their size you'll want to either download them to disk or download them a byte array. Below is a sample of the latter:
Byte[] bytes;
if( path.StartsWith(#"ftp://")) {
var wc = WebRequest.Create(path);
using (var response = wc.GetResponse()) {
using (var responseStream = response.GetResponseStream()) {
bytes = iTextSharp.text.io.StreamUtil.InputStreamToArray(responseStream);
}
}
}
You can then pass either the local file or the byte array to the PdfReader constructor.
Related
I need to encode the zip file in base64 formats.
I followed the following approach
string text = File.ReadAllText("../../../SampleDat.dat");
byte[] compress0 = Compress(stringbyte);
string short_com0 = base64_encode(compress0);
public static byte[] Compress(byte[] data)
{
using (var compressedStream = new MemoryStream())
using (var zipStream = new GZipStream(compressedStream, CompressionMode.Compress))
{
zipStream.Write(data, 0, data.Length);
zipStream.Close();
return compressedStream.ToArray();
}
}
public string base64_encode(byte[] data)
{
if (data == null)
throw new ArgumentNullException("data");
return Convert.ToBase64String(data);
}
After using this I got this encoded string.
H4sIAAAAAAAEAJVQTU/CQBS8m/gfejHRgxQpoJJ4qGXBKlBsq6KXph8P2NjdrbuLleT9eBe/QvSgHt7hTWYmMzMmsdt3Yxe9lBe0SDVcisytqpLmqaaCkxctU5/PBQ5GZNabkjAxFwWThPhxQgYDNJd4bkyGQXifeEGfYKoUKMWA60nKYP+n5mwCTKyksjxJNUiaHmxpolzIf4tuZPk3iWcaLoRce6IAJPP5iHLwC5wC3ZSU7K30JwmjVcaoUgYynOGN38fI+OUQrZUGZrDtN6g5SAzhaUUV3dhMViwzyNey7//uzpiEQ/L74N/D46agaYZuwSinyvA0fQbLNQGVTrm2Di3CtVxbI3iGEjttXGpdqZ5t13XdyD9szLxVIxfMXlIJCkrItS2hElIrm/ICXuzH6V7rfL4oTx+CIMtY/+7aiaNZq7ZFnLfDinavZsFtBvfNpZ9HZIH4MyriUctpd7rHJ6dNvPDGDX88HaFz3MGO02w6r7wgTAN2AgAA
When I created zip manually and read file in the code and compress that file
//file zipped manually
string filePath1 = "../../../git_only/oraclehcm1/dbscripts/SampleDat.zip";
byte[] physicalfile1 = File.ReadAllBytes(filePath1);
string long_com1 = base64_encode(physicalfile1);
The response I get is
UEsDBBQAAAAIAECDYlK8IEwDbAEAAHYCAAANAAAAU2FtcGxlRGF0LmRhdJVQTU/CQBS8m/gfejHRgxQpqJB4qGXBKlBsq6KXph8P2NjdrbuLleT9eBc/4tdBPbzDvMxMZmZMYrfvxi56KS9okWo4F5lbVSXNU00FJ09apj6fCxyMyKw3JWFiLgomCfHjhAwGaC7x3JgMg/A28YI+wVQpUIoB15OUwe5PzckEmFhJZXmSapA03fukiXIh/y26kuXfJJ5puBBy7YkCkMznI8rBL3AKdFNSspfS7ySMVhmjSpmX4Qyv/D5Gxi+HaK00ML/4AoOag8QQHlZU0Y3NZMUykB/LvuLtrTEJh+T3wb+Hx01B0wzdglFOleFp+giWawIqnXJt7VuEa7m2RvAIJXbauNS6Uj3bruu6kb/ZmHmrRi6YvaQSFJSQa1tCJaRWNuUFPNn3053W6XxRdu+CIMtY/+bSiaNZq7ZFnLfDih5ezILrDG6bSz+PyALxZ1TEg5bT7hweHXebeOaNG/54OkLnqIMdp9l0ngFQSwECHwAUAAAACABAg2JSvCBMA2wBAAB2AgAADQAkAAAAAAAAACAAAAAAAAAAU2FtcGxlRGF0LmRhdAoAIAAAAAAAAQAYAEMpLaJSD9cBq6mosXsP1wFNJS5xSw7XAVBLBQYAAAAAAQABAF8AAACXAQAAAAA=
This is the actual response . I also noticed the two zip are of the different size and the zip I which I created programmatically , The files in this zip have no extensions.
Please help me to create the second encoding through program and > .NET version I am using is 4.5
and I cannot use Zip.createDirectory() method due to project dependencies.
Any help is appreaciated .
Thanks in Advnance!
The first one is a gzip file, the second one is a zip file. If you want to make a zip file, try the ZipFile class as opposed to the GZipStream class.
I wouldn't expect two different Zip algorithms/libraries to yield the same output. For one, in the programmatic way, the file metadata (name, modification date, attributes) are not set, while the command line version will include all that information for unzipping purposes.
Plus libraries update at different cadence than standalones, and you might not have the fixes synchronized to reliably match the outputs.
I would like to take the contents of a file and rename the file while in memory to send with a different file name using an API.
The Goals:
Not alter the original file (file on disk) in any way.
Not create additional files (like a copy of the file with a new name). I'm trying to keep IO access as low as possible and do everything in memory.
Change the Name of a file object (in memory) to a different name.
Upload the file object to a WebAPI on another machine.
Have "FileA.txt" on source MachineA and have "FileB.txt" on destination MachineB.
I don't think it would matter but I have no plans to write the file back to the system (MachineA) with the new name, it will only be used to send the file object (in memory) to MachineB via a Web API.
I found a solution that uses reflection to accomplish this...
FileStream fs = new FileStream(#"C:\myfile.txt", FileMode.Open);
var myField = fs.GetType()
.GetField("_fileName", BindingFlags.Instance | BindingFlags.NonPublic)
myField.SetValue(fs, "my_new_filename.txt");
However, It's been a few years since that solution was given. Is there a better way to do this in 2021?
One other way would be defining the filename when you save it on MachineB.
You could pass this filename as a payload through the Web API and use it as the file name.
//buffer as byte[] and fileName as string would come from the request
using (FileStream fs = new FileStream(fileName, FileMode.Create))
{
fs.Write(buffer, 0, buffer.Length);
}
The best way I could come up with was using my old method from years ago. The following shows how I used it. I only do this to mask the original filename from the third-party WebAPI I'm sending it to.
// filePath: c:\test\my_secret_filename.txt
private byte[] GetBytesWithNewFileName(string filePath)
{
byte[] file = null;
using (var fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
// Change the name of the file in memory (does not affect the original file)
var fileNameField = fs.GetType().GetField(
"_fileName",
BindingFlags.Instance | BindingFlags.NonPublic
);
// If I leave out the next line, the file name field will have the full filePath
// string as its value in the resulting byte array. This will replace that with
// only the file name I wish to pass along "my_masked_filename.txt".
fileNameField.SetValue(fs, "my_masked_filename.txt");
// Get the filesize of the file and make sure it's compatible with
// the binaryreader object to be used
int fileSize;
try { fileSize = Convert.ToInt32(fs.Length); }
catch(OverflowException)
{ throw new Exception("The file is to big to convert using a binary reader."); }
// Get the file into a byte array
using (var br = new BinaryReader(fs)) { file = br.ReadBytes(fileSize); }
}
return file;
}
I have a function I use for aggregating streams from a zip archive.
private void ExtractMiscellaneousFiles()
{
foreach (var miscellaneousFileName in _fileData.MiscellaneousFileNames)
{
var fileEntry = _archive.GetEntry(miscellaneousFileName);
if (fileEntry == null)
{
throw new ZipArchiveMissingFileException("Couldn't find " + miscellaneousFileName);
}
var stream = fileEntry.Open();
OtherFileStreams.Add(miscellaneousFileName, (DeflateStream) stream);
}
}
This works well in most cases. However, if I have a zip within a zip, I get an excpetion on casting the stream to a DeflateStream:
System.InvalidCastException: Unable to cast object of type 'System.IO.Compression.SubReadStream' to type 'System.IO.Compression.DeflateStream'.
I am unable to find Microsoft documentation for a SubReadStream. I would like my zip within a zip as a DeflateStream. Is this possible? If so how?
UPDATE
Still no success. I attempted #Sunshine's suggestion of copying the stream using the following code:
private void ExtractMiscellaneousFiles()
{
_logger.Log("Extracting misc files...");
foreach (var miscellaneousFileName in _fileData.MiscellaneousFileNames)
{
_logger.Log($"Opening misc file stream for {miscellaneousFileName}");
var fileEntry = _archive.GetEntry(miscellaneousFileName);
if (fileEntry == null)
{
throw new ZipArchiveMissingFileException("Couldn't find " + miscellaneousFileName);
}
var openStream = fileEntry.Open();
var deflateStream = openStream;
if (!(deflateStream is DeflateStream))
{
var memoryStream = new MemoryStream();
deflateStream.CopyTo(memoryStream);
memoryStream.Position = 0;
deflateStream = new DeflateStream(memoryStream, CompressionLevel.NoCompression, true);
}
OtherFileStreams.Add(miscellaneousFileName, (DeflateStream)deflateStream);
}
}
But I get a
System.NotSupportedException: Stream does not support reading.
I inspected deflateStream.CanRead and it is true.
I've discovered this happens not just on zips, but on files that are in the zip but are not compressed (because too small, for example). Surely there's a way to deal with this; surely someone has encountered this before. I'm opening a bounty on this question.
Here's the .NET source for SubReadStream, thanks to #Quantic.
The return type of ZipArchiveEntry.Open() is Stream. An abstract type, in practice it can be a DeflateStream (you'd be happy), a SubReadStream (boo) or a WrappedStream (boo). Woe be you if they decide to improve the class some day and use a ZopfliStream (boo). The workaround is not good, you are trying to deflate data that is not compressed (boo).
Too many boos.
Only good solution is to change the type of your OtherFileStreams member. We can't see it, smells like a List<DeflateStream>. It needs to be a List<Stream>.
So it looks like the when storing a zip file inside another zip it doesn't deflate the zip but rather just inlines the content of the zip with the rest of the files with some information that these entries are part of a sub zip file. Which makes sense because applying compression to something that is already compressed is a waste of time.
This zip file is marked as CompressionMethodValues.Stored in the archive, which causes .NET to just return the original stream it read instead to wrapping it in a DeflateStream.
Source here: https://github.com/dotnet/corefx/blob/master/src/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs#L670
You could pass the stream into a ZipArchive, if it's not a DeflateStream (if you are interested in the file inside)
var stream = entry.Open();
if (!(stream is DeflateStream))
{
var subArchive = new ZipArchive(stream);
}
Or you can copy the stream to a FileStream (if you want to save it to disk)
var stream = entry.Open();
if (!(stream is DeflateStream))
{
var fs = File.Create(Path.GetTempFileName());
stream.CopyTo(fs);
fs.Close();
}
Or copy to any stream you are interested in using.
Note: This is also how .NET 4.6 behaves
I'm testing how to upload to AWS using SDK with a sample .txt file from a web app. The file uploads to the Bucket, but the downloaded file from the bucket is just an empty Notepad document without the text from the original uploaded file. I'm new to working with streams, so I'm not sure what could be wrong here. Does anyone see why the data wouldn't be sent in the transfer request? Thanks in advance!
using (var client = new AmazonS3Client(Amazon.RegionEndpoint.USWest1))
{
//Save File to Bucket
using (FileStream txtFileStream = (FileStream)UploadedHttpFileBase.InputStream)
{
try
{
TransferUtility fileTransferUtility = new TransferUtility();
fileTransferUtility.Upload(txtFileStream, bucketLocation,
UploadedHttpFileBase.FileName);
}
catch (Exception e)
{
e.Message.ToString();
}
}
}
EDIT:
Both TransferUtility and PutObjectRequest/PutObjectResponse/AmazonS3Client.PutObject saved a blank text file. Then, after having some trouble instantiating a new FileStream, a MemoryStream used after resetting the starting position to zero still saved a blank text file. Any ideas?
New Code:
using (var client = new AmazonS3Client(Amazon.RegionEndpoint.USWest1))
{
Stream saveableStream = new MemoryStream();
using (Stream source = (Stream)UploadedHttpFileBase.InputStream)
{
source.Position = 0;
source.CopyTo(saveableStream);
}
//Save File to Bucket
try
{
PutObjectRequest request = new PutObjectRequest
{
BucketName = bucketLocation,
Key = UploadedHttpFileBase.FileName,
InputStream = saveableStream
};
PutObjectResponse response = client.PutObject(request);
}
catch (Exception e)
{
e.Message.ToString();
}
}
Most probably that TransferUtility doesn't work good with temporary upload files. Try to copy your input stream somewhere (e.g. into other not-so-temporary file, or even MemoryStream if you're sure it would not give you OutOfMemory at some point). Another thing is to get rid of TransferUtility and use low-level AmazonS3Client.PutObject with which you get finer control over Stream lifetime (do not forget that you'll need to implement some retrying as S3 API is prone to returning random temporary errors).
The answer had something to do with nesting, which is still a little beyond my understanding, and not because the code posted here was inherently wrong. This code came after an initial StreamReader which checked the first line of the text file to determine whether or not to save the file. After moving the code out from the while loop doing the ReadLines, the upload worked. Everything works as it's supposed to now that the validation is reorganized so that there's no need for the nested Stream or MemoryStream.
I'm writing an interface to a web service where we need to upload configuration files. The documentation only provides a sample in C#.net which I am not familiar with. I'm trying to implement this in PHP.
Can someone familiar with both languages point me in the right direction? I can figure out all the basics, but I'm trying to figure out suitable PHP replacements for the FileStream, ReadBytes, and UploadDataFile functions. I believe that the RecService object contains the URL for the web service. Thanks for your help!
private void UploadFiles() {
clientAlias = “<yourClientAlias>”;
string filePath = “<pathToYourDataFiles>”;
string[] fileList = {"Config.txt", "ProductDetails.txt", "BrandNames.txt", "CategoryNames.txt", "ProductsSoldOut.txt", "Sales.txt"};
RecommendClient RecService = new RecommendClient();
for (int i = 0; i < fileList.Length; i++) {
bool lastFile = (i == fileList.Length ‐ 1); //start generator after last file
try {
string fileName = filePath + fileList[i];
if (!File.Exists(fileName))
continue; // file not found
}
// set up a file stream and binary reader for the selected file and convert to byte array
FileStream fStream = new FileStream(fileName, FileMode.Open, FileAccess.Read); BinaryReader br = new BinaryReader(fStream);
byte[] data = br.ReadBytes((int)numBytes); br.Close();
// pass byte array to the web service
string result = RecService.UploadDataFile(clientAlias, fileList[i], data, lastFile); fStream.Close(); fStream.Dispose();
} catch (Exception ex) {
// log an error message
}
}
}
For reading files, both on the local system and remotely over HTTP, you can use file_get_contents.
For POSTing to a web service, you should probably use cURL. This article looks like a pretty good explanation of how to go about it.