Extract content from DBPedia's big dump file in .NET

Extract content from DBPedia's big dump file in .NET - c#

I want to extract labels, abstracts, categories and relevant dates to each article from DBPedia dump file.
I'm using dotnetrdf and I want to save the extracted data to MS SQL database (I don't want to use triple stores like Virtuoso).
Due to the size of dump file, I can't load the dump file into memory.
Is there any solution to extract statements? The only way I can imagine is to split the dump file into smaller chunk files, is it the only solution?

Actually everything in dotNetRDF is designed to support streaming parsing, the most common use case happens to be loading stuff into our in-memory structures but even that uses the streaming parser subsystem under the hood.
See the Advanced Parsing section of the Reading RDF documentation which introduces the Handlers API, this API gives users complete control over what happens to the data as it is produced by the parser. So you can write a custom handler which receives the data as it is produced by the stream and puts it into your database.

Related

Generating report or spreadsheet from simple XML listing of data

I have XML objects that hold form data from a web application. These are as simple as can be for an xml format, a demonstrative example would be:
<data>
<patientFirstName>Bob</patientFirstName>
<patientLastName>Bobson</patientLastName>
</data>
and unfortunately due to humans beyond my control, roughly 150 other potential fields for the current "document".
The primary goal is to transfer this data from XML to a more "human-friendly" format such as a spreadsheet or PDF, etc..
I have tried using a DataSet as the source for a DataGrid and producing an html table. This works for a browser, and technically loads in Excel - but Excel gives a file type error, as the file is not in Excel format, it is literally an HTML table.
Is there any way to use the .net framework, or other MS objects whether managed or (preferably) unmanaged to take XML or a DataSet and turn it into a useable file format by typical office applications?
Is there a way to use SSRS without SQL queries or files? My big issue is this is a small Information System that does many things dynamically. I have seen examples of creating Excel spreadsheets using a file on the disk...but it is 2019 - I will allow my code to create or read files, the concept of a file is quite obsolete and a major source of security dilemmas.
In a perfect world I would want to create a decent looking report form template and simply bind certain nodes to their respective node in the XML. This would seem like something that many people would want to do, and as if it would be easy to implement, but search as much as I may and I find 99 terrible ways to pretend to make an Excel file and not 1 decent one. SSRS would seem like a lovely option if I could request to use a premade template and pump XML into it or a DataSet from a middleware server - though my issue is that it seems to be dumbed-down to the point it is designed to do ALL the work in some not-very-flexible manner.
If all else fails, might there be some MS functionality for pagination that I can use to create my own mini-report generator? I feel I may be forced to..
Please understand I am not looking to hand-parse markup in C# - in unmanaged C++ I would consider it, but C# becomes very slow when you start doing things "by hand" as opposed to using objects and their methods.
.net 4.5+

C# My own file format

I'm looking to make my own file format .
that format should contains pictures/pdf/and other files ...
also need to know how I can packer-unpacker for this format to unpack files from it/pack in it & reading the pictures from my own format to picture boxes on my WinForm for example.
I've searched but didn't really found what I'am looking for
I hope someone can help me , thank you

Zip is an excellent choice. Because you can encrypt the file and of course reduce the file size in some cases (text and uncompressed things). But if you want to create your own file format you can easily decide rules for your storage and order inside the file. Then serialize the info into the file. For example by object serialization or by writing the binary date to file object by object .

if you really want to write your own file format then I would suggest one of two things. One, you could do it entirely in binary at which point you would want to do a 'chunk' format. Chunk format is to basically have a header to each subsection. The header contains the size of both the header as well as the size of the payload. Create a serialization class for your header then add the bytes to the filestream from your payload. Actually pretty easy to do.
Second (and easier) way to do this would be to create an XML format. Create a master class for your format then add all of the data as collections of sub classes under that. Once you have that, use any of .net xml serialization classes to serialize it out to disk.

You can also use SQLite for your purposes. It provides dbms power without needing server. That is popular solution for your problem.
System.Data.SQLite is an ADO.NET adapter for SQLite.

What is the best / fastest way to export large set of data from C# to excel

I have code that uses the OpenXML library to export data.
I have 20,000 rows and 22 columns and it takes ages (about 10 minutes).
is there any solution that would export data from C# to excel that would be faster as i am doing this from an asp.net mvc app and many people browsers are timing out.

Assuming 20'000 rows and 22 columns with about 100 bytes each, makes 41 megabytes data alone. plus xml tags, plus formatting, I'd say you end up zipping (.xlsx is nothing but several zipped xml files) 100 mb of data.
Of course this takes a while, and so does fetching the data.
I recommend you use excel package plus instead of the Office OpenXML development kit.
http://epplus.codeplex.com/
There's probably a bug/performance-issue in the write-in-a-hurry-and-hope-that-it-doesnt-blow-up-too-soon Microsoft code.

CSV. It is a plain text file, but can be opened by any version of Excel.
No doubt it is a easier way to export data to excel. A lot of website provide data export as CSV.
What you need to do is just add a comma (,) to separate the values and a line break to separate the records. It won't take extra resource to build the csv file, so it is quite fast.

I wound up using an open source solution called ClosedXML that worked great

Depending on what version of Excel you are targetting, you could expose the data as an OData service which Excel 2010 can naturally consume and will handle the downloading and formattting for you.

I am assuming that this data is something that needs to be completely sent to the client and has already been pre-filtered in some fashion, but still needs to be sent back to the person who made the request.
In this case, you want to perform this particular operation 'asynchronously'. I'm not sure if this would fit your workflow, but say that a person requests this large XML formatted document, I would: a) queue another worker thread to kick off the generation of the document while returning a 'token' (perhaps a GUID to the requester); b) return a link to a page where the requestor can click on the link (passing the token) allowing the page to look up results.
If the thread has completed processing the document, it places it into a special folder with a unique name and adds the token to a database table with its document location. If the person requests that page, the token exists in the database and the document exists on the file system, they are allowed to click and download it through HTTP. If it does not exist, they are either told it does not exist or to wait for the results. (This message can be based on the time the request was received.)
If the person downloads the document successfully (and you can do this through script), you can remove the entry for the database for the document with that token and delete the file from the file system.
I hope I read this question correctly.

I have found that I can speed up exporting data from a database into an Excel spreadsheet by limiting the number of export operations. I found that by accumulating 100 lines of data before writing, the creation speed increased by a factor of at least 5-10x.

The mistake when exporting data that is most often done when exporting data is in the workflow
Build Model
Build XML DOM
Save XML DOM to file
This workflow leads to an overhead because building up the XML DOM needs it's time, the XML DOM is kept in memory together with the Model and then the whole bunch of data is written to a file.
A better way to handle this is to convert your model entry by entry directly to the target format and write it directly to a (buffered) file.
A format with low overhead that's fast to write and is readable by Excel is CSV (ok, it's legacy, it's awkward...).

Options for header in raw byte file

I have a large raw data file (up to 1GB) which contains raw samples from a USB data logger.
I need to store extra information relating to the file (sample rate, description, trigger point, last seek position etc) and was looking into adding this as a some sort of header.
The header file should ideally be human readable and flexible so I've so far ruled out some sort of binary serialization into a header.
I also want to avoid two separate files as they could end up separated when copied or backed up. I remembered somebody telling me that newer *.*x Microsoft Office documents are actually a number of files in a zip. Is there a simple way to achieve this? Could I still keep the quick seek times to the raw file?
Update
I started using the binary serializer and found it to be a pain. I ended up using the xml serializer as I'm more comfortable using it.
I reserve some space at the start of the files for the xml. Simple

When you say you want to make the header human readable, this suggests opening the file in a text editor. Do you really want to do this considering the file size and (I'm assuming), the remainder of the file being non-human readable binary data? If it is, just write the text header data to the start of the binary file - it will be visible when the file is opened but, of course, the remainder of the file will look like garbage.
You could create an uncompressed ZIP archive, which may allow you to seek directly to the binary data. See this for information on creating a ZIP archive: http://weblogs.asp.net/jgalloway/archive/2007/10/25/creating-zip-archives-in-net-without-an-external-library-like-sharpziplib.aspx

XML Database in C#.net

I am developing a WPF client program for some websites. It uses XML database. I am new to XML. Would someone please explain how to create,append(Most important),edit,read&encrypt XML file. It is a big question,i know . But, it is urgent.Have to complete the work ASAP. Searched in the internet, not getting correct info.

You should seriously consider using a DataSet within your application and load up your data from an XML file via DataSet,ReadXml. When you're done with your updates write your changes using DataSet.WriteXml.
But you should also seriously consider not using XML as a database.
Here's an article on CodeProject that discusses using XML as a database:
Xml Database Demo
I know you tagged this question C# but unfortunately the demo app is written in VB.NET.

(in response to your comment on Gerri's answer)
XML is inherently not appendable. A valid XML document requires a single document element. In order to "append" you would need to be able to back over the closing tag of the document element and overwrite it. The only option is to read in the entire document and write it back out again. Also you may want to use XmlDocument or XDocument instead of XmlWriter which is a horribly painful API when you don't need very fine grained control over the behavior.
The fact is, XML makes a really terrible database format. There's other lightweight database solutions out there that don't require a database server.

Assuming your database is small enough that you can easily load it into memory.
Create classes that model your database.
Add DataContract attributes to them to indicate how you want them serialized.
Use DataContractSerializer to serialize your database to XML and then save it to disk.
Each time you update the database:
Create new file as .tmp
Delete any old file called .old
Rename .xml to .old
Rename new file from .tmp to .xml
When you go to load the file, if .xml is corrupt or missing, try .tmp
This will help you survive the inevitable corruption that will occur during writing when something goes wrong.

Due to history of each data base company coming up with a "standard" interface that no other company follows, XML has become the defacto way to transfer data between databases.
If this in the intended use than it is fine as it only has to write in this format some times. There is a lot to worry about in writting XML using .NET as it has a lot of ways to forget to finish the writing leaving open tags (always use using/flush/close). Warning: The more processing cores the more often .Net screws up. Use Thread.BeginCriticalRegion()/Thread.EndCriticalRegion() if you have more than four real cores. Also as suggested it is best to save the earlier version as a .bak or such.
Of course if the XML standards could have a declaration of "document set" then we could append a document each time and life would be a lot easier.

Load your XML as an XMLDocument.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.