I'm trying to pull some info off of a korean website and then make those korean characters usable and then put the info in a text file or the like. My idea was to create kind of a reference table (that would be done quickly, as there aren't too many sets of data that need that treatment, roughly 200).
My question is now,
first: if that is actually a solid idea or if there's a better or easier solution,
second: what format would I wanna use for such a table/sheet? csv, xml?
So far I'm getting the info via HtmlAgilityPacks XML/HTML stuff, which works quite well. Any help is appreciated, if you need any of my code, let me know so I'll edit it in.
.csv read slightly faster than .txt. But if only 200 characters, I think the difference is insignificant.
Have you considered using a resource file, normally used for exactly what you are doing and the appropriate translations will be loaded based on the default culture set on the users PC.
http://msdn.microsoft.com/en-us/library/y99d1cd3(v=vs.110).aspx
Related
I am currently developing a Word-Completion application in C# and after getting the UI up and running, keyboard hooks set, and other things of that nature, I came to the realization that I need a WordList. The only issue is, I cant seem to find one with the appropriate information. I also don't want to spend an entire week formatting and gathering a WordList by hand.
The information I want is something like "TheWord, The definition, verb/etc."
So, it hit me. Why not download a basic word list with nothing but words(Already did this; there are about 109,523 words), write a program that iterates through every word, connects to the internet, retrieves the data(definition etc) from some arbitrary site, and creates XML data from said information. It could be 100% automated, and I would only have to wait for maybe an hour depending on my internet connection speed.
This however, brought me to a few questions.
How should I connect to a site to look up these words? << This my actual question.
How would I read this information from the website?
Would I piss off my ISP or the website for that matter?
Is this a really bad idea? Lol.
How do you guys think I should go about this?
EDIT
Someone noticed that Dictionary.com uses the word as a suffix in the url. This will make it easy to iterate through the word file. I also see that the webpage is stored in XHTML(Or maybe just HTML). Here is the source for the Word "Cat". http://pastebin.com/hjZj6AC1
For what you marked as your actual question - you just need to download the data from the website and find what you need.
A great tool for this is CsQuery which allows you to use jquery selectors.
You could do something like this:
var dom = CQ.CreateFromUrl("http://www.jquery.com");
string definition = dom.Select(".definitionDiv").Text();
When I try to save:
7 test úüãáâàçéêíõóô áéíñóúü¿¡
.. to my database, it saves as:
7 test ������������� ���������
see row 24.
However as you can see in the DB, I'm able to store characters like this, see for 23, I manually updated this record.
I read:
ASP.NET-Saving Special Characters to Database
.. however I cannot htmlencode everything. Are they any other options here?
You need to work out where things are going wrong. For example, you might have all these steps to consider:
Is your schema set up correctly?
Are you receiving data correctly from the client?
Are you processing the data correctly?
Are you saving the data correctly?
Are you fetching the data correctly?
Are you serving the data correctly?
It's vital that you work out a way of diagnosing each step independently. Log the values at every point, in a way which you can prove can handle non-ASCII characters. Then you can see exactly where your data is being corrupted.
If it's not something within ASP.NET itself, I'd then suggest you create a short but complete console application which does nothing but try to save the data. You can then concentrate on getting that working with none of the friction you get through writing a web app... and you can also post that program here, so we can help you.
I am assuming that is either a VARCHAR or TEXT column? Try using NVARCHAR or NTEXT.
I am trying to build a Translation Assistant which can read in other compiled C# application (.exe), and display the forms from the EXE, are displayed individually, along with a table next to it with english column which will show the current english words on display, and another column for the value, which a translator can enter. Once completed translations, the translator can export the translations a resx file, to add to a project and also as an excel file for record purposes.
I am new to C# and hence am not sure if my strucute is correct, i have designed in such that a dll is inserted into the .exe during compilator, and then using this dll, the translation application can extract the string. This works for most strings, but it is getting stuck where there are several string that can apear in the same textbox at different times [e.g. disconnected, connected etc]. I have tried searching everywhere, but I am not able to find information on how i will be able to pull out all strings from an application, and be able to identify which form they belong to, in order to create my application?
the other issue i am faced with is, actually displaying the translated strings, the application i am building would benifit greatly if it could display a example of how the translated strings would look, as translations in some languages could be excessivly long. but i have found that i am only able to read in the aspects of the compiled applications and create an instance, but am not able to translate it.
I am reading in the exe using Reflection, and have understood from online that i need to use reflection.emit to modify the form. but i am finding every sting that is idenfitied from the form, is extracted as an instance, hence changing the string is only changing the instance of the strings , and not the instance of the form itself. hence i am not able so a correct display.
I have been trying for 3 weeks to solve these last two questions, Thanks in advance for helping me solve this.
I think you can't find a general solution to your problem with the texts that may appear in the textbox. Here is why:
If the texts are in the resource file, you could read them, but you still don't know where they are used. You would need to do a complex analysis of the source code to know, where the text is displayed. Just imagine this little scenario:
textBox.Text = GetCorrectText(connection.State);
GetCorrectText could look like this:
string GetCorrectText(ConnectionState state)
{
return string.Format(Resources.ConnectionState, state);
}
Resources.ConnectionState might be "The connection is in the state {0}".
Its a simple example, but you would need to know or extract a lot of things:
The text property of the TextBox class is the string that is shown to the user
The Method GetCorrectText returns the text, so you need to parse it.
The Method string.Format returns the text. Now you either would need to hardcode that for string.Format it should use the first parameter as the text that is displayed or you would have to parse string.Format to learn that fact.
The example shows something else: You wouldn't be able to translate the whole string that is being displayed, because part of it is the name of the enum value.
What I want to show you is that you need to make trade offs.
I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.
How would you process and compare the content of two websites for plagiarism?
I'm thinking something like this pseudo-code:
// extract text
foreach website in websites
crawl website - store structure so pages are only scanned once
extract text blocks from all pages - store this is in list
// compare
foreach text in website1.textlist
compare with all text in website2.textlist
I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.
I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.
I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.
I'm implementing this in C# (maybe ASP.NET).
I'm very interested in any input or advice you might have, so please shoot! :)
My approach to this problem would be to google for specific, fairly unique blocks of text whose copyright you are trying to protect.
Having said that, if you want to build your own solution, here are some comments:
Respect robots.txt. If they have marked the site as do-not-crawl, chances are they are not trying to profit from your content anyway.
You will need to refresh the site structure you have stored from time-to-time as websites change.
You will need to properly separate text from HTML tags and JavaScript.
You will essentially need to do a full text search in the entire text of the page (with tags/Script removed) for the text you wish to protect. There are good, published algorithms for this.
You're probably going to be more interested in fragment detection. for example, lots of pages will have the word "home" on them and you don't care. But it's fairly unlikely very many pages will have exactly the same words on the entire page. So you probably want to compare and report on pages that have exct matches of length 4,5,6,7,8, etc words and counts for each length. Assign a score and weight them and if you exceed your "magic number" report the suspected xeroxers.
For C#, you can use the webBrowser() to get a page and fairly easily get its text. Sorry, no code sample handy to copy/paste but MSDN usually has pretty good samples.
In my web application I am developing a comment functionality, where user's can comment. But I am facing a problem which is I want to allow simple HTML tags in the comment box. HTML tags like <b>, <strong>, <i>, <em>, <u>, etc., that are normally allowed to enter in a commenting box. But then I also want when user presses enter then it will be automatically converted into breaks (<br /> tags) and get stored into database, so that when I'll display them in the web page then they'll look like as user entered.
Can you please tell me how to parse that user entered only allowed set of HTML tags and how to convert enters into <br /> tags and then store them in database.
Or if anyone have some better idea or suggestion to implement this kind of functionality. I am using ASP.NET 2.0 (C#)
I noticed that StackOverflow.com is doing the same thing on Profile Editing. When we edit our profile then below the "About Me" field "basic HTML allowed" line is written, I want to do almost the same functionality.
I don't have a C# specific answer for you, but you can go about it a few different ways. One is to let the user input whatever they want, then you run a filter over it to strip out the "bad" html. There are numerous open source filters that do this for PHP, Python, etc. In general, it's a pretty difficult problem, and it's best to let some well developed 3rd party code do this rather than write it yourself.
Another way to handle it is to allow the user to enter comments in some kind of simpler markup language like BBCode, Textile, or Markdown (stackoverflow is using Markdown), perhaps in conjunction with a nice Javascript editor. You then run the user's text through a processor for one of these markup languages to get the HTML. You can usually obtain implementations of these processors for whatever language you are using. These processors usually strip out the "bad" HTML.
Its rather "simple" to do that in php and python due to the large number of functions.I am still learning c# .lol. but havent yet come across the function.The chances are that it exists and all you need to do is search for it.I mean a function that can take the user input,search for the allowed tags (which are in an array of course) and replace the <> with something else like [] then use a function to escape the other html tags.In php we use htmlentities().
Something like
<code>
$txt=$_POST['comment'];
$txt=strreplace("<b>*</b>","[b]*[/b],"$txt");
$securetxt=htmlentities($txt);
$finaltxt=strreplace("[b]*[/b]","<b>*</b>","$securetxt");
//Now save to Db
I'm not sure, but I think you have to escape html characters when inserting in database and when retrieving echo them unescaped, so the browser can see it just like html.
I don´t know asp.net, but in php there´s an easy function, strip_tags, that let you add exceptions (in your case, b, em, etc.). If there´s nothing like that in C# you can write a regular expression that strips out all tags except the allowed ones but chances are that such an expression already exists so it should be easy to find.
replacing \n (or something similar) with br shouldn´t be a problem either with a simple search and replace.
This is a dangerous road to go down. You might think you can do some awesome regexes, or find someone who can help you with it, but sanitizing SOME markup and leaving other is just crazy talk.
I highly recommend you look into BBCode or another token system. Even something untokenized such as what SO uses, is probably a much better solution.