Parsing JSON response using text summarization API, encoding error in response

Parsing JSON response using text summarization API, encoding error in response - c#

I use service at https://www.meaningcloud.com/products/automatic-summarization
for text summarization. I am using .NET Core 5
For example, I want shorten this news: https://e.vnexpress.net/news/business/economy/vn-index-rises-for-third-straight-session-4141865.html
string input = "..." // long content of news post.
var client = new RestClient("https://api.meaningcloud.com/summarization-1.0");
client.Timeout = -1;
var request = new RestRequest(Method.POST);
request.AddParameter("key", "25870359b682ec3c93f9becd850eb459"); // fake token because this content is public.
request.AddParameter("sentences", 4);
request.AddParameter("txt", JsonEncodedText.Encode(content));
IRestResponse response = client.Execute(request);
System.Threading.Thread.Sleep(3000);
var res = JObject.Parse(response.Content);
// Need convert \r\n , \r\n\r\n to space.
string short_content = res["summary"].ToString();
// SysUtil.StringEncodingConvert(short_content, "ISO-8859-1", "UTF-8");
string result = raw_string.Replace(" [...] ", " ");
Input
The benchmark VN-Index saw steady growth throughout the day, gradually gaining a total of 10.23 points by the end of the session. The Ho Chi Minh Stock Exchange (HoSE), on which the index is based, saw 300 stocks gain and 78 lose. Total trading volume improved 48 percent over the previous session, reaching VND6.2 trillion ($269 million). The VN30-Index, a basket of HoSE’s 30 largest capped stocks, rose 1.63 percent, with 27 gaining and 2 losing. Its top gainers were SAB of Vietnam’s largest brewer Sabeco, up 4.8 percent, followed by VJC of budget airline Vietjet, up 2.8 percent, and MWG of electronics retailer Mobile World, up 2.2 percent. Of Vietnam’s biggest state-owned lenders by assets, BID of BIDV climbed 0.85 percent, VCB of Vietcombank 0.8 percent, and CTG of VietinBank 0.6 percent. HDB of HDBank and TCB of Techcombank led gains of private banks at 0.85 percent and 0.6 percent respectively. Other gainers included PNJ of Phu Nhuan Jewelry with 1.4 percent, HPG of steel producer Hoa Phat, 1.1 percent, and MSN of conglomerate Masan, 1 percent. The only two VN30 tickers that ended in the red were VIC of conglomerate Vingroup, down 1 percent, and PLX of fuel distributor Petrolimex, down 0.05 percent. The HNX-Index for stocks on the Hanoi Stock Exchange, home to mid and small caps, rose 1.35 percent, and the UPCoM-Index for stocks on the Unlisted Public Companies Market added 0.3 percent. Foreign investors turned net buyers to the tune of VND15.7 billion ($681,600), with buying pressure focused mainly on HPG and VHM of real estate giant Vinhomes.
output after text summarization (4 sentences)
The benchmark VN-Index saw steady growth throughout the day, gradually gaining a total of 10.23 points by the end of the session. The VN30-Index, a basket of HoSE\u2019s 30 largest capped stocks, rose 63 percent, with 27 gaining and 2 losing. Of Vietnam\u2019s biggest state-owned lenders by assets, BID of BIDV climbed 0.85 percent, VCB of Vietcombank 0.8 percent, and CTG of VietinBank 0.6 percent. The HNX-Index for stocks on the Hanoi Stock Exchange, home to mid and small caps, rose 1.35 percent, and the UPCoM-Index for stocks on the Unlisted Public Companies Market added 0.3 percent.
I also try use util
using System;
namespace myproj.Controllers
{
public class SysUtil
{
public static String StringEncodingConvert(String strText, String strSrcEncoding, String strDestEncoding)
{
System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding(strSrcEncoding);
System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding(strDestEncoding);
byte[] bData = srcEnc.GetBytes(strText);
byte[] bResult = System.Text.Encoding.Convert(srcEnc, destEnc, bData);
return destEnc.GetString(bResult);
}
}
}
but not success.
even I replace, still not success
tring result2 = result.Replace("\u2019s", "'s");
I catch some problem
\u2019s --> I need 's, how to archive this?

\u2019 is the unicode char for smart quote. Just replace that:
result2 = result.Replace('\u2019', '\'')

Related

How to get counted words in files in BODY field?

The following code counting words in directory from all ".sgm" files.
But I need to get counted words in all ".sgm" files between BODY tags for example.
How can I do that?
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Serialization;
namespace Project2
{
class Program
{
static void Main(string[] args)
{
string[] parcesPlaces = new string[] { "west-germany", "usa", "france", "uk", "canada", "japan" };
DirectoryInfo filePaths = new DirectoryInfo(#"D:\project_IAD");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
List<TotalBody> allNeedBody = new List<TotalBody>();
foreach (FileInfo file in Files)
{
string fileContent = File.ReadAllText(file.FullName);
string fileContentCleared = ReplaceHexadecimalSymbols(fileContent);
string myRootedXml = "<root>" + fileContentCleared + "</root>";
root result = (root)XmlDeserializeFromString(myRootedXml, typeof(root));
Console.WriteLine(" Ilość potrzebnych słów: {0}", result.REUTERS.ToList().Count);
foreach (rootREUTERS rootREUTERS in result.REUTERS)
{
if (rootREUTERS.PLACES.Length != 1)
{
continue;
}
else if (!parcesPlaces.Contains(rootREUTERS.PLACES[0]))
{
continue;
}
else
{
if (rootREUTERS.TEXT.BODY != null)
{
allNeedBody.Add(new TotalBody(rootREUTERS.PLACES[0], rootREUTERS.TEXT.BODY));
}
else
{
continue;
}
}
}
}
Console.WriteLine("Total count words: ");
Console.WriteLine(allNeedBody.Count);
Console.ReadKey();
}
private static object XmlDeserializeFromString(string v, Type type)
{
object result = null;
using (TextReader reader = new StringReader(v))
{
result = new XmlSerializer(type).Deserialize(reader);
}
return result;
}
private static string ReplaceHexadecimalSymbols(string txt)
{
string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]";
return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
}
}
Example of text in file "reut2-000.sgm":
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES> <UNKNOWN> C T
f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26
0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE>
SALVADOR, Feb 26 - </DATELINE><BODY>**Showers continued throughout the
week in the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria Smith said
in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against
5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.
Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an
end. With total Bahia crop estimates around 6.4 mln bags and sales
standing at almost 6.2 mln there are a few hundred thousand bags still
in the hands of farmers, middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining
+Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs
per tonne to ports to be named.
New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under
New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.
Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas, Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for
Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible currency areas.
Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July,
Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at
1.25 times New York Dec, Comissaria Smith said.
Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop.
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which
ends midday on February 27.** Reuter </BODY></TEXT> </REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE>
<ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES>
<UNKNOWN> F Y f0708reute d f
BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
Need to count words only in the BODY fields (On example marked in bold), without different characters, etc.
File example for testing propose.

What I see in your question is you trying to create xml formatted content, and trying to deserialize it just to count the content, that would be fine if you need to collect data, but if the intention is only to count words tagged in between body of documents it is much faster to just parse it and count it on the fly.
My strategy is to take substring of content that starts with <body> and take the substring that ends with </body> and count it by splitting it.
Here is the solution:
DirectoryInfo filePaths = new DirectoryInfo(#"D:\Stackoverflow\SgmCount\docs");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
int wordCount = 0;
foreach (FileInfo file in Files)
{
string content = File.ReadAllText(file.FullName);
content = content.Substring(content.IndexOf("<BODY>", StringComparison.Ordinal) + 5);
content = content.Substring(0, content.IndexOf("</BODY>", StringComparison.Ordinal) - 1);
char[] delimiters = { ' ', '\r', '\n' };
wordCount = content.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Length;
}
Console.WriteLine($"Total count words: {wordCount}" words);
This gives an output:
Total count words: 488 words

Cassandra C# Driver - SELECT Statement Execution Slow

I'm running a 3-node Cassandra 3.0.0 cluster running on AWS EC2 i3.large instances and I've been playing around with using the C# driver for Cassandra. Executing the following query (which is very simple) takes approximately 300 ms (to scan the single partition and return the top 100 rows).
var rs = session.Execute("SELECT col1, col6, col7 FROM breadcrumbs WHERE col1='samplepk' LIMIT 100;");
My data model is:
Column 1 = a 13-character string
Column 2 = a 23-character string
Column 3 = a date/time timestamp
Column 4 = a 4-digit integer
Column 5 = a 3 digit integer
Column 6 = a latitude value
Column 7 = a longitude value
Column 8 = a 15-digit double
Column 9 = a 15-digit double
I defined my primary key as Col1, col2.
My C# driver code is as follows:
Cluster cluster = Cluster.Builder().AddContactPoint(~~~~~IP Here~~~~).Build();
ISession session = cluster.Connect(~~~keyspacename~~~);
long ticks = DateTime.Now.Ticks;
var rs = session.Execute("SELECT col2, col6, col7 FROM breadcrumbs WHERE partitionkey=~targetkey~ LIMIT 100;");
Console.WriteLine((DateTime.Now.Ticks - ticks)/Math.Pow(10,4)+" ms");
Console.ReadKey();
Is that abnormally slow or are my expectations too high? If it is slow, does anyone have any ideas about what's causing it?
If I forgot to provide any pertinent details, please leave a comment :) .
Thanks in advance.

Regex for validation of many types of number

I'm new to Regex and I would like to know how do I detect the number by Regex in C#, that always display in a format : #,###
Ex : 2 000,000 into 2,000
Ex : 15 000.000 into 15,000
Ex : 6.700 into 6,700
Ex : .3.3.3 into 0,300
These are some examples that I'm doing for validation

As the comments suggest, the question is not very clear.
To get your examples working, you can use e.g.
(?:(?<int>\d+)[ .,]?|[.,])
(?<frac>\d+)?
(?:[ .,]\d+)*
to match the "integer part" and the "fractional part" divided by ., , or (wired, but that is what I read out of your examples - since 15 000.000 => 15,000 and 6.700 => 6,700 I assume a comma seperator everywhere).
I'm pretty sure I did not get it right! At least not entirely. The examples you provide look like numbers with different thousands separator, but it seems to have no system.
However, this is what you match with the regular expression above:
int | | frac | anything else
----+-+------+--------------
2 | | 000 | ,000
15 | | 000 | .000
6 |.| 70 |
|.| 3 | .3.3
In addition, it matches numbers without fractional part.
In Detail
(?:(?<int>\d+)[ .,]?|[.,])
Match decimals (one ore more) and store them in a group named int. Match an optional , . or , thereafter.
OR
Match . or ,.
(?<frac>\d+)?
Optionally match the fraction part (one or more decimals).
(?:[ .,]\d+)*
Match , . or , and one or more decimals (repeat this zero or more times).
This last one is to prevent the last parts of e.g. .3.3.3 to match in subsequent calls.
Next
Then you can use a MatchEvaluator-Function (here in form of a delegate) to replace the values.
var rx = new Regex(#"
(?:(?<int>\d+)[ .,]?|[.,])
(?<frac>\d+)?
(?:[ .,]\d+)*
",
RegexOptions.IgnorePatternWhitespace
);
var deDE = new System.Globalization.CultureInfo("de-DE");
text = rx.Replace(text, delegate(Match match) {
int integral;
int fraction;
int fraclen = match.Groups["frac"].Length;
int.TryParse(match.Groups["int"].Value, out integral);
int.TryParse(match.Groups["frac"].Value, out fraction);
var val = integral + fraction / Math.Pow(10, fraclen);
return String.Format(deDE, "{0:0.000}", val);
});
The function is called for every match. Inside, I read out the groups, convert them into integers and then create the matched value with integral + fraction / Math.Pow(10, fraclen) (integral part + fraction part divided by 10^len where len is the string-length of the fraction part, thus "70" becomes 0.7 by calculating 70/10^2 == 70/100 == 0.7).
At the end, I return String.Format with CultureInfo de-DE. This is done because in Germany you use , as decimal seperator. There are others too - and there are many other ways to output such a number..
This is just an example.

one decimal for string format

I have below digits. I want to show one digit after to decimal. How to format it?
2.85
2
1.99
I was using ("{0:0.0}". But data showing like
2.9 //It should be 2.8
2.0 //It should be 2
2.0 //It should be 1.9

Try using "{0:0.#}" as the format string. However, that will only fix the .0. To fix the rounding to always round down, you might want to use:
string s = (Math.Floor(value * 10) / 10).ToString("0.#");

Decimal[] decimals = { new Decimal(2.85), new Decimal(2), new Decimal(1.99) };
foreach (var x in decimals)
{
Console.WriteLine(string.Format("{0:0.#}", Decimal.Truncate(x * 10) / 10));
}
// output
2.8
2
1.9

Reading numbers from string in C#

What I want?
I want to display weather information on my page.
I want to display the result in the browser specific culture.
What am I doing?
I use MSN RSS for this purpose.
MSN returns the report in XML format. I parse the XML and display results.
What problem am I facing?
When displaying the report, I have to parse an XML node, <data> which will be different values in different culture.
For e.g.,
en-US: "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%"
de-DE: "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%"
I want to read only low, high and chance of precipitation values. i.e., I want to read 46, 67 and 20%.
Can somebody please give me a solution for this?
May be RegX or someother method is also fine with me :-)
Thanks in advance!

You should consider always fetching the RSS using the same culture. That way, you'll have an easier task parsing the content. If you'll only be using the numbers, it shouldn't stop you from emitting culture-specific content to the end user.
So if you go for the en-US version, you could do it like this:
Regex re = new Regex(#"Lo: (\d+)°F. Hi: (\d+)°F. Chance of precipitation: (\d+)%");
var match = re.Match(forecast);
if (match.Success)
{
var groups = match.Groups;
lo = int.Parse(groups[1].Captures[0].Value);
hi = int.Parse(groups[2].Captures[0].Value);
prec = int.Parse(groups[3].Captures[0].Value);
}

If you only want the numbers, you can use a regular expression, for example the following:
(\d+).*?(\d+).*?(\d+%)
A quick test in PowerShell shows that it does work at least for your input data:
PS Home:\> function test ($re) {
>> $a -match $re; $Matches
>> $b -match $re; $Matches
>> }
>>
PS Home:\> $a = "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%"
PS Home:\> $b = "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%"
PS Home:\> test "(\d+).*?(\d+).*?(\d+%)"
True
Name Value
---- -----
3 20%
2 67
1 46
0 46°F. Hi: 67°F. Chance of precipitation: 20%
True
3 20%
2 67
1 46
0 46°F. Höchst: 67°F. Niederschlag %: 20%
However, it won't work anymore if any locale might use numbers in the description strings.
You can add other constraints, like requiring a colon before every match:
: (\d+).*?: (\d+).*?: (\d+%)
This should deal with spurious numbers elsewhere in the string. But the best way overall would actually be to get your data from a source which gives you the data for machine reading, not for human consumption

The following should extract the two numbers and chance of precipitation, as well as the units that are used (for culturally dependent units).
(?<lo>\d+°.).*?(?<hi>\d+°.).*?(?<precipitation>\d+)
If you don't want units extracted, then you can use
(?<lo>\d+)°.*?(?<hi>\d+)°.*?(?<precipitation>\d+)

use regex (but i don't know the regex formula ;) )
You can also do a forloop over the sentence, and check each char if it's a integer. Each time you encounter once, place it in a string. when finding something else than an integer, parse the string to an int and voila. Do this 3 times

Its quite weird you are not getting XML with values in different nodes which would make more sense to me (they you could pick which values use for different locales).
But, if you want to extract data from given strings try this or something simmilar if you are not a fan of RegEx:
string dataUS = "Lo: 46°F. Hi: 67°F. Chance of precipitation: 20%";
string dataDE = "Niedrig: 46°F. Höchst: 67°F. Niederschlag %: 20%";
string[] stringValues = dataU.Split(new string[] {": "}, 4, StringSplitOptions.None);
List<int> values = new List<int>();
for (int i = 1; i < 4; i++)
{
StringBuilder sb = new StringBuilder();
foreach (char c in stringValues[i].Trim())
{
if (Char.IsDigit(c))
{
sb.Append(c);
}
else
{
values.Add(Convert.ToInt32(sb.ToString()));
break;
}
}
}
(im spliting on ": " instead of digits)

I suggest using Regex to get the values that you want according to UI culture language one by one :
I mean you can have a Regex to get the Lo temp. "(Lo|Niedrig):(\d+)" , a regex to get Hi temp
"(Hi|Höchst):(\d+)" and a regex to get chance of perception and so on.
In all of the above examples you can get the number from second element of the match.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing JSON response using text summarization API, encoding error in response - c#

\u2019 is the unicode char for smart quote. Just replace that: result2 = result.Replace('\u2019', '\'')

Related

How to get counted words in files in BODY field?

Cassandra C# Driver - SELECT Statement Execution Slow

Regex for validation of many types of number

one decimal for string format

Reading numbers from string in C#

Categories

Resources