Twitter message length counting - c#

This is from twitter doc: https://developer.twitter.com/en/docs/basics/counting-characters.html
"Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text ... Twitter also counts the number of codepoints in the text rather than UTF-8 bytes."
It works for Western languages. But when I apply FormC normalization to the following, for example:
(I posted an example in Korean, but stackoverflow considers it spam and doesn't let me post it)
I get the value of 160. On Twitter's Web client, this is the maximum available message - adding even one character goes over the limit.
Applying FormD to the above gets a value over 300.
Since Twitter limit is either 140 or 280, I really don't understand how that message's char count is determined by Twitter.
So - how in the world can I figure out what the actual message length is for non-Western languages for a tweet?
The code to normalize, in c#:
private static int GetCodepointLength(string inp)
{
var info = new StringInfo(inp.Normalize(NormalizationForm.FormC));
return info.LengthInTextElements;
}

Related

GS1 barcode parsing - It seems that there is no separating character

I have a program for parsing GS1 Barcodes (with Zebra scanner), which worked just fine, atleast I thought it was OK...
Until I came across one box with 2 GS1 barcodes.. one "linear" and one data matrix (UDI). For linear, it worked just fine, I successfully got out the GTIN and Serial. But datamatrix is different. For some reason, its content is a bit longer than linear code, it has some production date and something else at the end.
This is the Linear code: (01)00380652555852(17)260221(21)25146965079(30)1
This is data matrix: (01)00380652555852(17)260221(21)2514696507911210222240SA60AT225
I have problems with parsing out the Serial number - 25146965079.
Serial number in GS1 has a length of 1-20 characters. This one has 11, but How can I make it stop after the 9 characters? How can I know that the serial ends there?
I tried transforming each character to UDI, but it seems that there is no special separating character or anything.. so I honestly donjt know what to do. Does anyone have any idea?
This the code, if anyone wanna try anything https://prnt.sc/1x2sw8l
Those codes/products came right from the manufacturer, so there shouldnt be anything wrong with the code, I guess...
If you verify the barcode with a scanner that is designed to interpret a GS1 structure, you will see that the generated barcode is in fact incorrect.
You are missing a GS after the serial number, these codes MUST end a variable-length field if it's not the last one. This is specified in GS1 general specifications section 7.8.5.2
Without this separator you can't know where the serial ends - or, a machine interpreting the code can't know.
Tell the manufacturer that they need to study the GS1 specs.
Edit: the "correct" version would be:
(01)00380652555852(17)260221(21)25146965079<GS>(11)210222(240)SA60AT225
The parentheses and group separator <GS> are not included literally in the code.
Since you have two variable-length identifiers (21) and (240) you need a GS no matter what you do. Only alternative would be to have some padding for serial number, then you could do without separator.
According to the GS1 documentation (page 156 and forwards)
All the fields are correct
(01)00380652555852 --> GTIN
(17)260221 --> Expiration date
(21)25146965079 --> Serial Number
(11)210222 --> Production Date
(240)SA60AT225 --> Additional Product Identification
I tried scanning the image but the result was the same as yours.
So the problem is that the separators are not there. Which is a problem for you, and there is no way to know where the serial number ends without the separator.
I am sorry my English is not good
The reason of this problem is group separetors are unreadable character for example if you focus on text box and press capslock button or shift button nothing appear in text box the same in gs
To solve this problem
Public l as integer
And put the following code in keyup event
If textbox1.textlenght = l then
My.combuter.keybord.sendkeys({enter})
L= textbox1.textlenght
End if
This code will give space after each litter (because each litter combined with cabslock button) and five spaces in groub space
store raw input in KeyPress event and then read the character for Letter Or Digit.
if (e.KeyChar != 13)
{
int asci = Convert.ToInt32(e.KeyChar);
if (asci > 31 && asci < 128) // numeric and chars only
rawbcode += Convert.ToChar((int)(e.KeyChar & 0xffff));
else
{
if (asci == 29)
{
rawbcode += "<GS>"; // GS1 Seperator
}
}
}

USSD command translation

I need help decoding this received response.
at
OK
+CUSD: 0,"ar#?$ #9#d? ?# ???(d??)##1pD?"?T?Hc#
?& ?#D??? ?#??5 41 IA ?R",17
OK
+CUSD: 0,"ar?hb? ?' 10?# ? ?hb#?J##?#?? #f#??#?#S#d$#",17
I tried when dcs value was 72 on another network provider.
but this one value 17 I don't understand.
how to decode it?
after results :
AT+CSCS="UCS2"
OK
at+cusd=1,"002a003100350030002a0032002a00330032003300390031002a00360039003100370037002a00310023",15
+CUSD: 0,"00610072003f00680062003f0020003f00270020002000310030003f00400020003f0020003f006800620040003f004a00400040003f0040003f003f0020004000660040003f003f0040003f004000530040006400240040",17
AT+CSMP?
+CSMP: 17,167,0,0
OK
by the way when i set my AT+CSCS="UTF-8" it report Error but
it is reported back with this command AT+CSCS=?
The format of the response is according to 27.007:
+CUSD=[<n>[,<str>[,<dcs>]]]
Thus the third parameter is <dcs>. Its format is just deferred:
<dcs>: 3GPP TS 23.038 [25] Cell Broadcast Data Coding Scheme in integer format
(default 0)
In chapter "5 CBS Data Coding Scheme" in 23.038 it states These codings may also be used for USSD.
For 17, binary 0001 0001:
bit 7..4 Coding Group Bits = 0001
bit 3..0 = 0001 --> UCS2; message preceded by language indication
And it notes that
An MS not supporting UCS2 coding will present the two character language identifier followed by improperly interpreted user data.
which is exactly the case in your output (e.g. ar meaning arabic followed by garbage).
For 72, binary 0100 1000:
bit 7..4 Coding Group Bits = 01xx
bit 5 = 0 --> uncompressed,
bit 4 = 0 --> no class meaning
bit 3 & 2 = 1 & 0 --> UCS2 (16bit)
The "not supporting" part above might just be that you are using a limited character set encoding (PCCP437). In any case, unless your modem does not support UTF-8 you really should use that and not this PCCP437. Or you might use USC2. If your modem lacks both of those characters, you can try HEX (guessing on my part from what I saw when researching this answer, maybe you need to set the <dcs> parameter in AT+CSMP for this to work?).
Notice that after selecting UCS2 every string must be encoded that way, including switching to another character set, see this answer for an example.
Use the following functions to decode "UCS2" response data:
public static String HexStr2UnicodeStr(String strHex)
{
byte[] ba = Hex2ByteArray(strHex);
return HexBytes2UnicodeStr(ba);
}
public static String HexBytes2UnicodeStr(byte[] ba)
{
var strMessage = Encoding.BigEndianUnicode.GetString(ba, 0, ba.Length);
return strMessage;
}
for example:
String str1 = SmsEngine.HexStr2UnicodeStr("002a003100350030002a0032002a00330032003300390031002a00360039003100370037002a00310023");
// str1 = "*150*2*32391*69177*1#"
Please also check UnicodeStr2HexStr()

String Variable Character Limit

My String variable only stores 4096 characters, I need to store more, how can i achieve that?
Below is what i am trying to do
ServiceController[] myServices = ServiceController.GetServices();
String ServiceList = "";
foreach (ServiceController service in myServices)
{
ServiceList += service.DisplayName + "|||";
}
return ServiceList;
When the variable is returned, it only stores 4096 characters and rest are trimmed off.
P.S. I need them in one variable as I am making a URL out of them and passing to my webservice.
I need them in one variable as I am making a URL out of them and passing to my webservice.
No, don't do that!
A 4096 character URL is a very bad idea and is not guaranteed to work.
Extremely long URLs are usually a mistake. URLs over 2,000 characters will not work in the most popular web browser. Don't use them if you intend your site to work for the majority of Internet users.
(source)
Make a shortened URL that contains an id. Store the rest of the information in a database with the short id as the key.
Related
What is the maximum length of a URL in different browsers?
Maximum URL length is 2,083 characters in Internet Explorer
.NET string length limit is 2 billion characters.
Browsers do have a limit on how long of a URL they will accept, and the length limit is different across browser implementations. IE's limit is typically the shortest, at around 2k last time I checked in the IE6 era. Firefox and Chrome are considerably higher than that, but there is still a limit.
Your problem is elsewhere - but it is not possible to reliably use a URL that uses more than 2000 characters, you will need another approach entirely - see this SO answer: What is the maximum length of a URL?
(Also for building large strings use a StringBuilder instead)
Strings in C# has about 2Gig limit.so there is no problem with your string variable

H.225 User Information Packet Parsing

I'm writing some code using PacketDotNet and SharpPCap to parse H.225 packets for a VOIP phone system. I've been using Wireshark to look at the structure, but I'm stuck. I've been using This as a reference.
Most of the H.225 packets I see are user information type with an empty message body and the actual information apparently shows up as a list of NonStandardControls in Wireshark. I thought I'd just extract out these controls and parse them later, but I don't really know where they start.
In almost all cases, the items start at the 10th byte of the H.225 data. Each item appears to begin with the length which is recorded as 2 bytes. However, I am getting a packet that has items starting at the 11th byte.
The only difference I see in this packet is something in the message body supposedly called open type length which has a value of 1, whereas the rest all appear to be 0. Would the items start at 10 + open type length? Is there some document that explains what this open type length is for?
Thanks.
H.225 doesn't use a fixed length encoding, it user ASN.1 PER encoding (not BER).
You probably won't find a C# library. OPAL is adding a C API if you are able to use that.

Reading an mbox file in C#

One of our staff members has lost his mailbox but luckily has a dump of his email in mbox format. I need to somehow get all the messages inside the mbox file and squirt them into our tech support database (as its a custom tool there are no import tools available).
I've found SharpMimeTools which breaks down a message but not allow you to iterate through a bunch of messages in a mbox file.
Does anyone know of a decent parser thats open without having to learn the RFC to write one out?
I'm working on a MIME & mbox parser in C# called MimeKit.
It's based on earlier MIME & mbox parsers I've written (such as GMime) which were insanely fast (could parse every message in an 1.2GB mbox file in about 1 second).
I haven't tested MimeKit for performance yet, but I am using many of the same techniques in C# that I used in C. I suspect it'll be slower than my C implementation, but since the bottleneck is I/O and MimeKit is written to do optimal (4k) reads like GMime is, they should be pretty close.
The reasons you are finding your current approach to be slow (StreamReader.ReadLine(), combining the text, then passing it off to SharpMimeTools) are because of the following reasons:
StreamReader.ReadLine() is not a very optimal way of reading data from a file. While I'm sure StreamReader() does internal buffering, it needs to do the following steps:
A) Convert the block of bytes read from the file into unicode (this requires iterating over the bytes in the byte[] read from disk to convert the bytes read from the stream into a unicode char[]).
B) Then it needs to iterate over its internal char[], copying each char into a StringBuilder until it finds a '\n'.
So right there, with just reading lines, you have at least 2 passes over your mbox input stream. Not to mention all of the memory allocations going on...
Then you combine all of the lines you've read into a single mega-string. This requires another pass over your input (copying every char from each string read from ReadLine() into a StringBuilder, presumably?).
We are now up to 3 iterations over the input text and no parsing has even happened yet.
Now you hand off your mega-string to SharpMimeTools which uses a SharpMimeMessageStream which... (/facepalm) is a ReadLine()-based parser that sits on top of another StreamReader that does charset conversion. That makes 5 iterations before anything at all is even parsed. SharpMimeMessageStream also has a way to "undo" a ReadLine() if it discovers it has read too far. So it is reasonable to assume that he is scanning over some of those lines at least twice. Not to mention all of the string allocations going on... ugh.
For each header, once SharpMimeTools has its line buffer, it splits into field & value. That's another pass. We are up to 6 passes so far.
SharpMimeTools then uses string.Split() (which is a pretty good indication that this mime parser is not standards compliant) to tokenize address headers by splitting on ',' and parameterized headers (such as Content-Type and Content-Disposition) by splitting on ';'. That's another pass. (We are now up to 7 passes.)
Once it splits those it runs a regex match on each string returned from the string.Split() and then more regex passes per rfc2047 encoded-word token before finally making another pass over the encoded-word charset and payload components. We're talking at least 9 or 10 passes over much of the input by this point.
I give up going any farther with my examination because it's already more than 2x as many passes as GMime and MimeKit need and I know my parsers could be optimized to make at least 1 less pass than they do.
Also, as a side-note, any MIME parser that parses strings instead of byte[] (or sbyte[]) is never going to be very good. The problem with email is that so many mail clients/scripts/etc in the wild will send undeclared 8bit text in headers and message bodies. How can a unicode string parser possibly handle that? Hint: it can't.
using (var stream = File.OpenRead ("Inbox.mbox")) {
var parser = new MimeParser (stream, MimeFormat.Mbox);
while (!parser.IsEndOfStream) {
var message = parser.ParseMessage ();
// At this point, you can do whatever you want with the message.
// As an example, you could save it to a separate file based on
// the message subject:
message.WriteTo (message.Subject + ".eml");
// You also have the ability to get access to the mbox marker:
var marker = parser.MboxMarker;
// You can also get the exact byte offset in the stream where the
// mbox marker was found:
var offset = parser.MboxMarkerOffset;
}
}
2013-09-18 Update: I've gotten MimeKit to the point where it is now usable for parsing mbox files and have successfully managed to work out the kinks, but it's not nearly as fast as my C library. This was tested on an iMac so I/O performance is not as good as it would be on my old Linux machine (which is where GMime is able to parse similar sized mbox files in ~1s):
[fejj#localhost MimeKit]$ mono ./mbox-parser.exe larger.mbox
Parsed 14896 messages in 6.16 seconds.
[fejj#localhost MimeKit]$ ./gmime-mbox-parser larger.mbox
Parsed 14896 messages in 3.78 seconds.
[fejj#localhost MimeKit]$ ls -l larger.mbox
-rw-r--r-- 1 fejj staff 1032555628 Sep 18 12:43 larger.mbox
As you can see, GMime is still quite a bit faster, but I have some ideas on how to improve the performance of MimeKit's parser. It turns out that C#'s fixed statements are quite expensive, so I need to rework my usage of them. For example, a simple optimization I did yesterday shaved about 2-3s from the overall time (if I remember correctly).
Optimization Update: Just improved performance by another 20% by replacing:
while (*inptr != (byte) '\n')
inptr++;
with:
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
Optimization Update: I was able to finally make MimeKit as fast as GMime by switching away from my use of Enum.HasFlag() and using direct bit masking instead.
MimeKit can now parse the same mbox stream in 3.78s.
For comparison, SharpMimeTools takes more than 20 minutes (to test this, I had to split the emails apart into separate files because SharpMimeTools can't parse mbox files).
Another Update: I've gotten it down to 3.00s flat via various other tweaks throughout the code.
I don't know any parser, but mbox is really a very simple format. A new email begins on lines starting with "From " (From+Space) and an empty line is attached to the end of each mail. Should there be any occurence of "From " at the beginning of a line in the email itself, this is quoted out (by prepending a '>').
Also see Wikipedia's entry on the topic.
If you can stretch to using Python, there is one in the standard library. I'm unable to find any for .NET sadly.
To read .mbox files, you can use a third-party library Aspose.Email.
This library is a complete set of Email Processing APIs to build cross-platform applications having the ability to create, manipulate, convert, and transmit emails without using Microsoft Outlook.
Please, take a look at the example I have provided below.
using(FileStream stream = new FileStream("ExampleMbox.mbox", FileMode.Open, FileAccess.Read))
{
using(MboxrdStorageReader reader = new MboxrdStorageReader(stream, false))
{
// Start reading messages
MailMessage message = reader.ReadNextMessage();
// Read all messages in a loop
while (message != null)
{
// Manipulate message - show contents
Console.WriteLine("Subject: " + message.Subject);
// Save this message in EML or MSG format
message.Save(message.Subject + ".eml", SaveOptions.DefaultEml);
message.Save(message.Subject + ".msg", SaveOptions.DefaultMsgUnicode);
// Get the next message
message = reader.ReadNextMessage();
}
}
}
It is easy to use. I hope this approach will satisfy you and other searchers.
I am working as a Developer Evangelist at Aspose.

Categories

Resources