using regex to parse a file from a ftpwebrequest

using regex to parse a file from a ftpwebrequest - c#

I am reading data from a vendor via an ftpwebrequest in c#. Thus the data will be in a string format line by line e.g
0123030014030300123003120312030203013003104234923942348
I need to parse this data into the appropriate fields so I can then insert them into a sql talble. I know the position each field starts at so I want to use regex to parse each field. This is the function I use to get the data now I just need to parse the data. I am haveing trouble finding a clear solution on how to do so. Any suggestions would be greatly appreciated :)
static void GetData()
{
WebClient request = new WebClient();
string url = "ftp://ftp.WebSite.com/file";
request.Credentials = new NetworkCredential("userid", "password");
try
{
byte[] newFileData = request.DownloadData(url);
string fileString = System.Text.Encoding.UTF8.GetString(newFileData);
Console.WriteLine(fileString);
}
catch (WebException e)
{
}
}

You don't need to use a regular expression if you're just splitting at specified lengths. In C# you can just use String.Substring, for example:
byte[] newFileData = request.DownloadData(url);
string fileString = System.Text.Encoding.UTF8.GetString(newFileData);
string fieldOne = fileString.Substring(0, n);

Here is an example of regex. We specify what to capture by the totals of \d{x} such as \d{4} gives us 4 digits. Then we place it into a named capture group (?<NameHere>\d{4}) then we access it by its name (NameHere in previous text example) for the processing.
See example with removal of three fields Id, group # and Target numbers:
string data = "0123030014030300123003120312030203013003104234923942348";
string pattern = #"^(?<ID>\d{4})(?<Group>\d{3})(?<Target>\d{8})";
Match mt = Regex.Match(data, pattern);
// Writes
// ID: 0123 of Group 030 on Target: 01403030
Console.WriteLine("ID: {0} of Group {1} on Target: {2}", mt.Groups["ID"].Value, mt.Groups["Group"].Value, mt.Groups["Target"].Value);

Related

UriBuilder().Query will wrongly encode non-ASCII characters

I am working on an asp.net mvc 4 web application. and i am using .net 4.5. now i have the following WebClient() class:
using (var client = new WebClient())
{
var query = HttpUtility.ParseQueryString(string.Empty);
query["model"] = Model;
//code goes here for other parameters....
string apiurl = System.Web.Configuration.WebConfigurationManager.AppSettings["ApiURL"];
var url = new UriBuilder(apiurl);
url.Query = query.ToString();
string xml = client.DownloadString(url.ToString());
XmlDocument doc = new XmlDocument();
//code goes here ....
}
now i have noted a problem when one of the parameters contain non-ASCII charterers such as £, ¬, etc....
now the final query will have any non-ASCII characters (such as £) encoded wrongly (as %u00a3). i read about this problem and seems i can replace :-
url.Query = query.ToString();
with
url.Query = ri.EscapeUriString(HttpUtility.UrlDecode(query.ToString()));
now using the later approach will encode £ as %C2%A3 which is the correct encoded value.
but the problem i am facing with url.Query = Uri.EscapeUriString(HttpUtility.UrlDecode(query.ToString())); in that case one of the parameters contains & then the url will have the following format &operation=AddAsset&assetName=&.... so it will assume that I am passing empty assetName parameter not value =&??
EDIT
Let me summarize my problem again. I want to be able to pass the following 3 things inside my URL to a third part API :
Standard characters such as A,B ,a ,b ,1, 2, 3 ...
Non-ASCII characters such as £,¬ .
and also special characters that are used in url encoding such as & , + .
now i tried the following 2 approaches :
Approach A:
using (var client = new WebClient())
{
var query = HttpUtility.ParseQueryString(string.Empty);
query["model"] = Model;
//code goes here for other parameters....
string apiurl = System.Web.Configuration.WebConfigurationManager.AppSettings["ApiURL"];
var url = new UriBuilder(apiurl);
url.Query = query.ToString();
string xml = client.DownloadString(url.ToString());
XmlDocument doc = new XmlDocument();
//code goes here ....
}
In this approach i can pass values such as & ,+ since they are going to be url encoded ,,but if i want to pass non-ASCII characters they will be encoded using ISO-8859-1 ... so if i have £ value , my above code will encoded as %u00a3 and it will be saved inside the 3rd party API as %u00a3 instead of £.
Approach B :
I use :
url.Query = Uri.EscapeUriString(HttpUtility.UrlDecode(query.ToString()));
instead of
url.Query = query.ToString();
now I can pass non-ASCII characters such as £ since they will be encoded correctly using UTF8 instead of ISO-8859-1. but i can not pass values such as & because my url will be read wrongly by the 3rd party API.. for example if I want to pass assetName=& my url will look as follow:
&operation=Add&assetName=&
so the third part API will assume I am passing empty assetName, while I am trying to pass its value as &...
so not sure how I can pass both non-ASCII characters + characters such as &, + ????

You could use System.Net.Http.FormUrlEncodedContent instead.
This works with a Dictionary for the Name/Value pairing and the Dictionary, unlike the NameValueCollection, does not "incorrectly" map characters such as £ to an unhelpful escaping (%u00a3, in your case).
Instead, the FormUrlEncodedContent can take a dictionary in its constructor. When you read the string out of it, it will have properly urlencoded the dictionary values.
It will correctly and uniformly handle both of the cases you were having trouble with:
£ (which exceeds the character value range of urlencoding and needs to be encoded into a hexadecimal value in order to transport)
& (which, as you say, has meaning in the url as a parameter separator, so that values cannot contain it--so that it has to be encoded as well).
Here's a code example, that shows that the various kinds of example items you mentioned (represented by item1, item2 and item3) now end up correctly urlencoded:
String item1 = "£";
String item2 = "&";
String item3 = "xyz";
Dictionary<string,string> queryDictionary = new Dictionary<string, string>()
{
{"item1", item1},
{"item2", item2},
{"item3", item3}
};
var queryString = new System.Net.Http.FormUrlEncodedContent(queryDictionary)
.ReadAsStringAsync().Result;
queryString will contain item1=%C2%A3&item2=%26&item3=xyz.

Maybe you could try to use an Extension method on the NameValueCollection class. Something like this:
using System.Collections.Specialized;
using System.Text;
using System.Web;
namespace Testing
{
public static class NameValueCollectionExtension
{
public static string ToUtf8UrlEncodedQuery(this NameValueCollection nv)
{
StringBuilder sb = new StringBuilder();
bool firstIteration = true;
foreach (var key in nv.AllKeys)
{
if (!firstIteration)
sb.Append("&");
sb.Append(HttpUtility.UrlEncode(key, Encoding.UTF8))
.Append("=")
.Append(HttpUtility.UrlEncode(nv[key], Encoding.UTF8));
firstIteration = false;
}
return sb.ToString();
}
}
}
Then, in your code you can do this:
url.Query = query.ToUtf8UrlEncodedQuery();
Remember to add a using directive for the namespace where you put the NameValueCollectionExtension class.

The problem here isn't UriBuilder.Query, it's UriBuilder.ToString(). Read the documentation here: https://msdn.microsoft.com/en-us/library/system.uribuilder.tostring(v=vs.110).aspx. The property is defined as returning the "display string" of the builder, not a validly encoded string. Uri.ToString() has a similar problem, in that it doesn't perform proper encoding.
Use the following instead: url.Uri.AbsoluteUri, that will always be a properly encoded string. You shouldn't have to do any encoding on the way into the builder (that's part of it's purpose, after all, to properly encode things).

You need to use:
System.Web.HttpUtility.UrlEncode(key)
Change your code to this:
using (var client = new WebClient())
{
var query = HttpUtility.ParseQueryString(string.Empty);
query["model"] = Model;
//code goes here for other parameters....
string apiurl = System.Web.Configuration.WebConfigurationManager.AppSettings["ApiURL"];
var url = new UriBuilder(apiurl);
url.Query = HttpUtility.UrlEncode(query.ToString());
string xml = client.DownloadString(url.ToString());
XmlDocument doc = new XmlDocument();
//code goes here ....
}

If nothing helps, then just manually convert those problematic chars inside values of parameters
& to %26
+ to %2B
? to %3F

Removing A specific Tag in VCF file using C#

I have a very large number of records in VCF file, the issue is When Phone Exports, It adds File with Tag name ."PHOTO", when I am importing this VCF file into other phone, It appends that information into name fields, causing too long and large Contact Name,
One thing I was doing was with manual find and delete, but in a file having 25000+ lines its so difficult,
I am thinking to move with REGX and remove, But there also issues, for some records, PHOTO Tag is the Last Record then END:VCARD at some points ITS, TEL;..
Here is the sample I have taken, Just one Clue, then Rest I will try at my own.
public string RemoveBlockComments(string InputString)
{
string strRegex = #"<regx>";
RegexOptions myRegexOptions = RegexOptions.Multiline;
Regex myRegex = new Regex(strRegex, myRegexOptions);
return myRegex.Replace(strTargetString, "");
}
Here is Test data:
PHOTO;TYPE=PNG;ENCODING=B:/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMoHBwYIDAoMDAsK
CwsNDhIQDQ4KL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx
BhJB
TEL;TYPE=CELL:123456789
I want to remove All from PHOTO to BhJb(before Tel)
Possible Formats.
PHOTO....
Tel;
PHOTO....
EMAIL;
PHOTO....
END:VCARD

As per our chat discussion, it seems that you need to do 2 things:
Get the text after the TEL, EMAIL, END (or anything else you may have on the list of valid values there)
Get the part before that starts with PHOTO.
You can get the 2 values easily with your updated method:
private static String ReFormat(String str, out String removed)
{
Regex rgx = new Regex(#"(?msi)(?<removed>PHOTO\b.*?)(?=\b(?:TEL|EMAIL|END)\b)");
removed = rgx.Match(str).Groups["removed"].Value;
return str.Replace(removed, string.Empty);
}
And in the caller:
string removed = string.Empty;
string rest = ReFormat("PHOTO;TYPE=PNG;ENCODING=B:/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMoHBwYIDAoMDAsK\r\nCwsNDhIQDQ4KL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx\r\nBhJB\r\nTEL;TYPE=CELL:123456789", out removed);
Or I think you can use a method that returns a list of key-value pairs:
private static List<KeyValuePair<String, String>> ReFormat(String str)
{
return Regex.Matches(str, #"(?msi)(?<photo>PHOTO\b.*?)(?<tel>\b(?:TEL|EMAIL|END)\b.*?(?=\bPHOTO|$))").Cast<Match>().Select(p=> new KeyValuePair<String, String>(p.Groups["photo"].Value, p.Groups["tel"].Value)).ToList();
}
And then
var my_result = ReFormat("PHOTO;TYPE=PNG;ENCODING=B:/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMoHBwYIDAoMDAsK\r\nCwsNDhIQDQ4KL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx\r\nBhJB\r\nTEL;TYPE=CELL:123456789");
var mres1 = ReFormat("PHOTO;..jkh ohfhlahflkhasf fhasof\r\nTel:886886\r\ndataPHOTO:.ljlljhkdsghdsgd\n\rEmail:abc#abc.com..");

How to convert a utf8 like string to a real utf8?

I'm working on a client that should communicate with an MMO game server.
The client is using unity3d.
I get the data from the server with JSON format and I try to get the data in UTF8 encoding:
string responseString = new System.IO.StreamReader(response.GetResponseStream(), System.Text.Encoding.UTF8).ReadToEnd()
JSONObject JOBJ = new JSONObject(responseString);
and what is inside the response string looks like:
"\u0645\u0639\u062f\u0646 \u062a\u06cc\u062a\u0627\u0646\u06cc\u0648\u0645"
Then I try to get the required utf8 string data out of the JSON:
string xy = JOBJ["name"].ToString();
byte[] utf = System.Text.Encoding.UTF8.GetBytes(xy);
string s2= System.Text.Encoding.UTF8.GetString(utf);
The Problem is when I Log the string:
Debug.Log("Jproperty :" + s2);
All I get is the \u secuences like this:
"\u0645\u0639\u062f\u0646 \u062a\u06cc\u062a\u0627\u0646\u06cc\u0648\u0645"
While if I put the same result in the xy in the first place I'll get the fine result.
Also I should mention that while I think that the s2.length should be 11 it is 66.
Any one can tell me what's wrong with my code?

Strings that contain unicode escape sequences are perfectly valid. Your data might be getting escaped before it is sent to the server.
Try Regex.Unescape:
var nameEscaped = JOBJ["name"].ToString();
// nameEscaped =
// \u0645\u0639\u062f\u0646 \u062a\u06cc\u062a\u0627\u0646\u06cc\u0648\u0645
var name = Regex.Unescape(nameEscaped);
// name =
// معدن تیتانیوم

How to get data off of a character

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:
char ch = '\u00e9';
print (ch);
It will print "é". I have tried getting the bytes off of a given string using:
byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);
While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.
What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.
Here is the code that is referenced:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;
public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
formatParams = new string[2];
doc = new HtmlDocument();
returnString = "";
}
public string Translate(String input, String languagePair, Encoding encoding)
{
formatParams[0]= input;
formatParams[1]= languagePair;
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);
string result = String.Empty;
using (WebClient webClient = new WebClient())
{
webClient.Encoding = encoding;
result = webClient.DownloadString(url);
}
doc.LoadHtml(result);
input = alter (input);
string temp = doc.DocumentNode.SelectSingleNode("//span[#title='"+input+"']").InnerText;
charString = temp.ToCharArray();
return temp;
}
// Use this for initialization
void Start () {
}
string alter(string inputString){
returnString = "";
letters = inputString.ToCharArray();
for(int i=0; i<inputString.Length;i++){
if(letters[i]=='\''){
returnString = returnString + "'";
}else{
returnString = returnString + letters[i];
}
}
return returnString;
}
}

Maybe you should use another API/URL. This function below uses a different url that returns JSON data and seems to work better:
public static string Translate(string input, string fromLanguage, string toLanguage)
{
using (WebClient webClient = new WebClient())
{
string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
string result = webClient.DownloadString(url);
// I used JavaScriptSerializer but another JSON parser would work
JavaScriptSerializer serializer = new JavaScriptSerializer();
Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
return (string)sentences["trans"];
}
}
If I run this in a Console App:
Console.WriteLine(Translate("How are you?", "en", "es"));
It will display
¿Cómo estás?

I don't know much about the GoogleTranslate API, but my first thought is that you've got a Unicode Normalization problem.
Have a look at System.String.Normalize() and it's friends.
Unicode is very complicated, so I'll over simplify! Many symbols can be represented in different ways in Unicode, that is: 'é' could be represented as 'é' (one character), or as an 'e' + 'accent character' (two characters), or, depending what comes back from the API, something else altogether.
The Normalize function will convert your string to one with the same Textual meaning, but potentially a different binary value which may fix your output problem.

You actually pretty much have it. Just insert the coded letter with a \u and it works.
string mystr = "C\u00f3mo Est\u00e1s?";

There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).
The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.
So in your case, unless you really need to be using byte[] arrays, you should use char[] arrays. Or if you want to encode the characters so that they can be used in HTML, then you can just iterate through the characters and check if the character code is > 128, then you can replace it with the &hex; character code.

I had the same problem working one of my project [Language Resource Localization Translation]
I was doing the same thing and was using.. System.Text.Encoding.UTF8.GetBytes() and because of utf8 encoding was receiving special characters like your
e.g 239, 191, 189 in result string.
please take a look of my solution... hope this helps
Don't Use encoding at all Google translation will return correct like á as it self in the string. do some string manipulation and read the string as it is...
Generic Solution [works for every language translation which google support]
try
{
//Don't use UtF Encoding
// use default webclient encoding
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start +7 , length - 8);
objDic2.Text = reverseString(result);
//hard code substring; finding the correct translation within the string.
dictList.Add(objDic2);
}
catch (Exception ex)
{
lblMessages.InnerHtml = "<strong>Google translate exception occured no resource saved..." + ex.Message + "</strong>";
error = true;
}
public static string reverseString(string s)
{
char[] arr = s.ToCharArray();
Array.Reverse(arr);
return new string(arr);
}
as you can see from the code no encoding has been performed and i am sending 2 special key charachters as "►" + txtNewResourceValue.Text.Trim() + "◄"to determine the start and end of the return translation from google.
Also i have checked hough my language utility tool I am getting "Cómo Estás?" when sending
How are you to google translation... :)
Best regards
[Shaz]
---------------------------Edited-------------------------
public string Translate(String input, String languagePair)
{
try
{
//Don't use UtF Encoding
// use default webclient encoding
//input [string to translate]
//Languagepair [eg|es]
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start + 7, length - 8);
//return transalted string
return reverseString(result);
}
catch (Exception ex)
{
return "Google translate exception occured no resource saved..." + ex.Message";
}
}

Replace multiple words in string

I have multiple words I want to replace with values, whats the best way to do this?
Example:
This is what I have done but it feels and looks so wrong
string s ="Dear <Name>, your booking is confirmed for the <EventDate>";
string s1 = s.Replace("<Name>", client.FullName);
string s2 =s1.Replace("<EventDate>", event.EventDate.ToString());
txtMessage.Text = s2;
There has to be a better way?
thanks

You could use String.Format.
string.Format("Dear {0}, your booking is confirmed for the {1}",
client.FullName, event.EventDate.ToString());

If you're planning on having a dynamic number of replacements, which could change at any time, and you want to make it a bit cleaner, you could always do something like this:
// Define name/value pairs to be replaced.
var replacements = new Dictionary<string,string>();
replacements.Add("<Name>", client.FullName);
replacements.Add("<EventDate>", event.EventDate.ToString());
// Replace
string s = "Dear <Name>, your booking is confirmed for the <EventDate>";
foreach (var replacement in replacements)
{
s = s.Replace(replacement.Key, replacement.Value);
}

To build on George's answer, you could parse the message into tokens then build the message from the tokens.
If the template string was much larger and there are more tokens, this would be a tad more efficient as you are not rebuilding the entire message for each token replacement. Also, the generation of the tokens could be moved out into a Singleton so it is only done once.
// Define name/value pairs to be replaced.
var replacements = new Dictionary<string, string>();
replacements.Add("<Name>", client.FullName);
replacements.Add("<EventDate>", event.EventDate.ToString());
string s = "Dear <Name>, your booking is confirmed for the <EventDate>";
// Parse the message into an array of tokens
Regex regex = new Regex("(<[^>]+>)");
string[] tokens = regex.Split(s);
// Re-build the new message from the tokens
var sb = new StringBuilder();
foreach (string token in tokens)
sb.Append(replacements.ContainsKey(token) ? replacements[token] : token);
s = sb.ToString();

You can chain the Replace operations together:
s = s.Replace(...).Replace(...);
Note that you don't need to create other strings to do this.
Using String.Format is the appropriate way, but only if you can change the original string to suit the brace formatting.

When you do multiple replaces it's much more efficient to use StringBuilder instead of string. Otherwise replace function is making a copy of the string every time you run it, wasting time and memory.

Try this code:
string MyString ="This is the First Post to Stack overflow";
MyString = MyString.Replace("the", "My").Replace("to", "to the");
Result: MyString ="This is My First Post to the Stack overflow";

Use String.Format:
const string message = "Dear {0}, Please call {1} to get your {2} from {3}";
string name = "Bob";
string callName = "Alice";
string thingy = "Book";
string thingyKeeper = "Library";
string customMessage = string.Format(message, name, callName, thingy, thingyKeeper);

Improving on what #Evan said...
string s ="Dear <Name>, your booking is confirmed for the <EventDate>";
string s1 = client.FullName;
string s2 = event.EventDate.ToString();
txtMessage.Text = s.Replace("<Name>", s1).Replace("EventDate", s2);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

using regex to parse a file from a ftpwebrequest - c#

Related

UriBuilder().Query will wrongly encode non-ASCII characters

Removing A specific Tag in VCF file using C#

How to convert a utf8 like string to a real utf8?

How to get data off of a character

Replace multiple words in string

Categories

Resources