Parse Line and Break it into Variables

Parse Line and Break it into Variables - c#

I have a text file that contain only the FULL version number of an application that I need to extract and then parse it into separate Variables.
For example lets say the version.cs contains 19.1.354.6
Code I'm using does not seem to be working:
char[] delimiter = { '.' };
string currentVersion = System.IO.File.ReadAllText(#"C:\Applicaion\version.cs");
string[] partsVersion;
partsVersion = currentVersion.Split(delimiter);
string majorVersion = partsVersion[0];
string minorVersion = partsVersion[1];
string buildVersion = partsVersion[2];
string revisVersion = partsVersion[3];

Altough your problem is with the file, most likely it contains other text than a version, why dont you use Version class which is absolutely for this kind of tasks.
var version = new Version("19.1.354.6");
var major = version.Major; // etc..

What you have works fine with the correct input, so I would suggest making sure there is nothing else in the file you're reading.
In the future, please provide error information, since we can't usually tell exactly what you expect to happen, only what we know should happen.
In light of that, I would also suggest looking into using Regex for parsing in the future. In my opinion, it provides a much more flexible solution for your needs. Here's an example of regex to use:
var regex = new Regex(#"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9])");
var match = regex.Match("19.1.354.6");
if (match.Success)
{
Console.WriteLine("Match[1]: "+match.Groups[1].Value);
Console.WriteLine("Match[2]: "+match.Groups[2].Value);
Console.WriteLine("Match[3]: "+match.Groups[3].Value);
Console.WriteLine("Match[4]: "+match.Groups[4].Value);
}
else
{
Console.WriteLine("No match found");
}
which outputs the following:
// Match[1]: 19
// Match[2]: 1
// Match[3]: 354
// Match[4]: 6

Related

Finding multiple semi predictable patterns in a string

Alright, so I'm writing an application that needs to be able to extract a VAT-Number from an invoice (https://en.wikipedia.org/wiki/VAT_identification_number)
The biggest challenge to overcome here is that as apparent from the wikipedia article I have linked to, each country uses its own format for these VAT-numbers (The Netherlands uses a 14 character number while Germany uses a 11 character number).
In order to extract these numbers, I throw every line from the invoice into an array of strings, and for each string I test if it has a length that is equal to one of the VAT formats, and if that checks out, I check if said string also contains a country code ("NL", "DE", etc).
string[] ProcessedFile = Reader.ProcessFile(Input);
foreach(string S in ProcessedFile)
{
RtBEditor.AppendText(S + "\n");
}
foreach(string X in ProcessedFile)
{
string S = X.Replace(" ", string.Empty);
if (S.Length == 7)
{
if (S.Contains("GBGD"))
{
MessageBox.Show("Land = Groot Britanie (Regering)");
}
}
/*
repeat for all other lenghts and country codes.
*/
The problem with this code is that 1st:
if there is a string that happens to have the same length as one of the VAT-formats, and it has a country code embedded in it, the code will incorrectly think that it has found the VAT-number.
2nd:
In some cases, the VAT-number will be included like "VAT-number: [VAT-number]". In this case, the text that precedes the actual number will be added to its length, making the program unable to detect the actual VAT-Number.
The best way to fix this is in my assumption to somehow isolate the VAT-Number from the strings all together, but I have yet to find a way how to actually do this.
Does anyone by any chance know any potential solution?
Many thanks in advance!
EDIT:
Added a dummy invoice to clarify what kind of data is contained within the invoices.

As someone in the comments had pointed out, the best way to fix this is by using Regex. After trying around a bit I came to the following solution:
public Regex FilterNormaal = new Regex(#"[A-Z]{2}(\d)+B?\d*");
private void BtnUitlezen_Click(object sender, EventArgs e)
{
RtBEditor.Clear();
/*
Temp dummy vatcodes for initial testing.
*/
Form1.Dummy1.VAT = "NL855291886B01";
Form1.Dummy2.VAT = "DE483270846";
Form1.Dummy3.VAT = "SE482167803501";
OCR Reader = new OCR();
/*
Grab and process image
*/
if(openFileDialog1.ShowDialog() == DialogResult.OK)
{
try
{
Input = new Bitmap(openFileDialog1.FileName);
}
catch
{
MessageBox.Show("Please open an image file.");
}
}
string[] ProcessedFile = Reader.ProcessFile(Input);
foreach(string S in ProcessedFile)
{
string X = S.Replace(" ", string.Empty);
RtBEditor.AppendText(X + "\n");
}
foreach (Match M in FilterNormaal.Matches(RtBEditor.Text))
{
MessageBox.Show(M.Value);
}
}
At first, I attempted to iterate through my array of strings to find a match, but for reasons unknown, this did not yield any results. When applying the regex to the entire textbox, it did output the results I needed.

How do I find a variable set of 5 numbers qualified by surrounding underscores?

I am pulling file names into a variable (#[User::FileName]) and attempting to extract the work order number (always 5 numbers with underscores on both sides) from that string. For example, a file name would look like - "ABC_2017_DEF_9_12_GHI_35132_S5160.csv". I want result to return "35132". I have found examples of how to do it such as this SUBSTRING(FileName,1,FINDSTRING(FileName,"_",1) - 1) but the underscore will not always be in the same location.
Is it possible to do this in the expression builder?
Answer:
public void Main()
{
string strFilename = Dts.Variables["User::FileName"].Value.ToString();
var RegexObj = new Regex(#"_([\d]{5})_");
var match = RegexObj.Match(strFilename);
if (match.Success)
{
Dts.Variables["User::WorkOrder"].Value = match.Groups[1].Value;
}
Dts.TaskResult = (int)ScriptResults.Success;
}

First of all, the example you have provided ABC_2017_DEF_9_12_GHI_35132_S5160.csv contains 4 numbers located between underscores:
2017 , 9 , 12 , 35132
I don't know if the filename may contains many a 5 digits number can occurs many times, so in my answer i will assume that the number you want to return is the last occurrence of the number made of 5 digits.
Solution
You have to use the Following Regular Expression:
(?:_)\K[0-9][0-9][0-9][0-9][0-9](?=_)
DEMO
Or as #MartinSmith Suggested (in a comment), you can use the following RegEx:
_([\d]{5})_
Implemeting RegEx in SSIS
First add another Variable (Ex: #[User::FileNumber])
Add a Script Task and choose #[User::Filename] variable as ReadOnlyVariable, and #[User:FileNumber] as ReadWriteVariable
Inside the script task use the following code:
using System.Text.RegularExpressions;
public void Main()
{
string strFilename = Dts.Variables["filename"].Value.ToString();
string strNumber;
var objRegEx = new Regex(#"(?:_)\K[0-9][0-9][0-9][0-9][0-9](?=_)");
var mc = objRegEx.Matches(strFilename);
//The last match contains the value needed
strNumber = mc[mc.Count - 1].Value;
Dts.Variables["FileNumber"].Value.ToString();
Dts.TaskResult = (int)ScriptResults.Success;
}

do the other pieces mean something?
anyway you can use a script task and split function.
pass in #fileName as readonly, and #WO as readwrite
string fn = Dts.Variables["fileName"].Value;
string[] parts = fn.Split('_');
//Assuming it's always the 7th part
// You could extract the other parts as well.
Dts.Variables["WO"].Value = part(6);

I would do this with a Script Transformation (or Script Task if this is not in a DataFlow) and use a Regex.

Is there any way to "substitute" numbers in string C#?

I have html code, which I need to parse on the fly. I need to find exact divs there, which all have id of "content-text-" and then 6 numbers (like "content-text-123456"), which I don't know beforehand. Is there any way to "substitute" the numbers at the end of the string I'm searching for (like "content-text-######")? Searching for "content-text-" does not work.
I'm doing this project on Windows Phone 8.1 with C# if it matters.
EDIT:
WPPageResponse response = JsonConvert.DeserializeObject<WPPageResponse>(json);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(response.content);
foreach (var node in doc.DocumentNode.Descendants("div").Where(div => div.GetAttributeValue("id", "") == "content-text-######"))
{
// Gather data what it returns
}
Here is some code if it helps. It works if I know the numbers and search with them, but the thing is that I can't know all the numbers there.

You can use Regex for this.
string data = "MyTest = 5564327";
string output = Regex.Replace(data, #"\d", "#");
Console.WriteLine(output);
Console.Read();
Output is:
MyTest = #######

How to retrieve the locale(country) code from URL?

I have a URL, which is like http://example.com/UK/Deal.aspx?id=322
My target is to remove the locale(country) part, to make it like http://example.com/Deal.aspx?id=322
Since the URL may have other similar formats like: https://ssl.example.com/JP/Deal.aspx?id=735, using "substring" function is not a good idea.
What I can think about is to use the following method for separating them, and map them back later.
HttpContext.Current.Request.Url.Scheme
HttpContext.Current.Request.Url.Host
HttpContext.Current.Request.Url.AbsolutePath
HttpContext.Current.Request.Url.Query
And, suppose HttpContext.Current.Request.Url.AbsolutePath will be:
/UK/Deal.aspx?id=322
I am not sure how to deal with this since my boss asked me not to use "regular expression"(he thinks it will impact performance...)
Except "Regular Expression", is there any other way to remove UK from it?
p.s.: the UK part may be JP, DE, or other country code.
By the way, for USA, there is no country code, and the url will be http://example.com/Deal.aspx?id=322
Please also take this situation into consideration.
Thank you.

Assuming that you'll have TwoLetterCountryISOName in the Url. yYou can use UriBuilder class to remove the path from Uri without using the Regex.
E.g.
var originalUri = new Uri("http://example.com/UK/Deal.aspx?id=322");
if (IsLocaleEnabled(sourceUri))
{
var builder = new UriBuilder(sourceUri);
builder.Path
= builder.Path.Replace(sourceUri.Segments[1] /* remove UK/ */, string.Empty);
// Construct the Uri with new path
Uri newUri = builder.Uri;;
}
Update:
// Cache the instance for performance benefits.
static readonly Regex regex = new Regex(#"^[aA-zZ]{2}\/$", RegexOptions.Compiled);
/// <summary>
/// Regex to check if Url segments have the 2 letter
/// ISO code as first ocurrance after root
/// </summary>
private bool IsLocaleEnabled(Uri sourceUri)
{
// Update: Compiled regex are way much faster than using non-compiled regex.
return regex.IsMatch(sourceUri.Segments[1]);
}
For performance benefits you must cache it (means keep it in static readonly field). There's no need to parse a pre-defined regex on every request. This way you'll get all the performance benefits you can get.
Result - http://example.com/Deal.aspx?id=322

It all depends on whether the country code always has the same position. If it's not, then some more details on the possible formats are required.. Maybe you could check, if the first segment has two chars or something, to be sure it really is a country code (not sure if this is reliable though). Or you start with the filename, if it's always in the format /[optionalCountryCode]/deal.aspx?...
How about these two approaches (on string level):
public string RemoveCountryCode()
{
Uri originalUri = new Uri("http://example.com/UK/Deal.aspx?id=322");
string hostAndPort = originalUri.GetLeftPart(UriPartial.Authority);
// v1: if country code is always there, always has same position and always
// has format 'XX' this is definitely the easiest and fastest
string trimmedPathAndQuery = originalUri.PathAndQuery.Substring("/XX/".Length);
// v2: if country code is always there, always has same position but might
// not have a fixed format (e.g. XXX)
trimmedPathAndQuery = string.Join("/", originalUri.PathAndQuery.Split('/').Skip(2));
// in both cases you need to join it with the authority again
return string.Format("{0}/{1}", hostAndPort, trimmedPathAndQuery);
}

If the AbsolutePath will always have the format /XX/...pagename.aspx?id=### where XX is the two letter country code, then you can just strip off the first 3 characters.
Example that removes the first 3 characters:
var targetURL = HttpContext.Current.Request.Url.AbsolutePath.Substring(3);
If the country code could be different lengths, then you could find the index of the second / character and start the substring from there.
var sourceURL = HttpContext.Current.Request.Url.AbsolutePath;
var firstOccurance = sourceURL.IndexOf('/')
var secondOccurance = sourceURL.IndexOf('/', firstOccurance);
var targetURL = sourceURL.Substring(secondOccurance);

The easy way would be to treat as string, split it by the "/" separator, remove the fourth element, and then join them back with the "/" separator again:
string myURL = "https://ssl.example.com/JP/Deal.aspx?id=735";
List<string> myURLsplit = myURL.Split('/').ToList().RemoveAt(3);
myURL = string.Join("/", myURLsplit);
RESULT: https://ssl.example.com/Deal.aspx?id=735

Code an elegant way to strip strings

I am using C# and in one of the places i got list of all peoples names with their email id's in the format
name(email)\n
i just came with this sub string stuff just off my head. I am looking for more elegant, fast ( in the terms of access time, operations it performs), easy to remember line of code to do this.
string pattern = "jackal(jackal#gmail.com)";
string email = pattern.SubString(pattern.indexOf("("),pattern.LastIndexOf(")") - pattern.indexOf("("));
//extra
string email = pattern.Split('(',')')[1];
I think doing the above would do sequential access to each character until it finds the index of the character. Works ok now since name is short, but would struggle when having a large name ( hope people don't have one)

A dirty hack would be to let microsoft do it for you.
try
{
new MailAddress(input);
//valid
}
catch (Exception ex)
{
// invalid
}
I hope they would do a better job than a custom reg-ex.
Maintaining a custom reg-ex that takes care of everything might involve some effort.
Refer: MailAddress
Your format is actually very close to some supported formats.
Text within () are treated as comments, but if you replace ( with < and ) with > and get a supported format.

The second parameter in Substring() is the length of the string to take, not the ending index.
Your code should read:
string pattern = "jackal(jackal#gmail.com)";
int start = pattern.IndexOf("(") + 1;
int end = pattern.LastIndexOf(")");
string email = pattern.Substring(start, end - start);
Alternatively, have a look at Regular Expression to find a string included between two characters while EXCLUDING the delimiters

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse Line and Break it into Variables - c#

Altough your problem is with the file, most likely it contains other text than a version, why dont you use Version class which is absolutely for this kind of tasks. var version = new Version("19.1.354.6"); var major = version.Major; // etc..

Related

Finding multiple semi predictable patterns in a string

How do I find a variable set of 5 numbers qualified by surrounding underscores?

Is there any way to "substitute" numbers in string C#?

How to retrieve the locale(country) code from URL?

Code an elegant way to strip strings

Categories

Resources