split field/value from xml STRING not formatted - c#

I have one string in format of XML, (this is not well-formatted XML!) and I would like to get the field and value
<MYXML
address="rua sao carlos, 128" telefone= "1000-222" service="xxxxxx" source="xxxxxxx" username="aaaaaaa" password="122222" nome="asasas" sobrenome="aass" email="sao.aaaaa#aaaaa.com.br" pais="SS" telefone="4002-" />
I would like to get the parameter and value separeted in split.
I try this:
xml.ToString().Replace(" =" , "=").Replace("= " , "=").Replace(" = " , "=").Split(new char[]{' '});
But not work perfect becase for example the attribute 'address' was split in two items
{string[29]}
[0]: "<signature"
[1]: "aaa=\"xxxx\""
[2]: "sss=\"xxxx\""
[3]: "ssss=\"xxx\""
[4]: "username=\"xxx\""
[5]: "password=\"xxxx\""
[6]: "nome=\"xxxx\""
[7]: "sobrenome=\"xxx\""
[8]: "email=\"xxx.xxx#xxx.com.br\""
[9]: "pais=\"BR\""
[10]: "endereco=\"Rua"
[11]: "Sao"
[12]: "Carlos,"
[13]: "128\""
[14]: "cidade=\"Sao"
[15]: "Paulo\""
The error is
[10]: "endereco=\"Rua"
[11]: "Sao"
[12]: "Carlos,"
When the correct I would like is
[10]: "endereco=\"Rua Sao Carlos , 128"

A regular expression will work for this as you are working with badly formed xml.
Regex regex = new Regex("\\s\\w+=\"(\\w|\\s|,|=|#|-|\\.)+\"");
MatchCollection matches = regex.Matches(searchText);
foreach (var match in matches)
{
//your code here
}
Tested with your example string and matches were as expected.
Hope this Helps!

I would suggest you to use xPath or Linq to parse this xml. The way you are using is not a good way and that is why you end up in error."Rua Sao Carlo" contains three words separated by single space ; as a result when you try to split it with single space, it also splits the address

Try this overload of Split. It will allow you to use a string as the splitter token, namely '" ' (that is quote and space). This will split to the name and attribute pairs. Then take the resulting array, and split it again on = (equals) to get the pairs you need, then do as you will with them. Hope this gets you headed in the right direction

As already noted, you have badly formed XML. If you were to fix it, by either renaming or removing on of the telephone attributes, you could break down your XML like this:
This is the correct way to handle the XML, if however you do not have control over getting proper xml and must work w/ junk, i'd suggest the regex answer by #AFrieze.
var xmlString = #"<MYXML address=""rua sao carlos, 128"" service=""xxxxxx"" source=""xxxxxxx"" username=""aaaaaaa"" password=""122222"" nome=""asasas"" sobrenome=""aass"" email=""sao.aaaaa#aaaaa.com.br"" pais=""SS"" telefone=""4002-"" />";
var xml = XDocument.Parse(xmlString);
var values = xml.Descendants("MYXML").SelectMany(x => x.Attributes()).ToArray();
foreach (var value in values)
{
Console.WriteLine(value);
}
Console.Read();
This returns:
address="rua sao carlos, 128"
service="xxxxxx"
source="xxxxxxx"
username="aaaaaaa"
password="122222"
nome="asasas"
sobrenome="aass"
email="sao.aaaaa#aaaaa.com.br"
pais="SS"
telefone="4002-"

Related

How can I match the given pattern using Regex in C#?

I have the following input:
-key1:"val1" -key2: "val2" -key3:(val3) -key4: "(val4)" -key5: val5 -key6: "val-6" -key-7: val7 -key-eight: "val 8"
With only the following assumption about the pattern:
Keys always start with a - followed by a value delimited by :
How can I match and extract each key and it's corresponding value?
I have so far come up with the following regex:
-(?<key>\S*):\s?(?<val>\S*)
But it's currently not matching the complete value for the last argument as it contains a space but I cannot figure out how to match it.
The expected output should be:
key1 "val1"
key2 "val2"
key3 (val3)
key4 "(val4)"
key5 val5
key6 "val-6"
key-7 val7
key-eight val 8
Any help is much appreciated.
Guessing that you want to only allow whitespace characters that are not at the beginning or end, change your regex to:
-(?<key>\S*):\s?(?<val>\S+(\s*[^-\s])*)
This assumes that the character - preceeded by a whitespace unquestioningly means a new key is beginning, it cannot be a part of any value.
For this example:
-key: value -key2: value with whitespace -key3: value-with-hyphens -key4: v
The matches are:
-key: value, -key2: value with whitespace, -key3: value-with-hyphens, -key4: v.
It also works perfectly well on your provided example.
A low tech (non regex) solution, just for an alternative. Trim guff, ToDictionary if you need
var results = input.Split(new[] { " -" }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim('-').Split(':'));
Full Demo Here
Output
key1 -> "val1"
key2 -> "val2"
key3 -> (val3)
key4 -> "(val4)"
key5 -> val5
key6 -> "val-6"
key-7 -> val7
key8 -> "val 8"
Try this regex using Replace function:
(?:^|(?!\S)\s*)-|\s*:\s*
and replace with "\n". You should get key values in separate lines.
I presume you're wanting to keep the brackets and quotation marks as that's what you're doing in the example you gave? If so then the following should work:
-(?<key>\S+):+\s?(?<val>\S+\s?\d+\)?\"?)
This does presume that all val's end with a number though.
EDIT:
Given that the val doesn't always end with a number, but I'm guessing it always starts with val, this is what I have:
-(?<key>\S+):+\s?(?<val>\"?\(?(val)+\s?\S+)
Seems to be working properly...
This should do the trick
-(?<key>\S*):\s*(?<value>(?(?=")((")(?:(?=(\\?))\2.)*?\1))(\S*))
a sample link can be found here.
Basically it does and if/else/then to detect if the value contain " as (?(?=")(true regex)(false regex), the false regex is yours \S* while the true regex will try to match start/end quote (")(?:(?=(\\?))\2.)*?\1).

C# How to split text, but without removing delimiter?

I wanna split text by mathematical symbols [(),-,+,/,*,^].
For eg. "(3*21)+4/2" should make array {"(","3","*","21",")","+","4","/","2"}
I was trying do that by regex.split but brackets are problematic.
You can run through source string, adding to array cell if current value is a number, or moving to next array cell if not ([,*,-, etc...).
Not sure what problem you encountered with Regex.Split, but it seems quite simple. All you have to do is escape the character that have special meaning in regex. Like so:
string input = "(3*21+[3-5])+4/2";
string pattern = #"(\()|(\))|(\d+)|(\*)|(\+)|(-)|(/)|(\[)|(\])";
var result = Regex.Matches(input, pattern);
var result2 = Regex.Split(input, pattern);
Edit: updated pattern, '-' and '/' don't have to be escaped.
Afterwards you got 2 options: first one is using Split, it will make string array, but in between every match will be empty string. That's why I think you should go for Matches and transforming it to array of string is simple afterwards.
string[] stringResult = (from Match match in result select match.Value).ToArray();
stringResult
{string[15]}
[0]: "("
[1]: "3"
[2]: "*"
[3]: "21"
[4]: "+"
[5]: "["
[6]: "3"
[7]: "-"
[8]: "5"
[9]: "]"
[10]: ")"
[11]: "+"
[12]: "4"
[13]: "/"
[14]: "2"
I really think something like this will come in handy..
First, use getline and take all the input or if u already have a string, store it.
string input = Console.ReadLine();
Then create an array of length string.length...
string[] arr = new string[input.Length];
//Make sure ur input doesnt have spaces
Then store each value of the array to the value of string!! Like this
str[0]=arr[0];
This should work properly do this for all the characters or could use a for loop..
for(int i=0;i<input.Length;i++){
str[i]=arr[i];
}
That's it ...

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

Extract part from a string

Could someone please explain me how to write regular expressions to extract the "duration" and "time" from given strings?
Duration: 00:21:38.97, start: 0.000000, bitrate: 2705 kb/s
From the first string I want to extract duration "00:21:38.97" part.
size= 1547kB time=00:01:38.95 bitrate= 128.1kbits/s
From the second string I want to extract time "00:01:38.95" part.
I've tried
Regex.Match(theString, #"\:\s([^)]*)\,\s").Groups[1].Value;
Here is a possible solution:
class Program
{
static void Main(string[] args)
{
Regex regex = new Regex(#"(((?<Hour>[0-9]{1,2})[.:](?=[0-9]{2}))?(?<Minute>[0-9]{1,2})[.:])(?<Second>[0-9]{2})[.:](?<Milisecond>[0-9]{2})");
var string1 = "Duration: 00:21:38.97, start: 0.000000, bitrate: 2705 kb/s";
var string2 = "size= 1547kB time=00:01:38.95 bitrate= 128.1kbits/s ";
foreach(var match in regex.Match(string1).Captures)
{
Console.WriteLine(match.ToString());
}
foreach (var match in regex.Match(string2).Captures)
{
Console.WriteLine(match.ToString());
}
Console.ReadKey();
}
}
Output:
00:21:38.97
00:01:38.95
When you need to write a regex, you need to think about what describes the text you're trying to match.
For your first example, two possible descriptions come to mind:
"Match a series of four two-digit numbers, separated by colons".
That would be #"\d{2}:\d{2}:\d{2}:\d{2}" or #"(?:\d{2}:){3}\d{2}".
Match any text following after "Duration: " until (but not including) the next comma. That would be #"(?<=Duration: )[^,]*".
Similarly, for your second example, you could write
"Match a series of four two-digit numbers, separated by colons (except for the last one which is a dot)": #"\d{2}:\d{2}:\d{2}\.\d{2}".
Match any text following after "time=" until (but not including) the next whitespace. That would be #"(?<=time=)\S*".
Whether any of these actually does what you need it to do depends on the actual data you're encountering. For example, the first regex would find a match in 1234:56:78:9012 (it would match 34:56:78:90 here, which probably is not what you'd want). The second regex would fail on a string like Duration: 00:21:38.97; start: 0.000000; bitrate: 2705 kb/s because the separator has changed.
So you need to know exactly what you're looking for; writing a regex is pretty straightforward, then.

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

Categories

Resources