I have this text file that contains approximately 22 000 lines, with each line looking like this:
12A4 (Text)
So it's in the format 4-letter/number (Hexdecimal) and then text. Sometimes there is more than one value in text, separated by a comma:
A34d (Text, Optional)
Is there any efficient way to search for the Hex and then return the first text in the parentheses? Would it be much more effective if I stored this data in SQLite?
Example using substring and split.
string value = "A34d (Text, Optional)";
string hex = value.Substring(0, 4);
string text = value.Split('(')[1];
if (text.Contains(','))
text = text.Substring(0, text.IndexOf(','));
else
text = text.Substring(0, text.Length-1);
For searching use a Dictionary.
That's probably < 2 mb of data.
I think you can:
Read the whole file
Split each line in key ( the hex number ) and value ( the remaining ) Chris Persichetti answer is excellent for that
Store each line in a dictionary ( using the number as int , nor as string )
d = Dictionary<int,string>
d.put( int.Perse( key ), value );
Keep that dictionary in memory and then perform a very quick look up by the id
There are elegant answers posted already, but since you requested regex, try this:
var regex = #"^(?<hexData>.{4}\s(?<textData>.*)$)";
var matches = Regex.Matches
(textInput, regex, RegexOptions.IgnoreWhiteSpace
| RegexOptions.Singleline);
then you parse through matches object to get whatever you want.
If you want to search for the Hex value more than once, you definitely want to store this in a lookup table of some sort.
This could be as simple as a Dictionary<string, string> that you populate with the contents of your file on startup:
read each line (StreamReader.ReadLine)
hexString = substring of first 4 characters in line
store the rest of the string
To find the first part, create a function that retrieves "A" from "(A, B, C, ...)"
If you can rule out commas "," in "A", you are in luck: Remove the parentheses, split on "," and return first substring.
var lines = ...;
var item = (from line in lines
where line.StartsWith("a34d", StringComparison.OrdinalIgnoreCase)
select line).FirstOrDefault();
//if item == null, it is not found
var firstText = item.Split('(',',',')')[1];
It works and if you want to strip leading and trailing whitespaces from firstText then add a .Trim() in the end.
For splitting a text into several lines, see my two answers here. How can I convert a string with newlines in it to separate lines?
Use a StreamReader to ReadLine and you can then check if the first characters are equal to what you search and if it is you can do
string yourresult = thereadline.Split
(new string[]{" (",","},
StringSplitOptions.RemoveEmptyEntries)[1]
Related
I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?
Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]
I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();
There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.
To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();
Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]
I have a email id like below
string email=test.mail#test.com;
string myText = email.split(".");
i am not sure who split first two characters and followed by two characters after the period or dot.
myText = tema //(desired output)
Use LINQ ;)
string myText = string.Join("", email.Remove(email.IndexOf('#')).Split('.')
.Select(r =>new String(r.Take(2).ToArray())));
First Remove text after #, (including #)
Then split on .
From the returned array take first two characters from each element and convert it to array
Pass the array of characters to String constructor creating a string
using String.Join to combine returned strings element.
Another Linq solution:
string first = new string(email.Take(2).ToArray());
string second = new string(email.SkipWhile(c => c != '.').Skip(1).Take(2).ToArray());
string res = first + second;
string.Join(string.Empty, email.Substring(0, email.IndexOf("#")).Split('.').Select(x => x.Substring(0, 2)));
Lots of creative answers here, but the most important point is that Split() is the wrong tool for this job. It's much easier to use Replace():
myText = Regex.Replace(email, #"^(\w{2})[^.]*\.(\w{2})[^.]*#.+$", "$1$2");
Note that I'm making a lot of simplifying assumptions here. Most importantly, I'm assuming the original string contains the email address and nothing else (you're not searching for it), that the string is well formed (you're not trying to validate it), and that both of substrings you're interested in start with at least two word characters.
In C#, I have a string comes from a file in this format:
Type="Data"><Path.Style><Style
or maybe
Type="Program"><Rectangle.Style><Style
,etc. Now I want to only extract the Data or Program part of the Type element. For that, I used the following code:
string output;
var pair = inputKeyValue.Split('=');
if (pair[0] == "Type")
{
output = pair[1].Trim('"');
}
But it gives me this result:
output=Data><Path.Style><Style
What I want is:
output=Data
How to do that?
This code example takes an input string, splits by double quotes, and takes only the first 2 items, then joins them together to create your final string.
string input = "Type=\"Data\"><Path.Style><Style";
var parts = input
.Split('"')
.Take(2);
string output = string.Join("", parts); //note: .net 4 or higher
This will make output have the value:
Type=Data
If you only want output to be "Data", then do
var parts = input
.Split('"')
.Skip(1)
.Take(1);
or
var output = input
.Split('"')[1];
What you can do is use a very simple regular express to parse out the bits that you want, in your case you want something that looks like this and then grab the two groups that interest you:
(Type)="(\w+)"
Which would return in groups 1 and 2 the values Type and the non-space characters contained between the double-quotes.
Instead of doing many split, why don't you just use Regex :
output = Regex.Match(pair[1].Trim('"'), "\"(\w*)\"").Value;
Maybe I missed something, but what about this:
var str = "Type=\"Program\"><Rectangle.Style><Style";
var splitted = str.Split('"');
var type = splitted[1]; // IE Data or Progam
But you will need some error handling as well.
How about a regex?
var regex = new Regex("(?<=^Type=\").*?(?=\")");
var output = regex.Match(input).Value;
Explaination of regex
(?<=^Type=\") This a prefix match. Its not included in the result but will only match
if the string starts with Type="
.*? Non greedy match. Match as many characters as you can until
(?=\") This is a suffix match. It's not included in the result but will only match if the next character is "
Given your specified format:
Type="Program"><Rectangle.Style><Style
It seems logical to me to include the quote mark (") when splitting the strings... then you just have to detect the end quote mark and subtract the contents. You can use LinQ to do this:
string code = "Type=\"Program\"><Rectangle.Style><Style";
string[] parts = code.Split(new string[] { "=\"" }, StringSplitOptions.None);
string[] wantedParts = parts.Where(p => p.Contains("\"")).
Select(p => p.Substring(0, p.IndexOf("\""))).ToArray();
I have a program to compare text files. Takes in 2 files spits out 1. The input files have lines of data similar to this
tv_rocscores_DeDeP005M3TSub.csv FMR: 0.0009 FNMR: 0.023809524 SCORE: -4 Conformity: True
tv_..............P006............................................................
tv_..............P007............................................................
etc etc.
For my initial purposes, I was splitting the lines based on spaces, to get the respective values. However, for the first field, tv_rocscores_DeDeP005M3TSbu.csv i only need P005 and not the rest. I cannot opt for position number as well, because the position of P005 in the phrase is not the same for every file.
Any advise on how i split this so that i can identify my first field with only P005??
Your question is a bit unclear. If you're looking for pattern, say "P + three digits", e.g. "P005" you can use regular expressions:
String str = #"tv_rocscores_DeDeP005M3TSub.csv FMR: 0.0009 FNMR: 0.023809524 SCORE: -4 Conformity: True";
String[] parts = str.Split(' ');
parts[0] = Regex.Match(parts[0], #"P\d\d\d").Value; // <- "P005"
To extract the desired part I would try something like this:
var parts = str.Split(' ');
var number = Regex.Match(parts[0], ".*?(?<num>P\d+).*?").Groups["num"].Value;
Or if you know its only three digits you could change the regular expression to .*?(?<num>P\d{3}).*?
Hope that solves your problem :)
How about just checking if the first field contains P005?
bool hasP005 = field1.Contains("P005");
Your question isn't clear. Can't you just replace the first field with your string?
string[] parts = str.Split(' ');
parts[0] = "P005";
Are you looking to try field the field that contains that string? if so then you can use some linq
var field = s.Split(' ').Where(x => x.Contains("P005")).ToList()[0];
Say I have a string such as
abc123def456
What's the best way to split the string into an array such as
["abc", "123", "def", "456"]
string input = "abc123def456";
Regex re = new Regex(#"\D+|\d+");
string[] result = re.Matches(input).OfType<Match>()
.Select(m => m.Value).ToArray();
string[] result = Regex.Split("abc123def456", "([0-9]+)");
The above will use any sequence of numbers as the delimiter, though wrapping it in () says that we still would like to keep our delimiter in our returned array.
Note: In the example snippet we will get an empty element as the last entry of our array.
The boundary you look for can be described as "A position where a digit follows a non-digit, or where a non-digit follows a digit."
So:
string[] result = Regex.Split("abc123def456", #"(?<=\D)(?=\d)|(?<=\d)(?=\D)");
Use [0-9] and [^0-9], respectively, if \d and \D are not specific enough.
Add space around digitals, then split it. So there is the solution.
Regex.Replace("abc123def456", #"(\d+)", #" \1 ").Split(' ');
I hope it works.
You could convert the string to a char array and then loop through the characters. As long as the characters are of the same type (letter or number) keep adding them to a string. When the next character no longer is of the same type (or you've reached the end of the string), add the temporary string to the array and reset the temporary string to null.