Group delimited text by sections using known keys

Group delimited text by sections using known keys - c#

I've got string that looks like this. Every line is \r delimited, breaks placed here for visual purposes.
BEGIN_SECTIONS_INFORMATION
NUMSECTIONS=6
SECTION_GROUPNAME[1]=GROUP_1
SECTION_NAME[1]=foo
BEGIN_SECTION[1]
//blah...
END_SECTION[1]
SECTION_GROUPNAME[2]=GROUP_2
SECTION_NAME[2]=bazzz
BEGIN_SECTION[2]
//blah...
END_SECTION[2]
END_SECTIONS_INFORMATION
I need to split this string by SECTION_GROUPNAME into an IEnumerable<T> like this:
Index 0:
SECTION_GROUPNAME[1]=GROUP_1
SECTION_NAME[1]=foo
BEGIN_SECTION[1]
//blah...
END_SECTION[1]
Index 1:
SECTION_GROUPNAME[2]=GROUP_2
SECTION_NAME[2]=bazzz
BEGIN_SECTION[2]
//blah...
END_SECTION[2]
Rules:
Every section starts with SECTION_GROUPNAME[n].
Every section has a SECTION_NAME[n] and has an BEGIN_ and END_.
Section names are unique.
I have tried:
var sections = from line in sectionGroups
where line.StartsWith("SECTION_GROUPNAME")
group line by "SECTION_GROUPNAME";
Also tried
var sections = sectionGroups.Split(new string[] { "SECTION_GROUPNAME" }, StringSplitOptions.None);
From this post, OP created enums/list for groups. I can't do that as I don't know how many groups/sections will be in the string.

Assuming you want an IEnumerable<T> of strings that contain all the content you described without a key structure.
The basic idea is to remove the stuff that you don't want (the beginning and end bits), and then split by the start of your desired string. In the end, the split text doesn't make it into the array of results, so you have to manually add it back.
The following worked for me:
static string s = #"BEGIN_SECTIONS_INFORMATIONNUMSECTIONS=6SECTION_GROUPNAME[1]=GROUP_1SECTION_NAME[1] = foo\rBEGIN_SECTION[1]\rEND_SECTION[1]\rSECTION_GROUPNAME[2]=GROUP_2SECTION_NAME[2] = bazzzBEGIN_SECTION[2]END_SECTION[2]END_SECTIONS_INFORMATION";
static void Main(string[] args)
{
var withoutEnd = s.Split(new[] {"END_SECTIONS_INFORMATION"}, StringSplitOptions.RemoveEmptyEntries);
var SplitItems = withoutEnd[0].Split(new[] { "SECTION_GROUPNAME"}, StringSplitOptions.None).ToList();
SplitItems.RemoveAt(0); //the first part is just the introduction
var result = SplitItems.Select(x => "SECTION_GROUPNAME" + x);
}

Related

Need to refer to second to the last element of array of partial filenames

I need to find distinct values of partial filenames in an array of filenames. I'd like to do it in one line.
So, I have something like that as a filenames:
string[] filenames = {"aaa_ab12345.txt", "bbb_ab12345.txt", "aaa_ac12345.txt", "bbb_ac12345"}
and I need to find distinct values for ab12345 part of it.
So I currently have something like that:
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => System.IO.Path.GetFileNameWithoutExtension(file)
.Split({"_","."}, StringSplitOptions.RemoveEmptyEntries)[1]
)
.Distinct()
.ToArray();
Now, I'm getting filenames that are of form of aaa_bbb_ab12345.txt. So, instead of referring to the second part of the filename, I need to refer to the second to the last.
So, how do I refer to an arbitrary element based on length of array in one line, if it's a result of Split method? Something along lines of:
Array.ConvertAll(filenames, file=>file.Split(separator)[this.Length-2]).Distinct().ToArray();
In other words, if a string method results in an array of strings, how do I immediately select element based on the length of array:
String.Split()[third from end, fifth from end, etc.];

If you use GetFileNameWithoutExtension there will be no extension and therefore splitting by '_' will do it. Then you can take the last part with .Last().
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => Path.GetFileNameWithoutExtension(file).Split('_').Last()
)
.Distinct()
.ToArray();
With the input
string[] filenames = { "aaa_ab12345.txt", "bbb_ab12345.txt",
"aaa_ac12345.txt", "bbb_ac12345", "aaa_bbb_ab12345.txt" };
You get the result
{ "ab12345", "ac12345" }
The StringSplitOptions.RemoveEmptyEntries is only required if there are filenames ending with _ (before the extension).

Seems you're looking for something like this:
string[] arr = filenames.Select(n => n.Substring(n.IndexOf("_") + 1, 7)).Distinct().ToArray();

I usually defer problems like this to regex. They are very powerful. This approach also gives you the opportunity to detect unexpected cases and handle them appropriately.
Here is a crude example, assuming I understood your requirements:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string MyMatcher(string filename)
{
// this pattern may need work depending on what you need - it says
// extract that pattern between the "()" which is 2 characters and
// 4 digits, exactly; and can be found in `Groups[1]`.
Regex r = new Regex(#".*_(\w{2}\d{4}).*", RegexOptions.IgnoreCase);
Match m = r.Match(filename);
return m.Success
? m.Groups[1].ToString()
: null; // what should happen here?
}
string[] filenames =
{
"aaa_ab12345.txt",
"bbb_ab12345.txt",
"aaa_ac12345.txt",
"bbb_ac12345",
"aaa_bbb_ab12345.txt",
"ae12345.txt" // MyMatcher() return null for this - what should you do if this happens?
};
var results = filenames
.Select(MyMatcher)
.Distinct();
foreach (var result in results)
{
Console.WriteLine(result);
}
}
}
Gives:
ab1234
ac1234
This can be refined further, such as pre-compiled regex patterns, encapsulation in a class, etc.

Use everything before a specific character

So, I've been learning C# and I need to remove everything after the
":" character.
I've used a StreamReader to read the text file, but then I can't use the Split function, then I tried it by using an int function to import it, but then again I can't use the Split function?
What I want this to do is import a text file that's written like;
name:lastname
name2:lastname2
And so that it only shows name and name2.
I've been searching this for a couple of days but I can't seem to figure it out!
I don't know what I'm doing wrong and how to import the text file without using StreamReader or anything else.
Edit:
I'm trying to post something to a website that goes like;
example.com/q=(name without ":")
Edit 2:
StreamReader list = new StreamReader(#"list.txt");
string reader = list.ReadToEnd();
string[] split = reader.Split(":".ToCharArray());
Console.WriteLine(split);
gives output as;
System.String[]

You've got a few issues here. First, use File.ReadLines() instead of a StreamReader, its much simpler and easier:
IEnumerable<string> lines = File.ReadLines("path/to/file");
Next, your lines variable needs to be iterated so you can get to each line of the collection:
foreach (string line in lines)
{
//TODO: write split logic here
}
Then you have to split each line on the ':' character:
string[] split = line.Split(":");
Your split variable is an array of string (i.e string[]) which means you have to access a specific index of the array if you want to see its value. This is your second issue, if you pass split to Console.WriteLine() under the hood it just calls .ToString() on the object you have passed it, and with a string[] it won't automatically give you all the values, you have to write that yourself.
So if your line variable was: "name:Steve", the split variable would have two indexes and look like this:
//split[0] = "name"
//split[1] = "Steve"
I made a fiddle here that demonstrates.

I your file size small and your name:lastname in one line use:
var lines = File.ReadAllLines("filaPath");
foreach (var line in lines)
{
var array = line.Split(':');
if (array.Length > 0)
{
var name = array[0];
}
}
if name:lastname isn't in new line tell me how it's seprated

How can i analise millions of strings that merge into each other?

I have millions of strings, around 8GB worth of HEX; each string is 3.2kb in length.
Each of these strings contains multiple parts of data I need to extract.
This is an example of one such string:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ$GPGGA,104646.091,,,,,0,0,,,M,,M,,*41$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test2ÿÿ3ÿÿ4ÿÿ5ÿÿ6ÿÿ7ÿÿ8ÿÿ9ÿÿ:ÿÿ;ÿÿ<ÿÿ=ÿÿ>ÿÿ?ÿÿ#ÿÿAÿÿBÿÿCÿÿDÿÿEÿÿFÿÿGÿÿHÿÿIÿÿJÿÿ$GPGGA,104647.091,,,,,0,0,,,M,,M,,*40$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header TestKÿÿLÿÿMÿÿNÿÿOÿÿPÿÿQÿÿRÿÿSÿÿTÿÿUÿÿVÿÿWÿÿXÿÿYÿÿZÿÿ[ÿÿ\ÿÿ]ÿÿ^ÿÿ_ÿÿ`ÿÿaÿÿbÿÿcÿÿ$GPGGA,104648.091,,,,,0,0,,,M,,M,,*4F$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Testdÿÿeÿÿfÿÿgÿÿhÿÿiÿÿjÿÿkÿÿlÿÿmÿÿnÿÿoÿÿpÿÿqÿÿrÿÿsÿÿtÿÿuÿÿvÿÿwÿÿxÿÿyÿÿzÿÿ{ÿÿ|ÿÿ$GPGGA,104649.091,,,,,0,0,,,M,,M,,*4E$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test}ÿÿ~ÿÿ.ÿÿ€ÿÿ.ÿÿ‚ÿÿƒÿÿ„ÿÿ…ÿÿ†ÿÿ‡ÿÿˆÿÿ‰ÿÿŠÿÿ‹ÿÿŒÿÿ.ÿÿŽÿÿ.ÿÿ.ÿÿ‘ÿÿ’ÿÿ“ÿÿ”ÿÿ•ÿÿ$GPGGA,104650.091,,,,,0,0,,,M,,M,,*46$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Head
as you can see it is pretty much this repeated:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
I want to separate this string into two lists like this:
_GPSList
$GPGGA,104644.091,,,,,0,0,,,M,,M,,*43
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N
_WavList
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
Issue 1:
This repetition isn't containing within a single string, it overflows into the next string. so if some data crosses the end and start of two strings how to I deal with that?
Issue 2: How do I analyse the string and extract only the parts I need?

The solution I'm providing is not a complete answer but more like an idea which might help you get what you want.
Everything else which I present is an assumption on my behalf.
//Assuming your data is stored in a file "yourdatafile"
//Splitting all the text on "$" assuming this will separate GPSData
string[] splittedstring = File.ReadAllText("yourdatafile").Split('$');
//I found an extra string lingering in the sample you provided
//because I splitted on "$", so you gotta take that into account
var GPSList = new List<string>();
var WAVList = new List<string>();
foreach (var str in splittedstring)
{
//So if the string contains "Header" we would want to separate it from GPS data
if (str.Contains("Header"))
{
string temp = str.Remove(str.IndexOf("Header"));
int indexOfAsterisk = temp.LastIndexOf("*");
string stringBeforeAsterisk = str.Substring(0, indexOfAsterisk + 1);
string stringAfterAsterisk = str.Replace(stringBeforeAsterisk, "");
WAVList.Add(stringAfterAsterisk);
GPSList.Add("$" + stringBeforeAsterisk);
}
else
GPSList.Add("$" + str);
}
This provides the exact output as you need, only exception is with that extra string. Also some non-standard characters might look like black blocks.

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}

You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.

If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

C#: Getting Substring between two different Delimiters

I have problems splitting this Line. I want to get each String between "#VAR;" and "#ENDVAR;". So at the End, there should be a output of:
Variable=Speed;Value=Fast;
Variable=Fabricator;Value=Freescale;Op==;
Later I will separate each Substring, using ";" as a delimiter but that I guess wont be that hard. This is how a line looks like:
#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;
I tried some split-options, but most of the time I just get an empty string. I also tried a Regex. But either the Regex was wrong or it wasnt suitable to my String. Probably its wrong, at school we learnt Regex different then its used in C#, so I was confused while implementing.
Regex.Match(t, #"/#VAR([a-z=a-z]*)/#ENDVAR")
Edit:
One small question: I am iterating over many lines like the one in the question. I use NoIdeas code on the line to get it in shape. The next step would be to print it as a Text-File. To print an Array I would have to loop over it. But in every iteration, when I get a new line, I overwrite the Array with the current splitted string. I put the Rest of my code in the question, would be nice if someone could help me.
string[] w ;
foreach (EA.Element theObjects in myPackageObject.Elements)
{
theObjects.Type = "Object";
foreach (EA.Element theElements in PackageHW.Elements)
{
if (theObjects.ClassfierID == theElements.ElementID)
{
t = theObjects.RunState;
w = t.Replace("#ENDVAR;", "#VAR;").Replace("#VAR;", ";").Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in w)
{
tw2.WriteLine(s);
}
}
}
}
The piece with the foreach-loop is wrong pretty sure. I need something to print each splitted t. Thanks in advance.

you can do it without regex using
str.Replace("#ENDVAR;", "#VAR;")
.Split(new string[] { "#VAR;" }, StringSplitOptions.RemoveEmptyEntries);
and if you want to save time you can do:
str.Replace("#ENDVAR;", "#VAR;")
.Replace("#VAR;", ";")
.Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);

You can use a look ahead assertion here.
#VAR;(.*?)(?=#ENDVAR)
If your string never consists of whitespace between #VAR; and #ENDVAR; you could use the below line, this will not match empty instances of your lines.
#VAR;([^\s]+)(?=#ENDVAR)
See this demo

Answer using raw string manipulation.
IEnumerable<string> StuffFoundInside(string biggerString)
{
var closeDelimeterIndex = 0;
do
{
int openDelimeterIndex = biggerString.IndexOf("#VAR;", startingIndex);
if (openDelimeterIndex != -1)
{
closeDelimeterIndex = biggerString.IndexOf("#ENDVAR;", openDelimeterIndex);
if (closeDelimiterIndex != -1)
{
yield return biggerString.Substring(openDelimeterIndex, closeDelimeterIndex - openDelimiterIndex);
}
}
} while (closeDelimeterIndex != -1);
}
Making a list and adding each item to the list then returning the list might be faster, depending on how the code using this code would work. This allows it to terminate early, but has the coroutine overhead.

Use this regex:
(?i)#VAR;(.+?)#ENDVAR;
Group 1 in each match will be your line content.

(If you don't like regexs)
Code:
var s = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
var tokens = s.Split(new String [] {"#ENDVAR;#VAR;"}, StringSplitOptions.None);
foreach (var t in tokens)
{
var st = t.Replace("#VAR;", "").Replace("#ENDVAR;", "");
Console.WriteLine(st);
}
Output:
Variable=Speed;Value=Fast;Op==;
Variable=Fabricator;Value=Freescale;Op==;

Regex.Split works well but yields empty entries that have to be removed as shown here:
string[] result = Regex.Split(input, #"#\w+;")
.Where(s => s != "")
.ToArray();

I tried some split-options, but most of the time I just get an empty string.
In this case the requirements seem to be simpler than you're stating. Simply splitting and using linq will do your whole operation in one statement:
string test = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
List<List<string>> strings = (from s in test.Split(new string[]{"#VAR;",";#ENDVAR;"},StringSplitOptions.RemoveEmptyEntries)
let s1 = s.Split(new char[]{';'},StringSplitOptions.RemoveEmptyEntries).ToList<string>()
select (s1)).ToList<List<string>>();
the outpout is:
?strings[0]
Count = 3
[0]: "Variable=Speed"
[1]: "Value=Fast"
[2]: "Op=="
?strings[1]
Count = 3
[0]: "Variable=Fabricator"
[1]: "Value=Freescale"
[2]: "Op=="
To write the data to a file something like this will work:
foreach (List<string> s in strings)
{
System.IO.File.AppendAllLines("textfile1.txt", s);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Group delimited text by sections using known keys - c#

Related

Need to refer to second to the last element of array of partial filenames

Use everything before a specific character

How can i analise millions of strings that merge into each other?

grouping adjacent similar substrings

C#: Getting Substring between two different Delimiters

Categories

Resources