I need to find distinct values of partial filenames in an array of filenames. I'd like to do it in one line.
So, I have something like that as a filenames:
string[] filenames = {"aaa_ab12345.txt", "bbb_ab12345.txt", "aaa_ac12345.txt", "bbb_ac12345"}
and I need to find distinct values for ab12345 part of it.
So I currently have something like that:
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => System.IO.Path.GetFileNameWithoutExtension(file)
.Split({"_","."}, StringSplitOptions.RemoveEmptyEntries)[1]
)
.Distinct()
.ToArray();
Now, I'm getting filenames that are of form of aaa_bbb_ab12345.txt. So, instead of referring to the second part of the filename, I need to refer to the second to the last.
So, how do I refer to an arbitrary element based on length of array in one line, if it's a result of Split method? Something along lines of:
Array.ConvertAll(filenames, file=>file.Split(separator)[this.Length-2]).Distinct().ToArray();
In other words, if a string method results in an array of strings, how do I immediately select element based on the length of array:
String.Split()[third from end, fifth from end, etc.];
If you use GetFileNameWithoutExtension there will be no extension and therefore splitting by '_' will do it. Then you can take the last part with .Last().
string[] filenames_partial_distinct = Array.ConvertAll(
filenames,
file => Path.GetFileNameWithoutExtension(file).Split('_').Last()
)
.Distinct()
.ToArray();
With the input
string[] filenames = { "aaa_ab12345.txt", "bbb_ab12345.txt",
"aaa_ac12345.txt", "bbb_ac12345", "aaa_bbb_ab12345.txt" };
You get the result
{ "ab12345", "ac12345" }
The StringSplitOptions.RemoveEmptyEntries is only required if there are filenames ending with _ (before the extension).
Seems you're looking for something like this:
string[] arr = filenames.Select(n => n.Substring(n.IndexOf("_") + 1, 7)).Distinct().ToArray();
I usually defer problems like this to regex. They are very powerful. This approach also gives you the opportunity to detect unexpected cases and handle them appropriately.
Here is a crude example, assuming I understood your requirements:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string MyMatcher(string filename)
{
// this pattern may need work depending on what you need - it says
// extract that pattern between the "()" which is 2 characters and
// 4 digits, exactly; and can be found in `Groups[1]`.
Regex r = new Regex(#".*_(\w{2}\d{4}).*", RegexOptions.IgnoreCase);
Match m = r.Match(filename);
return m.Success
? m.Groups[1].ToString()
: null; // what should happen here?
}
string[] filenames =
{
"aaa_ab12345.txt",
"bbb_ab12345.txt",
"aaa_ac12345.txt",
"bbb_ac12345",
"aaa_bbb_ab12345.txt",
"ae12345.txt" // MyMatcher() return null for this - what should you do if this happens?
};
var results = filenames
.Select(MyMatcher)
.Distinct();
foreach (var result in results)
{
Console.WriteLine(result);
}
}
}
Gives:
ab1234
ac1234
This can be refined further, such as pre-compiled regex patterns, encapsulation in a class, etc.
I have millions of strings, around 8GB worth of HEX; each string is 3.2kb in length.
Each of these strings contains multiple parts of data I need to extract.
This is an example of one such string:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ$GPGGA,104646.091,,,,,0,0,,,M,,M,,*41$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test2ÿÿ3ÿÿ4ÿÿ5ÿÿ6ÿÿ7ÿÿ8ÿÿ9ÿÿ:ÿÿ;ÿÿ<ÿÿ=ÿÿ>ÿÿ?ÿÿ#ÿÿAÿÿBÿÿCÿÿDÿÿEÿÿFÿÿGÿÿHÿÿIÿÿJÿÿ$GPGGA,104647.091,,,,,0,0,,,M,,M,,*40$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header TestKÿÿLÿÿMÿÿNÿÿOÿÿPÿÿQÿÿRÿÿSÿÿTÿÿUÿÿVÿÿWÿÿXÿÿYÿÿZÿÿ[ÿÿ\ÿÿ]ÿÿ^ÿÿ_ÿÿ`ÿÿaÿÿbÿÿcÿÿ$GPGGA,104648.091,,,,,0,0,,,M,,M,,*4F$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Testdÿÿeÿÿfÿÿgÿÿhÿÿiÿÿjÿÿkÿÿlÿÿmÿÿnÿÿoÿÿpÿÿqÿÿrÿÿsÿÿtÿÿuÿÿvÿÿwÿÿxÿÿyÿÿzÿÿ{ÿÿ|ÿÿ$GPGGA,104649.091,,,,,0,0,,,M,,M,,*4E$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test}ÿÿ~ÿÿ.ÿÿ€ÿÿ.ÿÿ‚ÿÿƒÿÿ„ÿÿ…ÿÿ†ÿÿ‡ÿÿˆÿÿ‰ÿÿŠÿÿ‹ÿÿŒÿÿ.ÿÿŽÿÿ.ÿÿ.ÿÿ‘ÿÿ’ÿÿ“ÿÿ”ÿÿ•ÿÿ$GPGGA,104650.091,,,,,0,0,,,M,,M,,*46$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Head
as you can see it is pretty much this repeated:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
I want to separate this string into two lists like this:
_GPSList
$GPGGA,104644.091,,,,,0,0,,,M,,M,,*43
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N
_WavList
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
Issue 1:
This repetition isn't containing within a single string, it overflows into the next string. so if some data crosses the end and start of two strings how to I deal with that?
Issue 2: How do I analyse the string and extract only the parts I need?
The solution I'm providing is not a complete answer but more like an idea which might help you get what you want.
Everything else which I present is an assumption on my behalf.
//Assuming your data is stored in a file "yourdatafile"
//Splitting all the text on "$" assuming this will separate GPSData
string[] splittedstring = File.ReadAllText("yourdatafile").Split('$');
//I found an extra string lingering in the sample you provided
//because I splitted on "$", so you gotta take that into account
var GPSList = new List<string>();
var WAVList = new List<string>();
foreach (var str in splittedstring)
{
//So if the string contains "Header" we would want to separate it from GPS data
if (str.Contains("Header"))
{
string temp = str.Remove(str.IndexOf("Header"));
int indexOfAsterisk = temp.LastIndexOf("*");
string stringBeforeAsterisk = str.Substring(0, indexOfAsterisk + 1);
string stringAfterAsterisk = str.Replace(stringBeforeAsterisk, "");
WAVList.Add(stringAfterAsterisk);
GPSList.Add("$" + stringBeforeAsterisk);
}
else
GPSList.Add("$" + str);
}
This provides the exact output as you need, only exception is with that extra string. Also some non-standard characters might look like black blocks.
I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}
You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.
If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)
I have problems splitting this Line. I want to get each String between "#VAR;" and "#ENDVAR;". So at the End, there should be a output of:
Variable=Speed;Value=Fast;
Variable=Fabricator;Value=Freescale;Op==;
Later I will separate each Substring, using ";" as a delimiter but that I guess wont be that hard. This is how a line looks like:
#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;
I tried some split-options, but most of the time I just get an empty string. I also tried a Regex. But either the Regex was wrong or it wasnt suitable to my String. Probably its wrong, at school we learnt Regex different then its used in C#, so I was confused while implementing.
Regex.Match(t, #"/#VAR([a-z=a-z]*)/#ENDVAR")
Edit:
One small question: I am iterating over many lines like the one in the question. I use NoIdeas code on the line to get it in shape. The next step would be to print it as a Text-File. To print an Array I would have to loop over it. But in every iteration, when I get a new line, I overwrite the Array with the current splitted string. I put the Rest of my code in the question, would be nice if someone could help me.
string[] w ;
foreach (EA.Element theObjects in myPackageObject.Elements)
{
theObjects.Type = "Object";
foreach (EA.Element theElements in PackageHW.Elements)
{
if (theObjects.ClassfierID == theElements.ElementID)
{
t = theObjects.RunState;
w = t.Replace("#ENDVAR;", "#VAR;").Replace("#VAR;", ";").Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in w)
{
tw2.WriteLine(s);
}
}
}
}
The piece with the foreach-loop is wrong pretty sure. I need something to print each splitted t. Thanks in advance.
you can do it without regex using
str.Replace("#ENDVAR;", "#VAR;")
.Split(new string[] { "#VAR;" }, StringSplitOptions.RemoveEmptyEntries);
and if you want to save time you can do:
str.Replace("#ENDVAR;", "#VAR;")
.Replace("#VAR;", ";")
.Split(new string[] { ";" }, StringSplitOptions.RemoveEmptyEntries);
You can use a look ahead assertion here.
#VAR;(.*?)(?=#ENDVAR)
If your string never consists of whitespace between #VAR; and #ENDVAR; you could use the below line, this will not match empty instances of your lines.
#VAR;([^\s]+)(?=#ENDVAR)
See this demo
Answer using raw string manipulation.
IEnumerable<string> StuffFoundInside(string biggerString)
{
var closeDelimeterIndex = 0;
do
{
int openDelimeterIndex = biggerString.IndexOf("#VAR;", startingIndex);
if (openDelimeterIndex != -1)
{
closeDelimeterIndex = biggerString.IndexOf("#ENDVAR;", openDelimeterIndex);
if (closeDelimiterIndex != -1)
{
yield return biggerString.Substring(openDelimeterIndex, closeDelimeterIndex - openDelimiterIndex);
}
}
} while (closeDelimeterIndex != -1);
}
Making a list and adding each item to the list then returning the list might be faster, depending on how the code using this code would work. This allows it to terminate early, but has the coroutine overhead.
Use this regex:
(?i)#VAR;(.+?)#ENDVAR;
Group 1 in each match will be your line content.
(If you don't like regexs)
Code:
var s = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
var tokens = s.Split(new String [] {"#ENDVAR;#VAR;"}, StringSplitOptions.None);
foreach (var t in tokens)
{
var st = t.Replace("#VAR;", "").Replace("#ENDVAR;", "");
Console.WriteLine(st);
}
Output:
Variable=Speed;Value=Fast;Op==;
Variable=Fabricator;Value=Freescale;Op==;
Regex.Split works well but yields empty entries that have to be removed as shown here:
string[] result = Regex.Split(input, #"#\w+;")
.Where(s => s != "")
.ToArray();
I tried some split-options, but most of the time I just get an empty string.
In this case the requirements seem to be simpler than you're stating. Simply splitting and using linq will do your whole operation in one statement:
string test = "#VAR;Variable=Speed;Value=Fast;Op==;#ENDVAR;#VAR;Variable=Fabricator;Value=Freescale;Op==;#ENDVAR;";
List<List<string>> strings = (from s in test.Split(new string[]{"#VAR;",";#ENDVAR;"},StringSplitOptions.RemoveEmptyEntries)
let s1 = s.Split(new char[]{';'},StringSplitOptions.RemoveEmptyEntries).ToList<string>()
select (s1)).ToList<List<string>>();
the outpout is:
?strings[0]
Count = 3
[0]: "Variable=Speed"
[1]: "Value=Fast"
[2]: "Op=="
?strings[1]
Count = 3
[0]: "Variable=Fabricator"
[1]: "Value=Freescale"
[2]: "Op=="
To write the data to a file something like this will work:
foreach (List<string> s in strings)
{
System.IO.File.AppendAllLines("textfile1.txt", s);
}