Extracting unique and non-unique strings to separate output files

Extracting unique and non-unique strings to separate output files - c#

I am have trouble trying to extract only lines that are not duplicated and only lines that are only duplicates from a test file. The input file contains both duplicates and non-duplicate lines.
I have created a logging function and I can extract all unique lines from it to a separate file but that includes lines that are duplicates and lines that aren't, I need to separate them.
This is what I have so far;
static void Dupes(string path1, string path2)
{
string log = log.txt;
var sr = new StreamReader(File.OpenRead(path1));
var sw = new StreamWriter(File.OpenWrite(path2));
var lines = new HashSet<int>();
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
int hc = line.GetHashCode();
if (lines.Contains(hc))
continue;
lines.Add(hc);
sw.WriteLine(line);
}
sw.Close();
}
Ideally this would be in two functions, so they can be called to perform different actions on the output contents.

use LINQ to Group items, then check the count:
var lines = File.ReadAllLines(path1);
var distincts = lines.GroupBy(l => l)
.Where(l => l.Count() == 1)
.Select(l => l.Key)
.ToList();
var dupes = lines.Except(distincts).ToList();
It's worth noting that Except doesn't return duplicates - something I just learned. So no need to call Distinct afterwards.

You can do as follow
var lines = File.ReadAllLines(path1);
var countLines = lines.Select(d => new
{
Line = d,
Count = lines.Count(f => f == d),
});
var UniqueLines = countLines.Where(d => d.Count == 1).Select(d => d.Line);
var NotUniqueLines = countLines.Where(d => d.Count > 1).Select(d => d.Line);

Related

How to find maximum number of repeated string in a string in a list of string in c#

If we have a list of strings, then how we can find the list of strings that have the maximum number of repeated symbol by using LINQ.
List <string> mylist=new List <string>();
mylist.Add("%1");
mylist.Add("%136%250%3"); //s0
mylist.Add("%1%5%20%1%10%50%8%3"); // s1
mylist.Add("%4%255%20%1%14%50%8%4"); // s2
string symbol="%";
List <string> List_has_MAX_num_of_symbol= mylist.OrderByDescending(s => s.Length ==max_num_of(symbol)).ToList();
//the result should be a list of s1 + s2 since they have **8** repeated '%'
I tried
var longest = mylist.Where(s => s.Length == mylist.Max(m => m.Length)) ;
this gives me only one string not both

Here's a very simple solution, but not exactly efficient. Every element has the Count operation performed twice...
List<string> mylist = new List<string>();
mylist.Add("%1");
mylist.Add("%136%250%3"); //s0
mylist.Add("%1%5%20%1%10%50%8%3"); // s1
mylist.Add("%4%255%20%1%14%50%8%4"); // s2
char symbol = '%';
var maxRepeat = mylist.Max(item => item.Count(c => c == symbol));
var longest = mylist.Where(item => item.Count(c => c == symbol) == maxRepeat);
It will return 2 strings:
"%1%5%20%1%10%50%8%3"
"%4%255%20%1%14%50%8%4"

Here is an implementation that depends upon SortedDictionary<,> to get what you're after.
var mylist = new List<string> {"%1", "%136%250%3", "%1%5%20%1%10%50%8%3", "%4%255%20%1%14%50%8%4"};
var mappedValues = new SortedDictionary<int, IList<string>>();
mylist.ForEach(str =>
{
var count = str.Count(c => c == '%');
if (mappedValues.ContainsKey(count))
{
mappedValues[count].Add(str);
}
else
{
mappedValues[count] = new List<string> { str };
}
});
// output to validate output
foreach (var str in mappedValues.Last().Value)
{
Console.WriteLine(str);
}
Here's one using LINQ that gets the result you're after.
var result = (from str in mylist
group str by str.Count(c => c == '%')
into g
let max = (from gKey in g select g.Key).Max()
select new
{
Count = max,
List = (from str2 in g select str2)
}).LastOrDefault();

OK, here's my answer:
char symbol = '%';
var recs = mylist.Select(s => new { Str = s, Count = s.Count(c => c == symbol) });
var maxCount = recs.Max(x => x.Count);
var longest = recs.Where(x => x.Count == maxCount).Select(x => x.Str).ToList();
It is complicated because it has three lines (the char symbol = '%'; line excluded), but it counts each string only once. EZI's answer has only two lines, but it is complicated because it counts each string twice. If you really want a one-liner, here it is:
var longest = mylist.Where(x => x.Count(c => c == symbol) == mylist.Max(y => y.Count(c => c == symbol))).ToList();
but it counts each string many times. You can choose whatever complexity you want.

We can't assume that the % is always going to be the most repeated character in your list. First, we have to determine what character appears the most in an individual string for each string.
Once we have the character and it maximum occurrence, we can apply Linq to the List<string> and grab the strings that contain the character equal to its max occurrence.
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main()
{
List <string> mylist=new List <string>();
mylist.Add("%1");
mylist.Add("%136%250%3");
mylist.Add("%1%5%20%1%10%50%8%3");
mylist.Add("%4%255%20%1%14%50%8%4");
// Determine what character appears most in a single string in the list
char maxCharacter = ' ';
int maxCount = 0;
foreach (string item in mylist)
{
// Get the max occurrence of each character
int max = item.Max(m => item.Count(c => c == m));
if (max > maxCount)
{
maxCount = max;
// Store the character whose occurrence equals the max
maxCharacter = item.Select(c => c).Where(c => item.Count(i => i == c) == max).First();
}
}
// Print the strings containing the max character
mylist.Where(item => item.Count(c => c == maxCharacter) == maxCount)
.ToList().ForEach(Console.WriteLine);
}
}
Results:
%1%5%20%1%10%50%8%3
%4%255%20%1%14%50%8%4
Fiddle Demo

var newList = myList.maxBy(x=>x.Count(y=>y.Equals('%'))).ToList();
This should work. Please correct syntax if wrong anywhere and update here too if it works for you.

I am getting memory exception "System.IO.out of exception" error

For small directory size code is working fine ,it gives this error message when size of directory files are big.
My code :
IEnumerable<string> textLines =
Directory.GetFiles(#"C:\Users\karansha\Desktop\watson_query\", "*.*")
.Select(filePath => File.ReadAllLines(filePath))
.SelectMany(line => line)
.Where(line => !line.Contains("appGUID: null"))
.ToList();
List<string> users = new List<string>();
textLines.ToList().ForEach(textLine =>
{
Regex regex = new Regex(#"User:\s*(?<username>[^\s]+)");
MatchCollection matches = regex.Matches(textLine);
foreach (Match match in matches)
{
var user = match.Groups["username"].Value;
if (!users.Contains(user))
users.Add(user);
}
});
int numberOfUsers = users.Count(name => name.Length <= 10);
Console.WriteLine("Unique_Users_Express=" + numberOfUsers);

I would use Directory.EnumerateFiles and File.ReadLines since they are less memory hungry, they are working like a StreamReader whereas Directory.GetFiles and File.ReadAllLines reads all into memory first:
var matchingLines = Directory.EnumerateFiles(#"C:\Users\karansha\Desktop\watson_query\", "*.*")
.SelectMany(fn => File.ReadLines(fn))
.Where(l => l.IndexOf("appGUID: null", StringComparison.InvariantCultureIgnoreCase) >= 0);
foreach (var line in matchingLines)
{
Regex regex = new Regex(#"User:\s*(?<username>[^\s]+)");
// etc pp ...
}
You also don't need to create the List<string> for all the lines again. Just enumerate the query with foreach(textLines.ToList creates a third collection which is also redundant).

try to use next code, it uses ReadLines, which doesn't load entire file into memory, but read file line by line. It also uses HashSet to store unique results from matching a regular expression.
Regex regex = new Regex(#"User:\s*(?<username>[^\s]+)");
IEnumerable<string> textLines =
Directory.GetFiles(#"C:\Users\karansha\Desktop\watson_query\", "*.*")
.Select(filePath => File.ReadLines(filePath))
.SelectMany(line => line)
.Where(line => !line.Contains("appGUID: null"));
HashSet<string> users = new HashSet<string>(
textLines.SelectMany(line => regex.Matches(line).Cast<Match>())
.Select(match => match.Groups["username"].Value)
);
int numberOfUsers = users.Count(name => name.Length <= 10);
Console.WriteLine("Unique_Users_Express=" + numberOfUsers);

Sorting a string based on prefixes

If you are given an array with random prefixes, like this:
DOG_BOB
CAT_ROB
DOG_DANNY
MOUSE_MICKEY
DOG_STEVE
HORSE_NEIGH
CAT_RUDE
HORSE_BOO
MOUSE_STUPID
How would i go about sorting this so that i have 4 different arrays/lists of strings?
So the end result would give me 4 string ARRAYS or lists with
DOG_BOB,DOG_DANNY,DOG_STEVE <-- Array 1
HORSE_NEIGH, HORSE_BOO <-- Array 2
MOUSE_MICKEY, MOUSE_STUPID <-- Array 3
CAT_RUDE, CAT_ROB <-- Array 4
sorry about the names i just made them up lol
var fieldNames = typeof(animals).GetFields()
.Select(field => field.Name)
.ToList();
List<string> cats = new List<string>();
List<string> dogs = new List<string>();
List<string> mice= new List<string>();
List<string> horse = new List<string>();
foreach (var n in fieldNames)
{
var fieldValues = typeof(animals).GetField(n).GetValue(n);"
//Here's what i'm trying to do, with if statements
if (n.ToString().ToLower().Contains("horse"))
{
}
}
So i need them to be splitted into STRING ARRAYS/STRING LISTS and NOT just strings

string[] strings = new string[] {
"DOG_BOB",
"CAT_ROB",
"DOG_DANNY",
"MOUSE_MICKEY",
"DOG_STEVE",
"HORSE_NEIGH",
"CAT_RUDE",
"HORSE_BOO",
"MOUSE_STUPID"};
string[] results = strings.GroupBy(s => s.Split('_')[0])
.Select(g => String.Join(",",g))
.ToArray();
Or maybe something like this
List<List<string>> res = strings.ToLookup(s => s.Split('_')[0], s => s)
.Select(g => g.ToList())
.ToList();

var groups = fieldNames.GroupBy(n => n.Split('_')[0]);
Usage
foreach(var group in groups)
{
// group.Key (DOG, HORSE, CAT, etc)
foreach(var name in group)
// all names groped by prefix
}

foreach (String s in strings)
{
if (s.StartsWith("CAT_")
cats.Add(s);
else if (s.StartsWith("HORSE_")
horses.Add(s);
// ...
}
Or:
foreach (String s in strings)
{
String[] split = s.Split(new Char [] { '_' });
if (split[0].Equals("CAT")
cats.Add(s);
else if (split[0].Equals("HORSE")
horses.Add(s);
// ...
}
But I would prefer the first one.

Algorithmically, I'd do the following:
Parse out all unique prefixes by using the "_" as your delimeter.
Loop through your list of prefixes.
2a. Retrieve any values that have your prefix (loop/find/regex/depends on structure)
2b. Place retrieved values in a List.
2c. Sort list.
Output your results, or do what you need with your collections.

You can order the list up front and sort by prefix:
string[] input = new string[] {"DOG_BOB","CAT_ROB","DOG_DANNY","MOUSE_MICKEY","DOG_STEVE","HORSE_NEIGH","CAT_RUDE","HORSE_BOO","MOUSE_STUPID"};
string[] sortedInput = input.OrderBy(x => x).ToArray();
var distinctSortedPrefixes = sortedInput.Select(item => item.Split('_')[0]).Distinct().ToArray();
Dictionary<string, string[]> orderedByPrefix = new Dictionary<string, string[]>();
for (int prefixIndex = 0; prefixIndex < distinctSortedPrefixes.Length; prefixIndex++)
{
string prefix = distinctSortedPrefixes[prefixIndex];
var group = input.Where(item => item.StartsWith(prefix)).ToArray();
orderedByPrefix.Add(prefix, group);
}

With LINQ, using something like
names.GroupBy(s => s.Substring(0, s.IndexOf("_"))) // group by prefix
.Select(g => string.Join(",", g)) // join each group with commas
.ToList(); // take the results
See it in action (some extra .ToArray() calls included for .NET 3.0 compatibility)

This LINQ expression does what you want.
var result = data.GroupBy(data.Split('_')[0])
.Select(group => String.Join(", ", group))
.ToList();
For a list of lists of strings use this expression.
var result = data.GroupBy(data.Split('_')[0])
.Select(group => group.ToList())
.ToList();

How to find the number of each elements in the row and store the mean of each row in another array using C#?

I am using the below code to read data from a text file row by row. I would like to assign each row into an array. I must be able to find the number or rows/arrays and the number of elements on each one of them.
I would also like to do some manipulations on some or all rows and return their values.
I get the number of rows, but is there a way to to loop something like:
*for ( i=1 to number of rows)
do
mean[i]<-row[i]
done
return mean*
var data = System.IO.File.ReadAllText("Data.txt");
var arrays = new List<float[]>();
var lines = data.Split(new[] {'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
var lineArray = new List<float>();
foreach (var s in line.Split(new[] {','}, StringSplitOptions.RemoveEmptyEntries))
{
lineArray.Add(Convert.ToSingle(s));
}
arrays.Add(lineArray.ToArray());
}
var numberOfRows = lines.Count();
var numberOfValues = arrays.Sum(s => s.Length);

var arrays = new List<float[]>();
//....your filling the arrays
var averages = arrays.Select(floats => floats.Average()).ToArray(); //float[]
var counts = arrays.Select(floats => floats.Count()).ToArray(); //int[]

Not sure I understood the question. Do you mean something like
foreach (string line in File.ReadAllLines("fileName.txt")
{
...
}

Is it ok for you to use Linq? You might need to add using System.Linq; at the top.
float floatTester = 0;
List<float[]> result = File.ReadLines(#"Data.txt")
.Where(l => !string.IsNullOrWhiteSpace(l))
.Select(l => new {Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) })
.Select(x => x.Fields
.Where(f => Single.TryParse(f, out floatTester))
.Select(f => floatTester).ToArray())
.ToList();
// now get your totals
int numberOfLinesWithData = result.Count;
int numberOfAllFloats = result.Sum(fa => fa.Length);
Explanation:
File.ReadLines reads the lines of a file (not all at once but straming)
Where returns only elements for which the given predicate is true(f.e. the line must contain more than empty text)
new { creates an anonymous type with the given properties(f.e. the fields separated by comma)
Then i try to parse each field to float
All that can be parsed will be added to an float[] with ToArray()
All together will be added to a List<float[]> with ToList()

Found an efficient way to do this. Thanks for your input everybody!
private void ReadFile()
{
var lines = File.ReadLines("Data.csv");
var numbers = new List<List<double>>();
var separators = new[] { ',', ' ' };
/*System.Threading.Tasks.*/
Parallel.ForEach(lines, line =>
{
var list = new List<double>();
foreach (var s in line.Split(separators, StringSplitOptions.RemoveEmptyEntries))
{
double i;
if (double.TryParse(s, out i))
{
list.Add(i);
}
}
lock (numbers)
{
numbers.Add(list);
}
});
var rowTotal = new double[numbers.Count];
var rowMean = new double[numbers.Count];
var totalInRow = new int[numbers.Count()];
for (var row = 0; row < numbers.Count; row++)
{
var values = numbers[row].ToArray();
rowTotal[row] = values.Sum();
rowMean[row] = rowTotal[row] / values.Length;
totalInRow[row] += values.Length;
}

C# Combining lines

Hey everybody, this is what I have going on. I have two text files. Umm lets call one A.txt and B.txt.
A.txt is a config file that contains a bunch of folder names, only 1 listing per folder.
B.txt is a directory listing that contains folders names and sizes. But B contains a bunch of listing not just 1 entry.
What I need is if B, contains A. Take all lines in B that contain A and write it out as A|B|B|B ect....
So example:
A.txt:
Apple
Orange
Pear
XBSj
HEROE
B.txt:
Apple|3123123
Apple|3434
Orange|99999999
Orange|1234544
Pear|11
Pear|12
XBSJ|43949
XBSJ|43933
Result.txt:
Apple|3123123|3434
Orange|99999999|1234544
Pear|11|12
XBSJ|43949|43933
This is what I had but it's not really doing what I needed.
string[] combineconfig = File.ReadAllLines(#"C:\a.txt");
foreach (string ccline in combineconfig)
{
string[] readlines = File.ReadAllLines(#"C:\b.txt");
if (readlines.Contains(ccline))
{
foreach (string rdlines in readlines)
{
string[] pslines = rdlines.Split('|');
File.AppendAllText(#"C:\result.txt", ccline + '|' + pslines[0]);
}
}
I know realize it's not going to find the first "if" because it reads the entire line and cant find it. But i still believe my output file will not contain what I need.

Assuming you're using .NET 3.5 (so can use LINQ), try this:
string[] configLines = File.ReadAllLines("a.txt");
var dataLines = from line in File.ReadAllLines("b.txt")
let split = line.Split('|')
select new { Key = split[0], Value = split[1] };
var lookup = dataLines.ToLookup(x => x.Key, x => x.Value);
using (TextWriter writer = File.CreateText("result.txt"))
{
foreach (string key in configLines)
{
string[] values = lookup[key].ToArray();
if (values.Length > 0)
{
writer.WriteLine("{0}|{1}", key, string.Join("|", values));
}
}
}

var a = new HashSet<string>(File.ReadAllLines(#"a.txt")
.SelectMany(line => line.Split(' ')),
StringComparer.CurrentCultureIgnoreCase);
var c = File.ReadAllLines(#"b.txt")
.Select(line => line.Split('|'))
.GroupBy(item => item[0], item => item[1])
.Where(group => a.Contains(group.Key))
.Select(group => group.Key + "|" + string.Join("|", group.ToArray()))
.ToArray();
File.WriteAllLines("result.txt", c);
Output:
Apple|3123123|3434
Orange|99999999|1234544
Pear|11|12
XBSJ|43949|43933

A short one :
var a = File.ReadAllLines("A.txt");
var b = File.ReadAllLines("B.txt");
var query =
from bline in b
let parts = bline.Split('|')
group parts[1] by parts[0] into bg
join aline in a on bg.Key equals aline
select aline + "|" + string.Join("|", bg.ToArray());
File.WriteAllLines("result.txt", query.ToArray());

This should work:
using System;
using System.Linq;
using System.IO;
using System.Globalization;
namespace SO2593168
{
class Program
{
static void Main(string[] args)
{
var a = File.ReadAllLines("A.txt");
var b =
(from line in File.ReadAllLines("B.txt")
let parts = line.Split('|')
select new { key = parts[0], value = parts[1] });
var comparer = StringComparer.Create(CultureInfo.InvariantCulture, true);
var result =
from key in a
from keyvalue in b
where comparer.Compare(keyvalue.key, key) == 0
group keyvalue.value by keyvalue.key into g
select new { g.Key, values = String.Join("|", g.ToArray()) };
foreach (var entry in result)
Console.Out.WriteLine(entry.Key + "|" + entry.values);
}
}
}
This produces:
Apple|3123123|3434
Orange|99999999|1234544
Pear|11|12
XBSJ|43949|43933
Code here.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting unique and non-unique strings to separate output files - c#

You can do as follow var lines = File.ReadAllLines(path1); var countLines = lines.Select(d => new { Line = d, Count = lines.Count(f => f == d), }); var UniqueLines = countLines.Where(d => d.Count == 1).Select(d => d.Line); var NotUniqueLines = countLines.Where(d => d.Count > 1).Select(d => d.Line);

Related

How to find maximum number of repeated string in a string in a list of string in c#

I am getting memory exception "System.IO.out of exception" error

Sorting a string based on prefixes

How to find the number of each elements in the row and store the mean of each row in another array using C#?

C# Combining lines

Categories

Resources