Way(s) to extract selected node values from this XML Markup - c#

Given the (specimen - real markup may be considerably more complicated) markup and constraints listed below, could anyone propose a solution (C#) more effective/efficient than walking the whole tree to retrieve { "##value1##", "##value2##", "##value3##" }, i.e. a list of tokens that are going to be replaced when the markup is actually used.
Note: I have no control over the markup, structure of the markup or format/naming of the tokens that are being replaced.
<markup>
<element1 attributea="blah">##value1##</element1>
<element2>##value2##</element2>
<element3>
<element3point1>##value1##</element3point1>
<element3point2>##value3##</element3point2>
<element3point3>apple</element3point3>
<element3>
<element4>pear</element4>
</markup>

How about:
var keys = new HashSet<string>();
Regex.Replace(input, "##[^#]+##", match => {
keys.Add(match.Value);
return ""; // doesn't matter
});
foreach (string key in keys) {
Console.WriteLine(key);
}
This:
doesn't bother parsing the xml (just string manipulation)
only includes the unique values (no need to return a MatchCollection with the duplicates we don't want)
However, it may build a larger string, so maybe just Matches:
var matches = Regex.Matches(input, "##[^#]+##");
var result = matches.Cast<Match>().Select(m => m.Value).Distinct();
foreach (string s in result) {
Console.WriteLine(s);
}

I wrote a quick prog with your sample, this should do the trick.
class Program
{
//I just copied your stuff to Test.xml
static void Main(string[] args)
{
XDocument doc = XDocument.Load("Test.xml");
var verbs=new Dictionary<string,string>();
//Add the values to replace ehre
verbs.Add("##value3##", "mango");
verbs.Add("##value1##", "potato");
ReplaceStuff(verbs, doc.Root.Elements());
doc.Save("Test2.xml");
}
//A simple replace class
static void ReplaceStuff(Dictionary<string,string> verbs,IEnumerable<XElement> elements)
{
foreach (var e in elements)
{
if (e.Elements().Count() > 0)
ReplaceStuff(verbs, e.Elements() );
else
{
if (verbs.ContainsKey(e.Value.Trim()))
e.Value = verbs[e.Value];
}
}
}
}

Related

Is It possible to find out what are the common part in String List

I was working on finding out the Common string part in the String list. If we take a sample data set
private readonly List<string> Xpath = new List<string>()
{
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(1)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(2)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(3)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(4)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(5)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(6)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(7)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(8)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(9)>H2:nth-of-type(1)"
};
From this, I want to find out to which children these are similar. data is an Xpath list.
Programmatically I should be able to tell
Expected output:
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV
In order to get this What I did was like this. I separate each item by > and then create a list of items for each dataset originally.
Then using this find out what are the unique items
private IEnumerable<T> GetCommonItems<T>(IEnumerable<T>[] lists)
{
HashSet<T> hs = new HashSet<T>(lists.First());
for (int i = 1; i < lists.Length; i++)
{
hs.IntersectWith(lists[i]);
}
return hs;
}
Able to find out the unique values and create a dataset again. But what happened is if this contains Ex:- Div in two places and it also in every originally dataset even then this method will pick up only one Div.
From then I would get something like this:
BODY>MAIN:nth-of-type(1)>DIV>SECTION
But I need this
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-
type(3)>DIV>ARTICLE>DIV>DIV>DIV
Disclaimer: This is not the most performant solution but it works :)
Let's start with splitting the first path by > character
Do the same with all the paths
char separator = '>';
IEnumerable<string> firstPathChunks = Xpath[0].Split(separator);
var chunks = Xpath.Select(path => path.Split(separator).ToList()).ToArray();
Iterate through the firstPathChunks
Iterate through the chunks
if there is a match then remove the first element
if all first element is removed then append the matching prefix to sb
void Process(StringBuilder sb)
{
foreach (var pathChunk in firstPathChunks)
{
foreach (var chunk in chunks)
{
if (chunk[0] != pathChunk)
{
return;
}
chunk.RemoveAt(0);
}
sb.Append(pathChunk);
sb.Append(separator);
}
}
Sample usage
var sb = new StringBuilder();
Process(sb);
Console.WriteLine(sb.ToString());
Output
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>
Parsing the string by the seperator > is a good idea. Instead of then creating a list of unique items you should create a list of all items contained in the string which would result in
{
"BODY",
"MAIN:nth-of-type(1)",
"DIV",
"SECTTION",
"DIV",
...
}
for the first entry of your XPath list.
This way you create a List<List<string>> containing every element of each entry of your XPath list. You then can compare all first elements of the inner lists. If they are equal save that elements value to you output and proceed with all second elements and so on until you find an element that is not equal in all outer lists.
Edit:
After seperating your list by the > seperator this could look something like this:
List<List<string>> XPathElementsLists;
List<string> resultElements = new List<string>();
string result;
XPathElementsLists = ParseElementsFormXPath(XPath);
for (int i = 0; i < XPathElementsLists[0].Count; i++)
{
bool isEqual = true;
string compareElemment = XPathElementsLists[0][i];
foreach (List<string> element in XPathElementsLists)
{
if (!String.Equals(compareElemment, element))
{
isEqual = false;
break;
}
}
if (!isEqual)
{
break;
}
resultElements.Add(compareElemment);
}
result = String.Join(">", resultElements.ToArray());

check if string contains dictionary keys and replace the matching subtring with Values from dictionary

I am parsing a template file which will contain certain keys that I need to map values to. Take a line from the file for example:
Field InspectionStationID 3 {"PVA TePla #WSM#", "sw#data.tool_context.TOOL_SOFTWARE_VERSION#", "#data.context.TOOL_ENTITY#"}
I need to replace the string within the # symbols with values from a dictionary.
So there can be multiple keys from the dictionary. However, not all strings inside the # are in the dictionary so for those, I will have to replace them with empty string.
I cant seem to find a way to do this. And yes I have looked at this solution:
check if string contains dictionary Key -> remove key and add value
For now what I have is this (where I read from the template file line by line and then write to a different file):
string line = string.Empty;
var dict = new Dictionary<string, string>() {
{ "data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" }
};
StringBuilder inputText = new StringBuilder();
StreamWriter writeKlarf = new StreamWriter(klarfOutputNameActual);
using (StreamReader sr = new StreamReader(WSMTemplatePath))
{
while((line = sr.ReadLine()) != null)
{
//Console.WriteLine(line);
if (line.Contains("#"))
{
}
else
{
writeKlarf.WriteLine(line)
}
}
}
writeKlarf.Close();
THe idea is that for each line, replace the string within the # and the # with match values from the dictionary if the #string# is inside the dictionary. How can I do this?
Sample Output Given the line above:
Field InspectionStationID 3 {"PVA TePla", "sw0.2.002", "WSM102"}
Here because #WSM# is not the dictionary, it is replaced with empty string
One more thing, this logic only applies to the first qurter of the file. The rest of the file will have other data that will need to be entered via another logic so I am not sure if it makes sense to read the whole file in into memory just for the header section?
Here's a quick example that I wrote for you, hopefully this is what you're asking for.
This will let you have a <string, string> Dictionary, check for the Key inside of a delimiter, and if the text inside of the delimiter matches the Dictionary key, it will replace the text. It won't edit any of the inputted strings that don't have any matches.
If you want to delete the unmatched value instead of leaving it alone, replace the kvp.Value in the line.Replace() with String.Empty
var dict = new Dictionary<string, string>() {
{ "test", "cool test" }
};
string line = "#test# is now replaced.";
foreach (var kvp in dict)
{
string split = line.Split('#')[1];
if (split == kvp.Key)
{
line = line.Replace($"#{split}#", kvp.Value);
}
Console.WriteLine(line);
}
Console.ReadLine();
If you had a list of tuple that were the find and replace, you can read the file, replace each, and then rewrite the file
var frs = new List<(string F, string R)>(){
("#data.tool_context.TOOL_SOFTWARE_VERSION#", "sw0.2.002"),
("#otherfield#", "replacement here")
};
var i = File.ReadAllText("path");
frs.ForEach(fr => i = i.Replace(fr.F,fr.R));
File.WriteAllText("path2", i);
The choice to use a list vs dictionary is fairly arbitrary; List has a ForEach method but it could just as easily be a foreach loop on a dictionary. I included the ## in the find string because I got the impression the output is not supposed to contain ##..
This version leaves alone any template parameters that aren't available
You can try matching #...# keys with a help of regular expressions:
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
...
static string MyReplace(string value, IDictionary<string, string> subs) => Regex
.Replace(value, "#[^#]*#", match => subs.TryGetValue(
match.Value.Substring(1, match.Value.Length - 2), out var item) ? item : "");
then you can apply it to the file: we read file's lines, process them with a help of Linq and write them into another file.
var dict = new Dictionary<string, string>() {
{"data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" },
};
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => MyReplace(line, dict)));
Edit: If you want to switch off MyReplace from some line on
bool doReplace = true;
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => {
//TODO: having line check if we want to keep replacing
if (!doReplace || SomeCondition(line)) {
doReplace = false;
return line;
}
return MyReplace(line, dict)
}));
Here SomeCondition(line) returns true whenever header ends and we should not replace #..# any more.

Linq query for building a dictionary from a reg file

I'm building a simple dictionary from a reg file (export from Windows Regedit). The .reg file contains a key in square brackets, followed by zero or more lines of text, followed by a blank line. This code will create the dictionary that I need:
var a = File.ReadLines("test.reg");
var dict = new Dictionary<String, List<String>>();
foreach (var key in a) {
if (key.StartsWith("[HKEY")) {
var iter = a.GetEnumerator();
var value = new List<String>();
do {
iter.MoveNext();
value.Add(iter.Current);
} while (String.IsNullOrWhiteSpace(iter.Current) == false);
dict.Add(key, value);
}
}
I feel like there is a cleaner (prettier?) way to do this in a single Linq statement (using a group by), but it's unclear to me how to implement the iteration of the value items into a list. I suspect I could do the same GetEnumerator in a let statement but it seems like there should be a way to implement this without resorting to an explicit iterator.
Sample data:
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.msu]
#="Microsoft.System.Update.1"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS]
#="WMP11.AssocFile.M2TS"
"Content Type"="video/vnd.dlna.mpeg-tts"
"PerceivedType"="video"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\OpenWithProgIds]
"WMP11.AssocFile.M2TS"=hex(0):
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx]
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx\{BB2E617C-0920-11D1-9A0B-00C04FC2D6C1}]
#="{9DBD2C50-62AD-11D0-B806-00C04FD706EC}"
Update
I'm sorry I need to be more specific. The files am looking at around ~300MB so I took the approach I did to keep the memory footprint down. I'd prefer an approach that doesn't require pulling the entire file into memory.
You can always use Regex:
var dict = new Dictionary<String, List<String>>();
var a = File.ReadAllText(#"test.reg");
var results = Regex.Matches(a, "(\\[[^\\]]+\\])([^\\[]+)\r\n\r\n", RegexOptions.Singleline);
foreach (Match item in results)
{
dict.Add(
item.Groups[1].Value,
item.Groups[2].Value.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).ToList()
);
}
I whipped this out real quick. You might be able to improve the regex pattern.
Instead of using GetEnumerator you can take advantage of TakeWhile and Split methods to break your list into smaller list (each sublist represents one key and its values)
var registryLines = File.ReadLines("test.reg");
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
while (registryLines.Count() > 0)
{
// Take the key and values into a single list
var keyValues = registryLines.TakeWhile(x => !String.IsNullOrWhiteSpace(x)).ToList();
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyValues != null && keyValues.Count > 0)
resultKeys.Add(keyValues[0], keyValues.Skip(1).ToList());
// Jumps to the next registry (+1 to skip the blank line)
registryLines = registryLines.Skip(keyValues.Count + 1);
}
EDIT based on your update
Update I'm sorry I need to be more specific. The files am looking at
around ~300MB so I took the approach I did to keep the memory
footprint down. I'd prefer an approach that doesn't require pulling
the entire file into memory.
Well, if you can't read the whole file into memory, it makes no sense to me asking for a LINQ solution. Here is a sample of how you can do it reading line by line (still no need for GetEnumerator)
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
using (StreamReader reader = File.OpenText("test.reg"))
{
List<string> keyAndValues = new List<string>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
// Adds key and values to a list until it finds a blank line
if (!string.IsNullOrWhiteSpace(line))
keyAndValues.Add(line);
else
{
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyAndValues != null && keyAndValues.Count > 0)
resultKeys.Add(keyAndValues[0], keyAndValues.Skip(1).ToList());
// Starts a new Key collection
keyAndValues = new List<string>();
}
}
}
I think you can use a code like this - if you can use memory -:
var lines = File.ReadAllText(fileName);
var result =
Regex.Matches(lines, #"\[(?<key>HKEY[^]]+)\]\s+(?<value>[^[]+)")
.OfType<Match>()
.ToDictionary(k => k.Groups["key"], v => v.Groups["value"].ToString().Trim('\n', '\r', ' '));
C# Demo
This will take 24.173 seconds for a file with more than 4 million lines - Size:~550MB - by using 1.2 GB memory.
Edit :
The best way is using File.ReadAllLines as it is lazy:
var lines = File.ReadAllLines(fileName);
var keyRegex = new Regex(#"\[(?<key>HKEY[^]]+)\]");
var currentKey = string.Empty;
var currentValue = string.Empty;
var result = new Dictionary<string, string>();
foreach (var line in lines)
{
var match = keyRegex.Match(line);
if (match.Length > 0)
{
if (!string.IsNullOrEmpty(currentKey))
{
result.Add(currentKey, currentValue);
currentValue = string.Empty;
}
currentKey = match.Groups["key"].ToString();
}
else
{
currentValue += line;
}
}
This will take 17093 milliseconds for a file with 795180 lines.

extract variables and values from http post / string c#

I am trying to read in POST data to an ASPX (c#) page. I have got the post data now inside a string. I am now wondering if this is the best way to use it. Using the code here (http://stackoverflow.com/questions/10386534/using-request-getbufferlessinputstream-correctly-for-post-data-c-sharp) I have the following string
<callback variable1="foo1" variable2="foo2" variable3="foo3" />
As this is now in a string, I am splitting based on a space.
string[] pairs = theResponse.Split(' ');
Dictionary<string, string> results = new Dictionary<string, string>();
foreach (string pair in pairs)
{
string[] paramvalue = pair.Split('=');
results.Add(paramvalue[0], paramvalue[1]);
Debug.WriteLine(paramvalue[0].ToString());
}
The trouble comes when a value has a space in it. For example, variable3="foo 3" upsets the code.
Is there something better I should be doing to parse the incoming http post variables within the string??
You might want to treat it as XML directly:
// just use 'theResponse' here instead
var xml = "<callback variable1=\"foo1\" variable2=\"foo2\" variable3=\"foo3\" />";
// once inside an XElement you can get all the values
var ele = XElement.Parse(xml);
// an example of getting the attributes out
var values = ele.Attributes().Select(att => new { Name = att.Name, Value = att.Value });
// or print them
foreach (var attr in ele.Attributes())
{
Console.WriteLine("{0} - {1}", attr.Name, attr.Value);
}
Of course you can change that last line to whatever you want, the above is a rough example.

C# dedupe List based on split

I'm having a hard time deduping a list based on a specific delimiter.
For example I have 4 strings like below:
apple|pear|fruit|basket
orange|mango|fruit|turtle
purple|red|black|green
hero|thor|ironman|hulk
In this example I should want my list to only have unique values in column 3, so it would result in an List that looks like this,
apple|pear|fruit|basket
purple|red|black|green
hero|thor|ironman|hulk
In the above example I would have gotten rid of line 2 because line 1 had the same result in column 3. Any help would be awesome, deduping is tough in C#.
how i'm testing this:
static void Main(string[] args)
{
BeginListSet = new List<string>();
startHashSet();
}
public static List<string> BeginListSet { get; set; }
public static void startHashSet()
{
string[] BeginFileLine = File.ReadAllLines(#"C:\testit.txt");
foreach (string begLine in BeginFileLine)
{
BeginListSet.Add(begLine);
}
}
public static IEnumerable<string> Dedupe(IEnumerable<string> list, char seperator, int keyIndex)
{
var hashset = new HashSet<string>();
foreach (string item in list)
{
var array = item.Split(seperator);
if (hashset.Add(array[keyIndex]))
yield return item;
}
}
Something like this should work for you
static IEnumerable<string> Dedupe(this IEnumerable<string> input, char seperator, int keyIndex)
{
var hashset = new HashSet<string>();
foreach (string item in input)
{
var array = item.Split(seperator);
if (hashset.Add(array[keyIndex]))
yield return item;
}
}
...
var list = new string[]
{
"apple|pear|fruit|basket",
"orange|mango|fruit|turtle",
"purple|red|black|green",
"hero|thor|ironman|hulk"
};
foreach (string item in list.Dedupe('|', 2))
Console.WriteLine(item);
Edit: In the linked question Distinct() with Lambda, Jon Skeet presents the idea in a much better fashion, in the form of a DistinctBy custom method. While similar, his is far more reusable than the idea presented here.
Using his method, you could write
var deduped = list.DistinctBy(item => item.Split('|')[2]);
And you could later reuse the same method to "dedupe" another list of objects of a different type by a key of possibly yet another type.
Try this:
var list = new string[]
{
"apple|pear|fruit|basket",
"orange|mango|fruit|turtle",
"purple|red|black|green",
"hero|thor|ironman|hulk "
};
var dedup = new List<string>();
var filtered = new List<string>();
foreach (var s in list)
{
var filter = s.Split('|')[2];
if (dedup.Contains(filter)) continue;
filtered.Add(s);
dedup.Add(filter);
}
// Console.WriteLine(filtered);
Can you use a HashSet instead? That will eliminate dupes automatically for you as they are added.
May be you can sort the words with delimited | on alphabetical order. Then store them onto grid (columns). Then when you try to insert, just check if there is column having a word which starting with this char.
If LINQ is an option, you can do something like this:
// assume strings is a collection of strings
List<string> list = strings.Select(a => a.Split('|')) // split each line by '|'
.GroupBy(a => a[2]) // group by third column
.Select(a => a.First()) // select first line from each group
.Select(a => string.Join("|", a))
.ToList(); // convert to list of strings
Edit (per Jeff Mercado's comment), this can be simplified further:
List<string> list =
strings.GroupBy(a => a.split('|')[2]) // group by third column
.Select(a => a.First()) // select first line from each group
.ToList(); // convert to list of strings

Categories

Resources