C# split big html to H1 and text up to next H1 - c#

I have big document with the following rigid structure:
<h1>Title 1</h1>
Article text
<h1>Title 2</h1>
Article text
<h1>Title 3</h1>
Article text
My aim is to create a list of lists always with title and the following article text up to the next title.
I tried:
var parts = Regex.Split(html2, #"(<h1>)").Where(l => l !=string.Empty).ToArray().Select(a => Regex.Split(a, #"(</h1>)")).ToArray();
But the result is not as expected. Any Ideas how to split the separate articles and the titles? Thanks!

Parsing HTML with Regex is a bad idea, as described by this classic answer: https://stackoverflow.com/a/1732454/173322
If you had a specific tag to extract then it could work but since you have multiple tags and freeform article text in-between, I suggest you use a parsing engine like AngleSharp or HtmlAgilityPack. It'll be faster and more reliable.
If you must stick with manual text parsing, I would simply loop through each line, check if it starts with an <h1> tag, classify the lines as Titles or Article Text, then loop through again to strip out the tags from the Titles and pair with the Article text.

As mentioned in the comment, you should use a HTML parse, but, if you want to give it a try with code, you could split the string, determine whether the splitted text is a title or an article and then, add the result on a list.
However, for this task you have to:
Split the string by more than one characters, that is, split by the "<h1>" string. Credits: answer to: string.split - by multiple character delimiter.
(If I understand your question), you want a list of lists, you could use the code shown in this answer.
NOTE: This code assumes the string (i.e. your document's content) has equal amounts of titles and articles.
Here's the code I've made - hosted on dotnetfiddle.com as well:
// Variables:
string sample = "<h1>Title 1</h1>" + "Article text" + "<h1>Title 2</h1>" + "Article text" + "<h1>Title 3</h1>" + "Article text";
// string.split - by multiple character delimiter
// Credit: https://stackoverflow.com/a/1254596/12511801
string[] arr = sample.Split(new string[]{"</h1>"}, StringSplitOptions.None);
// I store the "title" and "article" in separated lists - their content will be unified later:
List<string> titles = new List<string>();
List<string> articles = new List<string>();
// Loop the splitted text by "</h1>":
foreach (string s in arr)
{
if (s.StartsWith("<h1>"))
{
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
if (s.Contains("<h1>"))
{
// Position 0 is the article and the 1 position is the title:
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
// Leading text - it's an article by default.
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
}
}
}
// ------------
// Create a list of lists.
// Credit: https://stackoverflow.com/a/12628275/12511801
List<List<string>> myList = new List<List<string>>();
for (int i = 0; i < titles.Count; i++)
{
myList.Add(new List<string>{"Title: " + titles[i], "Article: " + articles[i]});
}
// Print the results:
foreach (List<string> subList in myList)
{
foreach (string item in subList)
{
Console.WriteLine(item);
}
}
Result:
Title: Title 1
Article: Article text
Title: Title 2
Article: Article text
Title: Title 3
Article: Article text

Related

C# text file to string array and how to remove specific strings?

I need read a text file (10mb) and convert to .csv. See below portion of code:
string DirPathForm = System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetEntryAssembly().Location);'
string[] lines = File.ReadAllLines(DirPathForm + #"\file.txt");
Some portion of the text file have a pattern. So, used as below:
string[] lines1 = lines.Select(x => x.Replace("abc[", "ab,")).ToArray();
Array.Clear(lines, 0, lines.Length);
lines = lines1.Select(x => x.Replace("] CDE ", ",")).ToArray();
Some portion does not have a pattern to use directly Replace. The question is how remove the characters, numbers and whitespaces in this portion. Please see below?
string[] lines = {
"a] 773 b",
"e] 1597 t",
"z] 0 c"
};
to get the result below:
string[] result = {
"a,b",
"e,t",
"z,c"
};
obs: the items removed need be replaced by ",".
First of all, you should not use ReadAllLines since it is a huge file operation. It will load all the data into RAM and it is not correct. Instead, read the lines one by one in a loop.
Secondly, you can definitely use regex to replace data from the first condition to the second one.

Reading each non-English character from a file

Let's say a file has non-English text. We can read the file contents with FileIO.ReadLinesAsync method. Now each line contains set of characters. How to extract each letter (non-English alphabet) from this string? Here i represented my question in C# code.
List<string> finalAlphabets = new List<string>();
IList<string> alphabetLines = await FileIO.ReadLinesAsync(_languageFile,UnicodeEncoding.Utf8);
if (alphabetLines.Count != 0)
{
foreach (string alphabetLine in alphabetLines)
{
//lets say alphabetLine has "కాకికు", here i want to extract each letter from this and i want to add to finalAlphabets list
finalAlphabets.Add("కా"); // How to extract this letter from alphabetLine variable. If you look at the Length of alphabetLine , it shows 6, but actually in Telugu language it is 3 letter word.
}
}
There is set of text information classes - TextInfo, StringInfo, and in particular you are likely looking for TextElementEnumerator which lets one to find "text element" boundaries.
Simplified sample from MSDN article:
var myTEE = System.Globalization.StringInfo.GetTextElementEnumerator( "కాకికు");
while (myTEE.MoveNext()) {
Console.WriteLine( "[{0}]:\t{1}\t{2}",
myTEE.ElementIndex, myTEE.Current, myTEE.GetTextElement() );
}
Produces following output:
[0]: కా కా
[2]: కి కి
[4]: కు కు

how to split a text in to paragraph with a particular string

I have a long text file ... I read the text file and store the content in a string...
Now I want this text to split. The below is an image which shows what I want.
In the image "This is common text" means this string is common in every paragraph.
Green squares shows that I want that part in string array.
but how o do that... I have tried Regular expression for this... but isn't working....
please help
Try using RegEx.Split() using this pattern:
(.*This is common text.*)
Well, giving priority to RegEx over the string functions is always leads to a performance overhead.
It would be great if you use: (UnTested but it will give you an idea)
string[] lines = IO.File.ReadAllLines("FilePath")
List<string> lst = new List<string>();
List<string> lstgroup = new List<string>();
int i=0;
foreach(string line in lines)
{
if(line.Tolower().contains("this is common text"))
{
if(i > 0)
{
lst.AddRange(lstgroup.ToArray());
// Print elements here
lstgroup.Clear();
}
else { i++; }
continue;
}
else
{
lstgroup.Add(line)
}
}
i = 0;
// Print elements here too
I am not sure what you want to split on but you could use
string[] stringArray = Regex.Split(yourString, regex);
If you want a more concrete example you will have to (as others mentioned) give us more information regardning what the text looks like rather than just "common text".

split string to string array without loosing text order

I have a problem that I busted my head for 7 days, so I decide to ask you for help. Here is my problem:
I read data from datagridview (only 2 cell), and fill all given data in stringbuilder, its actually article and price like invoice (bill). Now I add all what I get in stringbuilder in just string with intention to split string line under line, and that part of my code work but not as I wont. Article is one below another but price is one price more left another more right not all in one vertical line, something like this:
Bread 10$
Egg 4$
Milk 5$
My code:
string[] lines;
StringBuilder sbd = new StringBuilder();
foreach (DataGridViewRow rowe in dataGridView2.Rows)
{
sbd.Append(rowe.Cells[0].Value).Append(rowe.Cells[10].Value);
sbd.Append("\n");
}
sbd.Remove(sbd.Length - 1, 1);
string userOutput = sbd.ToString();
lines = userOutput.Split(new string[] { "\r", "\n" },
StringSplitOptions.RemoveEmptyEntries);
You can use the Trim method in order to remove existing leading and trailing spaces. With PadRight you can automatically add the right number of spaces in order to get a specified total length.
Also use a List<string> that grows automatically instead of using an array that you get from splitting what you just put together before:
List<string> lines = new List<string>();
foreach (DataGridViewRow row in dataGridView2.Rows) {
lines.Add( row.Cells[0].Value.ToString().Trim().PadRight(25) +
row.Cells[10].Value.ToString().Trim());
}
But keep in mind that this way of formatting works only if you display the string in a monospaced font (like Courier New or Consolas). Proportional fonts like Arial will yield jagged columns.
Alternatively you can create an array with the right size by reading the number of lines from the Count property
string[] lines = new string[dataGridView2.Rows.Count];
for (int i = 0; i < lines.Length; i++) {
DataGridViewRow row = dataGridView2.Rows[i];
lines[i] = row.Cells[0].Value.ToString().Trim().PadRight(25) +
row.Cells[10].Value.ToString().Trim();
}
You can also use the PadLeft method in order to right align the amounts
row.Cells[10].Value.ToString().Trim().PadLeft(10)
Have you tried this String Split method ?
String myString = "Bread ;10$;";
String articleName = myString.split(';')[0];
String price = myString.split(';')[1];

C#: Read data from txt file

I have an .EDF (text) file. The file's contents are as follows:
ConfigFile.Sample, Software v0.32, CP Version 0.32
[123_Float][2]
[127_Number][0]
[039_Code][70]
I wnat to read these items and parse them like this:
123_Float - 2
127_Number - 0
039_Code - 70
How can I do this using C#?
Well, you might start with the File.ReadAllLines() method. Then, iterate through the lines in that file, checking to see if they match a pattern. If they do, extract the necessary text and do whatever you want with it.
Here's an example that assumes you want lines in the format [(field 1)][(field 2)]:
// Or wherever your file is located
string path = #"C:\MyFile.edf";
// Pattern to check each line
Regex pattern = new Regex(#"\[([^\]]*?)\]");
// Read in lines
string[] lines = File.ReadAllLines(path);
// Iterate through lines
foreach (string line in lines)
{
// Check if line matches your format here
var matches = pattern.Matches(line);
if (matches.Count == 2)
{
string value1 = matches[0].Groups[1].Value;
string value2 = matches[1].Groups[1].Value;
Console.WriteLine(string.Format("{0} - {1}", value1, value2));
}
}
This outputs them to the console window, but you could obviously do whatever you want with value1 and value2 (write them to another file, store them in a data structure, etc).
Also, please note that regular expressions are not my strong point -- there's probably a much more elegant way to check if a line matches your pattern :)
If you want more info, check out MSDN's article on reading data from a text file as a starting point.
Let us assume your file really is as simple as you describe it. Then you could drop the first line and parse the data lines like this:
foreach (string line in File.ReadAllLines(#"C:\MyFile.edf").Skip(1))
{
var parts = line.Split("][");
var value1 = parts[0].Replace("[", "");
var value2 = parts[1].Replace("]", "");
Console.WriteLine(string.Format("{0} - {1}", value1, value2));
}
Another variation.
var lines = File.ReadAllLines(file)
.Skip(1)
.Select(x => x.Split(new[] { '[', ']' },
StringSplitOptions.RemoveEmptyEntries));
foreach(var pair in lines)
{
Console.WriteLine(pair.First()+" - "+pair.Last());
}

Categories

Resources