Reading each non-English character from a file - c#

Let's say a file has non-English text. We can read the file contents with FileIO.ReadLinesAsync method. Now each line contains set of characters. How to extract each letter (non-English alphabet) from this string? Here i represented my question in C# code.
List<string> finalAlphabets = new List<string>();
IList<string> alphabetLines = await FileIO.ReadLinesAsync(_languageFile,UnicodeEncoding.Utf8);
if (alphabetLines.Count != 0)
{
foreach (string alphabetLine in alphabetLines)
{
//lets say alphabetLine has "కాకికు", here i want to extract each letter from this and i want to add to finalAlphabets list
finalAlphabets.Add("కా"); // How to extract this letter from alphabetLine variable. If you look at the Length of alphabetLine , it shows 6, but actually in Telugu language it is 3 letter word.
}
}

There is set of text information classes - TextInfo, StringInfo, and in particular you are likely looking for TextElementEnumerator which lets one to find "text element" boundaries.
Simplified sample from MSDN article:
var myTEE = System.Globalization.StringInfo.GetTextElementEnumerator( "కాకికు");
while (myTEE.MoveNext()) {
Console.WriteLine( "[{0}]:\t{1}\t{2}",
myTEE.ElementIndex, myTEE.Current, myTEE.GetTextElement() );
}
Produces following output:
[0]: కా కా
[2]: కి కి
[4]: కు కు

Related

C# split big html to H1 and text up to next H1

I have big document with the following rigid structure:
<h1>Title 1</h1>
Article text
<h1>Title 2</h1>
Article text
<h1>Title 3</h1>
Article text
My aim is to create a list of lists always with title and the following article text up to the next title.
I tried:
var parts = Regex.Split(html2, #"(<h1>)").Where(l => l !=string.Empty).ToArray().Select(a => Regex.Split(a, #"(</h1>)")).ToArray();
But the result is not as expected. Any Ideas how to split the separate articles and the titles? Thanks!
Parsing HTML with Regex is a bad idea, as described by this classic answer: https://stackoverflow.com/a/1732454/173322
If you had a specific tag to extract then it could work but since you have multiple tags and freeform article text in-between, I suggest you use a parsing engine like AngleSharp or HtmlAgilityPack. It'll be faster and more reliable.
If you must stick with manual text parsing, I would simply loop through each line, check if it starts with an <h1> tag, classify the lines as Titles or Article Text, then loop through again to strip out the tags from the Titles and pair with the Article text.
As mentioned in the comment, you should use a HTML parse, but, if you want to give it a try with code, you could split the string, determine whether the splitted text is a title or an article and then, add the result on a list.
However, for this task you have to:
Split the string by more than one characters, that is, split by the "<h1>" string. Credits: answer to: string.split - by multiple character delimiter.
(If I understand your question), you want a list of lists, you could use the code shown in this answer.
NOTE: This code assumes the string (i.e. your document's content) has equal amounts of titles and articles.
Here's the code I've made - hosted on dotnetfiddle.com as well:
// Variables:
string sample = "<h1>Title 1</h1>" + "Article text" + "<h1>Title 2</h1>" + "Article text" + "<h1>Title 3</h1>" + "Article text";
// string.split - by multiple character delimiter
// Credit: https://stackoverflow.com/a/1254596/12511801
string[] arr = sample.Split(new string[]{"</h1>"}, StringSplitOptions.None);
// I store the "title" and "article" in separated lists - their content will be unified later:
List<string> titles = new List<string>();
List<string> articles = new List<string>();
// Loop the splitted text by "</h1>":
foreach (string s in arr)
{
if (s.StartsWith("<h1>"))
{
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
if (s.Contains("<h1>"))
{
// Position 0 is the article and the 1 position is the title:
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
titles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[1]);
}
else
{
// Leading text - it's an article by default.
articles.Add(s.Split(new string[]{"<h1>"}, StringSplitOptions.None)[0]);
}
}
}
// ------------
// Create a list of lists.
// Credit: https://stackoverflow.com/a/12628275/12511801
List<List<string>> myList = new List<List<string>>();
for (int i = 0; i < titles.Count; i++)
{
myList.Add(new List<string>{"Title: " + titles[i], "Article: " + articles[i]});
}
// Print the results:
foreach (List<string> subList in myList)
{
foreach (string item in subList)
{
Console.WriteLine(item);
}
}
Result:
Title: Title 1
Article: Article text
Title: Title 2
Article: Article text
Title: Title 3
Article: Article text

C# text file to string array and how to remove specific strings?

I need read a text file (10mb) and convert to .csv. See below portion of code:
string DirPathForm = System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetEntryAssembly().Location);'
string[] lines = File.ReadAllLines(DirPathForm + #"\file.txt");
Some portion of the text file have a pattern. So, used as below:
string[] lines1 = lines.Select(x => x.Replace("abc[", "ab,")).ToArray();
Array.Clear(lines, 0, lines.Length);
lines = lines1.Select(x => x.Replace("] CDE ", ",")).ToArray();
Some portion does not have a pattern to use directly Replace. The question is how remove the characters, numbers and whitespaces in this portion. Please see below?
string[] lines = {
"a] 773 b",
"e] 1597 t",
"z] 0 c"
};
to get the result below:
string[] result = {
"a,b",
"e,t",
"z,c"
};
obs: the items removed need be replaced by ",".
First of all, you should not use ReadAllLines since it is a huge file operation. It will load all the data into RAM and it is not correct. Instead, read the lines one by one in a loop.
Secondly, you can definitely use regex to replace data from the first condition to the second one.

reading a CSV issue

I am trying to read a csv
following is the sample.
"0734306547 ","9780734306548 ","Jane Eyre Pink PP ","Bronte Charlotte ","FRONT LIST",20/03/2013 0:00:00,0,"PAPERBACK","Y","Pen"
Here is the code i am using read CSV
public void readCSV()
{
StreamReader reader = new StreamReader(File.OpenRead(#"C:\abc\21-08-2013\PNZdatafeed.csv"),Encoding.ASCII);
List<string> ISBN = new List<String>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (!String.IsNullOrWhiteSpace(line))
{
string[] values = line.Split(',');
if (values[9] == "Pen")
{
ISBN.Add(values[1]);
}
}
}
MessageBox.Show(ISBN.Count().ToString());
}
I am not able to compare it values if (values[9] == "Pen") because when i debug the code it says values[9] value is \"Pen\""
How do i get rid of the special characters.?
The problem here is that you're splitting the line every time you find , and leaving the data like that. For example, if this is the line you're reading in:
"A","B","C"
and you split it at commas, you'll get "A", "B", and "C" as your data. According to your description, you don't want quotes around the data.
To throw away quotes around a string:
Check if the leftmost character is ".
If so, check if the rightmost character is ".
If so, remove the leftmost and rightmost characters.
In pseudocode:
if (data.left(1) == "\"" && data.right(1) == "\"") {
data = data.trimleft(1).trimright(1)
}
At this point you might have a few questions (I'm not sure how much experience you have). If any of these apply to you, feel free to ask them, and I'll explain further.
What does "\"" mean?
How do I extract the leftmost/rightmost character of a string?
How do I extract the middle of a string?

how to split a text in to paragraph with a particular string

I have a long text file ... I read the text file and store the content in a string...
Now I want this text to split. The below is an image which shows what I want.
In the image "This is common text" means this string is common in every paragraph.
Green squares shows that I want that part in string array.
but how o do that... I have tried Regular expression for this... but isn't working....
please help
Try using RegEx.Split() using this pattern:
(.*This is common text.*)
Well, giving priority to RegEx over the string functions is always leads to a performance overhead.
It would be great if you use: (UnTested but it will give you an idea)
string[] lines = IO.File.ReadAllLines("FilePath")
List<string> lst = new List<string>();
List<string> lstgroup = new List<string>();
int i=0;
foreach(string line in lines)
{
if(line.Tolower().contains("this is common text"))
{
if(i > 0)
{
lst.AddRange(lstgroup.ToArray());
// Print elements here
lstgroup.Clear();
}
else { i++; }
continue;
}
else
{
lstgroup.Add(line)
}
}
i = 0;
// Print elements here too
I am not sure what you want to split on but you could use
string[] stringArray = Regex.Split(yourString, regex);
If you want a more concrete example you will have to (as others mentioned) give us more information regardning what the text looks like rather than just "common text".

C#: Read data from txt file

I have an .EDF (text) file. The file's contents are as follows:
ConfigFile.Sample, Software v0.32, CP Version 0.32
[123_Float][2]
[127_Number][0]
[039_Code][70]
I wnat to read these items and parse them like this:
123_Float - 2
127_Number - 0
039_Code - 70
How can I do this using C#?
Well, you might start with the File.ReadAllLines() method. Then, iterate through the lines in that file, checking to see if they match a pattern. If they do, extract the necessary text and do whatever you want with it.
Here's an example that assumes you want lines in the format [(field 1)][(field 2)]:
// Or wherever your file is located
string path = #"C:\MyFile.edf";
// Pattern to check each line
Regex pattern = new Regex(#"\[([^\]]*?)\]");
// Read in lines
string[] lines = File.ReadAllLines(path);
// Iterate through lines
foreach (string line in lines)
{
// Check if line matches your format here
var matches = pattern.Matches(line);
if (matches.Count == 2)
{
string value1 = matches[0].Groups[1].Value;
string value2 = matches[1].Groups[1].Value;
Console.WriteLine(string.Format("{0} - {1}", value1, value2));
}
}
This outputs them to the console window, but you could obviously do whatever you want with value1 and value2 (write them to another file, store them in a data structure, etc).
Also, please note that regular expressions are not my strong point -- there's probably a much more elegant way to check if a line matches your pattern :)
If you want more info, check out MSDN's article on reading data from a text file as a starting point.
Let us assume your file really is as simple as you describe it. Then you could drop the first line and parse the data lines like this:
foreach (string line in File.ReadAllLines(#"C:\MyFile.edf").Skip(1))
{
var parts = line.Split("][");
var value1 = parts[0].Replace("[", "");
var value2 = parts[1].Replace("]", "");
Console.WriteLine(string.Format("{0} - {1}", value1, value2));
}
Another variation.
var lines = File.ReadAllLines(file)
.Skip(1)
.Select(x => x.Split(new[] { '[', ']' },
StringSplitOptions.RemoveEmptyEntries));
foreach(var pair in lines)
{
Console.WriteLine(pair.First()+" - "+pair.Last());
}

Categories

Resources