I have millions of strings, around 8GB worth of HEX; each string is 3.2kb in length.
Each of these strings contains multiple parts of data I need to extract.
This is an example of one such string:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ$GPGGA,104646.091,,,,,0,0,,,M,,M,,*41$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test2ÿÿ3ÿÿ4ÿÿ5ÿÿ6ÿÿ7ÿÿ8ÿÿ9ÿÿ:ÿÿ;ÿÿ<ÿÿ=ÿÿ>ÿÿ?ÿÿ#ÿÿAÿÿBÿÿCÿÿDÿÿEÿÿFÿÿGÿÿHÿÿIÿÿJÿÿ$GPGGA,104647.091,,,,,0,0,,,M,,M,,*40$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header TestKÿÿLÿÿMÿÿNÿÿOÿÿPÿÿQÿÿRÿÿSÿÿTÿÿUÿÿVÿÿWÿÿXÿÿYÿÿZÿÿ[ÿÿ\ÿÿ]ÿÿ^ÿÿ_ÿÿ`ÿÿaÿÿbÿÿcÿÿ$GPGGA,104648.091,,,,,0,0,,,M,,M,,*4F$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Testdÿÿeÿÿfÿÿgÿÿhÿÿiÿÿjÿÿkÿÿlÿÿmÿÿnÿÿoÿÿpÿÿqÿÿrÿÿsÿÿtÿÿuÿÿvÿÿwÿÿxÿÿyÿÿzÿÿ{ÿÿ|ÿÿ$GPGGA,104649.091,,,,,0,0,,,M,,M,,*4E$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test}ÿÿ~ÿÿ.ÿÿ€ÿÿ.ÿÿ‚ÿÿƒÿÿ„ÿÿ…ÿÿ†ÿÿ‡ÿÿˆÿÿ‰ÿÿŠÿÿ‹ÿÿŒÿÿ.ÿÿŽÿÿ.ÿÿ.ÿÿ‘ÿÿ’ÿÿ“ÿÿ”ÿÿ•ÿÿ$GPGGA,104650.091,,,,,0,0,,,M,,M,,*46$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Head
as you can see it is pretty much this repeated:
GPGGA,104644.091,,,,,0,0,,,M,,M,,*43$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ$GPGGA,104645.091,,,,,0,0,,,M,,M,,*42$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*32Header Test.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
I want to separate this string into two lists like this:
_GPSList
$GPGGA,104644.091,,,,,0,0,,,M,,M,,*43
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N*
$GPVTG,0.00,T,,M,0.00,N,0.00,K,N
_WavList
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ
32HeaderTest.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ.ÿÿ ÿÿ!ÿÿ"ÿÿ#ÿÿ$ÿÿ%ÿÿ&ÿÿ'ÿÿ(ÿÿ)ÿÿ*ÿÿ+ÿÿ,ÿÿ-ÿÿ.ÿÿ/ÿÿ0ÿÿ1ÿÿ
Issue 1:
This repetition isn't containing within a single string, it overflows into the next string. so if some data crosses the end and start of two strings how to I deal with that?
Issue 2: How do I analyse the string and extract only the parts I need?
The solution I'm providing is not a complete answer but more like an idea which might help you get what you want.
Everything else which I present is an assumption on my behalf.
//Assuming your data is stored in a file "yourdatafile"
//Splitting all the text on "$" assuming this will separate GPSData
string[] splittedstring = File.ReadAllText("yourdatafile").Split('$');
//I found an extra string lingering in the sample you provided
//because I splitted on "$", so you gotta take that into account
var GPSList = new List<string>();
var WAVList = new List<string>();
foreach (var str in splittedstring)
{
//So if the string contains "Header" we would want to separate it from GPS data
if (str.Contains("Header"))
{
string temp = str.Remove(str.IndexOf("Header"));
int indexOfAsterisk = temp.LastIndexOf("*");
string stringBeforeAsterisk = str.Substring(0, indexOfAsterisk + 1);
string stringAfterAsterisk = str.Replace(stringBeforeAsterisk, "");
WAVList.Add(stringAfterAsterisk);
GPSList.Add("$" + stringBeforeAsterisk);
}
else
GPSList.Add("$" + str);
}
This provides the exact output as you need, only exception is with that extra string. Also some non-standard characters might look like black blocks.
Since I have not been able to find an resolution via my searching endeavor, I believe I may have a unique problem. Essentially I am creating a gene finding/creation application in c#.NET for my wife and am using RichTextBoxes for her to be able to highlight, color, export, etc the information she needs. I have made several custom methods for it because, as I am sure we all know, RichTextBoxes from Microsoft leave much to the imagination.
Anyway, here is my issue: I need to be able to search for a term across hard returns. The users have strings in 60 letter intervals and they need to search for items that may cross that hard return barrier. For instance let's say I have 2 lines (I will make them short for simplicity):
AAATTTCCCGGG
TTTCCCGGGAAA
If the user runs a search for GGGTTT, I need to be able to pull the result even though there is a line break/hard return in there. For the life of me I cannot think of a good way to do this and still select the result in the RichTextBox. I can always find the result but getting a proper index for the RichTextBox is what eludes me.
If needed I am not against richTextBox.SaveFile() and LoadFile() and parsing the rtf text as a string manually. It doesnt have to be pretty, in this case, it just has to work.
I appreciate any help/guidance you may give.
Here is a relevant snippet:
//textbox 2 search area (examination area)
private void button5_Click(object sender, EventArgs e)
{
textBox3.Text = textBox3.Text.ToUpper();
if (textBox3.Text.Length > 0)
{
List<string> lines = richTextBox2.Lines.ToList();
string allText = "";
foreach (string line in lines)
allText = allText + line.Replace("\r", "").Replace("\n", "");
if (findMultiLineRTB2(allText, textBox3.Text) != -1)
{
richTextBox2.Select(lastMatchForRTB2, textBox3.Text.Length);
richTextBox2.SelectionColor = System.Drawing.Color.White;
richTextBox2.SelectionBackColor = System.Drawing.Color.Blue;
}//end if
else
MessageBox.Show("Reached the end of the sequence", "Finished Searching");
}//end if
}//end method
private int findMultiLineRTB2(string rtbText, string searchString)
{
lastMatchForRTB2 = rtbText.IndexOf(searchString, lastMatchForRTB2 + 1);
return lastMatchForRTB2;
}
So i make an assumption: you want to search a word across all lines where each line is 60 characters long. The desired result is the index of that word.
You just have to build a string that has no line breaks, for example with string.Join:
string allText = string.Join("", richTextBox.Lines);
int indexOf = allText.IndexOf("GGGTTT"); // 9 in your sample
Hi i have the following which creates two worksheets in an excel spreadsheet based on the values in a datagrid, I am able to get it working for two datagrids, however i need to do it for 14 datagrids, this is what i have got so far;
var grid1Output = RadGridView1.ToExcelML();
var grid2Output = RadGridView2.ToExcelML().Replace("Worksheet1", "Worksheet2");
var workBook = grid1Output.Replace("</Worksheet>", "</Worksheet>" +
grid2Output.Substring(grid2Output.IndexOf("<Worksheet"),
grid2Output.IndexOf("</Worksheet")- grid2Output.IndexOf("<Worksheet")) + " </Worksheet>");
The above works fine, however I need to do it for 14 gridoutputs in total. My problem is, I am having trouble replacing strings at the right place. How do i do this?
I would probably do it with Linq for XML methods rather than string manipulation, but the choice is yours.
Either way, it shouldn't be that hard to write a method that takes a grid output (I am assuming it's a string), extracts the contents, and returns them. The calling routine then assembles the 14 XML strings and wraps them in a single Worksheet tag.
Here's a stab at it. Bear in mind that I'm not familiar with the RadGridView and the output of ToExcelML, so you probably won't be able to use this code without some modification. I'm making some assumptions that may not be valid.
First, I would create a method that takes an XML string as input. I am assuming that this string is entirely wrapped in a <Worksheetn> tag.
string ExtractWorksheetContents(string excelML, int index)
{
// You might also be able to do this with a regex, depending on how the contents are structured
// Since I don't know enough about the content, I will do this with string manipulation, as
// you did, rather than loading the XML and making assumptions.
string tagName = string.Format("Worksheet{0}", index);
int worksheetStart = excelML.IndexOf("<" + tagName);
int worksheetEnd = excelML.IndexOf("</" + tagName + ">") + tagName.Length + 3);
// Should contain some checks that neither w'sheet start nor end are -1
return excelML.Substring(worksheetStart, worksheetEnd-worksheetStart);
}
Then I would assemble the results. Again, I'm making assumptions about how the XML is structured.
StringBuilder sb = new StringBuilder();
sb.Append("<Worksheet>");
RadGridView[] gridViews = new RadGridView[] { RadGridView1, RadGridView2 .... RadGridView14 };
for(int i=0;i<14; i++)
{
var rgv = gridViews[i];
sb.Append(ExtractWorksheetContents(rgv.ToExcelML(),i+1));
}
sb.Append("</Worksheet>");
var workBook = sb.ToString();
Hope this helps somewhat.
I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example:
SECTION_A 256
SECTION_B 344
SECTION_C 556
This means, that SECTION_A starts in line 256.
What would be the best way to parse this heading into a dictionary and then when necessary read the sections.
Typical scenarios would be:
Parse the header and read only section SECTION_B
Parse the header and read fist paragraph of each section.
The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it.
I'd appreciate your suggestions. My environment is VS 2008 and C# 3.5 SP1.
You can do this quite easily.
There are three parts to the problem.
1) How to find where a line in the file starts. The only way to do this is to read the lines from the file, keeping a list that records the start position in the file of that line. e.g
List lineMap = new List();
lineMap.Add(0); // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0); // Line 1 starts at location 0 in the data file
using (StreamReader sr = new StreamReader("DataFile.txt"))
{
String line;
int lineNumber = 1;
while ((line = sr.ReadLine()) != null)
lineMap.Add(sr.BaseStream.Position);
}
2) Read and parse your index file into a dictionary.
Dictionary index = new Dictionary();
using (StreamReader sr = new StreamReader("IndexFile.txt"))
{
String line;
while ((line = sr.ReadLine()) != null)
{
string[] parts = line.Split(' '); // Break the line into the name & line number
index.Add(parts[0], Convert.ToInt32(parts[1]));
}
}
Then to find a line in your file, use:
int lineNumber = index["SECTION_B";]; // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber]; // Convert line number into file offset
Then open a new FileStream on DataFile.txt, Seek(offsetInDataFile, SeekOrigin.Begin) to move to the start of the line, and use a StreamReader (as above) to read line(s) from it.
Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good.
Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? The only way to know is to start from the beginning and start counting.
The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there.
Is this a one-off processing situation? If not, have you considered stuffing the entire file into a local database, like a SQLite database? That would allow you to have a direct mapping between line number and its contents. Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way.
Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.
Read the file until the end of the header, assuming you know where that is. Split the strings you've stored on whitespace, like so:
Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline
foreach(string header in headers) {
var s = header.Split(new[]{' '});
sectionIndex.Add(s[0], Int32.Parse(s[1]));
}
Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names.
Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of.
You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). You would have the section and line numbers for use in retrieving the data at a later time.
string dataRow = "";
try
{
TextReader tr = new StreamReader("filename.txt");
while (true)
{
dataRow = tr.ReadLine();
if (dataRow.Substring(1, 8) != "SECTION_")
break;
else
//Parse line for section code and line number and log values
continue;
}
tr.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).
The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().
The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.
I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.
I would recommend that you change your architecture to work more like a parser in a compiler.
You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.
In your case the tokens would be:
Column data
Comma
End of Line
You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.
This has the advantages of:
Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change
Here's a sample of what the Lexer would look like:
Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.
enum TokenType
{
ColumnData,
Comma,
LineTerminator
}
class Token
{
public TokenType Type { get; private set;}
public string Data { get; private set;}
public Token(TokenType type)
{
Type = type;
}
public Token(TokenType type, string data)
{
Type = type;
Data = data;
}
}
private IEnumerable<Token> GetTokens(TextReader s)
{
var builder = new StringBuilder();
while (s.Peek() >= 0)
{
var c = (char)s.Read();
switch (c)
{
case ',':
{
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.Comma);
break;
}
case '\r':
{
var next = s.Peek();
if (next == '\n')
{
s.Read();
}
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.LineTerminator);
break;
}
default:
builder.Append(c);
break;
}
}
s.Read();
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
}
private string ExtractText(StringBuilder b)
{
var ret = b.ToString();
b.Remove(0, b.Length);
return ret;
}
Your "parser" code would then look like this:
public void ConvertXLS(TextReader s)
{
var columnData = new List<string>();
bool lastWasColumnData = false;
bool seenAnyData = false;
foreach (var token in GetTokens(s))
{
switch (token.Type)
{
case TokenType.ColumnData:
{
seenAnyData = true;
if (lastWasColumnData)
{
//TODO: do some error reporting
}
else
{
lastWasColumnData = true;
columnData.Add(token.Data);
}
break;
}
case TokenType.Comma:
{
if (!lastWasColumnData)
{
columnData.Add(null);
}
lastWasColumnData = false;
break;
}
case TokenType.LineTerminator:
{
if (seenAnyData)
{
OutputLine(lastWasColumnData);
}
seenAnyData = false;
lastWasColumnData = false;
columnData.Clear();
}
}
}
if (seenAnyData)
{
OutputLine(columnData);
}
}
You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.
I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.
It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.
x.Replace("\r\n", "\u0000") // Or some other unused character
.Replace("\n", "\\x0A") // Or whatever escaping you need
.Replace("\u0000", "\r\n") // Replace the real line breaks
I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.
Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).
The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.
Here's an example of the file after I convert to CSV...
Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.
I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.
I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.
If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:
string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
if (lineCount == 0)
{
lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
columnCount = lineData.length;
++lineCount;
continue;
}
string thisLine = lastLine + currentLine;
lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
if (lineData.Length < columnCount)
{
lastLine += currentLine;
continue;
}
else
{
lastLine = null;
}
......
Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!
Below is a explanation about what I needed:
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.
https://stackoverflow.com/a/12640862/1582188