I'm trying to parse a table in plain text format. The program is written in Visual Studio using C#. I need to parse through the table and insert the data into the database.
Below is a sample table I will be reading in:
ID Name Value1 Value2 Value3 Value4 //header
1 nameA 3.0 0.2 2 6.2
2 nameB
3 nameC 2.9 3.0 7.3
4 nameD 1.5 3.0 1.8 1.1
5 nameE
6 nameF 1.2 2.4 3.3 2.5
7 nameG 3.0 3.2 2.1 4.5
8 nameH 88 12.4 28.9
In the example, I will need to capture data for id 1, 3, 4, 6, 7, and 8.
I thought of two ways to approach this, but neither of them works 100%.
Method 1:
By reading in the header, I can get the start index for each column. I will then use Substring collect data for each row.
ISSUE: once it past a certain row (which I will have no idea when this is happening), the columns shift, and Substring will no longer to collect the correct data.
This method will only collect correct data for 1, 3, and 4.
Method 2:
Using Regex to collect all the matches. I'm hoping this can collect ID, Name, Value1, Value2, Value3, Value4, in this order.
My pattern is (\d*?)\s\s\s+(.*?)\s\s\s+(\d*\.*\d*)\s\s\s+(\d*\.*\d*)\s\s\s+(\d*\.*\d*)\s\s\s+(\d*\.*\d*)
ISSUE: data that are collected are shifted left for some rows. For example, on ID 3, Value2 should be blank, but the regex will be reading Value2 = 3.0, Value3 = 7.3, and Value4 = blank. Same thing goes for ID 8.
Question:
How can I read in the whole table and parse them correctly?
(1) I do not know starting from which row the values will be shifted and
(2) I do not know how many cells it will be shifted by and if they are consistent.
Additional Information
The table is in a PDF file, I converted the PDF to text file so I can read in the data. The shifting data happens when a table goes across multiple pages, but it is not consistent.
EDIT
Below are some actual data:
68 BENZYL ALCOHOL 6.0 0.4 1 7.4
91 EVERNIA PRUNASTRI (OAK MOSS) 34 3 3 10
22 test 2323 23 12
ok, here u go! Use this regex pattern:
NOTE: you have to match this to any single line, not to the whole document! If you want to do it for your whole document then you have to add the 'multiline' modifier ('m'). You can do this by adding (?m) at the beginning of the regex pattern!
EDIT:
You provided some lines of your real data. Here's my updated regex pattern:
^(?<id>\d+)(?:\s{2,25})(?<name>.+?)(?:\s{2,45})(?<val1>\d+(?:\.\d+)?)?(?:\s{2,33})(?<val2>\d+(?:\.\d+)?)?(?:\s{2,14})(?<val3>\d+(?:\.\d+)?)?(?:\s{2,19})(?<val4>\d+(?:\.\d+)?)?$
How about treating this file like a fixed-length file, where you can define each column by an index and length. Once you have defined your fixed length columns, you can just get the value for the column with Substring, then Trim to clean it up.
You can wrap all this up in a Linq statement to project to an anonymouse type and filter for the IDs you want.
Something like this:
static void Main(string[] args)
{
int[] select = new int[] { 1, 3, 4, 6, 7, 8 };
string[] lines = File.ReadAllLines("TextFile1.txt");
var q = lines.Skip(1).Select(l => new {
Id = Int32.Parse(GetValue(l, 0, 6)),
Name = GetValue(l, 6, 11),
Value1 = GetValue(l, 17, 11),
Value2 = GetValue(l, 28, 13),
Value3 = GetValue(l, 41, 14),
Value4 = GetValue(l, 55, 13),
}).Where(o => select.Contains(o.Id));
var r = q.ToArray();
}
static string GetValue(string line, int index, int length)
{
string value = null;
int lineLength = line.Length;
// Take as much of the line as we can up to column length
if(lineLength > index)
value = line.Substring(index, Math.Min(length, lineLength - index)).Trim();
// Return null if we just have whitespace
return String.IsNullOrWhiteSpace(value) ? null : value;
}
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a spreadsheet with 12 columns and between 1 and 50 rows - I do not know how many rows there will be.
If I copy the data out of the spreadsheet I can create an array with any number of Rows and all of the data is separated into an array no problem.
I have a further 6 pieces of data from other sources within the program ‘a’ to ‘f’ - taken from various TextBoxes and DatePickers.
I need to take data from cells: 1, 2, 3, 4, 5, 6 from the array - this becomes 13, 14, 15, 16, 17, 18 from the second row and so on (as there are 6 cells containing data I do not need in each row).
I need to order the data:
a, b, 3, 2, c, d, 1, 4, 5, 6, e, f
a, b, 15, 14, c, d, 13, 16, 17, 18, e, f
And so on.
This new string of data needs to be copied into a different spreadsheet that I cannot change.
I would like to be able to add more than 1 row at a time.
With a lot of help from stack overflow, which improved the code I was using to add 1 row at a time, I created this code:
string phrase = Value.Text;
string[] words = Value.Text.Split(new char[] { '\t', '\r' });
List<string> values = new List<string>();
values.Add(a.Text)
values.Add(b.Text);
values.Add(words[3]);
values.Add(words[2]);
values.Add(c.Text);
values.Add(d.Text);
values.Add(words[1]);
values.Add(words[4]);
values.Add(words[5]);
values.Add(words[6]);
values.Add(e.Text);
values.Add(f.Text);
string outPut = String.Join("\t", values);
this.OutPutValue.Text = outPut;
This works for a single row.
I can add a string:
String newLine = “\r”
So I have this code:
values.Add(e.Text);
values.Add(f.Text);
values.Add(newLine);
values.Add(a.Text)
values.Add(b.Text);
values.Add(words[15]);
values.Add(words[14]);
values.Add(c.Text);
And so on...
If I try this, in the receiving spreadsheet the second Row starts on Column B instead of Column A because of the extra Tab from:
String.Join("\t", values);
Is there a way to introduce a line break so that the next line starts on Column A, rather than Column B?
Someone offered StringBuilder when I raised this before, but I failed to give enough information and I do not think that would work in this scenario (or at least I could not get my head around it).
Thanks for any help.
Here's how you can produce the two lines of data:
List<string> values = new List<string>();
StringBuilder sb = new System.Text.StringBuilder();
string[] words = Value.Text.Split(new char[] { '\t', '\r' });
values.Add(a.Text);
values.Add(b.Text);
values.Add(words[3]);
values.Add(words[2]);
values.Add(c.Text);
values.Add(d.Text);
values.Add(words[1]);
values.Add(words[4]);
values.Add(words[5]);
values.Add(words[6]);
values.Add(e.Text);
values.Add(f.Text);
sb.AppendLine(String.Join("\t", values));
values.Clear();
values.Add(a.Text);
values.Add(b.Text);
values.Add(words[15]);
values.Add(words[14]);
values.Add(c.Text);
values.Add(d.Text);
values.Add(words[13]);
values.Add(words[16]);
values.Add(words[17]);
values.Add(words[18]);
values.Add(e.Text);
values.Add(f.Text);
sb.AppendLine(String.Join("\t", values));
OutPut.AppendText(sb.ToString());
But does "Value.Text" above represent all of the rows, or just a single row?
So I have a string which I split in half. Now I need to compare both parts of the string and output has to be all the elements that are the same in both of them.
I noticed some people using Intersect, but I don't know why it doesn't work for me, I get really weird output if I use it.
So here is my code:
string first= "1 3 6 8 4 11 34 23 3 1 7 22 24 8"
int firstLength = first.Length;
int half = firstLength / 2;
string S1 = first.Substring(0, half);
string S2= first.Substring(half, half);
var areInCommon = S1.Intersect(S2);
Console.WriteLine("Numbers that these 2 strings have in common are: ");
foreach (int i in areInCommon)
Console.WriteLine(i);
So in this case output would be: 1, 3 and 8.
Any help would be appreciated.
You were close what you really want is arrays of the numbers not arrays of chars... you can get that with the split function.
string first= "1 3 6 8 4 11 34 23 3 1 7 22 24 8"
int firstLength = first.Length;
int half = firstLength / 2;
string S1 = first.Substring(0, half);
string S2= first.Substring(half, half);
var areInCommon = S1.Split(" ".ToArray()).Intersect(S2.Split(" ".ToArray());
Console.WriteLine("Numbers that these 2 strings have in common are: ");
foreach (var i in areInCommon)
Console.WriteLine(i);
A note about using ToArray():
I use ToArray() out of habit and the reason is that if you want to pass in parameters you can't do it without this construct. For example if the data looked like this:
string first= "1, 3, 6, 8, 4, 11, 34, 23, 3, 1, 7, 22, 24, 8"
then we would need to use
.Split(" ,".ToArray(), StringSplitOptions.RemoveEmptyEntries);
since this happens a lot, I use the .ToArray() out of habit. You can also use a new construct (eg new char [] { ' ', ',' } ) I find that more cumbersome, but probably slightly faster.
simply split both the string within an array and them compare both the strings using contains() function.
string implements IEnumerable<char>, thus, you're intersecting sequences of characters instead of strings.
You should use String.Split:
IEnumerable<string> S1 = first.Substring(0, half).Split(' ');
IEnumerable<string> S2= first.Substring(half, half).Split(' ');
And then your intersection will output the desired result.
Also, you can convert each string representation of numbers into integers (i.e. int):
IEnumerable<int> S1 = first.Substring(0, half).Split(' ').Select(s => int.Parse(s));
IEnumerable<int> S2 = first.Substring(half, half).Split(' ').Select(s => int.Parse(s));
You are converting all your characters into integers. The character '1' is not represented by the integer 1. Change your foreach to:
foreach (char i in areInCommon)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I am converting some legacy VB6 code to C# and this just has me a little baffled. The VB6 code wrote certain data sequentially to a file. This data is always 110 bytes. I can read this file just fine in the converted code, but I'm having trouble with when I write the file from the converted code.
Here is a stripped down sample I wrote real quick in LINQPad:
void Main()
{
int[,] data = new[,]
{
{
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
},
{
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39
}
};
using ( MemoryStream stream = new MemoryStream() )
{
using ( BinaryWriter writer = new BinaryWriter( stream, Encoding.ASCII, true ) )
{
for( var i = 0; i < 2; i++ )
{
byte[] name = Encoding.ASCII.GetBytes( "Blah" + i.ToString().PadRight( 30, ' ' ) );
writer.Write( name );
for( var x = 0; x < 20; x++ )
{
writer.Write( data[i,x] );
}
}
}
using ( BinaryReader reader = new BinaryReader( stream ) )
{
// Note the extra +4 is because of the problem below.
reader.BaseStream.Seek( 30 + ( 20 * 4 ) + 4, SeekOrigin.Begin );
string name = new string( reader.ReadChars(30) );
Console.WriteLine( name );
// This is the problem..This extra 4 bytes should not be here.
//reader.ReadInt32();
for( var x = 0; x < 20; x++ )
{
Console.WriteLine( reader.ReadInt32() );
}
}
}
}
As you can see, I have a 30 character string written first. The string is NEVER longer than 30 characters and is padded with spaces if it is shorter. After that, twenty 32-bit integers are written. It is always 20 integers. So I know each character in a string is one byte. I know a 32 bit integer is four bytes. So in my reader sample, I should be able to seek 110 bytes ( 30 + (4 * 20) ), read 30 chars, and then read 20 ints and that's my data. However, for some reason, there is an extra 4 bytes being written after the string.
Am I just missing something completely obvious (as is normally the case for myself)? Strings aren't null terminated in .Net and this is four bytes anyway, not just an extra byte? So where is this extra 4 bytes coming from? I'm not directly calling Write(string) so it can't be a prefixed length, which it's obviously not since it's after my string. If you uncomment the ReadInt32(), it produces the desired result.
The extra 4 bytes are from the extra 4 characters you're writing. Change the string you're encoding as ASCII to this:
("Blah" + i.ToString()).PadRight(30, ' ')
That is, pad the string after you've concatenated the prefix and the integer.
Your extra four bytes are whitespace, because you aren't subtracting the length of 'Blah'. You don't know where you are in your stream. So basically, you think you're writing only 30 chars, but you really wrote 34 chars.
I know you didn't ask this - but you're writing garbage data to a file that doesn't need to be there.
Instead of padding your string with whitespace, you should just include a header or pointer that indicates the length of the next field in your file.
For example, say you have a 120 byte file. The first 4 bytes of the file indicate that the length of the following string is 96 bytes. So you read 4 bytes, get the length and then read 96 bytes. The next 4 bytes say that you have a string that's 16 bytes long, so you read the next 16 bytes and get your next string. This is pretty much how every well defined protocol works.
I'm trying to read column values from this file starting at the arrow position:
Here's my error:
I'm guessing it's because the length values are wrong.
Say I have column with value :"Dog "
with the word dog and a few spaces after it. Do I have to set the length parameter as 3 (for dog) or can I set it as 6 to accommodate the spaces after Dog. This because each column length is fixed. As you can see some words are smaller than others and in order to be consistent I just want to set length as max column length (ex: 28 is length of 3rd column of my file but not all 28 spots are taken up everytime - ex: the word client is only 6 characters long
Robert Levy's answer is correct for the issue you're seeing - you've attempted to pull a substring from a string with a starting position that is greater than the length of the string.
You're parsing a fixed-length field file, where each field has a certain amount of characters, whether or not it uses all of them, and the pos and len arrays are intended to define those field lengths for use with Substring. As long as the line you're reading matches the expected field starts and lengths, you will be ok. As soon as you come to a line that doesn't match (for example, what appears to be the totals line - 0TotalRecords: 3,390,315) the field length definitions you've been using won't work, as the format has changed (and the line length may not even be the same).
There are a couple of things I would change to make this work. First, I would change your pos and len arrays so that they take the entirety of the field, not part of it. You can use Trim() to get rid of any leading or trailing blanks. As defined, your first field will only take the last number of the Seq# (pos 4, len 1), and your second field will only take the first 5 characters of the field, even though it appears to have space for ~12 characters.
Take a look at this (it's hard to be exact working from the picture, but for purposes of demonstration it will work):
1 2 3 4
01234567890123456789012345678901234567890
Seq# Field Description
3 BELNR ACCOUNTING DOCUMENT NBR
The numbers are the position of each charcter in the line. I would define the pos array to be the start of the field (0 for the first field, and then the position of the first letter of the field heading for each field after that), so you would have:
Seq# = 0
Field = 6
Description = 18
The len array would hold the length of the field, which I would define as the amount of characters up to the beginning of the next field, like this:
Seq# = 6
Field = 12
Description = 28 (using what you have as it is hard to tell
This would make your array initialization the following:
int[] pos = new int[3] { 0, 6, 18 };
int[] len = new int[3] { 6, 12, 28 };
If you wanted the fourth field, it would start at position 36 (pos 18 + len 28 = 36).
The second thing is I would check in the loop to see if the Total Records line is there, and skip that line (most likely it's the last line):
foreach (string line in textBox1.Lines)
{
if (!line.Contains("Total Records"))
{
val[j] = line.Substring(pos[j], len[j]).Trim();
}
}
Another way to do this would be to modify the original query and add a TakeWhile clause to it to only take lines until you hit the Total Records one:
string[] lines = File.ReadAllLines(ofd.FileName).Skip(8)
.TakeWhile(l => !l.Contains("Total Records")).ToArray();
The above would skip the first 8 lines and take all the remaining lines up to, but not including, the first line to contain "Total Records" in the string.
Then you could do something like this:
string[] lines = File.ReadAllLines(ofd.FileName).Skip(8)
.TakeWhile(l => !l.Contains("Total Records")).ToArray();
textBox1.Lines = lines;
int[] vale = new int[3];
int[] pos = new int[3] { 0, 6, 18 };
int[] len = new int[3] { 6, 12, 28 };
foreach (string line in textBox1.Lines)
{
val[j] = line.Substring(pos[j], len[j]).Trim();
}
Now you don't have to check for the "Total Records" line.
Of course, if there are other lines in your file, or there are records after the "Total Records" line (which I rather doubt) you'll have to handle those cases as well.
In short, the code for pulling out the substrings will only work for lines that match that particular format (or more specifically, have fields that match those positions/lengths) - anything outside out of that will either give you incorrect values or throw an error (if the start position is greater than the length of the string).
that exception is complaining about the first parameter which suggests that your file contains a row that is < 18 characters
The following VB line, where _DSversionInfo is a DataSet, returns no rows:
_DSversionInfo.Tables("VersionInfo").Select("FileID=88")
but inspection shows that the table contains rows with FileID's of 92, 93, 94, 90, 88, 89, 215, 216. The table columns are all of type string.
Further investigation showed that using the ID of 88, 215 and 216 will only return rows if the number is quoted.
ie _DSversionInfo.Tables("VersionInfo").Select("FileID='88'")
All other rows work regardless of whether the number is quoted or not.
Anyone got an explanation of why this would happen for some numbers but not others? I understand that the numbers should be quoted just not why some work and others don't?
I discovered this in some VB.NET code but (despite my initial finger pointing) don't think it is VB.NET specific.
According to the MSDN documentation on building expressions, strings should always be quoted. Failing to do so produces some bizarro unpredictable behavior... You should quote your number strings to get predictable and proper behavior like the documentation says.
I've encounted what you're describing in the past, and kinda tried to figure it out - here, pop open your favorite .NET editor and try the following:
Create a DataTable, and into a string column 'Stuff' of that DataSet, insert rows in the following order: "6", "74", "710", and Select with the filter expression "Stuff = 710". You will get 1 row back. Now, change the first row into any number greater than 7 - suddenly, you get 0 rows back.
As long as the numbers are ordered in proper descending order using string ordering logic (i.e., 7 comes after 599) the unquoted query appears to work.
My guess is that this is a limitation of how DataSet filter expressions are parsed, and it wasn't meant to work this way...
The Code:
// Unquoted filter string bizzareness.
var table = new DataTable();
table.Columns.Add(new DataColumn("NumbersAsString", typeof(String)));
var row1 = table.NewRow(); row1["NumbersAsString"] = "9"; table.Rows.Add(row1); // Change to '66
var row2 = table.NewRow(); row2["NumbersAsString"] = "74"; table.Rows.Add(row2);
var row4 = table.NewRow(); row4["NumbersAsString"] = "90"; table.Rows.Add(row4);
var row3 = table.NewRow(); row3["NumbersAsString"] = "710"; table.Rows.Add(row3);
var results = table.Select("NumbersAsString = 710"); // Returns 0 rows.
var results2 = table.Select("NumbersAsString = 74"); // Throws exception "Min (1) must be less than or equal to max (-1) in a Range object." at System.Data.Select.GetBinaryFilteredRecords()
Conclusion: Based on the exception text in that last case, there appears to be some wierd casting going on inside filter expressions that is not guaranteed to be safe. Explicitely putting single quotes around the value for which you're querying avoids this problem by letting .NET know that this is a literal.
DataTable builds an index on the columns to make Select() queries fast. That index is sorted by value, then it uses a binary search to select the range of records that matches the query expression.
So the records will be sorted like this 215,216,88,89,90,92,93,94. A binary search is done treating them as integer (as per our filter expression) cannot locate certain records because, it is designed to only search properly sorted collections.
It indexes the data as string and Binary search searches as number. See the below explanation.
string[] strArr = new string[] { "115", "118", "66", "77", "80", "81", "82" };
int[] intArr = new int[] { 215, 216, 88, 89, 90, 92, 93, 94 };
int i88 = Array.BinarySearch(intArr, 88); //returns -ve index
int i89 = Array.BinarySearch(intArr, 89); //returns +ve index
This should be a bug in the framework.
this error usually comes due to invalid data table column type in which you are going to search
i got this error when i was using colConsultDate instead of Convert(colConsultDate, 'System.DateTime')
because colConsultDate was a data table column of type string which i must have to convert into System.DateTime therefor your search query should be like
string query = "Convert(colConsultDate, 'System.DateTime') >= #" + sdateDevFrom.ToString("MM/dd/yy") + "# AND Convert(colConsultDate, 'System.DateTime') <= #" + sdateDevTo.ToString("MM/dd/yy") + "#";
DataRow[] dr = yourDataTable.Select(query);
if (dr.Length > 0)
{
nextDataTabel = dr.CopyToDataTable();
}
#Val Akkapeddi just wanna add things to your answer.
if you do something like this it would be benefited specially when you have to use comparison operators. because you put quotes around 74 it will be treated as string. please see yourself by actually writing code. Comparison operators
(decimal is just for reference you can add your desired datatype instead.)
var results2 = table.Select("Convert(NumbersAsString , 'System.Decimal') = 74.0")