I'm pretty sure it has been asked before, but I could not find anything good.
I'm trying to parse a log but having troubles with it.
At first it looked pretty easy because the log is build like this:
thing,thing,thing,thing
so I string split it on the ,
however in the value itself it is possible that a , appears, and this is where I did not know what to do anymore.
How would I successfully parse this kind of log?
Edit~~
here is an log example:
1326139200953,info,,0,"str value which may contain, ",,,0
1326139201109,info,,0,"str value which may contain, ",,,0
1326139201265,info,,0,"str value which may contain, ",,,0
1326139201999,start,,0,,,,0
1326139368296,new,F:\Dir\Dir\file.txt,1536,,0,,0
``
If your log file doesn't have field encapsulators, the fields have variable width, and the separator/delimiter can also appear in a field, then it's likely you can't program something that will work in all cases.
Can you supply an example of your log file data? It may be possible to match the parts you need with a regex.
Unfortunately I think your question is not answerable in its current state, please provide more info.
Edit: Thanks for updating the question, you do have field encapsulators (double quotes). This will make it easier!
I think there are many ways to do this. Personally i think i would carry on splitting on commas, but then loop over the resulting array, checking if the first character of any value is a double quote. If it is, then you need to join it to the array item after it. If the last character of the joined array item isn't a double quote, you need to continue joining until you've closed your opening double quote.
There's certainly a better way so you may wish to wait for another solution.
Edit 2: Give this a go and let me know how you get on:
string myRegex = #"(?<=^(?:[^""]*""[^""]*"")*[^""]*),";
string[] outputArray = Regex.Split(myStr, myRegex);
Related
I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.
I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.
Sometimes the data I get includes multiple details in the single string. As the name of this Movie "Dog Eats Dog (2012) (2012)". The name should have been "Dog Eats Dog (2012)" rather than the first one.
Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)". Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.
I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.
But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)".
Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring.
The duplicate year always comes in the end of the string. So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?
If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.
Thanks.
The right regex is "( \(\d{4}\))\1+".
string pattern = #"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2
Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \(\d{4}\)) does the capture, \1+ finds any non empty sequence of that captured block
Finally, replace the initial block and its copies by the initial block alone.
This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)
`#"(\(\d{4}\))(?:\s*\1)+"`
I have a demo of it here
Working on a program that takes a CSV file and splits on each ",". The issue I have is there are thousand separators in some of the numbers. In the CSV file, the numbers render correctly. When viewed as a text document, they are shown like below:
Dog,Cat,100,100,Fish
In a CSV file, there are four cells, with the values "Dog", "Cat", "100,000", "Fish". When I split on the "," to an array of strings, it contains 5 elements, when what I want is 4. Anyone know a way to work around this?
Thanks
There are two common mistakes made when reading csv code: using a split() function and using regular expressions. Both approaches are wrong, in that they are prone to corner cases such as yours and slower than they could be.
Instead, use a dedicated parser such as Microsoft.VisualBasic.TextFieldParser, CodeProject's FastCSV or Linq2csv, or my own implemention here on Stack Overflow.
Typically, CSV files would wrap these elements in quotes, causing your line to be displayed as:
Dog,Cat,"100,100",Fish
This would parse correctly (if using a reasonable method, ie: the TextFieldParser class or a 3rd party library), and avoid this issue.
I would consider your file as an error case - and would try to correct the issue on the generation side.
That being said, if that is not possible, you will need to have more information about the data structure in the file to correct this. For example, in this case, you know you should have 4 elements - if you find five, you may need to merge back together the 3rd and 4th, since those two represent the only number within the line.
This is not possible in a general case, however - for example, take the following:
100,100,100
If that is 2 numbers, should it be 100100, 100, or should it be 100, 100100? There is no way to determine this without more information.
you might want to have a look at the free opensource project FileHelpers. If you MUST use your own code, here is a primer on the CSV "standard" format
well you could always split on ("\",\"") and then trim the first and last element.
But I would look into regular expressions that match elements with in "".
Don't just split on the , split on ", ".
Better still, use a CSV library from google or codeplex etc
Reading a CSV file in .NET?
You may be able to use Regex.Replace to get rid of specifically the third comma as per below before parsing?
Replaces up to a specified number of occurrences of a pattern specified in the Regex constructor with a replacement string, starting at a specified character position in the input string. A MatchEvaluator delegate is called at each match to evaluate the replacement.
[C#] public string Replace(string, MatchEvaluator, int, int);
I ran into a similar issue with fields with line feeds in. Im not convinced this is elegant, but... For mine I basically chopped mine into lines, then if the line didnt start with a text delimeter, I appended it to the line above.
You could try something like this : Step through each field, if the field has an end text delimeter, move to the next, if not, grab the next field, appaend it, rince and repeat till you do have an end delimeter (allows for 1,000,000,000 etc) ..
(Im caffeine deprived, and hungry, I did write some code but it was so ugly, I didnt even post it)
Do you know that it will always contain exactly four columns? If so, this quick-and-dirty LINQ code would work:
string[] elements = line.Split(',');
string element1 = elements.ElementAt(0);
string element2 = elements.ElementAt(1);
// Exclude the first two elements and the last element.
var element3parts = elements.Skip(2).Take(elements.Count() - 3);
int element3 = Convert.ToInt32(string.Join("",element3parts));
string element4 = elements.Last();
Not elegant, but it works.
I am trying to create a regex to match a CSV file of records in the form of:
optional value, , ,, again some value; this is already, next record;
Now there is an upper limit of commas (10) separating attributes of each record and unlimited number of ; separating each record. Values might or might not be present. I am unexperienced with regex and my effort is rather futile so far. Please help. If necessary, I will include more details.
EDIT
I want to verify that the file is in the required form and get the number of records in it.
Do you really need to use regular expressions for this? Might be a little bit overkill. I'd just perform one String.Split() to get the records, then another String.Split() on each record to get the values. Also rather easy to get the number of elements etc. then.
If you really want to use Regexps, I'd use two steps again:
/(.*?);/ to get the datasets;
/(.*?)[,;]/ to get the values.
Could probably be done with one regexp as well but I'd consider this overkill (as you'd have to find the sub matches etc. identify their parent record, etc.).
Escaped characters would be another thing but rather similar to do: e.g. /(.*?[^\\]);/
try this
bool isvalid = csv.Split(';')
.Select(c => c.Split(',')
.Count())
.Distinct()
.Count() == 1;
Reminds me to the famous article form Coding Horror: Regular Expressions: Now You Have Two Problems.
FileHelpers saved my day when dealing with CSV or other text format.
I'm thinking of something like:
foreach (var word in paragraph.split(' ')) {
if (badWordArray.Contains(word) {
// do something about it
}
}
but I'm sure there's a better way.
Thanks in advance!
UPDATE
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.
While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.
Edit to add sample code:
public string FilterWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.Replace(inputWords, "<3");
}
That should work for you, more or less.
Edit to answer OP clarification:
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.
Much as the replacement portion above, you can see if something matches like so:
public bool HasBadWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.IsMatch(inputWords);
}
It will return true if the string you passed to it contains any words in the list.
At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).
One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we
uppercase everything in the input
remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)
And then get some friends to try to break it. It's fun.
You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)