I have a really long string. I would like to add a linefeed every 80 characters. Is there a regular expression replacement pattern I can use to insert "\r\n" every 80 characters? I am using C# if that matters.
I would like to avoid using a loop.
I don't need to worry about being in the middle of a word. I just want to insert a linefeed exactly every 80 characters.
I don't know the exact C# names, but it should be something like
str.Replace("(.{80})", "$1\r\n");
The idea is to grab 80 characters and save it in a group, then put it back in (I think "$1" is the right syntax) along with the "\r\n".
(Edit: The original regex had a + in it, which you definitely don't want. That would completely eliminate everything except the last line and any leftover pieces--a decidedly suboptimal result.)
Note that this way, you will most likely split inside words, so it might look pretty ugly.
You should be looking more into word wrapping if this is indeed supposed to be readable text. A little googling turned up a couple of functions; or if this is a text box, you can just turn on the WordWrap property.
Also, check out the .Net page at regular-expressions.info. It's by far the best reference site for regexes that I know of. (Jan Goyvaerts is on SO, but nobody told me to say that.)
Related
I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.
usually i can workaround and get everything works by myself, but this one is kinda tricky, even msdn references and examples confuses more than helps.
i have testing some codes and stuck at mixing a capture grouping for changing with a non-capturing group, to stop the matchings when i wish
a simpler code that i want to change is:
stats = "label:100,value:7878,label:110,value:7879,something,label:200,value:8888";
valor = "value:8080";
i know if i use
pattern = #"value:(\d+)";
i can change every value number to 8080 when i do
Regex.Replace(stats, pattern, valor);
but i need he stops changing these when find 'something' string
i managed to change every single char to 'valor' until he finds 'something' using
pattern = #"^(?:(?!something).)*";
is there a way to only change 'value:(\d+)' numbers to 'valor' , along with the ?:(?!something) to stop the matchings in the same sentence?
ive seen lots of examples but they never said something like this so i dunno if its possible to merge both conditions at same time
You can make use of a look-behind solution that makes sure there is no something before the value:
(?<!\bsomething\b.*)value:\d+
See demo
Note that something is matched as a whole word due to \b word boundaries.
The result of replace operation:
Note that (?:(?!something).) is very inefficient and should be used when no other means works. In .NET, there is a powerful variable-width look-behind, which is the right tool for this task.
Also note that if you are not using capture group backreferences, you do not need those capturing groups in your pattern (I remove parentheses from around \d+).
I need to introduce some text macros, for example:
"Some text here, some text here #from_file[a.txt,2,N] and here and here"
The #from_file[a.txt,2,N] macro should get 2 random lines from a.txt and join them with new line character another #from_file[a.txt,5,S] - take 5 random lines and join with space
I of course need some another macros: #random[0-9] - random number, #random[A-B,5] - random string with 5 characters
Macros can be in another format etc: {from_file:a.txt,2,N}
My first idea was to use regular expressions - but maybe exist another solution for my problem?
It sounds like you want to create some sort of "general purpose" text-macro system, and while I'm sure this can be done with regexps, what you want basically boil down to what you want to be capable of, and how extensive & flexible it needs to be.
You basically need to define your grammar and constraints. Can the file-name contain the macro-block terminator-character '}' ? If so, does it need to be escaped? Should escaping be supported? Are spaces within a macro-block allowed?
Basically find out how you want things to work, preferably as constrained as possible, as this means you can implement a simpler solution, and there might not be any need for a full blown parser and similar ilk.
Maybe a regex-based solution will be sufficient (although most certainly not very good). But before you can tell that, you need to spec better ;)
First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.
You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.
strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);
If someone enters a very long title/sentence, the text will stretch across the web page.
Is there a way to break the text so it continues on to the next line?
Using overflow hidden will hide the text.
I think I should be using the wbr tag.
Should I use the insert(); method for this?
i.e.
string myText = "111111111111111111111111111111111111111111111111111111111111111111111111";
myText = myText.Insert(80, "<wbr/>");
Not sure how cross browser the wbr tag is also!
Strictly speaking, you should use the zero width space () for this rather than <wbr>. However, Internet Explorer 6 and earlier are known not to support this (they show an ugly box). So <wbr> is probably the safest choice. Except... Internet Explorer 8 in standards mode is known not to support <wbr>, so you've got yourself a wonderful conundrum here.
You can read more at quirksmode.org.
Do note HBoss' comment in that it's hard to predict where to break, unless you're using a fixed width font like Courier. You should probably heed his advice and break more often than just every 80 characters. (And don't get me started on combining characters.)
As far as ASP.NET is concerned, you can indeed use the Insert method for this, but beware when you need to insert more than one: you'' need to do some book keeping (and a StringBuilder would also be advised).
You could use a regex to find words surrounded by whitespace/special chars, and surround it with a div/span that has different overflow properties.
If you do use <wbr>, be sure to surround the word with <nobr>.
I'm not sure you can solve this by splitting the content with breaks since you aren't guaranteed that the break will fit uniformly across browsers. You have variations of font-size, widths, etc.
Normally when I see content that extends too var it simply overlaps over the rest of the page or the designer sets the overflow so that the content can be scrolled. There could potentially be some CSS tricks you could use, but I'm not aware of any.
As an alternative approach, instead of simply inserting a line break every x number of characters, you might just insert a space after certain characters, for example, punctuation. This will make sure that the content wraps at some point or another.