OpenXML replace text in all document

OpenXML replace text in all document - c#

I have the piece of code below. I'd like replace the text "Text1" by "NewText", that's work. But when I place the text "Text1" in a table that's not work anymore for the "Text1" inside the table.
I'd like make this replacement in the all document.
using (WordprocessingDocument doc = WordprocessingDocument.Open(String.Format("c:\\temp\\filename.docx"), true))
{
var body = doc.MainDocumentPart.Document.Body;
foreach (var para in body.Elements<Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("##Text1##"))
text.Text = text.Text.Replace("##Text1##", "NewText");
}
}
}
}

Your code does not work because the table element (w:tbl) is not contained in
a paragraph element (w:p). See the following MSDN article for more information.
The Text class (serialized as w:t) usually represents literal text within a Run element in a
word document. So you could simply search for all w:t elements (Text class) and replace your
tag if the text element (w:t) contains your tag:
using (WordprocessingDocument doc = WordprocessingDocument.Open("yourdoc.docx", true))
{
var body = doc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains("##Text1##"))
{
text.Text = text.Text.Replace("##Text1##", "NewText");
}
}
}

Borrowing on some other answers in various places, and with the fact that four main obstacles must be overcome:
Delete any high level Unicode chars from your replace string that cannot be read from Word (from bad user input)
Ability to search for your find result across multiple runs or text elements within a paragraph (Word will often break up a single sentence into several text runs)
Ability to include a line break in your replace text so as to insert multi-line text into the document.
Ability to pass in any node as the starting point for your search so as to restrict the search to that part of the document (such as the body, the header, the footer, a specific table, table row, or tablecell).
I am sure advanced scenarios such as bookmarks, complex nesting will need more modification on this, but it is working for the types of basic word documents I have run into so far, and is much more helpful to me than disregarding runs altogether or using a RegEx on the entire file with no ability to target a specific TableCell or Document part (for advanced scenarios).
Example Usage:
var body = document.MainDocumentPart.Document.Body;
ReplaceText(body, replace, with);
The code:
using System;
using System.Collections.Generic;
using System.Linq;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
namespace My.Web.Api.OpenXml
{
public static class WordTools
{
/// <summary>
/// Find/replace within the specified paragraph.
/// </summary>
/// <param name="paragraph"></param>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
public static void ReplaceText(Paragraph paragraph, string find, string replaceWith)
{
var texts = paragraph.Descendants<Text>();
for (int t = 0; t < texts.Count(); t++)
{ // figure out which Text element within the paragraph contains the starting point of the search string
Text txt = texts.ElementAt(t);
for (int c = 0; c < txt.Text.Length; c++)
{
var match = IsMatch(texts, t, c, find);
if (match != null)
{ // now replace the text
string[] lines = replaceWith.Replace(Environment.NewLine, "\r").Split('\n', '\r'); // handle any lone n/r returns, plus newline.
int skip = lines[lines.Length - 1].Length - 1; // will jump to end of the replacement text, it has been processed.
if (c > 0)
lines[0] = txt.Text.Substring(0, c) + lines[0]; // has a prefix
if (match.EndCharIndex + 1 < texts.ElementAt(match.EndElementIndex).Text.Length)
lines[lines.Length - 1] = lines[lines.Length - 1] + texts.ElementAt(match.EndElementIndex).Text.Substring(match.EndCharIndex + 1);
txt.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve); // in case your value starts/ends with whitespace
txt.Text = lines[0];
// remove any extra texts.
for (int i = t + 1; i <= match.EndElementIndex; i++)
{
texts.ElementAt(i).Text = string.Empty; // clear the text
}
// if 'with' contained line breaks we need to add breaks back...
if (lines.Count() > 1)
{
OpenXmlElement currEl = txt;
Break br;
// append more lines
var run = txt.Parent as Run;
for (int i = 1; i < lines.Count(); i++)
{
br = new Break();
run.InsertAfter<Break>(br, currEl);
currEl = br;
txt = new Text(lines[i]);
run.InsertAfter<Text>(txt, currEl);
t++; // skip to this next text element
currEl = txt;
}
c = skip; // new line
}
else
{ // continue to process same line
c += skip;
}
}
}
}
}
/// <summary>
/// Determine if the texts (starting at element t, char c) exactly contain the find text
/// </summary>
/// <param name="texts"></param>
/// <param name="t"></param>
/// <param name="c"></param>
/// <param name="find"></param>
/// <returns>null or the result info</returns>
static Match IsMatch(IEnumerable<Text> texts, int t, int c, string find)
{
int ix = 0;
for (int i = t; i < texts.Count(); i++)
{
for (int j = c; j < texts.ElementAt(i).Text.Length; j++)
{
if (find[ix] != texts.ElementAt(i).Text[j])
{
return null; // element mismatch
}
ix++; // match; go to next character
if (ix == find.Length)
return new Match() { EndElementIndex = i, EndCharIndex = j }; // full match with no issues
}
c = 0; // reset char index for next text element
}
return null; // ran out of text, not a string match
}
/// <summary>
/// Defines a match result
/// </summary>
class Match
{
/// <summary>
/// Last matching element index containing part of the search text
/// </summary>
public int EndElementIndex { get; set; }
/// <summary>
/// Last matching char index of the search text in last matching element
/// </summary>
public int EndCharIndex { get; set; }
}
} // class
} // namespace
public static class OpenXmlTools
{
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML which give exception on save
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
}

Maybe this solution is easier
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
//1. Copy all the file into a string
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
docText = sr.ReadToEnd();
//2. Use regular expression to replace all text
Regex regexText = new Regex(find);
docText = regexText.Replace(docText, replace);
//3. Write the changed string into the file again
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
sw.Write(docText);

Related

Removing HTML from messages safely

I need to output all of the plaintext within messages that may include valid and/or invalid HTML and possibly text that is superficially similar to HTML (i.e. non-HTML text within <...> such as: < why would someone do this?? >).
It is more important that I preserve all non-HTML content than it is to strip out all HTML, but ideally I would like to get rid of as much of the HTML as possible for readability.
I am currently using HTML Agility Pack, but I am having issues where non-HTML within < and > is also removed, for example:
my function:
text = HttpUtility.HtmlDecode(text);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(text);
text = doc.DocumentNode.InnerText;
simple example input*:
this text has <b>weird < things</b> going on >
actual output (unacceptable, lost the word "things"):
this text has weird going on >
desired output:
this text has weird < things going on >
Is there a way to remove only legitimate HTML tags within HTML Agility Pack without stripping out other content that may include < and/or >? Or do I need to manually create a white-list of tags to remove like in this question? That is my fallback solution but I'm hoping there is a more complete solution built in to HTML Agility Pack (or another tool) that I just haven't been able to find.
*(real input often has a ton of unneeded HTML in it, I can give a longer example if that would be useful)

You could use this pattern to replace the HTML tags:
</?[a-zA-Z][a-zA-Z0-9 \"=_-]*?>
Explanation:
<
maybe / (as it may be closing tag)
match a-z or A-Z as the first letter
MAYBE match any of a-z, or A-Z, 0-9, "=_- indefinitely
>
Final Code:
using System;
using System.Text.RegularExpressions;
namespace Regular
{
class Program
{
static void Main(string[] args)
{
string yourText = "this text has <b>weird < things</b> going on >";
string newText = Regex.Replace(yourText, "</?[a-zA-Z][a-zA-Z0-9 \"=_-]*>", "");
Console.WriteLine(newText);
}
}
}
Outputs:
this text has weird < things going on >
#corey-ogburn's comment is not correct as <[space]abc> would be replaced.
As you only want to strip them off the string I don't see a reason where you'd want to check if you have a tag starting/ending, but you could easily make it with regex.
It's not always a good choice to use RegEx to parse HTML, but I think it'd be fine if you want to parse simple text.

I wrote this a really long time ago to do something similar. You might use it as a starting point:
You'll need:
using System;
using System.Collections.Generic;
And the code:
/// <summary>
/// Instances of this class strip HTML/XML tags from a string
/// </summary>
public class HTMLStripper
{
public HTMLStripper() { }
public HTMLStripper(string source)
{
m_source = source;
stripTags();
}
private const char m_beginToken = '<';
private const char m_endToken = '>';
private const char m_whiteSpace = ' ';
private enum tokenType
{
nonToken = 0,
beginToken = 1,
endToken = 2,
escapeToken = 3,
whiteSpace = 4
}
private string m_source = string.Empty;
private string m_stripped = string.Empty;
private string m_tagName = string.Empty;
private string m_tag = string.Empty;
private Int32 m_startpos = -1;
private Int32 m_endpos = -1;
private Int32 m_currentpos = -1;
private IList<string> m_skipTags = new List<string>();
private bool m_tagFound = false;
private bool m_tagsStripped = false;
/// <summary>
/// Gets or sets the source string.
/// </summary>
/// <value>
/// The source string.
/// </value>
public string source { get { return m_source; } set { clear(); m_source = value; stripTags(); } }
/// <summary>
/// Gets the string stripped of HTML tags.
/// </summary>
/// <value>
/// The string.
/// </value>
public string stripped { get { return m_stripped; } set { } }
/// <summary>
/// Gets or sets a value indicating whether [HTML tags were stripped].
/// </summary>
/// <value>
/// <c>true</c> if [HTML tags were stripped]; otherwise, <c>false</c>.
/// </value>
public bool tagsStripped { get { return m_tagsStripped; } set { } }
/// <summary>
/// Adds the name of an HTML tag to skip stripping (leave in the text).
/// </summary>
/// <param name="value">The value.</param>
public void addSkipTag(string value)
{
if (value.Length > 0)
{
// Trim start and end tokens from skipTags if present and add to list
CharEnumerator tmpScanner = value.GetEnumerator();
string tmpString = string.Empty;
while (tmpScanner.MoveNext())
{
if (tmpScanner.Current != m_beginToken && tmpScanner.Current != m_endToken) { tmpString += tmpScanner.Current; }
}
if (tmpString.Length > 0) { m_skipTags.Add(tmpString); }
}
}
/// <summary>
/// Clears this instance.
/// </summary>
public void clear()
{
m_source = string.Empty;
m_tag = string.Empty;
m_startpos = -1;
m_endpos = -1;
m_currentpos = -1;
m_tagsStripped = false;
}
/// <summary>
/// Clears all.
/// </summary>
public void clearAll()
{
this.clear();
m_skipTags.Clear();
}
/// <summary>
/// Strips the HTML tags.
/// </summary>
private void stripTags()
{
// Preserve source and make a copy for stripping
m_stripped = m_source;
// Find first tag
getNext();
// If there are any tags (if next tag is string.Empty we are at EOS)...
if (m_tagName != string.Empty)
{
do
{
// If the tag we found is not to be skipped...
if (!m_skipTags.Contains(m_tagName))
{
// Remove tag from string
m_stripped = m_stripped.Remove(m_startpos, m_endpos - m_startpos + 1);
m_tagsStripped = true;
}
// Get next tag, rinse and repeat (if next tag is string.Empty we are at EOS)
getNext();
} while (m_tagName != string.Empty);
}
}
/// <summary>
/// Steps the pointer to the next HTML tag.
/// </summary>
private void getNext()
{
m_tagFound = false;
m_tag = string.Empty;
m_tagName = string.Empty;
bool beginTokenFound = false;
CharEnumerator scanner = m_stripped.GetEnumerator();
// If we're not at the beginning of the string, move the enumerator to the appropriate location in the string
if (m_currentpos != -1)
{
Int32 index = 0;
do
{
scanner.MoveNext();
index += 1;
} while (index < m_currentpos + 1);
}
while (!m_tagFound && m_currentpos + 1 < m_stripped.Length)
{
// Find next begin token
while (scanner.MoveNext())
{
m_currentpos += 1;
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_startpos = m_currentpos;
beginTokenFound = true;
break;
}
}
// If a begin token is found, find next end token
if (beginTokenFound)
{
while (scanner.MoveNext())
{
m_currentpos += 1;
// If we find another begin token before finding an end token we are not in a tag
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// If the char immediately following a begin token is a white space we are not in a tag
if (m_currentpos - m_startpos == 1 && evaluateChar(scanner.Current) == tokenType.whiteSpace)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// End token found
if (evaluateChar(scanner.Current) == tokenType.endToken)
{
m_endpos = m_currentpos;
m_tagFound = true;
break;
}
}
}
if (m_tagFound)
{
// Found a tag, get the info for this tag
m_tag = m_stripped.Substring(m_startpos, (m_endpos + 1) - m_startpos);
m_tagName = m_stripped.Substring(m_startpos + 1, m_endpos - m_startpos - 1);
// If this tag is to be skipped, we do not want to reset the position within the string
// Also, if we are at the end of the string (EOS) we do not want to reset the position
if (!m_skipTags.Contains(m_tagName) && m_currentpos != stripped.Length)
{
m_currentpos = -1;
}
}
}
}
/// <summary>
/// Evaluates the next character.
/// </summary>
/// <param name="value">The value.</param>
/// <returns>tokenType</returns>
private tokenType evaluateChar(char value)
{
tokenType returnValue = new tokenType();
switch (value)
{
case m_beginToken:
returnValue = tokenType.beginToken;
break;
case m_endToken:
returnValue = tokenType.endToken;
break;
case m_whiteSpace:
returnValue = tokenType.whiteSpace;
break;
default:
returnValue = tokenType.nonToken;
break;
}
return returnValue;
}
}

How to replace text with index value in C#

I have a text file which contains a repeated string called "map" for more than 800 now I would like to replace them with map to map0, map1, map2, .....map800.
I tried this way but it didn't work for me:
void Main() {
string text = File.ReadAllText(#"T:\File1.txt");
for (int i = 0; i < 2000; i++)
{
text = text.Replace("map", "map"+i);
}
File.WriteAllText(#"T:\File1.txt", text);
}
How can I achieve this?

This should work fine:
void Main() {
string text = File.ReadAllText(#"T:\File1.txt");
int num = 0;
text = (Regex.Replace(text, "map", delegate(Match m) {
return "map" + num++;
}));
File.WriteAllText(#"T:\File1.txt", text);
}

/// <summary>
/// Replaces each existing key within the original string by adding a number to it.
/// </summary>
/// <param name="original">The original string.</param>
/// <param name="key">The key we are searching for.</param>
/// <param name="offset">The offset of the number we want to start with. The default value is 0.</param>
/// <param name="increment">The increment of the number.</param>
/// <returns>A new string where each key has been extended with a number string with "offset" and beeing incremented with "increment".The default value is 1.</returns>
/// <example>
/// Assuming that we have an original string of "mapmapmapmap" and the key "map" we
/// would get "map0map1map2map3" as result.
/// </example>
public static string AddNumberToKeyInString(string original, string key, int offset = 0, int increment = 1)
{
if (original.Contains(key))
{
int counter = offset;
int position = 0;
int index;
// While we are withing the length of the string and
// the "key" we are searching for exists at least once
while (position < original.Length && (index = original.Substring(position).IndexOf(key)) != -1)
{
// Insert the counter after the "key"
original = original.Insert(position + key.Length, counter.ToString());
position += index + key.Length + counter.ToString().Length;
counter += increment;
}
}
return original;
}

It's because you are replacing the same occurrence of map each time. So the resulting string will have map9876543210 map9876543210 map9876543210 for 10 iterations, if the original string was "map map map". You need to find each individual occurrence of map, and replace it. Try using the indexof method.

Something along these lines should give you an idea of what you're trying to do:
static void Main(string[] args)
{
string text = File.ReadAllText(#"C:\temp\map.txt");
int mapIndex = text.IndexOf("map");
int hitCount = 0;
int hitTextLength = 1;
while (mapIndex >= 0 )
{
text = text.Substring(0, mapIndex) + "map" + hitCount++.ToString() + text.Substring(mapIndex + 2 + hitTextLength);
mapIndex = text.IndexOf("map", mapIndex + 3 + hitTextLength);
hitTextLength = hitCount.ToString().Length;
}
File.WriteAllText(#"C:\temp\map1.txt", text);
}
Due to the fact that strings are immutable this wouldn't be the ideal way to deal with large files (1MB+) as you would be creating and disposing the entire string for each instance of "map" in the file.
For an example file:
map hat dog
dog map cat
lost cat map
mapmapmaphat
map
You get the results:
map0 hat dog
dog map1 cat
lost cat map2
map3map4map5hat
map6

out of bounds error c#

Im trying to read contents of a csv file into different variables in order to send to a web service.It has been working fine but suddenly today i got and exception.
index was outside the bounds of the array:
what Did I do wrong?
String sourceDir = #"\\198.0.0.4\e$\Globus\LIVE\bnk.run\URA.BP\WEBOUT\";
// Process the list of files found in the directory.
string[] fileEntries = Directory.GetFiles(sourceDir);
foreach (string fileName2 in fileEntries)
{
// read values
StreamReader st = new StreamReader(fileName2);
while (st.Peek() >= 0)
{
String report1 = st.ReadLine();
String[] columns = report1.Split(','); //split columns
String prnout = columns[0];
String tinout = columns[1];
String amtout = columns[2];
String valdate = columns[3];
String paydate = columns[4];
String status = columns[5];
String branch = columns[6];
String reference = columns[7];
}
}

It's hard to guess without even seeing the .csv file, but my first one would be that you don't have 8 columns.
It would be easier if you could show the original .csv file, and tell us where the exception pops.
edit: If you think the data is alright, I'd suggest you debugging and see what the split call returns in Visual Studio. That might help
edit2: And since you're doing that processing in a loop, make sure each row has at least 8 columns.

My money is on bad data file. If that is the only thing in the equation that has changed (aka you haven't made any code changes) then that's pretty much your only option.
If your data file isn't too long post it here and we can tell you for sure.
You can add something like below to check for invalid column lengths:
while (st.Peek() >= 0)
{
String report1 = st.ReadLine();
String[] columns = report1.Split(','); //split columns
if(columns.Length < 8)
{
//Log something useful, throw an exception, whatever.
//You have the option to quitely note that there was a problem and
//continue on processing the rest of the file if you want.
continue;
}
//working with columns below
}

Just for sanity's sake, I combined all the various notes written here. This code is a bit cleaner and has some validation in it.
Try this:
string dir = #"\\198.0.0.4\e$\Globus\LIVE\bnk.run\URA.BP\WEBOUT\";
foreach (string fileName2 in Directory.GetFiles(dir)) {
StreamReader st = new StreamReader(fileName2);
while (!sr.EndOfStream) {
string line = sr.ReadLine();
if (!String.IsNullOrEmpty(line)) {
string[] columns = line.Split(',');
if (columns.Length == 8) {
string prnout = columns[0];
string tinout = columns[1];
string amtout = columns[2];
string valdate = columns[3];
string paydate = columns[4];
string status = columns[5];
string branch = columns[6];
string reference = columns[7];
}
}
}
}
EDIT: As some other users have commented, the CSV format also accepts text qualifiers, which usually means the double quote symbol ("). For example, a text qualified line may look like this:
user,"Hello!",123.23,"$123,123.12",and so on,
Writing CSV parsing code is a little more complicated when you have a fully formatted file like this. Over the years I've been parsing improperly formatted CSV files, I've worked up a standard code script that passes virtually all unit tests, but it's a pain to explain.
/// <summary>
/// Read in a line of text, and use the Add() function to add these items to the current CSV structure
/// </summary>
/// <param name="s"></param>
public static bool TryParseLine(string s, char delimiter, char text_qualifier, out string[] array)
{
bool success = true;
List<string> list = new List<string>();
StringBuilder work = new StringBuilder();
for (int i = 0; i < s.Length; i++) {
char c = s[i];
// If we are starting a new field, is this field text qualified?
if ((c == text_qualifier) && (work.Length == 0)) {
int p2;
while (true) {
p2 = s.IndexOf(text_qualifier, i + 1);
// for some reason, this text qualifier is broken
if (p2 < 0) {
work.Append(s.Substring(i + 1));
i = s.Length;
success = false;
break;
}
// Append this qualified string
work.Append(s.Substring(i + 1, p2 - i - 1));
i = p2;
// If this is a double quote, keep going!
if (((p2 + 1) < s.Length) && (s[p2 + 1] == text_qualifier)) {
work.Append(text_qualifier);
i++;
// otherwise, this is a single qualifier, we're done
} else {
break;
}
}
// Does this start a new field?
} else if (c == delimiter) {
list.Add(work.ToString());
work.Length = 0;
// Test for special case: when the user has written a casual comma, space, and text qualifier, skip the space
// Checks if the second parameter of the if statement will pass through successfully
// e.g. "bob", "mary", "bill"
if (i + 2 <= s.Length - 1) {
if (s[i + 1].Equals(' ') && s[i + 2].Equals(text_qualifier)) {
i++;
}
}
} else {
work.Append(c);
}
}
list.Add(work.ToString());
// If we have nothing in the list, and it's possible that this might be a tab delimited list, try that before giving up
if (list.Count == 1 && delimiter != DEFAULT_TAB_DELIMITER) {
string[] tab_delimited_array = ParseLine(s, DEFAULT_TAB_DELIMITER, DEFAULT_QUALIFIER);
if (tab_delimited_array.Length > list.Count) {
array = tab_delimited_array;
return success;
}
}
// Return the array we parsed
array = list.ToArray();
return success;
}
You should note that, even as complicated as this algorithm is, it still is unable to parse CSV files where there are embedded newlines within a text qualified value, for example, this:
123,"Hi, I am a CSV File!
I am saying hello to you!
But I also have embedded newlines in my text.",2012-07-23
To solve those, I have a multiline parser that uses the Try() feature to add additional lines of text to verify that the main function worked correctly:
/// <summary>
/// Parse a line whose values may include newline symbols or CR/LF
/// </summary>
/// <param name="sr"></param>
/// <returns></returns>
public static string[] ParseMultiLine(StreamReader sr, char delimiter, char text_qualifier)
{
StringBuilder sb = new StringBuilder();
string[] array = null;
while (!sr.EndOfStream) {
// Read in a line
sb.Append(sr.ReadLine());
// Does it parse?
string s = sb.ToString();
if (TryParseLine(s, delimiter, text_qualifier, out array)) {
return array;
}
}
// Fails to parse - return the best array we were able to get
return array;
}

Since you don't know how many columns will be in csv file, you might need to test for length:
if (columns.Length == 8) {
String prnout = columns[0];
String tinout = columns[1];
...
}
I bet you just got an empty line (extra EOL at the end), and that's as simple as that

multiline formatting for verbatim strings in c# (prefix with #)

I love using the #"strings" in c#, especially when I have a lot of multi-line text. The only annoyance is that my code formatting goes to doodie when doing this, because the second and greater lines are pushed fully to the left instead of using the indentation of my beautifully formatted code. I know this is by design, but is there some option/hack way of allowing these lines to be indented, without adding the actual tabs/spaces to the output?
adding example:
var MyString = #" this is
a multi-line string
in c#.";
My variable declaration is indented to the "correct" depth, but the second and further lines in the string get pushed to the left margin- so the code is kinda ugly. You could add tabs to the start of line 2 and 3, but the string itself would then contain those tabs... make sense?

How about a string extension? Update: I reread your question and I hope there is a better answer. This is something that bugs me too and having to solve it as below is frustrating but on the plus side it does work.
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
public static class StringExtensions
{
public static string StripLeadingWhitespace(this string s)
{
Regex r = new Regex(#"^\s+", RegexOptions.Multiline);
return r.Replace(s, string.Empty);
}
}
}
And an example console program:
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string x = #"This is a test
of the emergency
broadcasting system.";
Console.WriteLine(x);
Console.WriteLine();
Console.WriteLine("---");
Console.WriteLine();
Console.WriteLine(x.StripLeadingWhitespace());
Console.ReadKey();
}
}
}
And the output:
This is a test
of the emergency
broadcasting system.
---
This is a test
of the emergency
broadcasting system.
And a cleaner way to use it if you decide to go this route:
string x = #"This is a test
of the emergency
broadcasting system.".StripLeadingWhitespace();
// consider renaming extension to say TrimIndent() or similar if used this way

Cymen has given the right solution. I use a similar approach as derived from Scala's stripMargin() method. Here's what my extension method looks like:
public static string StripMargin(this string s)
{
return Regex.Replace(s, #"[ \t]+\|", string.Empty);
}
Usage:
var mystring = #"
|SELECT
| *
|FROM
| SomeTable
|WHERE
| SomeColumn IS NOT NULL"
.StripMargin();
Result:
SELECT
*
FROM
SomeTable
WHERE
SomeColumn IS NOT NULL

I can't think of an answer that would completely satisfy your question, however you could write a function that strips leading spaces from lines of text contained in a string and call it on each creation of such a string.
var myString = TrimLeadingSpacesOfLines(#" this is a
a multi-line string
in c#.");
Yes it is a hack, but you specified your acceptance of a hack in your question.

Here is a longish solution which tries to mimic textwrap.dedent as much as possible. The first line is left as-is and expected not to be indented. (You can generate the unit tests based on the doctests using doctest-csharp.)
/// <summary>
/// Imitates the Python's
/// <a href="https://docs.python.org/3/library/textwrap.html#textwrap.dedent">
/// <c>textwrap.dedent</c></a>.
/// </summary>
/// <param name="text">Text to be dedented</param>
/// <returns>array of dedented lines</returns>
/// <code doctest="true">
/// Assert.That(Dedent(""), Is.EquivalentTo(new[] {""}));
/// Assert.That(Dedent("test me"), Is.EquivalentTo(new[] {"test me"}));
/// Assert.That(Dedent("test\nme"), Is.EquivalentTo(new[] {"test", "me"}));
/// Assert.That(Dedent("test\n me"), Is.EquivalentTo(new[] {"test", " me"}));
/// Assert.That(Dedent("test\n me\n again"), Is.EquivalentTo(new[] {"test", "me", " again"}));
/// Assert.That(Dedent(" test\n me\n again"), Is.EquivalentTo(new[] {" test", "me", " again"}));
/// </code>
private static string[] Dedent(string text)
{
var lines = text.Split(
new[] {"\r\n", "\r", "\n"},
StringSplitOptions.None);
// Search for the first non-empty line starting from the second line.
// The first line is not expected to be indented.
var firstNonemptyLine = -1;
for (var i = 1; i < lines.Length; i++)
{
if (lines[i].Length == 0) continue;
firstNonemptyLine = i;
break;
}
if (firstNonemptyLine < 0) return lines;
// Search for the second non-empty line.
// If there is no second non-empty line, we can return immediately as we
// can not pin the indent.
var secondNonemptyLine = -1;
for (var i = firstNonemptyLine + 1; i < lines.Length; i++)
{
if (lines[i].Length == 0) continue;
secondNonemptyLine = i;
break;
}
if (secondNonemptyLine < 0) return lines;
// Match the common prefix with at least two non-empty lines
var firstNonemptyLineLength = lines[firstNonemptyLine].Length;
var prefixLength = 0;
for (int column = 0; column < firstNonemptyLineLength; column++)
{
char c = lines[firstNonemptyLine][column];
if (c != ' ' && c != '\t') break;
bool matched = true;
for (int lineIdx = firstNonemptyLine + 1; lineIdx < lines.Length;
lineIdx++)
{
if (lines[lineIdx].Length == 0) continue;
if (lines[lineIdx].Length < column + 1)
{
matched = false;
break;
}
if (lines[lineIdx][column] != c)
{
matched = false;
break;
}
}
if (!matched) break;
prefixLength++;
}
if (prefixLength == 0) return lines;
for (var i = 1; i < lines.Length; i++)
{
if (lines[i].Length > 0) lines[i] = lines[i].Substring(prefixLength);
}
return lines;
}

How to find out next word in a sentence in c#?

I have a string
"bat and ball not pen or boat not phone"
I want to pick words adjacent to not
for example -- "not pen", "not phone"
but I was unable to do it? I have tried to pick up the word by using the index and substring but its not possible.
tempTerm = tempTerm.Trim().Substring(0, tempTerm.Length - (orterm.Length + 1)).ToString();

How about using some Regex
Something like
string s = "bat and ball not pen or boat not phone";
Regex reg = new Regex("not\\s\\w+");
MatchCollection matches = reg.Matches(s);
foreach (Match match in matches)
{
string sub = match.Value;
}
See Learn Regular Expression (Regex) syntax with C# and .NET for some more details

You can split the sentence, and then just loop through looking for "not":
string sentence = "bat and ball not pen or boat not phone";
string[] words = sentence.Split(new char[] {' '});
List<string> wordsBesideNot = new List<string>();
for (int i = 0; i < words.Length - 1; i++)
{
if (words[i].Equals("not"))
wordsBesideNot.Add(words[i + 1]);
}
// At this point, wordsBesideNot is { "pen", "phone" }

String[] parts = myStr.Split(' ');
for (int i = 0; i < parts.Length; i++)
if (parts[i] == "not" && i + 1 < parts.Length)
someList.Add(parts[i + 1]);
This should get you all the words adjacent to not, you could compare with case insensitive if need be.

You can use this regex: not\s\w+\b. It will match desired phrases:
not pen
not phone

I'd say start by splitting your string into an array - it will make this kind of thing a whole lot easier.

In C# I would so something like this
// Orginal string
string s = "bat and ball not pen or boat not phone";
// Seperator
string seperate = "not ";
// Length of the seperator
int length = seperate.Length;
// sCopy so you dont touch the original string
string sCopy = s.ToString();
// List to store the words, you could use an array if
// you count the 'not's.
List<string> stringList = new List<string>();
// While the seperator (not ) exists in the string
while (sCopy.IndexOf(seperate) != -1)
{
// Index of the next seperator
int index = sCopy.IndexOf(seperate);
// Remove anything before the seperator and the
// seperator itself.
sCopy = sCopy.Substring(index + length);
// In case of multiple spaces remove them.
sCopy = sCopy.TrimStart(' ');
// If there are more spaces or more words to come
// then specify the length
if (sCopy.IndexOf(' ') != -1)
{
// Cut the word out of sCopy
string sub = sCopy.Substring(0, sCopy.IndexOf(' '));
// Add the word to the list
stringList.Add(sub);
}
// Otherwise just get the rest of the string
else
{
// Cut the word out of sCopy
string sub = sCopy.Substring(0);
// Add the word to the list
stringList.Add(sub);
}
}
int p = 0;
The words in the list are pen and phone. This will fail when you get odd characters, full stops etc. If you don't know how the string is going to be constructed you might need something more complex.

public class StringHelper
{
/// <summary>
/// Gets the surrounding words of a given word in a given text.
/// </summary>
/// <param name="text">A text in which the given word to be searched.</param>
/// <param name="word">A word to be searched in the given text.</param>
/// <param name="prev">The number of previous words to include in the result.</param>
/// <param name="next">The number of next words to include in the result.</param>
/// <param name="all">Sets whether the method returns all instances of the search word.</param>
/// <returns>An array that consists of parts of the text, including the search word and the surrounding words.</returns>
public static List<string> GetSurroundingWords(string text, string word, int prev, int next, bool all = false)
{
var phrases = new List<string>();
var words = text.Split();
var indices = new List<int>();
var index = -1;
while ((index = Array.IndexOf(words, word, index + 1)) != -1)
{
indices.Add(index);
if (!all && indices.Count == 1)
break;
}
foreach (var ind in indices)
{
var prevActual = ind;
if (prev > prevActual)
prev = prevActual;
var nextActual = words.Length - ind;
if (next > nextActual)
next = nextActual;
var picked = new List<string>();
for (var i = 1; i <= prev; i++)
picked.Add(words[ind - i]);
picked.Reverse();
picked.Add(word);
for (var i = 1; i <= next; i++)
picked.Add(words[ind + i]);
phrases.Add(string.Join(" ", picked));
}
return phrases;
}
}
[TestClass]
public class StringHelperTests
{
private const string Text = "Date and Time in C# are handled by DateTime class in C# that provides properties and methods to format dates in different datetime formats.";
[TestMethod]
public void GetSurroundingWords()
{
// Arrange
var word = "class";
var expected = new [] { "DateTime class in C#" };
// Act
var actual = StringHelper.GetSurroundingWords(Text, word, 1, 2);
// Assert
Assert.AreEqual(expected.Length, actual.Count);
Assert.AreEqual(expected[0], actual[0]);
}
[TestMethod]
public void GetSurroundingWords_NoMatch()
{
// Arrange
var word = "classify";
var expected = new List<string>();
// Act
var actual = StringHelper.GetSurroundingWords(Text, word, 1, 2);
// Assert
Assert.AreEqual(expected.Count, actual.Count);
}
[TestMethod]
public void GetSurroundingWords_MoreSurroundingWordsThanAvailable()
{
// Arrange
var word = "class";
var expected = "Date and Time in C# are handled by DateTime class in C#";
// Act
var actual = StringHelper.GetSurroundingWords(Text, word, 50, 2);
// Assert
Assert.AreEqual(expected.Length, actual[0].Length);
Assert.AreEqual(expected, actual[0]);
}
[TestMethod]
public void GetSurroundingWords_ZeroSurroundingWords()
{
// Arrange
var word = "class";
var expected = "class";
// Act
var actual = StringHelper.GetSurroundingWords(Text, word, 0, 0);
// Assert
Assert.AreEqual(expected.Length, actual[0].Length);
Assert.AreEqual(expected, actual[0]);
}
[TestMethod]
public void GetSurroundingWords_AllInstancesOfSearchWord()
{
// Arrange
var word = "and";
var expected = new[] { "Date and Time", "properties and methods" };
// Act
var actual = StringHelper.GetSurroundingWords(Text, word, 1, 1, true);
// Assert
Assert.AreEqual(expected.Length, actual.Count);
Assert.AreEqual(expected[0], actual[0]);
Assert.AreEqual(expected[1], actual[1]);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

OpenXML replace text in all document - c#

Related

Removing HTML from messages safely

How to replace text with index value in C#

out of bounds error c#

multiline formatting for verbatim strings in c# (prefix with #)

How to find out next word in a sentence in c#?

Categories

Resources