Union of million line urls in 2 files

Union of million line urls in 2 files - c#

File A B contains million urls.
1, go through the url in file A one by one.
2, extract subdomain.com (http://subdomain.com/path/file)
3, if subdomain.com exist file B, save it to file C.
Any quickest way to get file C with c#?
Thanks.
when i use readline, it have no much different.
// stat
DateTime start = DateTime.Now;
int totalcount = 0;
int n1;
if (!int.TryParse(num1.Text, out n1))
n1 = 0;
// memory
dZLinklist = new Dictionary<string, string>();
// read file
string fileName = openFileDialog1.FileName; // get file name
textBox1.Text = fileName;
StreamReader sr = new StreamReader(textBox1.Text);
string fullfile = File.ReadAllText(#textBox1.Text);
string[] sArray = fullfile.Split( '\n');
//IEnumerable<string> sArray = tool.GetSplit(fullfile, '\n');
//string sLine = "";
//while (sLine != null)
foreach ( string sLine in sArray)
{
totalcount++;
//sLine = sr.ReadLine();
if (sLine != null)
{
//string reg = "http[s]*://.*?/";
//Regex R = new Regex(reg, RegexOptions.Compiled);
//Match m = R.Match(sLine);
//if(m.Success)
int length = sLine.IndexOf(' ', n1); // default http://
if(length > 0)
{
//string urls = sLine.Substring(0, length);
dZLinklist[sLine.Substring(0,length)] = sLine;
}
}
}
TimeSpan time = DateTime.Now - start;
int count = dZLinklist.Count;
double sec = Math.Round(time.TotalSeconds,2);
label1.Text = "(" + totalcount + ")" + count.ToString() + " / " + sec + " = " + (Math.Round(count / sec,2)).ToString();
sr.Close();

I would go for using Microsoft LogParser for processing big files: MS LogParser. Are you limited to implement it in described way only?

Related

SSIS DTS Script Task has encountered an exception in user code

I'm trying to execute an SSIS package with a scripting component in Visual Basic 2010. I get the following error when I execute the package:
public void Main()
{
// TODO: Custom Code starts here
/*
* Description: Reads the input CMI Stats files and converts into a more readable format
* This Code for Better CMI Parser is converted as per SC's original code by S.A. on 3/6/2014
* Here is the description from original procedure
* CustType = DOMESTIC/INTERNATIONAL/ETC
* CategoryType = SBU/MAN
* Category = Actual value (AI/CC/etc)
* DataType = INCOMING or SHIP (or something else later?)
*
* 3/23/2010
* Uncommented the CAD file load....
*/
string[,] filesToProcess = new string[2, 2] { {(String)Dts.Variables["csvFileNameUSD"].Value,"USD" }, {(String)Dts.Variables["csvFileNameCAD"].Value,"CAD" } };
string headline = "CustType,CategoryType,CategoryValue,DataType,Stock QTY,Stock Value,Floor QTY,Floor Value,Order Count,Currency";
string outPutFile = Dts.Variables["outputFile"].Value.ToString();
//Declare Output files to write to
FileStream sw = new System.IO.FileStream(outPutFile, System.IO.FileMode.Create);
StreamWriter w = new StreamWriter(sw);
w.WriteLine(headline);
//Loop Through the files one by one and write to output Files
for (int x = 0; x < filesToProcess.GetLength(1); x++)
{
if (System.IO.File.Exists(filesToProcess[x, 0]))
{
string categoryType = "";
string custType = "";
string dataType = "";
string categoryValue = "";
//Read the input file in memory and close after done
StreamReader sr = new StreamReader(filesToProcess[x, 0]);
string fileText = sr.ReadToEnd();
string[] lines = fileText.Split(Convert.ToString(System.Environment.NewLine).ToCharArray());
sr.Close();
//Read String line by line and write the lines with params from sub headers
foreach (string line in lines)
{
if (line.Split(',').Length > 3)
{
string lineWrite = "";
lineWrite = line;
string[] cols = line.Split(',');
if (HeaderLine(cols[1]))
{
string[] llist = cols[0].Split();
categoryType = llist[llist.Length - 1];
custType = llist[0];
dataType = llist[1];
if (dataType == "COMPANY")
{
custType = llist[0] + " " + llist[1];
dataType = llist[2];
}
}
if (cols[0].Contains("GRAND"))
{
categoryValue = "Total";
}
else
{
string[] col0 = cols[0].Split(' ');
categoryValue = col0[col0.Length - 1];
}
int z = 0;
string[] vals = new string[cols.Length];
for (int i = 1; i < cols.Length - 1; i++)
{
vals[z] = cols[i].Replace(',', ' ');
z++;
}
//line = ",".join([CustType, CategoryType, CategoryValue, DataType, vals[0], vals[1], vals[2], vals[3], vals[6], currency])
lineWrite = clean(custType) + "," + clean(categoryType) + "," + clean(categoryValue) + ","
+ clean(dataType) + "," + clean(vals[0]) + "," + clean(vals[1]) + "," + clean(vals[2])
+ "," + clean(vals[3]) + "," + clean(vals[6]) + "," + filesToProcess[x, 1];
if (!HeaderLine(line))
{
w.WriteLine(lineWrite);
w.Flush();
}
}
}
}
}
w.Close();
sw.Close();
//Custom Code ends here
Dts.TaskResult = (int)ScriptResults.Success;
}
public bool HeaderLine(String line)
{
return line.Contains("Stock Qty");
}
public string clean(string str)
{
if (str != null)
return Regex.Replace(str,#"[""]","");
//return str.Replace('"', ' ');
else
return "";
}
#region ScriptResults declaration
/// <summary>
/// This enum provides a convenient shorthand within the scope of this class for setting the
/// result of the script.
///
/// This code was generated automatically.
/// </summary>
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
#endregion
}
}
Can anyone suggest what could have possibly gone wrong or maybe how to debug this code in order to understand the errors?
Thanks!

Here's how you debug scripts in SSIS
With the code open, put a breakpoint
Close the code
Run the package
When the script starts running, it will open up a code window and you can walk through the code step by step

Error , invalid Regular expression in c# console application

private static void PizzaHutPizzaScrapper()
{
try
{
MatchCollection mclName;
MatchCollection mclPrice;
WebClient webClient = new WebClient();
string strUrl = "http://www.pizzahut.com.pk/pizzas.html";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
string pageContent = objUTF8.GetString(reqHTML);
Regex r = new Regex("(<p class=\"MenuDescHead\">[A-Za-z\\s*]+[0-9]*)");
// Regex r1 = new Regex("(<p class=\"MenuDescPrice\">[A-Za-z.\\s?]+[0-9]*[A-Za-z\\s?]*[0-9]*[A-Za-z.\\s?]*)");
Regex r1 = new Regex("(<p class=\"MenuDescPrice\">[0-9]*)");
mclName = r.Matches(pageContent);
mclPrice = r1.Matches(pageContent);
StringBuilder strBuilder = new StringBuilder();
string name = "";
string price = "";
List<string> menuPriceList = new List<string>();
foreach (Match ml in mclPrice)
{
price = ml.Value.Remove(0, ml.Value.IndexOf(">") + 1).Trim();
if (price != "")
{
menuPriceList.Add(ml.Value);
}
}
int j = 0;
for (int i = 0; i < mclName.Count; i++)
{
name = mclName[i].Value.Remove(0, mclName[i].Value.IndexOf(">") + 1);
if (i == 0 || i == 4)
{
price = menuPriceList[j].Remove(0, menuPriceList[j].IndexOf(">") + 1);
strBuilder.Append(name.Trim() + ", " + price.Trim() + " , PizzaHut\r\n");
j++;
}
price = menuPriceList[j].Remove(0, menuPriceList[j].IndexOf(">") + 1);
strBuilder.Append(name.Trim() + ", " + price.Trim() + " ,PizzaHut\r\n");
j++;
}`
i want to select numeric values only...but it fetch alphabets as well..
i want to select only numeric values from HTML and using [0-9]* as regular expression, but its not working and show alphabets as well. i want only numeric values, what is correct regular expression? any idea??

Here you go, what you're looking for are Grouping Constructs:
MatchCollection mclName;
MatchCollection mclPrice;
WebClient webClient = new WebClient();
string strUrl = "http://www.pizzahut.com.pk/pizzas.html";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
string pageContent = objUTF8.GetString(reqHTML);
Regex nameRegex = new Regex("<p class=\"MenuDescHead\">([A-Za-z\\s]+[0-9]*)");
Regex priceRegex = new Regex("<p class=\"MenuDescPrice\">[^0-9]*([0-9]*)");
mclName = nameRegex.Matches(pageContent);
mclPrice = priceRegex.Matches(pageContent);
StringBuilder strBuilder = new StringBuilder();
List<string> menuPriceList = new List<string>();
foreach (Match ml in mclPrice)
{
string price = ml.Groups[1].ToString();
if (price != "" && price != "0")
{
menuPriceList.Add(price);
}
}
int j = 0;
for (int i = 0; i < mclName.Count; i++)
{
string price;
string name = mclName[i].Groups[1].ToString();
if (i == 0 || i == 4)
{
price = menuPriceList[j];
strBuilder.Append(name.Trim() + ", " + price.Trim() + " , PizzaHut\r\n");
j++;
}
price = menuPriceList[j];
strBuilder.Append(name.Trim() + ", " + price.Trim() + " ,PizzaHut\r\n");
j++;
}
Console.WriteLine(strBuilder.ToString());

Alternative to ReadLine?

I'm trying to read some files with ReadLine, but my file have some break lines that I need to catch (not all of them), and I don't know how to get them in the same array, neither in any other array with these separators... because... ReadLine reads lines, and break these lines, huh?
I can't replace these because I need to check it after the process, so I need to get the breaklines AND the content after that. That's the problem. How can I do that?
Here's my code:
public class ReadFile
{
string extension;
string filename;
System.IO.StreamReader sr;
public ReadFile(string arquivo, System.IO.StreamReader sr)
{
string ext = Path.GetExtension(arquivo);
sr = new StreamReader(arquivo, System.Text.Encoding.Default);
this.sr = sr;
this.extension = ext;
this.filename = Path.GetFileNameWithoutExtension(arquivo);
if (ext.Equals(".EXP", StringComparison.OrdinalIgnoreCase))
{
ReadEXP(arquivo);
}
else MessageBox.Show("Extensão de arquivo não suportada: "+ext);
}
public void ReadEXP(string arquivo)
{
string line = sr.ReadLine();
string[] words;
string[] Separators = new string[] { "<Segment>", "</Segment>", "<Source>", "</Source>", "<Target>", "</Target>" };
string ID = null;
string Source = null;
string Target = null;
DataBase db = new DataBase();
//db.CreateTable_EXP(filename);
db.CreateTable_EXP();
while ((line = sr.ReadLine()) != null)
{
try
{
if (line.Contains("<Segment>"))
{
ID = "";
words = line.Split(Separators, StringSplitOptions.None);
ID = words[0];
for (int i = 1; i < words.Length; i++ )
ID += words[i];
MessageBox.Show("Segment[" + words.Length + "]: " + ID);
}
if (line.Contains("<Source>"))
{
Source = "";
words = line.Split(Separators, StringSplitOptions.None);
Source = words[0];
for (int i = 1; i < words.Length; i++)
Source += words[i];
MessageBox.Show("Source[" + words.Length + "]: " + Source);
}
if (line.Contains("<Target>"))
{
Target = "";
words = line.Split(Separators, StringSplitOptions.None);
Target = words[0];
for (int i = 1; i < words.Length; i++)
Target += words[i];
MessageBox.Show("Target[" + words.Length + "]: " + Target);
db.PopulateTable_EXP(ID, Source, Target);
MessageBox.Show("ID: " + ID + "\nSource: " + Source + "\nTarget: " + Target);
}
}
catch (IndexOutOfRangeException e)
{
MessageBox.Show(e.Message.ToString());
MessageBox.Show("ID: " + ID + "\nSource: " + Source + "\nTarget: " + Target);
}
}
return;
}

If you are trying to read XML, try using the built in libaries, here is a simple example of loading a section of XML with <TopLevelTag> in it.
var xmlData = XDocument.Load(#"C:\folder\file.xml").Element("TopLevelTag");
if (xmlData == null) throw new Exception("Failed To Load XML");
Here is a tidy way to get content without it throwing an exception if missing from the XML.
var xmlBit = (string)xmlData.Element("SomeSubTag") ?? "";
If you really have to roll your own, then look at examples for CSV parsers,
where ReadBlock can be used to get the raw data including line breaks.
private char[] chunkBuffer = new char[4096];
var fileStream = new System.IO.StreamReader(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite));
var chunkLength = fileStream.ReadBlock(chunkBuffer, 0, chunkBuffer.Length);

Reading a large CSV file and processing in C#. Any suggestions?

I have a large CSV file around 25G. I need to parse each line which has around 10 columns and do some processing and finally save it to a new file with parsed data.
I am using dictionary as my datastructure. To avoid the memory overflow I am writing the file after 500,000 records and clearing the dictionary.
Can anyone suggest whether is this good way of doing. If not, any other better way of doing this? Right now it is taking 30 mins to process 25G file.
Here is the code
private static void ReadData(string filename, FEnum fileType)
{
var resultData = new ResultsData
{
DataColumns = new List<string>(),
DataRows = new List<Dictionary<string, Results>>()
};
resultData.DataColumns.Add("count");
resultData.DataColumns.Add("userid");
Console.WriteLine("Start Processing : " + DateTime.Now);
const long processLimit = 100000;
//ProcessLimit : 500000, TimeElapsed : 30 Mins;
//ProcessLimit : 100000, TimeElaspsed - Overflow
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
Dictionary<string, Results> parsedData = new Dictionary<string, Results>();
FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read);
using (StreamReader streamReader = new StreamReader(fileStream))
{
string charsRead = streamReader.ReadLine();
int count = 0;
long linesProcessed = 0;
while (!String.IsNullOrEmpty(charsRead))
{
string[] columns = charsRead.Split(',');
string eventsList = columns[0] + ";" + columns[1] + ";" + columns[2] + ";" + columns[3] + ";" +
columns[4] + ";" + columns[5] + ";" + columns[6] + ";" + columns[7];
if (parsedData.ContainsKey(columns[0]))
{
Results results = parsedData[columns[0]];
results.Count = results.Count + 1;
results.Conversion = results.Count;
results.EventList.Add(eventsList);
parsedData[columns[0]] = results;
}
else
{
Results results = new Results {
Count = 1, Hash_Person_Id = columns[0], Tag_Id = columns[1], Conversion = 1,
Campaign_Id = columns[2], Inventory_Placement = columns[3], Action_Id = columns[4],
Creative_Group_Id = columns[5], Creative_Id = columns[6], Record_Time = columns[7]
};
results.EventList = new List<string> {eventsList};
parsedData.Add(columns[0], results);
}
charsRead = streamReader.ReadLine();
linesProcessed++;
if (linesProcessed == processLimit)
{
linesProcessed = 0;
SaveParsedValues(filename, fileType, parsedData);
//Clear Dictionary
parsedData.Clear();
}
}
}
stopwatch.Stop();
Console.WriteLine(#"File : {0} Batch Limit : {1} Time elapsed : {2} ", filename + Environment.NewLine, processLimit + Environment.NewLine, stopwatch.Elapsed + Environment.NewLine);
}
Thank you

The Microsoft.VisualBasic.FileIO.TextFieldParser class looks like it could do the job. Try it, it may speed things up.

C# stringbuilder + conversion

What I have going on is this:
1) Reading a directory of files
2) Writing out to a text file the filenames + "|"
3) Where i'm stuck.....
I have a bunch of files named... and need to be converted corispondingly:
Apple0154~3.Txt convertedTO -> Apple0156.txt
Apple0136~31.txt convertedTO -> Apple0166.txt
The prefix is always apple so it kinda goes like:
Apple (always the same prefix).
The numbers match is # + ~ subnumber -1
always in in .txt
I'm sure this is confusing i'm using this code but i cant figured out how to get this resulting textfile:
Apple0154~3.Txt|Apple0156.txt
Apple0136~31.txt|Apple0166.txt
{
string resultingfile = ***This is what i dont know***
string movedredfolder = (overlordfolder + "\\redactions\\");
DirectoryInfo movedredinfo = new DirectoryInfo(movedredfolder);
using (StreamWriter output = new StreamWriter(Path.Combine(movedredfolder, "Master.txt")))
{
foreach (FileInfo fi in movedredfolder)
{
output.WriteLine(Path.GetFileName(fi)+"|"+resultingfile);
}
}
}

Ok, I see what you are trying to do.
Try using Regular expressions to grab the 2 numbers out of the original file name. Something like:
Regex r = new Regex(#"Apple(\d+)~(\d+)\.txt");
Match mat = r.Match(filename);
if( !mat.Success )
{
// Something bad happened...
return;
}
int one = int.Parse(mat.Groups[1].Value);
int two = int.Parse(mat.Groups[2].Value);
int num = one + (two-1);
string newFilename = "Apple"+num.ToString("0000")+".txt";

Inside the foreach loop:
string fileName = Path.GetFileName(fi);
string[] parts = fileName.Split('~', '.');
int basenum = int.Parse(parts[0].Substring(6));
int offset = int.Parse(parts[1]);
string resultingfile = string.Format("Apple{0:0000}.txt", basenum+offset-1);

euh something like:
string path = Path.GetFileName(fi);
int indexOfTilde = path.IndexOf('~');
int indexOfPoint = path.LastIndexOf('.');
int length = indexOfPoint -indexOfTilde;
string tmp = path.SubString(indexOfTilde+1, length);
int numberToIncrease = Convert.ToInt32(tmp) - 1;
int baseNumber = Convert.ToInt32(path.SubString(5, indexOfTilde-4);
string newPath = "Apple" + (baseNumber + numberToIncrease ) + ".txt";
and you can use the FileInfo.MoveTo for file movement :)
good luck!
edit: damn... too slow typing of me...

Ok, this should work for one file:
String filename = "Apple0154~3.Txt";
Regex re = new Regex(#"Apple(?<num>\d+)\~(?<add>\d+)");
Int32 num = Int32.Parse(re.Match(filename).Groups["num"].Value);
Int32 add = Int32.Parse(re.Match(filename).Groups["add"].Value);
Int32 rez = num + (add - 1);
MessageBox.Show("Apple" + rez + ".txt");

using (var output = new StreamWriter(Path.Combine(movedredfolder, "Master.txt")))
{
foreach (var filePath in Directory.GetFiles(directoryPath))
{
var fileName = Path.GetFileNameWithoutExtension(filePath);
var fileExtension = Path.GetExtension(filePath);
var index = fileName.IndexOf('~');
var firstNumber = Int32.Parse(fileName.Substring(5, index - 1));
var secondNumber = Int32.Parse(fileName.Substring(index + 1)) - 1;
output.Write("Apple0" +
(firstNumber + secondNumber).ToString() +
fileExtension + "|"
);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Union of million line urls in 2 files - c#

I would go for using Microsoft LogParser for processing big files: MS LogParser. Are you limited to implement it in described way only?

Related

SSIS DTS Script Task has encountered an exception in user code

Error , invalid Regular expression in c# console application

Alternative to ReadLine?

Reading a large CSV file and processing in C#. Any suggestions?

C# stringbuilder + conversion

Categories

Resources