Scraping html list data from a dynamic server

Scraping html list data from a dynamic server - c#

Hallo guys!
Sorry for the dump question, this is my last resort. I swear i triend countless of other Stackoverflow questions, different Frameworks, etc., but those didnt seem to help.
Ich have the following Problem:
A website displays a list of data (there is a TON of div, li, span etc. tags infront, its a big HTML.)
Im writing a tool that fetches data from a specific list inside a ton of other div tags, downloads it and outputs an excel file.
The website im trying to access, is dynamic. So you open the website, it loads a little bit, and then the list appears (probably some JS and stuff).
When i try to download the website via a webRequest in C#, the html I get ist almost empty with a ton on white spaces, lots of non-html stuff, some garbage data as well.
Now: Im pretty used to C#, HTMLAgillityPack, and countless other libraries, not so much in web related stuff tho. I tried CefSharp, Chromium etc. all of those stuff, but couldnt get them to work properly unfortunately.
I want to have a HTML in my program to work with that looks exactly like the HTML that you see when
you open the dev console in chrome wenn visting the website mentined above.
The HTML parser works flwalessly there.
This is how I image how the code could look like simplified.
Extreme C# pseudocode:
WebBrowserEngine web = new WebBrowserEngine()
web.LoadURLuntilFinished(url); // with all the JS executed and stuff
String html = web.getHTML();
web.close();
My Goal would be that the string html in the pseudocode looks exactly like the one in the Chrome dev tab.
Maybe there is a solution posted somewhere else but i swear i coudlnt find it, been looking for days.
Andy help is greatly appreciated.

#SpencerBench is spot on in saying
It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically.
To answer the question for your specific use case, we need to understand the behaviour of the page you want to scrape data from, or as I asked in the comments, how do you know the page is "finished"?
However, it's possible to give a fairly generic answer to the question which should act as a starting point for you.
This answer uses Selenium, a package which is commonly used for automating testing of web UIs, but as they say on their home page, that's not the only thing it can be used for.
Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well.
The web site I'm scraping
So first we need a web site. I've created one using ASP.net core MVC with .net core 3.1, although the web site's technology stack isn't important, it's the behaviour of the page you want to scrape which is important. This site has 2 pages, unimaginatively called Page1 and Page2.
Page controllers
There's nothing special in these controllers:
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
public class Page1Controller : Controller
{
public IActionResult Index()
{
return View("Page1");
}
}
}
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
public class Page2Controller : Controller
{
public IActionResult Index()
{
return View("Page2");
}
}
}
API controller
There's also an API controller (i.e. it returns data rather than a view) which the views can call asynchronously to get some data to display. This one just creates an array of the requested number of random strings.
namespace StackOverflow68925623Website.Controllers
{
using Microsoft.AspNetCore.Mvc;
using System;
using System.Collections.Generic;
using System.Text;
[Route("api/[controller]")]
[ApiController]
public class DataController : ControllerBase
{
[HttpGet("Create")]
public IActionResult Create(int numberOfElements)
{
var response = new List<string>();
for (var i = 0; i < numberOfElements; i++)
{
response.Add(RandomString(10));
}
return Ok(response);
}
private string RandomString(int length)
{
var sb = new StringBuilder();
var random = new Random();
for (var i = 0; i < length; i++)
{
var characterCode = random.Next(65, 90); // A-Z
sb.Append((char)characterCode);
}
return sb.ToString();
}
}
}
Views
Page1's view looks like this:
#{
ViewData["Title"] = "Page 1";
}
<div class="text-center">
<div id="list" />
<script src="~/lib/jquery/dist/jquery.min.js"></script>
<script>
var apiUrl = 'https://localhost:44394/api/Data/Create';
$(document).ready(function () {
$('#list').append('<li id="loading">Loading...</li>');
$.ajax({
url: apiUrl + '?numberOfElements=20000',
datatype: 'json',
success: function (data) {
$('#loading').remove();
var insert = ''
for (var item of data) {
insert += '<li>' + item + '</li>';
}
insert = '<ul id="results">' + insert + '</ul>';
$('#list').html(insert);
},
error: function (xht, status) {
alert('Error: ' + status);
}
});
});
</script>
</div>
So when the page first loads, it just contains an empty div called list, however the page loading trigger's the function passed to jQuery's $(document).ready function, which makes an asynchronous call to the API controller, requesting an array of 20,000 elements. While the call is in progress, "Loading..." is displayed on the screen, and when the call returns, this is replaced by an unordered list containing the received data. This is written in a way intended to be friendly to developers of automated UI tests, or of screen scrapers, because we can tell whether all the data has loaded by testing whether or not the page contains an element with the ID results.
Page2's view looks like this:
#{
ViewData["Title"] = "Page 2";
}
<div class="text-center">
<div id="list">
<ul id="results" />
</div>
<script src="~/lib/jquery/dist/jquery.min.js"></script>
<script>
var apiUrl = 'https://localhost:44394/api/Data/Create';
var requestCount = 0;
var maxRequests = 20;
$(document).ready(function () {
getData();
});
function getDataIfAtBottomOfPage() {
console.log("scroll - " + requestCount + " requests");
if (requestCount < maxRequests) {
console.log("scrollTop " + document.documentElement.scrollTop + " scrollHeight " + document.documentElement.scrollHeight);
if (document.documentElement.scrollTop > (document.documentElement.scrollHeight - window.innerHeight - 100)) {
getData();
}
}
}
function getData() {
window.onscroll = undefined;
requestCount++;
$('results2').append('<li id="loading">Loading...</li>');
$.ajax({
url: apiUrl + '?numberOfElements=50',
datatype: 'json',
success: function (data) {
var insert = ''
for (var item of data) {
insert += '<li>' + item + '</li>';
}
$('#loading').remove();
$('#results').append(insert);
if (requestCount < maxRequests) {
window.setTimeout(function () { window.onscroll = getDataIfAtBottomOfPage }, 1000);
} else {
$('#results').append('<li>That\'s all folks');
}
},
error: function (xht, status) {
alert('Error: ' + status);
}
});
}
</script>
</div>
This gives a nicer user experience because it requests data from the API controller in multiple smaller chunks, so the first chunk of data appears fairly quickly, and once the user has scrolled down to somewhere near the bottom of the page, the next chunk of data is requested, until 20 chunks have been requested and displayed, at which point the text "That's all folks" is added to the end of the unordered list. However this is more difficult to interact with programmatically because you need to scroll the page down to make the new data appear.
(Yes, this implementation is a bit buggy - if the user gets to the bottom of the page too quickly then requesting the next chunk of data doesn't happen until they scroll up a bit. But the question isn't about how to implement this behaviour in a web page, but about how to scrape the displayed data, so please forgive my bugs.)
The scraper
I've implemented the scraper as a xUnit unit test project, just because I'm not doing anything with the data I've scraped from the web site other than Asserting that it is of the correct length, and therefore proving that I haven't prematurely assumed that the web page I'm scraping from is "finished". You can put most of this code (other than the Asserts) into any type of project.
Having created your scraper project, you need to add the Selenium.WebDriver and Selenium.WebDriver.ChromeDriver nuget packages.
Page Object Model
I'm using the Page Object Model pattern to provide a layer of abstraction between functional interaction with the page and the implementation detail of how to code that interaction. Each of the pages in the web site has a corresponding page model class for interacting with that page.
First, a base class with some code which is common to more than one page model class.
namespace StackOverflow68925623Scraper
{
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;
public class PageModel
{
protected PageModel(IWebDriver driver)
{
this.Driver = driver;
}
protected IWebDriver Driver { get; }
public void ScrollToTop()
{
var js = (IJavaScriptExecutor)this.Driver;
js.ExecuteScript("window.scrollTo(0, 0)");
}
public void ScrollToBottom()
{
var js = (IJavaScriptExecutor)this.Driver;
js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight)");
}
protected IWebElement GetById(string id)
{
try
{
return this.Driver.FindElement(By.Id(id));
}
catch (NoSuchElementException)
{
return null;
}
}
protected IWebElement AwaitGetById(string id)
{
var wait = new WebDriverWait(Driver, TimeSpan.FromSeconds(10));
return wait.Until(e => e.FindElement(By.Id(id)));
}
}
}
This base class gives us 4 convenience methods:
Scroll to the top of the page
Scroll to the bottom of the page
Get the element with the supplied ID, or return null if it doesn't exist
Get the element with the supplied ID, or wait for up to 10 seconds for it to appear if it doesn't exist yet
And each page in the web site has its own model class, derived from that base class.
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium;
public class Page1Model : PageModel
{
public Page1Model(IWebDriver driver) : base(driver)
{
}
public IWebElement AwaitResults => this.AwaitGetById("results");
public void Navigate()
{
this.Driver.Navigate().GoToUrl("https://localhost:44394/Page1");
}
}
}
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium;
public class Page2Model : PageModel
{
public Page2Model(IWebDriver driver) : base(driver)
{
}
public IWebElement Results => this.GetById("results");
public void Navigate()
{
this.Driver.Navigate().GoToUrl("https://localhost:44394/Page2");
}
}
}
And the Scraper class:
namespace StackOverflow68925623Scraper
{
using OpenQA.Selenium.Chrome;
using System;
using System.Threading;
using Xunit;
public class Scraper
{
[Fact]
public void TestPage1()
{
// Arrange
var driver = new ChromeDriver();
var page = new Page1Model(driver);
page.Navigate();
try
{
// Act
var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
// Assert
Assert.Equal(20000, actualResults.Length);
}
finally
{
// Ensure the browser window closes even if things go pear-shaped
driver.Quit();
}
}
[Fact]
public void TestPage2()
{
// Arrange
var driver = new ChromeDriver();
var page = new Page2Model(driver);
page.Navigate();
try
{
// Act
while (!page.Results.Text.Contains("That's all folks"))
{
Thread.Sleep(1000);
page.ScrollToBottom();
page.ScrollToTop();
}
var actualResults = page.Results.Text.Split(Environment.NewLine);
// Assert - we expect 1001 because of the extra "that's all folks"
Assert.Equal(1001, actualResults.Length);
}
finally
{
// Ensure the browser window closes even if things go pear-shaped
driver.Quit();
}
}
}
}
So, what's happening here?
// Arrange
var driver = new ChromeDriver();
var page = new Page1Model(driver);
page.Navigate();
ChromeDriver is in the Selenium.WebDriver.ChromeDriver package and implements the IWebDriver interface from the Selenium.WebDriver package with the code to interact with the Chrome browser. Other packages are available containing implementations for all popular browsers. Instantiating the driver object opens a browser window, and calling its Navigate method directs the browser to the page we want to test/scrape.
// Act
var actualResults = page.AwaitResults.Text.Split(Environment.NewLine);
Because on Page1, the results element doesn't exist until all the data has been displayed, and no user interaction is required in order for it to be displayed, we use the page model's AwaitResults property to just wait for that element to appear and return it once it has appeared.
AwaitResults returns an IWebElement instance representing the element, which in turn has various methods and properties we can use to interact with the element. In this case we use its Text property which returns the element's contents as a string, without any markup. Because the data is displayed as an unordered list, each element in the list is delimited by a line break, so we can can use String's Split method to convert it to a string array.
Page2 needs a different approach - we can't use the presence of the results element to determine whether the data has all been displayed, because that element is on the page right from the start, instead we need to check for the string "That's all folks" which is written right at the end of the last chunk of data. Also the data isn't loaded all in one go, and we need to keep scrolling down in order to trigger the loading of the next chunk of data.
// Act
while (!page.Results.Text.Contains("That's all folks"))
{
Thread.Sleep(1000);
page.ScrollToBottom();
page.ScrollToTop();
}
var actualResults = page.Results.Text.Split(Environment.NewLine);
Because of the bug in the UI that I mentioned earlier, if we get to the bottom of the page too quickly, the fetch of the next chunk of data isn't triggered, and attempting to scroll down when already at the bottom of the page doesn't raise another scroll event. That's why I'm scrolling to the bottom of the page and then back to the top - that way I can guarantee that a scroll event is raised. You never know, the web site you're trying to scrape data from may itself be buggy.
Once the "That's all folks" text has appeared, we can go ahead and get the results element's Text property and convert it to a string array as before.
// Assert - we expect 1001 because of the extra "that's all folks"
Assert.Equal(1001, actualResults.Length);
This is the bit that won't be in your code. Because I'm scraping a web site which is under my control, I know exactly how much data it should be displaying so I can check that I've got all the data, and therefore that my scraping code is working correctly.
Further reading
Absolute beginner's introduction to Selenium: https://www.guru99.com/selenium-csharp-tutorial.html
(A curiosity in that article is the way that it starts by creating a console application project and later changes its output type to class library and manually adds the unit test packages, when the project could have been created using one of Visual Studio's unit test project templates. It gets to the right place in the end, albeit via a rather odd route.)
Selenium documentation: https://www.selenium.dev/documentation/
Happy scraping!

If you need to fully execute the web page, then a complete browser like CefSharp is your only option.
It could be that the page is using some combination of scroll state, element visibility, or element positions to trigger content loading. If that's the case, then you'll need to figure out what it is and trigger it programmatically. I know that CefSharp can simulate user actions like clicking, scrolling, etc.

Related

C# Discord.NET Store variable for interaction (pagination)

I am trying to create C# based Discord bot with pagination but I can't find any way to store the current page so I know what page to select next.
I tried
public class DiscordModule : InteractionModuleBase<SocketInteractionContext>
{
private int page = 0;
[SlashCommand("ping", "Recieve a pong")]
public async Task Ping()
{
var embed = new EmbedBuilder()
{
Title = "Component"
};
var comp = new ComponentBuilder();
comp.WithButton("Increment", "Increment");
await RespondAsync(embed: embed.Build(), components: comp.Build());
}
[ComponentInteraction("Increment")]
public async Task Increment()
{
page = page + 1;
await RespondAsync($"{page}");
}
}
It was just a proof of concept, but I couldn't get it to work as it always returned the same value. I didn't think it would but I saw the following:
Discord.InteractivityAddon That uses a current page variable.
Discord.Addons.Interactive That uses a variable page.
I don't quite get how they achieve a variable tied to the instance of the message sent. I considered using a map message Id to page number but this could end up using a lot of memory.
I know I could just use the libraries linked but I want to implement it myself.

c# selenium form submission

I'm having issues with if submit form is available then submit data and check for response, it seems to check for submit form, submit the data but then doesn't process the response given, example of code:
if (driver.FindElements(By.Name("search")).Count > 0 && driver.FindElement(By.Name("search")).Displayed)
{
driver.FindElement(By.Name("search")).SendKeys(query + Keys.Enter);
if (driver.FindElements(By.XPath("//*[#id='not found']/h2")).Count > 0 && driver.FindElement(By.XPath("//*[#id='not found']/h2")).Displayed)
{
Console.WriteLine("search not found");
driver.Manage().Cookies.DeleteAllCookies();
driver.Navigate().GoToUrl("https://example.com");
}
}
what this should doing is:
if
driver.findelement(by.name("search")
is true, then
driver.findelement(by.name("search").sendkeys(query)
then, check for response provided and handle using given commands within the if statement.

I would rewrite this a little to make it a little more readable and not hit the page so many times. Every time you do driver.findElement(), Selenium scrapes the page. Scrape it once, do all your analysis using that first scrape, and then proceed.
IReadOnlyCollection<IWebElement> search = GetVisibleElements(By.Name("search"));
if (search.Any())
{
search.ElementAt(0).SendKeys(query + Keys.Enter);
if (GetVisibleElements(By.XPath("//*[#id='not found']/h2")).Any())
{
// search not found
Console.WriteLine("search not found");
Driver.Manage().Cookies.DeleteAllCookies();
Driver.Navigate().GoToUrl("https://example.com");
}
else
{
// search found
// do stuff here
}
}
Since you are checking more than once if an element exists and is visible, I would wrap that code in a function to make it more usable and make your code easier to read.
public IReadOnlyCollection<IWebElement> GetVisibleElements(By locator)
{
return Driver.FindElements(locator).Where(e => e.Displayed).ToList();
}
This function locates the elements based on the locator provided, filters it down to only those elements that are displayed, and then returns the list. You can then see if there are any elements in the returned list in your script.

DisconnectedContext detected when using STA thread to modify SharePoint page

Background Info: I'm using an ItemCheckedIn receiver in SharePoint 2010, targeting .NET 3.5 Framework. The goal of the receiver is to:
Make sure the properties (columns) of the page match the data in a Content Editor WebPart on the page so that the page can be found in a search using Filter web parts. The pages are automatically generated, so barring any errors they are guaranteed to fit the expected format.
If there is a mismatch, check out the page, fix the properties, then check it back in.
I've kept the receiver from falling into an infinite check-in/check-out loop, although right now it's a very clumsy fix that I'm trying to work on. However, right now I can't work on it because I'm getting a DisconnectedContext error whenever I hit the UpdatePage function:
public override void ItemCheckedIn(SPItemEventProperties properties)
{
// If the main page or machine information is being checked in, do nothing
if (properties.AfterUrl.Contains("home") || properties.AfterUrl.Contains("machines")) return;
// Otherwise make sure that the page properties reflect any changes that may have been made
using (SPSite site = new SPSite("http://san1web.net.jbtc.com/sites/depts/VPC/"))
using (SPWeb web = site.OpenWeb())
{
SPFile page = web.GetFile(properties.AfterUrl);
// Make sure the event receiver doesn't get called infinitely by checking version history
...
UpdatePage(page);
}
}
private static void UpdatePage(SPFile page)
{
bool checkOut = false;
var th = new Thread(() =>
{
using (WebBrowser wb = new WebBrowser())
using (SPLimitedWebPartManager manager = page.GetLimitedWebPartManager(PersonalizationScope.Shared))
{
// Get web part's contents into HtmlDocument
ContentEditorWebPart cewp = (ContentEditorWebPart)manager.WebParts[0];
HtmlDocument htmlDoc;
wb.Navigate("about:blank");
htmlDoc = wb.Document;
htmlDoc.OpenNew(true);
htmlDoc.Write(cewp.Content.InnerText);
foreach (var prop in props)
{
// Check that each property matches the information on the page
string element;
try
{
element = htmlDoc.GetElementById(prop).InnerText;
}
catch (NullReferenceException)
{
break;
}
if (!element.Equals(page.GetProperty(prop).ToString()))
{
if (!prop.Contains("Request"))
{
checkOut = true;
break;
}
else if (!element.Equals(page.GetProperty(prop).ToString().Split(' ')[0]))
{
checkOut = true;
break;
}
}
}
if (!checkOut) return;
// If there was a mismatch, check the page out and fix the properties
page.CheckOut();
foreach (var prop in props)
{
page.SetProperty(prop, htmlDoc.GetElementById(prop).InnerText);
page.Item[prop] = htmlDoc.GetElementById(prop).InnerText;
try
{
page.Update();
}
catch
{
page.SetProperty(prop, Convert.ToDateTime(htmlDoc.GetElementById(prop).InnerText).AddDays(1));
page.Item[prop] = Convert.ToDateTime(htmlDoc.GetElementById(prop).InnerText).AddDays(1);
page.Update();
}
}
page.CheckIn("");
}
});
th.SetApartmentState(ApartmentState.STA);
th.Start();
}
From what I understand, using a WebBrowser is the only way to fill an HtmlDocument in this version of .NET, so that's why I have to use this thread.
In addition, I've done some reading and it looks like the DisconnectedContext error has to do with threading and COM, which are subjects I know next to nothing about. What can I do to prevent/fix this error?
EDIT
As #Yevgeniy.Chernobrivets pointed out in the comments, I could insert an editable field bound to the page column and not worry about parsing any html, but because the current page layout uses an HTML table within a Content Editor WebPart, where this kind of field wouldn't work properly, I'd need to make a new page layout and rebuild my solution from the bottom up, which I would really rather avoid.
I'd also like to avoid downloading anything, as the company I work for normally doesn't allow the use of unapproved software.

You shouldn't do html parsing with WebBrowser class which is part of Windows Forms and is not suited for web as well as for pure html parsing. Try using some html parser like HtmlAgilityPack instead.

The difference between "IE9 debug tools" HTML output and webpage source HTML that I got via C#

I'm not the first time here with questions like this.
I have a Volvo auto parts catalog that is implemented as a client application to a local database and works only in IE8/9. I need to find and get some positions displayed in IE.
Here's an example of IE output:
It's just a table and nothing more.
And here's what I see in IE9 debug tools:
IE shows me full layout of a page where I can see a target table and rows with the data I need to get.
I wrote a simple class that should walk through all IE tabs and get HTML from the target page:
using System.Globalization;
using System.Text.RegularExpressions;
using SHDocVw;
namespace WebpageHtmlMiner
{
static class HtmlMiner
{
public static string GetWebpageHtml(string uriPattern)
{
var uriRegexPattern = uriPattern;
var regex = new Regex(uriRegexPattern);
var shellWindows = new ShellWindows();
InternetExplorer internetExplorer = null;
foreach (InternetExplorer ie in shellWindows)
{
Match match = regex.Match(ie.LocationURL);
if (!string.IsNullOrEmpty(match.Value))
{
internetExplorer = ie;
break;
}
}
if (internetExplorer == null)
{
return "Target page is not opened in IE";
}
var mshtmlDocument = (mshtml.IHTMLDocument2)internetExplorer.Document;
var webpageHtml = mshtmlDocument.body.parentElement.outerHTML.ToString(CultureInfo.InvariantCulture);
return webpageHtml; //profit
}
}
}
It seems to work fine but instead of what I see in IE debug tools I get HTML code with tons of javascript functions and no data in target table.
Is there any way to get exactly what I see in IE debug tools?
Thanks.

You can get the original source (the one sent by the server) in "Script" tab (this works both on my IE8 and my IE10).
If you do not use AJAX, I think you can right-click on the page and choose Display Souce option too.

Persist data using JSON

I'm tryping to use JSON to update records in a database without a postback and I'm having trouble implementing it. This is my first time doing this so I would appreciate being pointed in the right direction.
(Explanation, irrelevant to my question: I am displaying a list of items that are sortable using a jquery plugin. The text of the items can be edited too. When people click submit I want their records to be updated. Functionality will be very similar to this.).
This javascript function creates an array of the objects. I just don't know what to do with them afterwards. It is called by the button's onClick event.
function SaveLinks() {
var list = document.getElementById('sortable1');
var links = [];
for (var i = 0; i < list.childNodes.length; i++) {
var link = {};
link.id = list.childNodes[i].childNodes[0].innerText;
link.title = list.childNodes[i].childNodes[1].innerText;
link.description = list.childNodes[i].childNodes[2].innerText;
link.url = list.childNodes[i].childNodes[3].innerText;
links.push(link);
}
//This is where I don't know what to do with my array.
}
I am trying to get this to call an update method that will persist the information to the database. Here is my codebehind function that will be called from the javascript.
public void SaveList(object o )
{
//cast and process, I assume
}
Any help is appreciated!

I have recently done this. I'm using MVC though it shouldn't be too different.
It's not vital but I find it helpful to create the contracts in JS on the client side and in C# on the server side so you can be sure of your interface.
Here's a bit of sample Javascript (with the jQuery library):
var item = new Item();
item.id = 1;
item.name = 2;
$.post("Item/Save", $.toJSON(item), function(data, testStatus) {
/*User can be notified that the item was saved successfully*/
window.location.reload();
}, "text");
In the above case I am expecting text back from the server but this can be XML, HTML or more JSON.
The server code is something like this:
public ActionResult Save()
{
string json = Request.Form[0];
var serializer = new DataContractJsonSerializer(typeof(JsonItem));
var memoryStream = new MemoryStream(Encoding.Unicode.GetBytes(json));
JsonItem item = (JsonItem)serializer.ReadObject(memoryStream);
memoryStream.Close();
SaveItem(item);
return Content("success");
}
Hope this makes sense.

You don't use CodeBehind for this, you use a new action.
Your action will take an argument which can be materialized from your posted data (which, in your case, is a JavaScript object, not JSON). So you'll need a type like:
public class Link
{
public int? Id { get; set; }
public string Title { get; set; }
public string Description { get; set; }
public string Url { get; set; }
}
Note the nullable int. If you have non-nullable types in your edit models, binding will fail if the user does not submit a value for that property. Using nullable types allows you to detect the null in your controller and give the user an informative message instead of just returning null for the whole model.
Now you add an action:
[AcceptVerbs(HttpVerbs.Post)]
public ActionResult DoStuff(IEnumerable<Link> saveList)
{
Repository.SaveLinks(saveList);
return Json(true);
}
Change your JS object to a form that MVC's DefaultModelBinder will understand:
var links = {};
for (var i = 0; i < list.childNodes.length; i++) {
links["id[" + i + "]"] = list.childNodes[i].childNodes[0].innerText;
links["title[" + i + "]"] = list.childNodes[i].childNodes[1].innerText;
links["description[" + i + "]"] = list.childNodes[i].childNodes[2].innerText;
links["url[" + i + "]"] = list.childNodes[i].childNodes[3].innerText;
}
Finally, call the action in your JS:
//This is where I don't know what to do with my array. Now you do!
// presumes jQuery -- this is much easier with jQuery
$.post("/path/to/DoStuff", links, function() {
// success!
},
'json');

Unfortunately, JavaScript does not have a built-in function for serializing a structure to JSON. So if you want to POST some JSON in an Ajax query, you'll either have to munge the string yourself or use a third-party serializer. (jQuery has a a plugin or two that does it, for example.)
That said, you usually don't need to send JSON to the HTTP server to process it. You can simply use an Ajax POST request and encode the form the usual way (application/x-www-form-urlencoded).
You can't send structured data like nested arrays this way, but you might be able to get away with naming the fields in your links structure with a counter. (links.id_1, links.id_2, etc.)
If you do that, then with something like jQuery it's as simple as
jQuery.post( '/foo/yourapp', links, function() { alert 'posted stuff' } );
Then you would have to restructure the data on the server side.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.