In the previous blog post, I showed how to get an empty message list displayed in the folder pane. The next step is to actually implement the folder update. Since this task involves several tasks, I will be breaking this step into multiple blog posts.
Getting a DOM for HTML
In terms of webscraping, I treat the first step as simply turning a URI into a DOM. The developer center actually has some good resources on this, if you have access to a document object. The issue, though, is getting a document object, since your code will likely be running from an XPCOM component [1]. What is needed then, is a utility method for loading the DOM. This is the code I've been using:
function asyncLoadDom(uri, callback) { let doc = Cc['@mozilla.org/appshell/window-mediator;1'] .getService(Ci.nsIWindowMediator) .getMostRecentWindow("mail:3pane").document; let frame = doc.createElement("iframe"); frame.setAttribute("type", "content"); frame.setAttribute("collapsed", "true"); doc.documentElement.appendChild(frame); let ds = frame.webNavigation; ds.allowPlugins = ds.allowJavascript = ds.allowImages = false; ds.allowSubframes = false; ds.allowMetaRedirects = true; frame.addEventListener("load", function (event) { if (event.originalTarget.location.href == "about:blank") return; callback(frame.contentDocument); doc.documentElement.removeChild(frame); }, true); frame.contentDocument.location.href = uri;
}
The first argument is the URI to load, as a string, and the second argument is the function to be called back with the DOM document as its sole argument. An added benefit to this method is that it also uses an asynchronous callback method, so you're not blocking the UI while you wait for the page to download. This code will likely not be called except by the protocol object, though, since we probably want to throttle the number of pages loaded up at once.
The protocol object
Earlier, I mentioned that one of the implemented objects wasn't actually mandatory. This object was the protocol object. An instance of this object is meant to wrap around an actual connection to the server; where you don't need to connect to a server, this object might not be worth implementing. In reality, it is still a useful thing to have if you have a non-trivial account type—any time a task is more complicated than "load this thing and use it," a protocol object can help with managing multiple subtasks.
For a wire protocol, the implementation of this object should be straightforward. It would essentially be a state machine, with an idle state entered after setting up the connection during which the instance can accept tasks to do. A state machine could also be done for webscraping-based account types, but I am using a more queue-based approach due to how I have structured the web loads.
At a high level, server requests are chunked at two levels. On the higher level, the application makes calls to functions like updateFolder; these calls I have decided to term tasks. The lower level requests are the requests you communicate to the server; for lack of any better terminology, I will refer to these as states[2]. In my implementation, I keep two queues, one for each of these.
Managing the queue for tasks is best done at the server. The overall logic is actually rather simple:
const kMaxProtocols = 2; wfServer.prototype = { /* Queued tasks to run on the next open protocol */ _queuedTasks: [], _protocols: [], runTask: function (task) { if (this._protocols.length < kMaxProtocols) { let protocol = new wfProtocol(this); protocol.loadTask(task); this._protocols.push(protocol); return; } for (let i = 0; i < this._protocols.length; i++) { if (!this._protocols[i].isRunning) { this._protocols[i].loadTask(task); return; } } this._queuedTasks.push(task); }, getNextTask: function (task) { if (this._queuedTasks.length > 0) return this._queuedTasks.shift(); return null; }, };
The runTask method is designed to be called with a task object; for the core mailnews protocols, this is primarily being called by the service [3]. For now, I've made the value for the maximum number of protocol objects unchangeable, but it is probably better to allow this value to be configurable via a per-server preference.
The core implementation of the protocol running object for webscraping is not too difficult:
const kMaxLoads = 4; function wfProtocol(server) { this._server = server; } wfProtocol.prototype = { /// Queued URLs; first kMaxLoads are the currently running _urls: [], /// The current task _task: null, /// Load the next URL; if all URLs are finished, finish the task onUrlLoaded: function (url) { if (this._urls.length > kMaxLoads) this._urls[kMaxLoads].runUrl(); this._urls.shift(); if (this._urls.length == 0) this.finishTask(); }, /** * Queue the next URL to load. * Any extra arguments will be passed to the callback method. * The callback is called with this protocol as the this object. */ loadUrl: function (url, callback) { let closure = this; let task = new UrlRunner(url, this); let argcalls = [null]; for (let i = 2; i < arguments.length; i++) argcalls.push(arguments[i]); task.onUrlLoad = function (dom) { argcalls[0] = dom; callback.apply(closure, argcalls); }; this._urls.push(task); if (this._urls.length <= kMaxLoads) task.runUrl(); }, /// Run the task loadTask: function (task) { this._task = task; this._task.runTask(this); }, /// Handle a completed task finishTask: function () { let task = this._server.getNextTask(); this._task.onTaskCompleted(this); if (task) this.loadTask(task); } }; /// An object that represents a URL to be run function UrlRunner(url, protocol) { this._url = url; this._protocol = protocol; } UrlRunner.prototype = { runUrl: function () { let real = this; asyncLoadDom(this._url, function (dom) { real.onUrlLoad(dom); real._protocol.onUrlLoaded(real._url); }); }, onUrlLoad: function (dom) {} };
The protocol is initialized by calling loadTask, which calls runTask on the task object. This would make some calls to loadUrl which will load it (since the max has not been loaded yet). When the function is loaded, via UrlRunner.runUrl, the callback function is called and then the onUrlLoaded function is called to clean up the URL from the queue and run any more. When this function detects that there are no more URLs are being loaded—hence why the callback is called before this function is—finishTask is called on the task object.
The working of loadUrl bears special mention. The first argument is the URL (as a string) to be loaded. The second argument is the method on wfProtocol to be called when the URL is loaded. This implies that the actual code for implementing tasks is mostly contained on wfProtocol as opposed to the task objects. All subsequent arguments are passed in as arguments to the callback function; the first argument to this function is the DOM document.
Notes
- Well, there is an nsIDOMParser which can turn text into a DOM without needing a document object. Unfortunately, it only supports XML. There is a patch for making it parse HTML, but it has gotten no traction in recent months.
- Just to muddle it all up, the URL instances in most mailnews implementations are actually how the tasks are implemented, although I internally use a URL to represent a state (kind of). A potentially clarifying discussion can be found in mozilla.dev.apps.thunderbird.
- I am not totally happy with the current model of the protocol system in mailnews, particularly with the technique of crossing over to the service to make the calls to the protocol. In my implementation, I've made those functions static functions on the protocol object. Since this is somewhat different from the current implementations and I'm not sure I want to keep this, I've couched my statements of how things work.
1 comment:
Love your style by the way.
"the URL instances in most mailnews implementations are actually how the tasks are implemented"
If you and I could agree on terminology, it would be great. Unfortunately I don't like your "task" and "state" but I don't really like my equivalents either.
But just so you know, I am beginning to use "Request" as a generic version of "Do Something that might involve calling a server" (like your "state") which I don't think is that far off of the nsIRequest interface definition. But it has to be qualified to be used. So I have of course "XMLHttpRequest" but also "SoapRequest" which are single round-trips to the server. In my "Native" layer (Exchange-like objects) as I said in irc I use "Machine" as the analog for your "task" - but I don't really like it either. I have not yet implemented the "Colonial" layer methods for this (nsMsgIncomingServer subclasses) but I am still searching for the correct naming. "tasks", "running a url", "method"? In my naming scheme "ColonialRequest" probably makes the most sense, but I doubt if my "Colonial" naming described in http://mesquilla.com/2010/05/10/mailbox-state-machine-exchange-support/ will catch on.
Post a Comment