« | Home | Recent Comments | Categories | »

XML and XSLT on the T2000

Posted on April 1st, 2006 at 11:11 by John Sinteur in category: Sun Coolthreads T2000 -- Write a comment

XML is not just the ‘buzzword du jour’, it’s also very, very useful, if you use it correctly. The applications on my webservers shove a lot of xml around, and to turn it into a webpage, we do a lot of xslt transformations on it.

Let’s make up a totally fictional example. Say, for example, a user is logged on, and we want to display a list of his telephone numbers that have caller-id enabled on it. We get an xml message with all his telephone numbers and for each telephone number all the functions on it from a backend system, then we’ll slug that xml message through xslt transformations to get it to something we can display.

So I needed a benchmark that would take a number of xml messages, a number of xslt transformation, and run that through a number of threads to see how well the system performs if you do a lot of that at the same time. And I couldn’t find anything useful on Google. Plenty of xml and xslt benchmarks, of course, but none of them would let me tune the number of threads, or do any of the other things I wanted to know.

So, I walked over to a collegue who knows more about optimizing xslt than anybody else I’ve ever met, and we set out to create our own benchmark. We may add it to sourceforge if all goes well, but first we’ll have to fix something we ran into when testing the T2000.

Take, for example, this xml message:


<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <table>
        <table-header>
            <table-row>
                <table-cell>H1</table-cell>
                <table-cell>H2</table-cell>
                <table-cell>H3</table-cell>
                <table-cell>H4</table-cell>
                <table-cell>H5</table-cell>
            </table-row>
        </table-header>
        <table-body>
            <table-row>
                <table-cell>11</table-cell>
                <table-cell>12</table-cell>
                <table-cell>13</table-cell>
                <table-cell>14</table-cell>
                <table-cell>15</table-cell>
            </table-row>
            <table-row>
                <table-cell>21</table-cell>
                <table-cell>22</table-cell>
                <table-cell>23</table-cell>
                <table-cell>24</table-cell>
                <table-cell>25</table-cell>
            </table-row>
            <table-row>
                <table-cell>31</table-cell>
                <table-cell>32</table-cell>
                <table-cell>33</table-cell>
                <table-cell>34</table-cell>
                <table-cell>35</table-cell>
            </table-row>
            <table-row>
                <table-cell>41</table-cell>
                <table-cell>42</table-cell>
                <table-cell>43</table-cell>
                <table-cell>44</table-cell>
                <table-cell>45</table-cell>
            </table-row>
            <table-row>
                <table-cell>51</table-cell>
                <table-cell>52</table-cell>
                <table-cell>53</table-cell>
                <table-cell>54</table-cell>
                <table-cell>55</table-cell>
            </table-row>
        </table-body>
    </table>
    <table>
        <table-header>
            <table-row>
                <table-cell>H1</table-cell>

etc, and have that go on for 3 and a half megabyte. Yes, it’s a big message, but some of the xml we shove around is indeed big.

Then, take this xslt transformation:


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="node() | @*">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*"/>
            <prec is="{count(preceding::*)}"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

And do that in a lot of threads, and a lot of times, and see what happens. In Java, of course. The test application needs to build a DOM tree, a memory structure that represents the xml message, and do the work requested in the xslt transformation on that DOM tree. Just about all of it is work for the Xalan library. Here is a good article in wikipedia bout exactly this kind of work.

The above sample does nothing but “count the number of preceding nodes” for each node in the xml message. This particular transformation will use almot 6 Mb of memory, and if you run 50 threads doing this repeatedly for 60 seconds, you’ll find that a transformation takes about 3 seconds each, you’ll need about 250 Mb of memory in your Java virtual machine, and you’ll have the machine running flat out.

But when we tried this with some other transformations, for example with a {count(descendant::*)} instead of a {count(preceding::*)}, we found the total CPU utilization maxing out at about 20%. The machine was doing everything it could, and still only using 20% of available CPU capacity. This made us scratch our heads for a moment, until we realized that we were shoving so much memory around, that that became the bottleneck. The {count(preceding::*)} was the lucky one, since for that transformation it had to go back in memory, and that memory was likely already available in L2 cache.

Or, in a language more people will be likely to understand: a modern computer has more than one kind of memory available. For the processor in the T2000 for example, there’s “instruction” and “data” cache, memory that the chip can access at full speed, since it’s usually part of the chip itself. This is fairly expensive memory to put in a computer, so there’s very little of it – just 16 Kb of instruction cache per core, and 8 Kb of data cache per core. This memory is usually called “level 1 cache”. There’s also “Level 2 cache”, and in case of the T2000 all the cores share 3 Mb of memory together. This is quite a lot, since a Pentium has usually about 512 Kb or 1 Mb at most. This memory is also very fast, often at the speed of the CPU as well. Then there’s “main memory” – this is usually the memory you’ll see advertised – “This computer has 512 Mb of memory” or something like that. The Sun T2000 has 8 Gb of memory, and that’s quite a lot. If you buy a new fast PC, you’ll probably get 512 Mb or 1 Gb with it. The processor can not access this memory at full speed, however. Memory access to main memory is a lot slower than the CPU can operate. Usually if the processor needs something from main memory, it is copied into the level 2 cache upon accessing it, and that copying is done in chunks of varying size. The logic behind that is that if a program needs a piece of information from memory, the odds are that the next bit of memory it needs is right next to that. Copying memory into level 2 cache to make it avaiable to the CPU takes some time, so copying more (while the CPU is working on the first bits copied) makes sense. To have the CPU work at maximum capacity, it’s best if the system manages to keep the level 2 cache loaded with whatever memory the CPU needs or is likely to need in the next few moments. If the system keeps missing the cache, the overall system cannot work at 100% CPU power.

And that’s what we were seeing – we were moving around so much memory that the caches kept getting “misses” instead of “hits” with memory that was in cache. Now according to the specs on the Sun pages, the system can move 3.1 Gigabyteof data around per second, and although that’s quite respectable, our benchmark blew that right away.

This means, of course, that our benchmark isn’t testing Xalan or the CPU, it is testing amount of memory a system can move around, and that’s something you can simply read from the specs. So, we’re going to modify the benchmark a bit, so that the workload we present to the system more accurately represents something you encounter in the real world. After all, no server is simply counting nodes all the time.

It also shows that access to memory is becoming more of a bottleneck in modern days. If you buy a new PC, and you bother to read the specs, you’ll see something like “800 MHz FSB” which roughly means that access to memory can be done at a frequency of 800 MHz. Since the CPU you’ll stick in such a PC is probably 3 GHz, you’ll notice that the CPU has to slow down from 3 GHz to 800 Mhz every time it needs to go to main memory. Caches are very important, and the speed of the main memory is getting more and more important as well.

If Sun wants some advice on how to improve this machine: don’t just go for faster CPU speed, improve bandwidth to main memory as well…

But then again, that should not be a surprise to them – every computer has this problem.

I’ll post again on XML and XSLT when we’ve made the benchmark more realistic…

  1. you’ll notice that the CPU has to slow down from 3 GHz to 800 Mhz every time it needs to go to main memory. But I imagine that this is precise where the hyperthreading/coolthreads stuff kicks serious butt, since the cost of a context switch to another active thread is nearly zero. So I wonder if they set up the architecture so that they switch threads on a cache miss?

    Are you really pushing around megabytes of XML for a web based application? This is not an end user application then, I take it.

  2. Yes, I expect that this is indeed exactly the situation where a context switch in the CPU will occur – however, in my little test all threads were doing the same thing, and there was no thread to switch to with work. The “megabytes” is an extreme case, most xml is between a few and a few hundred Kb.

  3. I know this test isn’t about XSLT processors, but can you try this test with Saxon8 also? Typically Xalan’s performance as compared to anything else in the field, well, sucks…

  4. I’ll have a chat with Matthew (author of the tests) about it..

  5. Even if all threads are running the same code, they’re unlikely to be in lockstep, so they’re all on a different line of the code. If you have 4 processors running on one hyperthreaded CPU core, and the memory hit is a 4 cycle hit (800MHz compared to 3GHz), then they’d all have to have a cache miss within 4 cycles of each other to completely stall the CPU–otherwise there should always be a runnable thread. No?

  6. What kind of operations require hundreds of Kbs of XML in a customer facing application?

    I’ve always thought that XML seemed really verbose; is there clever compression that reduces the amount of storage overhead eaten up by all the tags?

  7. What kind of operations require hundreds of Kbs of XML in a customer facing application?

    Take a nokia phone. Write out all the variants you sell – with subscription, prepaid, etc. Write out all the accessories you sell with it. The size of your xml message will grow pretty fast. And clever compression? That’s only useful for transport, and if bandwidth is an issue during transport. For transformations you turn it into a DOM tree.

  8. Apparently they *are* having a cache miss within 4 cycles. But then again, you’d have to know how “deep” the bubble is (your guess of 4 cycles may or may not be correct, what the instruction stream looks like, etc, etc. And since this is Java, I can’t just look at the compiled code and guess..

previous post: Niagara vs ftp.heanet.ie Showdown

next post: Albino peacock