« | Home | Recent Comments | Categories | »

Sun, allies broaden open-source chip push

Posted on December 13th, 2007 at 10:10 by John Sinteur in category: Sun Coolthreads T2000

Remember when OpenBSD had trouble getting the specs for the Sparc chips? Well, those days are over:

[Quote:]

Sun Microsystems’ open-source chip plan is bearing some early fruit, but the server and software company hopes to increase further involvement by sharing the designs of its forthcoming “Niagara 2″ processor.

You can download the specs here, so if you have a space chip factory in your basement….

The Niagara chips are really, really good when you hit them with a workload that requires massive threading…


Write a comment

The Value of Being Green

Posted on August 16th, 2006 at 8:46 by John Sinteur in category: Sun Coolthreads T2000

[Quote:]

If you ever get asked by a cynic, or your management “what’s the real value of being green?,” I can give you a very specific answer, at least for Sun. In the State of California, it’s worth $700 to $1000 per server. I did say per server. Every single bid we’re in across the state just got $700 to $1,000 per server more competitive.

With ASP’s under $5,000 for a Niagara machine, that’s not a little competitive push.

That’s real power.


Write a comment

Jonathan Schwartz’s Weblog

Posted on June 24th, 2006 at 21:53 by John Sinteur in category: Sun Coolthreads T2000

[Quote:]

I was on a plane flight with an executive from the hospitality industry not too long ago, who told me a very interesting story – about the impact of flat panel televisions on hotel room occupancy. According to this exec, flat panel TV’s drove down industry occupancy rates.

No, seriously.

Apparently the space savings and lower power consumption of a flat panel TV (think about it, they’re quite a bit smaller and draw far less energy) allowed hotels to skip having to put giant media cabinets in their rooms. And they could save on their total power (and air conditioning) envelope, as well. Which freed up space, power and budget for more rooms. Which led to a glut of new rooms, and the rest falls into place.

It reminded me of a conversation I had with a CIO at a large financial institution in midtown Manhattan a few months back. She’d just been promoted to be the CIO of her company, and in one of her first meetings with the CEO, brought him a picture of the roof of their building.

Care to guess why?


Write a comment

Comments:

  1. Sounds extremely familiar! Then again, my company runs 25 datacenters globally, and yes, I’ve seen the roofs of most.

CoolThread Servers – Testimonials

Posted on June 11th, 2006 at 14:14 by John Sinteur in category: Sun Coolthreads T2000

Check where the very first quote from the “try and buy customer blogs” on the T2000 testimonial page at Sun is coming from…


Write a comment

Comments:

  1. They should give you a T2000 for that!

Paul Murphy | ZDNet.com

Posted on April 26th, 2006 at 8:01 by John Sinteur in category: Sun Coolthreads T2000

Ever had your weblog postings reviewed by a ZDNet columnist?

I have

And he got the money quote as well…


Write a comment

Comments:

  1. I should hope he did get the “money quote”. I also think that Sun got good value from all the work you put in to testing the machine. Great stuff!

  2. Jesus Jhon, what did you write this time… If you need to hide, remember the place we use last time. You really pissed them off, did you not: Next on the hit list is […] a guy named John Sinteur.

Core Duo on the desktop – The Tech Report

Posted on April 18th, 2006 at 18:17 by John Sinteur in category: Sun Coolthreads T2000

If you were wondering why the Mac Mini did so well in my benchmarks against the Sun T2000, read this article on the Core Duo chip. It is basicallty saying the same thing I’ve been saying about the T2000 – multi-core chips are the future.


Write a comment

Load average / run queue

Posted on April 14th, 2006 at 17:20 by John Sinteur in category: Sun Coolthreads T2000

Any Solaris sys-admins in the audience? This picture probably scares you… but the machine was perfectly responsive. I’ll post the C program we used later on, it’s a multi-threaded implementation of a skiplist, written by a graduate student on the UVA for a few tests.

runqueue.jpg


Write a comment

SWaP – (Space, Watts and Performance) Metric

Posted on April 14th, 2006 at 11:53 by John Sinteur in category: Sun Coolthreads T2000

In earlier posts I said that two limiting factors in datacenters today are 1) the amount of space a computer takes, and 2) the amount of power a computer takes. One 19″ rack typically has only 4000W of power available for the computers in the rack, and 48U of space (one U being 1.75 inches (44.45 mm)). If you’ve got a good idea how much computer performance you need to buy to serve your needs, you can calculate how much rack space you need to buy, and if you run into the power-limit or the space-limit for one rack, you’ll have to rent another, increasing your costs quite a bit. So server makers have been trying to 1) get the power usage down, 2) reduce the size of the server, as well as 3) increase performance of the computer. This results in rack-dense solutions such as HP blades, or low-power solutions such as the T2000.

Sun has come up with a metric to compare servers, called SWaP. Space, Watts and Performance.

Basically, the higher your “SWaP” numbers, the less datacenter space and power you need to do a computing job.

So, let’s first take a look at what I’ve got so far in performance numbers. I’ve got two “weird” systems in here, the Pentium M, a laptop, and a new Mac Mini. Those two systems are typically not found in a datacenter, except for some very “creative” installations. A laptop, when it can operate with a closed lid, is not only very power efficient, with it’s own battery-backup in case of power failures, but also very thin, so you could place two of them in a 1U rack space. The mini is similar: very power-efficient, and you could stand 4 of them next to each other in a 2U rack space, and probably have plenty of space left over “behind” them for more mini’s or external hard disks. So, what the heck, I’m going to show them in the graphs as well.

I should receive some more benchmarks this weekend or next week: an iMac with a Core Duo, and a dual-core Opteron 280 system. That last one is usually found in the T2000 competitors, so I’m really looking forward to those results.

Anyway, first the performance graph (higher is better):

performance.png

From this, you’d assume the T2000 and the Athlon computer are competitors. This is a high-end Athlon chip, and a completely equipped server from an A brand would probably set you back the same amount of dollars. Except, most server suppliers are picking the Opteron chip instead of the Athlon for such systems.

But when you look at the power consumption, it’s quite a bit higher for the Athlon. And the mini, with it’s 110 Watt consumption, has a huge advantage not seen in this graph. So, let’s turn the graph into a SWaP graph:

SWaP.png

And suddenly it becomes clear why Sun is promoting this metric. They’re really good at it – and notice how well the Mac Mini does!

I’ll post a new version of these graphs when I get more results…


Write a comment

Comments:

  1. I’ll have that mactel mini, thank you. I bet you could make a noise versus performance graph that would put the mac mini on top over the whole range!

  2. trust me, you do NOT want a T2000 in your living room. Well, unless you don’t mind having to step outside to have a conversation…

More xml/xslt numbers

Posted on April 12th, 2006 at 7:48 by John Sinteur in category: Sun Coolthreads T2000

I had a collegue test my java xml/xslt benchmark on a very interesting machine – one with a dual-core Athlon. I’d love to get my hands on a dual core opteron as well, but this Athlon is in the same class of CPU’s that people will consider when thinking about the competitors of a T2000.

The chip: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+
It has 1 Mb of level 2 cache, the machine had 2 Gb of memory, and was running the 64 bits version of Java 1.5, just like the java version on the T2000. Unfortunately, the saxon test failed on the machine – the same “(Too many open files)” transformer exceptions as I saw earlier.

I’ve copied the old numbers from the T2000 into the table, for comparison. The same results: if you’re running a low number of threads, and need the fastest results possible, go for the Athlon. If you’re running a large number of threads, and need massive amounts of throughput, the T2000 wins easily.

fastest transformation, in milliseconds. Lower is better
T2000/saxon T2000/xalan Athlon/xalan 
chess 155 698 72
game 5 8 0
nitf 8 13 1
recipes 14 31 2
table 5 14 1
topic 17 32 3
wai 7 53 4
total number of transformations done. Higher is better
T2000/saxon T2000/xalan Athlon/xalan 
chess 29310 10100 16166
game 968610 865870 393857
nitf 693270 580570 298835
recipes 363270 247490 160213
table 905800 558570 287640
topic 335350 230930 139842
wai 770380 110780 108329
average time to do one transformation, in milliseconds. Lower is better
T2000/saxon T2000/xalan Athlon/xalan 
chess 334 988 617
game 9 11 25
nitf 13 17 32
recipes 25 40 62
table 10 17 34
topic 27 43 71
wai 12 90 92

I’ll be trying out another bench today – one that will graph what the performance is with different number of threads on the different machines. After that, I should be able to get some SWaP graphs as well – that’s the comparison Sun most likes to make, so let’s find out if their marketing department picked the right way to look at the machine.


Write a comment

More xml/xslt numbers

Posted on April 11th, 2006 at 10:40 by John Sinteur in category: Sun Coolthreads T2000

A collegue was kind enough to run the test set on one of his linux boxes.

The specs:

Debian GNU/Linux 2.6.15
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Intel(R) Pentium(R) M processor 2.00GHz (with 2MB level2 cache)
2 GB DDR-2 RAM

The results:

fastest transformation
saxon xalan
chess 26 156
game 0 1
nitf 1 2
recipes 2 4
table 0 1
topic 2 4
wai 0 8

(“zero” microseconds doesn’t mean “zero time”, it means less than half a microsecond)

number of transformations done
saxon xalan
chess 27066 11433
game 193012 34978
nitf 118298 12136
recipes 71345 7181
table 136911 9918
topic 79239 7401
wai 90454 14324
average time for a transformation
saxon xalan
chess 423 n/a
game 53 n/a
nitf 93 n/a
recipes 143 n/a
table 84 n/a
topic 128 n/a
wai 126 n/a

I can’t give any xalan numbers, the individual threads for each case reported numbers that were so wildly different from each other that a sensible “average” wasn’t really possible. Apparently xalan does something weird on this setup.

A nice, fast, modern laptop, and no surprises other than the xalan test: an individual transformation is a lot faster than on a T2000, but on total capacity the T2000 still wins. And again it’s clear that if you’re running a website with a lot of traffic, the T2000 is an excellent choice. Your visitors won’t get the fastest possible response time, but you’ll handle far more visitors for the same amount of dollar investment than if you’re buying the other machines I’ve tested so far. If you’ve only got a few visitors, or each visitor needs the fastest possible response time, don’t pick the T2000. The T2000 isn’t a formula 1 car, it’s an 18-wheeler truck.


Write a comment

More T2000 xml/xslt numbers

Posted on April 11th, 2006 at 8:41 by John Sinteur in category: Sun Coolthreads T2000

I just finished a test run on a Sun Enterprise V440. The machine has 8 Gb of memory, and 4 CPU’s. It’s an older machine, the CPU’s are UltraSparc IIIi, each at 1 GHz. The original purchase price, however, is quite a lot higher than the price a T2000 fetches. And the T2000 blows it away, no contest. The V440 outperforms my Mac, of course, quite significantly, but it cannot begin to match the throughput the T2000 reaches…

fastest transformation, in milliseconds. Lower is better
T2000/saxon T2000/xalan  Mac/saxon Mac/xalan V440/saxon V440/xalan
chess 155 698 2237 4783 1573 2350
game 5 8 4 5 4 5
nitf 8 13 10 11 8 10
recipes 14 31 18 19 19 21
table 5 14 7 9 8 9
topic 17 32 19 22 21 23
wai 7 53 42 44 45 46
total number of transformations done. Higher is better
T2000/saxon T2000/xalan  Mac/saxon Mac/xalan V440/saxon V440/xalan
chess 29310 10100 756 1052 1000 987
game 968610 865870 31807 27629 133376 102090
nitf 693270 580570 19491 19260 63888 56685
recipes 363270 247490 13124 14296 32565 29648
table 905800 558570 23317 22697 69766 58713
topic 335350 230930 12093 13311 28720 27038
wai 770380 110780 8715 9261 16043 15374
average time to do one transformation, in milliseconds. Lower is better
T2000/saxon T2000/xalan  Mac/saxon Mac/xalan V440/saxon V440/xalan
chess 334 988 13207 9557 9991 10120
game 9 11 313 361 74 97
nitf 13 17 510 520 154 174
recipes 25 40 760 697 303 335
table 10 17 430 445 142 169
topic 27 43 825 747 344 367
wai 12 90 1140 1070 620 649

Write a comment

T2000 with xml/xslt

Posted on April 10th, 2006 at 21:01 by John Sinteur in category: Sun Coolthreads T2000

Looking on the web for xml/xslt benchmarks, I ran into the Sarvega XSLTBench benchmark. In the test are a bunch of stylesheets that are pretty “real-life” – ones you might run into in a real situation. They run from medium to very complex when you look at the transformation complexity, from medium to huge in stylesheet size, and overall they are a good model of real life cases.

There’s only one problem with the benchmark: it is single threaded.

So, I wrote some code to start a large number of threads, and had them each work over the same transformation repeatedly. Wrap that in a thread that runs it for about half an hour, and presto, results.

I tried this on the T2000, of course, and on two other computers. First, the Pentium IV at 3.2 GHz I have been talking about. That, unfortunately, didn’t work at all. The machine is running FreeBSD, so to get Java on it, I had to compile from source (with gcc), with a linux compatibility layer, and this simply doesn’t perform. The results were utterly useless. If anybody has a fast intel box that can be tested, if possible with solaris 10, or otherwise with linux, please let me know. Next, I tried it on my Mac. A dual-core G4, with 1 Ghz processors. And oldy, but it has served me well, and OS X has a decent Java on it. Tomorrow I can test with two sun V440’s, so that’ll be interesting as well.

First, the sources. If you already have the xalan and saxon libs, download this file, if you don’t, or you want to make sure you’re testing with the same version as I am, download this file, but before you do, make sure you understand the licenses that come with xalan and saxon at their respective web sites.

Next, some results. First, let’s look at the fastest transformation the systems were able to do. This was, amonst the thousands of transformations done during the test, the fastest single one. This sort of tells us what the maximum speed was the system could muster for one transformation (time is in milliseconds, lower is better/faster):

T2000/saxon T2000/xalan Mac/saxon Mac/xalan
chess 155 698 2237 4783
game 5 8 4 5
nitf 8 13 10 11
recipes 14 31 18 19
table 5 14 7 9
topic 17 32 19 22
wai 7 53 42 44

There’s two things that stand out: on most transformations, my Mac isn’t that bad. It’s 1 GHz G4 chip can keep up with a 1 GHz core from the T2000, except for the very complex chess transformation (which basically takes the moves from a chess game and draws a chess diagram for each move – about as complex as you can get when it comes to transformations). Next, it’s clear saxon is faster. But before you toss you xalan libraries out the window: test, test, test. The chess transformation didn’t work with the latest saxon (version 8 ), and I’ve been seeing weird “out of file descriptors” error messages with it – it may be leaking file handles. I’ll probably file a bug report later on.

So, given the same amount of time to do transformations, which of the two systems managed to do the most work? The T2000, of course, since it can bring 32 CPU’s to the table where my old Mac only has two. Here are the numbers (counted transformations, higher is better):

Number of transformations done:
T2000/saxon T2000/xalan Mac/saxon Mac/xalan
chess 29310 10100 756 1052
game 968610 865870 31807 27629
nitf 693270 580570 19491 19260
recipes 363270 247490 13124 14296
table 905800 558570 23317 22697
topic 335350 230930 12093 13311
wai 770380 110780 8715 9261

As you can see, the T2000 blew my Mac just completely out of the water, the difference is much more than you’d expect by simply multiplying the Mac results with 16. It shows you how bogged down a system can become when it has to juggle many threads. The T2000 really, really shines with this kind of work. My guess is that a single-CPU pentium will suffer the wrath of a T2000 with this test as well, but I’m counting on my readers here…

And here’s the average time it took the systems to do a transformation (time is in milliseconds, lower is better/faster):
T2000/saxon T2000/xalan Mac/saxon Mac/xalan
chess 334 988 13207 9557
game 9 11 313 361
nitf 13 17 510 520
recipes 25 40 760 697
table 10 17 430 445
topic 27 43 825 747
wai 12 90 1140 1070

Again, the difference is huge. The individual processors in a T2000 may not be the fastest in the market, but it does keep going long, long, long after other systems are bogged down with load. Note how close “fastest” and “average” are with the T2000, compared to what my Mac could do.

If all goes well, I’ll have some results from the V440 systems from Sun tomorrow. If you have some systems that can be added to these tables, please let me know!


Write a comment

Multiple-core CPU performance

Posted on April 5th, 2006 at 13:29 by John Sinteur in category: Apple, Sun Coolthreads T2000

It’s not just Sun who is working on getting the most out of multiple-core CPU’s..


Write a comment

T2000 versus 2 CPU Sun V440

Posted on April 5th, 2006 at 9:21 by John Sinteur in category: Sun Coolthreads T2000

While matthew is working on a better version of the xml/xslt benchmark, I tried to run the same tests that memory-stalled the T2000, this time on a an older Sun V440. One with only two CPU’s, UltraSPARCH IIIi, at 1 GHz each.

This older machine, with only 1 Mb level 2 cache, is not memory stalled with the same tests. Yet, it appears to be slower with these transformations. If you buy a new V440, it has 1.5 GHz CPU’s, so it will be faster than the older machine I’m trying this on. List price is about the same for such a V440 and my T2000.

I wonder what the T2000 will do with the revised tests, when it’s no longer stalling…


Write a comment

XML and XSLT on the T2000

Posted on April 1st, 2006 at 11:11 by John Sinteur in category: Sun Coolthreads T2000

XML is not just the ‘buzzword du jour’, it’s also very, very useful, if you use it correctly. The applications on my webservers shove a lot of xml around, and to turn it into a webpage, we do a lot of xslt transformations on it.

Let’s make up a totally fictional example. Say, for example, a user is logged on, and we want to display a list of his telephone numbers that have caller-id enabled on it. We get an xml message with all his telephone numbers and for each telephone number all the functions on it from a backend system, then we’ll slug that xml message through xslt transformations to get it to something we can display.

So I needed a benchmark that would take a number of xml messages, a number of xslt transformation, and run that through a number of threads to see how well the system performs if you do a lot of that at the same time. And I couldn’t find anything useful on Google. Plenty of xml and xslt benchmarks, of course, but none of them would let me tune the number of threads, or do any of the other things I wanted to know.

So, I walked over to a collegue who knows more about optimizing xslt than anybody else I’ve ever met, and we set out to create our own benchmark. We may add it to sourceforge if all goes well, but first we’ll have to fix something we ran into when testing the T2000.

Take, for example, this xml message:


<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <table>
        <table-header>
            <table-row>
                <table-cell>H1</table-cell>
                <table-cell>H2</table-cell>
                <table-cell>H3</table-cell>
                <table-cell>H4</table-cell>
                <table-cell>H5</table-cell>
            </table-row>
        </table-header>
        <table-body>
            <table-row>
                <table-cell>11</table-cell>
                <table-cell>12</table-cell>
                <table-cell>13</table-cell>
                <table-cell>14</table-cell>
                <table-cell>15</table-cell>
            </table-row>
            <table-row>
                <table-cell>21</table-cell>
                <table-cell>22</table-cell>
                <table-cell>23</table-cell>
                <table-cell>24</table-cell>
                <table-cell>25</table-cell>
            </table-row>
            <table-row>
                <table-cell>31</table-cell>
                <table-cell>32</table-cell>
                <table-cell>33</table-cell>
                <table-cell>34</table-cell>
                <table-cell>35</table-cell>
            </table-row>
            <table-row>
                <table-cell>41</table-cell>
                <table-cell>42</table-cell>
                <table-cell>43</table-cell>
                <table-cell>44</table-cell>
                <table-cell>45</table-cell>
            </table-row>
            <table-row>
                <table-cell>51</table-cell>
                <table-cell>52</table-cell>
                <table-cell>53</table-cell>
                <table-cell>54</table-cell>
                <table-cell>55</table-cell>
            </table-row>
        </table-body>
    </table>
    <table>
        <table-header>
            <table-row>
                <table-cell>H1</table-cell>

etc, and have that go on for 3 and a half megabyte. Yes, it’s a big message, but some of the xml we shove around is indeed big.

Then, take this xslt transformation:


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="node() | @*">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*"/>
            <prec is="{count(preceding::*)}"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

And do that in a lot of threads, and a lot of times, and see what happens. In Java, of course. The test application needs to build a DOM tree, a memory structure that represents the xml message, and do the work requested in the xslt transformation on that DOM tree. Just about all of it is work for the Xalan library. Here is a good article in wikipedia bout exactly this kind of work.

The above sample does nothing but “count the number of preceding nodes” for each node in the xml message. This particular transformation will use almot 6 Mb of memory, and if you run 50 threads doing this repeatedly for 60 seconds, you’ll find that a transformation takes about 3 seconds each, you’ll need about 250 Mb of memory in your Java virtual machine, and you’ll have the machine running flat out.

But when we tried this with some other transformations, for example with a {count(descendant::*)} instead of a {count(preceding::*)}, we found the total CPU utilization maxing out at about 20%. The machine was doing everything it could, and still only using 20% of available CPU capacity. This made us scratch our heads for a moment, until we realized that we were shoving so much memory around, that that became the bottleneck. The {count(preceding::*)} was the lucky one, since for that transformation it had to go back in memory, and that memory was likely already available in L2 cache.

Or, in a language more people will be likely to understand: a modern computer has more than one kind of memory available. For the processor in the T2000 for example, there’s “instruction” and “data” cache, memory that the chip can access at full speed, since it’s usually part of the chip itself. This is fairly expensive memory to put in a computer, so there’s very little of it – just 16 Kb of instruction cache per core, and 8 Kb of data cache per core. This memory is usually called “level 1 cache”. There’s also “Level 2 cache”, and in case of the T2000 all the cores share 3 Mb of memory together. This is quite a lot, since a Pentium has usually about 512 Kb or 1 Mb at most. This memory is also very fast, often at the speed of the CPU as well. Then there’s “main memory” – this is usually the memory you’ll see advertised – “This computer has 512 Mb of memory” or something like that. The Sun T2000 has 8 Gb of memory, and that’s quite a lot. If you buy a new fast PC, you’ll probably get 512 Mb or 1 Gb with it. The processor can not access this memory at full speed, however. Memory access to main memory is a lot slower than the CPU can operate. Usually if the processor needs something from main memory, it is copied into the level 2 cache upon accessing it, and that copying is done in chunks of varying size. The logic behind that is that if a program needs a piece of information from memory, the odds are that the next bit of memory it needs is right next to that. Copying memory into level 2 cache to make it avaiable to the CPU takes some time, so copying more (while the CPU is working on the first bits copied) makes sense. To have the CPU work at maximum capacity, it’s best if the system manages to keep the level 2 cache loaded with whatever memory the CPU needs or is likely to need in the next few moments. If the system keeps missing the cache, the overall system cannot work at 100% CPU power.

And that’s what we were seeing – we were moving around so much memory that the caches kept getting “misses” instead of “hits” with memory that was in cache. Now according to the specs on the Sun pages, the system can move 3.1 Gigabyteof data around per second, and although that’s quite respectable, our benchmark blew that right away.

This means, of course, that our benchmark isn’t testing Xalan or the CPU, it is testing amount of memory a system can move around, and that’s something you can simply read from the specs. So, we’re going to modify the benchmark a bit, so that the workload we present to the system more accurately represents something you encounter in the real world. After all, no server is simply counting nodes all the time.

It also shows that access to memory is becoming more of a bottleneck in modern days. If you buy a new PC, and you bother to read the specs, you’ll see something like “800 MHz FSB” which roughly means that access to memory can be done at a frequency of 800 MHz. Since the CPU you’ll stick in such a PC is probably 3 GHz, you’ll notice that the CPU has to slow down from 3 GHz to 800 Mhz every time it needs to go to main memory. Caches are very important, and the speed of the main memory is getting more and more important as well.

If Sun wants some advice on how to improve this machine: don’t just go for faster CPU speed, improve bandwidth to main memory as well…

But then again, that should not be a surprise to them – every computer has this problem.

I’ll post again on XML and XSLT when we’ve made the benchmark more realistic…


Write a comment

Comments:

  1. you’ll notice that the CPU has to slow down from 3 GHz to 800 Mhz every time it needs to go to main memory. But I imagine that this is precise where the hyperthreading/coolthreads stuff kicks serious butt, since the cost of a context switch to another active thread is nearly zero. So I wonder if they set up the architecture so that they switch threads on a cache miss?

    Are you really pushing around megabytes of XML for a web based application? This is not an end user application then, I take it.

  2. Yes, I expect that this is indeed exactly the situation where a context switch in the CPU will occur – however, in my little test all threads were doing the same thing, and there was no thread to switch to with work. The “megabytes” is an extreme case, most xml is between a few and a few hundred Kb.

  3. I know this test isn’t about XSLT processors, but can you try this test with Saxon8 also? Typically Xalan’s performance as compared to anything else in the field, well, sucks…

  4. I’ll have a chat with Matthew (author of the tests) about it..

  5. Even if all threads are running the same code, they’re unlikely to be in lockstep, so they’re all on a different line of the code. If you have 4 processors running on one hyperthreaded CPU core, and the memory hit is a 4 cycle hit (800MHz compared to 3GHz), then they’d all have to have a cache miss within 4 cycles of each other to completely stall the CPU–otherwise there should always be a runnable thread. No?

  6. What kind of operations require hundreds of Kbs of XML in a customer facing application?

    I’ve always thought that XML seemed really verbose; is there clever compression that reduces the amount of storage overhead eaten up by all the tags?

  7. What kind of operations require hundreds of Kbs of XML in a customer facing application?

    Take a nokia phone. Write out all the variants you sell – with subscription, prepaid, etc. Write out all the accessories you sell with it. The size of your xml message will grow pretty fast. And clever compression? That’s only useful for transport, and if bandwidth is an issue during transport. For transformations you turn it into a DOM tree.

  8. Apparently they *are* having a cache miss within 4 cycles. But then again, you’d have to know how “deep” the bubble is (your guess of 4 cycles may or may not be correct, what the instruction stream looks like, etc, etc. And since this is Java, I can’t just look at the compiled code and guess..

Niagara vs ftp.heanet.ie Showdown

Posted on March 31st, 2006 at 13:12 by John Sinteur in category: Sun Coolthreads T2000

I’m not going to publish any benchmarks on how apache does just by itself on the T2000. The reason is simple: I don’t have a gigabit switch where the machine is currently located, just a 100 Mbps switch, and the results Colm MacCárthaigh got shows that the T2000 can saturate all four of it’s gigabit interfaces. Instead I’ll concentrate on applications in the near future. Here’s Colm’s results, summarized:

[Quote:]

How many requests the machine can handle in a second is probably the most valuable statistic when talking about webserver performance. It’s a direct measure of how many user requests you can handle. Fellow ASF committer, Dan Diephouse, has been producing some interesting stats for requests-per-second for webservices (and they are impressive), however we were more interested in how many plain-old static files the machine could really ship in a hurry. And without further ado, those numbers are;
rps.png

[..]

concurrent.png

As you can see, the T2000 was able to sustain about 83,000 concurrent downloads, and my limited dtrace skills tell me that thread-creation at that point seemed to be the main limiting factor, which is hardly surprising. For us, that number represents an upper limit on what the machine could handle when faced with a barrage of clients. Of course, no server should ever be allowed to get into that kind of insane territory, but it’s always good to know that there is plenty of headroom. More to the point, it means that availability at the lower levels of concurrency is much higher.

[..]

convlat.png

Overall, the T2000 performs very impressively. At very low numbers of concurrency, it actually has a higher latency than either of the Dell machines we tested, but these latencies are of the order of tens of milliseconds. In other words, the network latency makes a bigger difference in the overall scheme of things.


Write a comment

SWaP

Posted on March 31st, 2006 at 12:51 by John Sinteur in category: Sun Coolthreads T2000

As I said in an earlier post on the T2000, most datacenters have a limit on how much power one rack cabinet can draw. Here in the Netherlands, the max is mostly set at about 4000 Watts, or about 16 Amps. A few years ago I found out the hard way that (Intel based) computers draw significantly more during powerup – the datacenter I was hosting my own servers at in Amsterdam had a (very rare) power outage, and when power came back, the entire rack of computers tried to boot at the same time, and triggered the 16 Amps surge protector. I had to come in (early on a Sunday, of course) and switch the rack on one by one.

A quick test shows the Pentium 4 machine I’ve been testing with draws about 1.6 Amps, the T2000 about 1. I’ll probably do some performance tests with a Sun 440 as well later on, but I don’t know offhand how much power that machine draws. More, if I recall correctly.

Anyway, you do the math: fill a cabinet with pentium 4’s and see where you max out and fill a cabinet with T2000’s – the T2000 wins hands down with “performance per watt”.

So it won’t just save you a lot on floor space and your energy bill, it may just save you from an early Sunday trip as well…


Write a comment

MSN vs Google

Posted on March 31st, 2006 at 9:54 by John Sinteur in category: ¿ʞɔnɟ ǝɥʇ ʇɐɥʍ, Sun Coolthreads T2000

If you’re wondering why I prefer Google over MSN, check out the search results at MSN for my own name:

http://www.mymsnsearch.com/results.aspx?q=John+Sinteur&FORM=35TpnYVudYne

update: Okay, so maybe I should have waited a day before posting it and turn it into an April Fools day joke, and regular msn users would have spotted the strange URL anyway, but I had Maarten point out something cool in the real MSN search results for my name. Here’s a screenshot:

technorati.jpg

The “odd” thing here is that the technorati results, and my postings on the CoolThreads T2000 machine are higher than my work on the anti-spam plugin for WordPress, even ‘though that has been much, much longer on my website. According to my statistics, I still get lots of visitors daily for the anti-spam plugin, and although the T2000 postings are getting lots of hits as well, the ranking of the search results are a surprise to me…

Oh, and if you want to create your own joke MSN search results, go here.


Write a comment

Comments:

  1. Heh heh heh heh. Nice try, mister!

Why test this machine?

Posted on March 30th, 2006 at 9:01 by John Sinteur in category: Sun Coolthreads T2000

To show you why the T2000 is an interesting machine for me to look at, I’m going to have to tell you a few things about large websites. I’m going to tell you what the concerns of the people maintaining such a site are like, and how the T2000 fits in.

To do that, I’m going to show you some graphs and numbers. Now don’t come talk to me afterwards and say things like “I notice KPN is doing X and Y” because I’m going to fudge the numbers. The trends and concerns I’m showing you are real, the numbers are not. The real numbers are confidential, of course.

Here’s what an average day looks like for a large Dutch web site:

day-graph.png

As you can see, the busy hours correspond to “working hours and after dinner”. No surprise there. If you’re doing a website with an international audience, you’ll probably get a graphs that is 1) much flatter since your audience is not bound to one timezone and 2) reflecting international internet usage, so you’ll get a peak during US peak hours and another during European peal hours, for example.

You’ll see the same pattern in your applications. Here’s the thread count for one of them:

thread-graph.png

Although those graphs match mostly, if you’d put them on top of each other some differences start to show. That’s because the things people do on your website during day time hours will be different from the things people do on your website during the evening hours. During the daytime, lots of people are using their work computer to browse the net. Anything personal business they have with you, such as bills, or settings and configuration changes on the product they buy from you on a personal basis, will likely wait for the evening browse session. So the actual work will shift around from one application to another, spread over the day. The graph shows you just one application – and if you notice that this small part alone has quite a number of threads, you’ll understand why I’ve been talking about threads so much.

Remember the new logo KPN introduced a little over a week ago? Well, guess where the peak in this graph comes from:

piek-graph-2.png

Although that peak is interesting, look at the trend. Our real trend numbers are different and go back more than that, of course, but this particular component of the web site was introduced last summer. But the point I’m trying make is: some of your work load may double in six months, and some of your work load may double in six minutes. Either way, you want to be able to deal with it – by limiting a certain kind of traffic, and/or by allocating resources to things you feel are more important than the things that are getting hammered.

To summarize a few points:

– peak traffic and average traffic are different things. You need to accommodate peak traffic, but up to a point. If you’re able to differentiate your work load, you can allocate resources to parts that need it more, from a business perspective. You want to be able to prevent the views of a new commercial having an influence on what the people see who log on to your website to check their bill. The bills are more important.

– you want to be able to move capacity around – if marketing launches a new campaign that will land you a bunch of new customers for a particular product, you want to be able to allocate extra resources to the part of the application that handles registrations for that new product, for example.

– you want to plan for growth. Lots of growth, and sometimes faster than the business expects.

If you’d buy one large computer as your web server, these things are probably going to be very difficult to do, unless you use virtualization to have that large computer pretend it is a lot of small ones.

What else is a factor? What costs money?

First of all, purchasing the computer. If you do buy a large computer from, for example, Sun, you’re getting a machine that is good at a lot of things, some of which you’re not going to need. A webserver typically doesn’t need blazingly fast disk arrays – a webserver needs to read some files from disk, and write log entries, but the files most requested by the web clients will typically live in memory buffers. A well tuned web site is limited only by the amount of CPU power it has – disks and network should not be the bottleneck. If they are: you need to tune your site. So your database server will probably need those disks, but not your webserver. And if you buy that large server, you are paying for that disk-throughput capacity, because the kind of applications that machine is really built for do need it.

Second: floor space in your datacenter. Well, not just floor space, but increasingly important: power in your datacenter. Most datacenters have a limit on the amount of power they can deliver to one rack, based on electricity and the amount of air conditioning they have. In older datacenters you’re not going to be able to fill a rack with power hungry Pentium 4 machines without running into those limits (sometimes even before you fill half a rack). The computer industry knows this, of course, and that’s why “Performance per Watt” is such a big thing. Sun calls it SWaP, the Space, Watts and Performance metric.

Third: personnel. In your typical datacenter (ours does much, much more for us) you will typically employ one technical guy and a dog (the technical guy will remove broken hardware from the racks and replace it with new hardware, and the dog is there to bite him if he tries anything else), but somebody has to maintain the servers: apply patches, stop and start applications, etc. If you have a large amount of little servers, you have to do something about maintenance, or you’d get a lot of administrators doing nothing else but keeping up to date with patches and stuff. Sun has a lot of software to help out with this, HP has software to maintain their blade servers. It’s an area where a lot of development is being done, specifically to address the points I mentioned early in this posting: shift workloads around, allocate resources dynamically, react swiftly to changing circumstances. I would love to see the tools that Google developed for their server park.

If you combine everything I’ve shown you and said so far, it’s obviously easier to be flexible and dynamic if you have a large set of small resources and the right management tools for them, so you either buy a large machine and do virtualization into lots of small virtual servers, or you buy small machines, and you manage them in groups/clusters or whatever you want to call it.

And once you know how well a certain machine does the kind of work you want it to do, it becomes a simple spreadsheet calculation to find which set of machines gives you the most bang for the buck, where that buck includes purchase, power/space, maintenance personnel. And that allows you to compare wildly different beasts as well – for example you can compare a few Sun Fire V1280’s, a larger set of Sun Fire V400, a big set of Sun Fire T2000 CoolThreads or a really big set of HP Blades, even although they are wildly different products.


Write a comment

Hammering Apache

Posted on March 29th, 2006 at 8:33 by John Sinteur in category: Sun Coolthreads T2000

Promoted from the comments section:

Out of curiosity, what is the max % of CPU usage reported for a thread that’s hammering away at a compile? (In the screenshot the max is 2.8% but it’s sorted and I imagine some scrolled off the screen.)

For single-thread processes I’ve seen it hover around 2.8 to 3%. For multi-thread processes I’ve seen larger numbers of course. I’m not too sure that exact number is very valuable in the larger scheme of things. I’ve hammered apache on it yesterday, on a copy of my weblog (if you go to the machine URL, you’ll see this weblog with an older copy of the postings database), and experimented a bit with the number of threads requesting pages, and the number of preforked apache processes. Sun recommends using the prefork model, but I’m also going to try the mpm model we’re using on the other Suns since I didn’t really like the behavious apache was showing – it rapidly grew to using 4.5 Gb of memory, which might be a problem for us, since we’ll also be running a few java virtual machines on it with between 2 and 3 Gb of memory allocated to each of them. In the prefork configuration it served between 100 and 400 weblog pages per second without breaking a sweat depending on how I put a load on it.

Anyway, coming back to your question, the /server-status page in apache said apache was using between 400 and 950% CPU, which is a different number than you’d get by adding all the 2.8% and 3% processes and multiplying that by 32. In the screenshot you saw mostly 2.8%, and not the “theoretical” 3.1 % if you divide 100% by 32 CPU’s, and I think that’s because the cc processes were fairly short-lived, just compiling one file. If you run a single-thread task and let it live longer, you’ll see prstat reports of 3 and 3.1%.


Write a comment

How many cores? How many CPU’s?

Posted on March 28th, 2006 at 13:27 by John Sinteur in category: Sun Coolthreads T2000

If you read the documentation on the Sun website, you’ll notice that the “medium” configuration I’m currently testing has “8 cores”, which, if you remember my previous post on the T2000, you migh interpret as being capable of doing 8 different things at the same time. In the startup messages I’ve posted, you could see it presents itself to the operating system as 32 CPU’s, which you might interpret as being capable of doing 32 different things at the same time.

So which is it? 8? 32?

With these new ways of doing processing and having multiple CPU’s in the same chip, sharing some resources, it isn’t as black and white as it used to be back when each CPU in a computer was a seperate chip. So, let’s see if we can find out what the “real” number is.

Maarten asked me why I didn’t do a parallel build of some software, and that’s exactly what I used to see if I could find an answer. PHP is a Open Source computer language. This weblog uses it to generate the web page you’re reading. Compiling that software takes 3 minutes 5 seconds on my 3.2 GHZ Pentium 4. If use exactly the same build method on the T2000, it takes about 27 and a half minute. Sounds awful, but remember that this method would use only about 1/32 of the capacity of the Sun. Maarten’s question was why I wasn’t using the tools that would split the job in a lot of small things that could be done in parallel.

Part of the job can be done in parallel – PHP is a large collection of small source files written in de C language, and compiling each of those files can be done independently from all the others. The results must be combined into a few libraries and applications, and those steps cannot be done in parallel. A rough guess is that about half the work of building PHP cannot be done in parallel, and must therefore be done the “slow” way. This shows that building software is not the best thing to buy a T2000 for, but it allows me to test the machine in one way: how many things can be done in parallel? 8? 32? A number in between?

The GNU version of Make has a parameter that tells it how many jobs it attempts to do in parallel. As this Sun article says:

Your results will vary based on the particular compiler, options, and language being compiled, as well as whether the sources are local or remote. A common rule-of-thumb is to request the number of parallel jobs to be approximately 1.5 times the number of available CPUs on the machine.

I decided to do the reverse: build PHP repeatedly, with a different number of parallel jobs. Afterwards, look at what level the build was at best speed, and then divide by 1.5, and take that number as “the number of CPU’s” in the classic sense.

Here’s a screenshot of what a parallel build looks like:

parallelmake.jpg

As you can see, plenty of “cc1″ processes doing work, and a respectable load average. This screenshot was made when using 40 jobs, and there were quite a number of processes in “runnable” state according to vmstat – which means there were processes waiting for an available processor, which indicates that the “classic” number of CPU’s is somewhat less than 40 / 1.5.

So here’s the actual graph:

pbuild.jpg

On the horizontal axis you’ll find the “-j” parameter I gave to make, the number of parallel jobs to use. The vertical axis is the number of seconds the build took. As you can see, with 1 job, or the non-parallel way to build, it took over 1600 seconds. The machine was mostly idle during this time. Increase the number of parallel things the machine is allowed to do, and performance increases rapidly. Optimal build times appear at around 24 jobs, things don’t get much faster by allocating more jobs, so the machine is pretty much using all available resources at that point.

Divide that by 1.5, and you get 16 CPU’s.

So here’s the dilemma Sun must have faced: if they’d told the operating system that this hardware had 8 CPU’s available, the operating system would have scheduled no more than 8 threads executing at the same time, and capacity would have been wasted. But how high should they go? It all depends on the workset that needs to be done, the kind of application that runs – and since not every application needs “just” CPU, but disk- and network access as well, it’s difficult to get the number exactly right for all tasks at hand. So they probably did tests similar to this simple one I did, and concluded that the safe number was to tell the operating system that 32 CPU’s are available. If the operating system actually has 32 threads that need work at the same time, some of them will be slown down a bit, but no capacity would have been wasted. It’s better to overstate the available capacity a bit and work up a run queue than lose capacity because the operating system thinks it hasn’t as many CPU’s available.

But don’t use the above timing numbers to compare it to the 3 minutes my Pentium 4 took – a large part of the build process cannot be done in parallel, and more realistic “benchmarks” will be done later… and you’ll probably find all kind of benchmarks on the web anyway – I wanted to take a slightly different look at what drives the performance of this fairly unique machine…


Write a comment

Comments:

  1. Out of interest, is your Pentium a hyperthreaded one too?

  2. yes, it is. However, in FreeBSD 5.4 on this box, it still shows as just one:

    suske# sysctl -a | grep cpu
    kern.threads.virtual_cpu: 1
    kern.smp.maxcpus: 1
    kern.smp.cpus: 1

    dev.cpu.0.freq: 3194

  3. In FreeBSD 5.4 hyperthreading is supported by default, here’s a snippet from the boot messages:

    CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz (3192.01-MHz 686-class CPU)
    Origin = “GenuineIntel” Id = 0xf41 Stepping = 1
    Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
    Hyperthreading: 2 logical CPUs

  4. Out of curiosity, what is the max % of CPU usage reported for a thread that’s hammering away at a compile? (In the screenshot the max is 2.8% but it’s sorted and I imagine some scrolled off the screen.)

  5. You’re suggesting that they’re faking the number of processors (32) in the OS, but the chip actually has 8 cores that are designed to handle 4 simultaneous threads each, so there’s a real hardware basis for the number 32. Here’s a quote from a technical paper:

    Each core has a simple single-issue 6-stage pipeline where instructions from all 4 threads are interleaved per cycle with zero thread-switch cost, maximizing pipeline utilization. When any thread is blocked by a cache miss or branch penalty, the other threads issue instructions more frequently, effectively hiding the miss latency of the first thread.

    From: http://opensparc.sunsource.net/nonav/publications/D05_01Aut2.pdf, found among other papers at: http://opensparc.sunsource.net/nonav/pubs.html

  6. no, no, I’m not claiming they’re faking the number of processors, I’m saying the technology cannot be compared to 32 actual seperate CPU’s – as your quote clearly demonstrates, and as my “test” show on a practical level. Interleaving threads is a good basis for showing the OS a certain number of CPU’s, you’re absolutely right that this is the source for the number 32 – I must have missed that in my research, that’s a good find. Wether the interleaving has a positive effect on available CPU capacity for actual work depends on the type of workload available – there’s a reason Sun tells you what the machine is very good at.

  7. I was reacting you your: “So they probably did tests similar to this simple one I did, and concluded that the safe number was to tell the operating system that 32 CPU’s are available.” They may well have done tests, but the bottom line is that they built support for 32 simultaneous threads, so they id as a 32 proc machine.

    I wonder what the impact on compiler smarts is going to be. Why bother to work hard to avoid stalling the pipeline when you know there are 3 other threads that’ll jump in and utilize the processor?

  8. Correct, I was probably wrong when I said that. And yes, compilers suddenly became even more interesting with this chip, that’s right – but since this chip shines at threading, and the easiest language to do threading in is Java, I expect the same thing to happen in virtual machine technology..

T2000 in non-technical terms

Posted on March 26th, 2006 at 11:45 by John Sinteur in category: Sun Coolthreads T2000

Most of the reviews I’ve found on the net for this machine are pretty technical in nature. Let’s see if I can write something you can use to convince your non-technical manager that this computer is interesting. Again, feel free to ask for clarification on any point..

Most people are only familiar with the computer that sits on their desk and is used to browse the Internet. It has one important chip in it, usually the one that gets them an “Intel Inside” sticker on the box, and the salesman has told the buyer that this “CPU” is a “3 GHz pentium”, and it is very fast. And, indeed, it is. This T2000 computer from Sun also has one important chip in it, the CPU, and it’s “only” 1 GHz. So if you didn’t know any better, you’d think this machine would be a third of the speed of the machine you’re browsing on, right? Well, no. Not at all.

That Intel chip generally has one “core” on it – you could compare it with an office where one person was working very fast. The Sun Coolthreads chip has more than one core on it – compare it with an office that has 32 people working in it. Although each of those 32 people is working slower, together they can perform massively more work than the office with one person in it.

This kind of multiple-core technology will reach the desktop soon enough – Apple already has a laptop available with a new Intel chip, the “Core Duo”, and if you guessed from its name that you could compare this with an office with 2 people working in it, you’d be right.

This trend to change chips from “one core” to “more cores” has been going on for a while now. Why did that happen, exactly?

Apologies to Intel, but I’m going to use them to illustrate some points, but the things I say are more or less true for the entire industry. Intel just happened to be the most visible example of all this. A few years ago, there was a “MHz war” going on. Intel and their competitors (both AMD and the Power consortium) were both marketing to their customers with “we are faster because our chip runs on a higher rate”. Although this used to be true, it also forced the engineers at those companies to look at just one thing for the next version of their chip: higher frequencies. One of the things the engineers realized is that if you chop your work into smaller pieces, you can get through more pieces per second, and thus get a higher rate. How does that work for a computer? Let’s say we have a program in memory that says “add one to whatever number is in that piece of memory”. Sounds like a simple thing, right? But you can split it into a lot of small steps: fetch the program instruction from memory to the chip, decode it to see what it wants to do (“add 1″), fetch the number that is currently in memory, add one to it, store it back into memory. The trick Intel and others invented was to have different parts of their chip do the different steps. In older chips, after the “fetch the program instruction from memory” bit was done, no new “program instructions” were fetched until the entire “add 1″ operation was completed. These days, that’s no longer true. The part of the chip that gets instructions on what to do from memory will go on and fetch the next part from memory while the rest of the chip is still busy adding one to a number. Clearly this can be faster than waiting, however, suppose the next instruction after “add 1″ is “if the result is 100, do this otherwise do that. Which flow of instructions are going to be fetched next? Modern chips have logic for that, called “branch prediction”, and the chip will take a guess. Sometimes this guess is wrong, of course, and then the chip has to back up and redo a small bit of work. The overall speed gain from “guessing” is worth the occasional miss.

This idea is called “pipelining” and it was so perfected in the Pentium 4 that the chip is known to have a “long pipeline”. That means that the entire process of doing work on the chip was chopped into so many pieces it has to go through a fairly large number of stages to get done. The advantage is high clock rates and thus good marketing material, but a fairly large “cost” in speed if somewhere in the pipeline it is discovered the wrong “guess” was made. Every time that happens a large part of the pipeline is cleared and must be refilled, and that costs you speed and processing power.

The Pentium 4 competitors used different methods to get speed from their processor, and although they’ll claim a lower frequency (AMD is typically at 2 or 2.2 GHz) they’ll give you the same amount of actual work as a Pentium 4 at 3 GHz.

A while ago Intel engineers found themselves running into a few technical problems getting the clock rate any higher, and since then it has become clear that to get the speeds up again, something else had to be done. The new Intel “Core Duo” is the first big result. Intel went more or less back to the Pentium 3 chips, and evolved from there, in a different direction. Instead of makeing a longer pipeline, they doubled the chip. Instead of one program doing “add 1 to a number” the chip can have two programs both doing “add 1 to a number” at the same time. This isn’t really a new idea, it had been done before – either by actually sticking to processors in the same computer, or by dividing up work in the chip itself. For example, doing a calculation with two “floating point” numbers (such as 3.14159 times 2.718) would be done in a different part of the chip from doing calculations with “integer” numbers (such as 2 times 3), and those two parts could work on a different calculation at the same time. This multiple-core thing is more or less the same, except now just about all the functionality of the chip is duplicated. There’s a whole bunch of extremely technical stuff I’m glossing over right now, such as sharing the memory that is on the chip (level 1 and level 2 cache) but if you want to read about that, there’s other places on the Internet than this post.

So, back to the T2000 and the CoolThreads chip. The Intel Core Duo presentes itself als two processors to the operating system. The CoolThreads chip in the T2000 I’m evaluating presents itself not as one, not as two, but as 32 processors. Not blazingly fast processors, each presents itself as a 1 GHz chip, but it sure makes that up in quantity.

It also means there are a lot of things this chip is not good at. You probably would not want one to run Word or Excel on it. In that case, you’d be doing one thing only, and you would get one of those 32 parts working for you while the other 31 would sit by idle.

Sun also clearly states on their web site that this chip is not good at doing floating point calculations. So, if you need a machine that is good with floating point and has multiple processors, you’re probably still going to end up with a Enterprise 6900, which has 24 seperate UltraSparc IV chips, each at 1.5 GHz. But that machine costs a cool million dollars, and my T2000 is listed for just a little bit more than 12,000 dollars. Clearly the T2000 is limited compared to the E6900, but there’s a few things the T2000 excels at (and might even give the E6900 a run for its money – I’d love to test drive one of those for 60 days and find out).

The Sun web site calls it “the fastest web server”, and not without reason. Let’s look at what a busy webserver does: serving lots and lots of people web pages. Some of those pages will need to be generated on the fly (for example because they contain personalized information). Lots and lots of websites these days do it on a server with a Pentium 4 in it. The server hosting this weblog, for example, is a computer with 1 chip in it, a Pentium 4. That may change in the coming week or so, as the software I’m installing on the T2000 should be able to handle my weblog nicely, and that’s a great thing to try. I get between 3000 and 4000 visitors on my weblog per day, spread out over the day. That’s not much, but imagine the other website I’m working on, where those same 3,000 to 4,000 visitors browsing the site at the same time would be considered a quiet moment. Now mind you, when I say “at the same time” I mean they’re browsing at the same time, and requesting something like ten to fifteen webpages during the ten minutes that their visit lasts.

Back to that same image of the office with one worker in it – if that one worker had to serve webpages to those 4,000 people, that one worker would have to switch jobs a lot – so often, actually, that the overhead of switching would hurt performance. That same office with 32 slower workers however, would serve those 4,000 people a lot better. The kind of work (compositing web pages and handing them out through the net) and the nature of the work (lots, lots and lots of jobs that have no dependency on each other – each visitor gets their own web pages and they have no relationship to the other 4,000 pages generated at the same time) makes the T2000 a perfect match.

Now, since most of my work involves getting webservers to handle lots and lots of visitors, you realize why I’m testdriving it.

Whilst installing software I’ve already seen the first effects of the way this machine works. Building software is sequential work – the compiler will generally only do one thing at a time. And the machine does not “feel” fast when I’m doing that. But for some software I can install two parts at the same time – so I open a second window, and start a second build in that window. That’s not the way I’m used to doing things, since when you do that on a machine with a Pentium in it, you’ll notice both builds will indeed go slower. The total amount of time it takes to build both pieces of software remains the same (or sometimes goes up since you add work-switching overhead). With this T2000, that is very clearly not the case. Build three or four pieces of software at the same, and you won’t notice any slowdown in any build. It helps that the machine has nice little fast disks, of course, since the build results need to be stored on disk, but it’s a nice indicator of things to come.


Write a comment

Comments:

  1. Hey John, will Sun give me a free server box if I write about it on my weblog?

  2. PS: why don’t you use a parallel make for large builds instead of starting another build in another shell?

  3. I will do a parallel build later this week, I haven’t got my toolset complete yet.

  4. Ask them ;-)

    When you fill out the forms they want to know why you’re evaluating the machine. For my work, there’s a clear and obvious need to know if this machine is useful – and the try-60-days program is a god’s end. Rumor has it that they are indeed saying “keep it” to some people who write about it, but I would have written this anyway, it’s a good way to organize my thoughts, and I have to present my findings to others as well. That will be a more formal report, but my informal thoughts are highly regarded in KPN :-)

    Having said that – I’m sure somebody at Sun will be reading this weblog, and indeed this comment, in the near future, so let the record show that I would welcome such a message from Sun – it’s a very nice machine, but since my boss pays me more than this machine is worth, it will not influence the conclusions I am going to reach a few weeks from now.

Unpacking the T2000

Posted on March 23rd, 2006 at 20:26 by John Sinteur in category: Sun Coolthreads T2000

It’s pretty obvious the system was made to be bulk-installed in a 19″ rack in a datacenter. The 19″ rackmount slides are a dead giveaway, and the only thing looking remotely nice is the 2U high front of the machine. Other than the machine and the rails, there’s a packing list and a few bits of legal documentation, and that’s it. Nothing more. The outside of the box is self-documenting:

IMG_0881.jpg

and the inside of the box is very, very neat and tidy:

IMG_0882.jpg
(if you’re wondering about those cables, the disks are SAS – two very small sized 74Gb disks, and two empty bays)

Installation is easy – any Sun admin will recognize the steps:

1. connect a serial cable to the console port, and to a PC or terminal server
2. connect both power cords to power, connect any network cables you need
3. set your favourite LOM settings, (including the network address of the admin interface)
4. type “poweron”

At that moment the machine starts making far more noise as all the fans kick in. The Solaris 10 initial configuration starts up, also well known to any good Sun admin.

No need for other documentation or CD’s – everything is preinstalled.

So, no surprises, really. And that’s how it should be. Tomorrow I’ll talk a bit about the target market of this machine, and why I’m personally very much interested in it. If any of you reading this has questions at any point, either technical or non-technical, please don’t hesitate to ask.

Meanwhile, I’ll leave you with the boot messages, after the link…

Read the rest of this entry »


Write a comment

Comments:

  1. 32, approaching 42 real quick.

Progress

Posted on March 23rd, 2006 at 14:17 by John Sinteur in category: Sun Coolthreads T2000

My first computer, ever, was an Apple ][+, way back in the early ’80s. Every time I install a new computer small enough to fit on my desk, I do a back of the envelope calculation how much faster that new box is.

I think I broke the 10 million barrier today. It’s comparing apples to sunsoranges, of course, but this T2000 is somewhere between 10 and 50 million times as fast as my very first box.

I can’t think of another occupation where the tools have increased in power this much… even if my calculations are off by an order of magnitude…


Write a comment

A box arrived

Posted on March 23rd, 2006 at 14:12 by John Sinteur in category: Sun Coolthreads T2000

I’m going to evaluate a Sun Fire T2000 server, with the new CoolThreads stuff. Sun is running a 60 day evaluation deal, and I’m going to make it go through its paces with humongous amounts of Java stuff (for KPN) and for the stuff I do with my own company, mostly apache, mysql and php, with a bunch of mail thrown in for good measure.

The box is still sealed at this point. So why post already? Because the transporter that Sun picked to transport the box has broken the record for most flexible and friendly delivery I’ve dealt with. Ever. I called the drivers boss afterwards to deliver my compliments… That doesn’t happen often, and the guy who picked up the telephone at Bos/TKS was quite surprised. Most people don’t bother to call unless they’ve got a complaint, but I feel service beyond the call of duty deserves recognition as well.

Anyway, this evalution is definately off to a good start. I’ll post more of my experiences here – as well as the reasons I decided the T2000 is a worthy candidate for evaluations; as some of you know, I prefer to build my own servers…


Write a comment