Simulation speed

© Copyright 2018 Energy Systems Research Unit, Glasgow, Scotland

Those who use simulation tools are familiar with waiting for the kettle to boil. Unlike the impatient cook, simulation users can actually influence the resources required to carry out tasks -- if we understand the resource implications in how we design our simulation models, the workflows we use and the bottlenecks within our computing platforms. Improvements of an order of magnitude in specific tasks are possible.

The method used to highlight these opportunities is the creation of a test matrix which includes simulation models at various levels of complexity, various assessment goals, various simulation frequencies and performance-data-to-save directives, computational rigour directives across a mix of legacy and current computing platforms, operating systems. And just to spice it up we will look at the implications of the tool chains used to build simulation tools and run alternative simulation tools across the matrix.

What we found:

The computational demands simulation tools and simulation projects are only roughly correlated with the rankings generated by standard benchmarking tools. The test matrix used in this study helps to focus attention on modelling and hardware and working practice issues that impact timings.

Running ESP-r in a virtual environment roughly halfs its computational speed. Virtual disk speeds are also likely to be half of what you expect and because there is generally less RAM you loose much of the efficiency of post-processing from the file buffer and have to fall back on the really slow disk access. There is a smaller computational and post-processing hit for EnergyPlus.

If you are setting up a virtual computer for simulation use make sure there is 20-30GB of free space rather than 10GB because you will easily fill this with model and performance data files. If possible give the virtual computer more than 2GB of RAM and run your virtual computer on an SSD rather than a rotational disk.

ESP-r runs are almost as efficient on Windows 10 as on Linux (everything else being equal) but post-processing suffers from slower disk access. EnergyPlus on Windows 10 is roughly in line with Linux

Computer memory can speed up post-processing tasks by one or two orders of magintude if it allows data recovery from the file buffer as opposed to disk access. Subtle changes in simulation task ordering can also reduced elapsed time by an order of magnitude. An investment in additional RAM is often much more cost effective than the latest-greatest CPU.

Runnng assessments in parallel can cut elapsed time (even on ARM single-board-computers) but risks being disk-bound for seasonal or annual assessments writing large performance data sets. Post processing in parallel also risks being disk bound to the point where post-processing can take more time than the original assessments.

The constrained models used for simulation training can run focused and seasonal assessments on just about the full range of computer kit tested in less than a minute (usually in the order of seconds). This suggests that legacy computers and even ARM based single-board-computers (a dozen fit in a briefcase) can be used for training workshops.

The benchmarks suggest Conduction Finite Difference is not as horriffic as many practitioners assume and for many projects there is no need to fall back on the legacy CTF solver in EnergyPlus.

If you have recompiled a simulation tool to introduce a new facility once you have finished debugging it be sure to switch over to optimized builds for distribution because the performance hit is massive.

Early use of focused assessments e.g. one week or one month reality checks are critical for QA. They quickly produce data sets (even on legacy kit) that aids our understanding of how the building and/or systems are working so we can deliver useful information before shifting to production runs on our most powerful computers.

User choice implications

What we choose to include in our virtual worlds of physics and our design decisions within those virtual worlds have consequences. Although simulation tools may guide (or coerce) us, decisions about the extent of our models and the levels of detail it contains is ultimately down to the model design decisions of practitioners. Following the easy path though the tool does not necessarily create a model which is fit-for-purpose. Enough said. The benchmarking matrix includes two models. The first is a pair of cellular offices with a corridor. Its composition and complexity characterises the exemplar training models distributed with simulation suites such as ESP-r and EnergyPlus. The second model was taken from a project on retrofit options for traditional heavy masonry apartments.

A simple training model

Models used for training in simulation tools tend to be constrained in complexity. Often training models have a half dozen zones, perhaps 40-50 surfaces and straightforward environmental controls. Such model constraints reflect novices abilities and, in general, assessment tasks associated with them make minimal demands on computers.

The figure to the left is a training model distributed with ESP-r. It has two identical cellular offices with a corridor. Glazing frames are separate surfaces and some of the walls have been subdivided. This model was also exported as an EnergyPlus IDF file V8.8 so the assessment timings are well matched in terms of the zones, surfaces, surface composition, casual gains as well as startup days, assessment periods and timesteps.

However, to accommodate the Conduction Finite Difference solver in EnergyPlus (see grumble later on), the conductivity of steel in the floor structure was reduced and a fibre concret rainscreen was substituted for the standard metallic rainscreen in the facade. ESP-r uses a finite volume solver which is less sensitive to high conductivity materials (complaints start with layers of 0.4mm thickness).

The contents of this model are reported in this ASCII contents file
or as a PDF file

The periods of assessment, timesteps, warmup periods and solar shading directives for the EnergyPlus model were also matched. Both 4 TSPH and 20 TSPH and both the CTF and CFD solvers are exercised in EnergyPlus and ESP-r uses its Finite Volume approach. One difference is the use of a free floating control for the EnergyPlus runs.

High-mass traditional apartment

Models which represent real buildings often explicitly represent most, if not all, of the rooms in the building as well as specific aspects of the building facade. They may also abstractly represent adjacent spaces which represent bounding conditions and they may also include a degree of diversity in the occupancy and casual gains within the spaces so as to capture the response characteristics of the building.

To assist our comparison with EnergyPlus 4 & 20 TSPH assessments have been run. Twenty TSPH yields an annual results file of over 21GB (full zone and surface energy balances). Although the ESP-r res module can handle this the pedantic nature of the post processing requests tends to thrash the disk, indeed the quantity of data written also impacts the speed of the simulation. As will be discussed in a later section, it is possible to request that performance data be saved at a lesser frequency and to write a subset of data to the file. The latter technique has been used for the summer and annual 20 TSPH runs performance data was saved once per hour than at each timestep. User choices for recording performance data which impact on the size of the files are also explored. In a later section we will look at the implications across a range of user directives.

As with the simple office model the traditional apartment was exported to EnergyPlus version 8.8. Give the mass of the structure the Conduction Finite Difference solver is the solver of choice (the CTF solver is also tested). In this model only one instance of steel needed its conductivity relaxed to avoid a severe error. Again the assessment was run with a free floating control.

For this study we will use an 1890s apartment in a sandstone building. Such models often represent each room explicitly as well as the bay windows in the facade and the different thicknesses of stone (up to 900mm thick). In this case there are 13 zones and 432 surfaces. Environmental controls involve radiators (mixed convection & radiation) with TRV (mixed sensors). The figure below shows a view (including an abstract bounding zone below the occupied space as well as the roof space above). The usual range of timesteps for this model is 8-10 TSPH (there is nothing in the model or the goals of the project which demands a shorter timestep).

The contents of this model are reported in this ASCII contents file
or as a PDF file

Directives on simulation frequency

Our choices on assessment timestep is typically a compromise between the response characteristics of the building and its environmental controls, the what-do-we-need-to-record in order to confirm the design works and the constraints of the solver being used. In the matrix we use both a 15 minute timestep and a 3 minute timestep. Both tools support a minute timestep for the building solver and ESP-r can simulate plant at a fraction of a second although the latter is not explored in this study.

Ideally the timestep reflects either the frequency at which we want to explore performance issues or an aspect of an environmental control or mass flow which might undergo rapid changes.

Some solvers constrain the simulation timesteps - in the tables below the CTF method in EnergyPlus does not like frequencies less than 4 per hour (not sure about frequencies greater than 30 per hour). Most example IDF files use 4 or 6 timesteps per hour. The Finite difference method does not allow frequencies less than 20 per hour (20 30 or 60 are possible).

The extent and frequency of performance data captured during an assessment forms another matrix to explore. ESP-r has a concept of save levels two of which are explored - a full set of performance data (i.e. full energy balances at zones and surfaces) and a subset of performance data. The tests include both save levels and for longer periods a data frequency of hourly as well as each timestep. Post-processing in ESp-r recovers the following performance data for each zone:

sensible heating, cooling, humidification over the assessment period

sensible heating, cooling, humidification statistics (Max Min Std deviation)

monthly gains and losses for surface convection, casual gains, air movement, heating, cooling, solar absorbed and solar entering

number of hours over 24C for dry bulb and resultant temperatures (plus statistics)

The EPP assessment writes hourly data on:

Output:Variable,*,Site Outdoor Air Drybulb Temperature,hourly;

Output:Variable,*,Zone Mean Air Temperature,hourly;

Output:Variable,*,Zone Air Temperature,hourly;

Output:Variable,*,Surface Inside Face Temperature,hourly;

Output:Variable,*,Surface Outside Face Temperature,hourly;

as csv and ESO files but no further post-processing. These directives would need to be considerably extended to be comparable with what ESP-r stores. There is also no direct equivalent of the ESP-r res module so the method for comparing post-processing timings needs to be evolved.

To test the impact of writing data at higher frequencies (see disk-bound section below) variant IDF files were created with timestep directives.

Directives on simulation periods

To highlight the costs of uncritical generation of annual performance indicators and focused reality checks to gain confidence in models we include in the matrix the following assessment periods:

a single week creates an easy-to-digest set of performance data which can give indications of a number of performance criteria. It is a classic period for reality checks prior to seasonal assessments

two months is often sufficient to test a design against a progression of boundary conditions. The period used in the tables below is January and February. This captures both extreme conditions as well as the impact of often-rapidly-changing weather patterns. The performance data set is not nearly as easy to digest.

summer is a classic seasonal assessment which can deliver information on overheating risks. It captures some transition periods as well. It also marks a point where the size of the data store influences workflows and post-processing tasks.

annual is beloved by accountants the world over. We really should not do annual assessments until the model has been calibrated, passed a number of reality checks and gained confidence in our model.

Pre-processing directives matrix

Directives which impose additional resolution on models, for example, computing surface-to-surface viewfactors, shading & insolation directives are computed prior to assessments and the stored results used during assessment.

Simulation tools offer a range of user directives for radiation exchange within zones. The process of computing and using surface-to-surface view factors in ESP-r is straightforward and seems to be where dragons live for EnergyPlus. In ESP-r the method is based on creating a grid of points on all surfaces. The grid resolution defaults to 10x10 per surface but can be increased (e.g. to cope with thin frames around doors and windows). Shifting to a ~40x40 grid would impact computing resources but is rarely required.

For ESP-r timings were taken on a Dell 7010 running Linux and ESP-r V12.7. The task of pre-computing surface-to-surface view factors rarely takes more than a minute and is often a matter of seconds. Pre-calculating the hourly distribution of solar insolation within a complex zone (including to the explicit mass surfaces) requires even fewer resources.

Two levels of romm complexity have been tested. The room on the left includes 314 vertices, 152 surfaces of varying complexity from 4-36 edges in the facade as well as dozens of explicit internal mass surface pairs representing desks and filing cabinets.

Also seen are the location of MRT sensors (at occupant head locations for assessing radiant asymmetry).

viewfactors surface-to-surface view factors (18x18 mesh on each surface) 57s

viewfactors mrt-sensor-to-surface view factors 2s

insolation hourly for one typical day in each month over a year shading 5s insolation 5s

The pre-processing in the context of a simple zone (90 vertices 40 surfaces) such as that found in the simple pair of offices implies the following overhead:

viewfactors surface-to-surface view factors (18x18 mesh on each surface) 4s

viewfactors mrt-sensor-to-surface view factors <1s

shading & insolation hourly for one typical day in each month over a year for all zones after site change 5 seconds

In addition to the time taken for actual calculations the specification of user directives for shading and insolation requires a few seconds. There are a few options to set in terms of the accuracy of viewfactor calculations as well as locating and defining MRT sensor locations.

As zones evolve shading is automatically updated. If the location on site is adjusted all shading is automatically updated and because there is no user interaction needed it may take only a few seconds per zone. Viewfactor calculations require the user to initiate the calculations. In a 90 zone model this becomes a repetitive task that probably should be automated.

Assessment directives matrix

Simulation models include a range of directives which impact computing resources.

Shading - simulation tools offer a range of user directives as to whether shading and insolation are to be treated abstractly or computed, and if computed at what level of resolution. To explore all of these would vastly increase the size of the matrix. The models all include a request for shading to be computed (but some rooms in exceed the shape limits imposed by EnergyPlus for computed insolation).

Mass flow iteration - ESP-r defaults to solving mass flows at the same timestep as the building solution (the so-called ping-pong) approach. Users can switch to the so-called onion approach where mass flow iteration is increased. The flow solver is highly efficient and tends to not impose a noticeable burden. None of the models in the matrix include a flow network so this has not been tested.

Plant solution timesteps - System component solution can be solved at the same frequency as the zone solver or a multiple. At one extreme the building could solve every minute and the system solve (and write performance data) at a half-second timestep. This has the potential to generate large performance data files as well as slowing down assessments. This has not been tested in the matrix.

Hardware choices

Each computing platform includes a mix of hardware which can impact how much time the kettle takes to boil. Each epoch of computer has quirks as well as opportunities for fine tuning. And there are the disruptor platforms which may find application in some simulation scenarios.

Processor type and speed - Although most modern computers have multiple cores few simulation tools make use of multiple cores. Processor clock speed also is often correlated with timings. The efficiency of multiple simultaneous assessments is, however, dependent on a number of factors which are summarised below:

Memory - Memory determines how many tasks can run simultaneously as well as the complexity of models which can be acommodated. If there is a significant memory buffer, the OS can use the memory buffer rather than direct disk access (speeding up some tasks by several orders of magnitude).

Disk type - the type of disk impacts disk intensive tasks. Rotational disks can easily store large files (ESP-r can generate files approaching 40GB) but scanning such files can be much less efficient than with a SSD. Simulation tools such as ESP-r also create large numbers of small model files which are also better handled by a SSD. The EMMC drives embedded in tables and some single board computers tend to be slower than SSD.One of the bottlenecks of single board computers is their reliance on SDHC cards which are magnitudes slower than EMMC cards.

Operating system - some simulation tools can take advantage of command interpreters (shells, scripting tools etc.) to automate tasks. Linux offers finer grained user permissions so that, for example, corporate databases cannot be easily corrupted by users. Tests indicate that disk I/O is generally faster on Linux than on Windows. Operating system overheads (processor and memory) can be constrained more readily in Linux than Windows.

The tool-chain - the compiler used has less impact than the degree of optimisation requested. ARM processors yield greater optimisation than Intel processors when using the GCC compiler which is the tool chain used in this study.

Virtualisation - where simulation is deployed within virtual environments e.g. VirtualBox, VMWare, Parallels, there is a reduction in the speed of computation (often only a few percent if the virtual processor is the same as the host), there is often less memory available and disk access can be a magnitude slower.

Tool optimisation - those developing simultion tools have different needs from simulation users. Compilation in support of tool development is typically not optimized for fast processing and only the release version will be built with full optimisation. The tables show that a debug version has a significant impact on resouces if used in a production work-flow. Commercial compilers may also result in a faster tool - so compiling from source might result in slightly slower performance.

Computer matrix

The following table describes the matrix of computers included. The columns are:

Computer matrix
Computer	CPU	CPU speed	memory	OS	tool-chain	disk type	comment
Dell 7010	4x Intel i5-3470	3.6 GHz	8GB RAM	Ubuntu 16.04	GCC 5.4	rotational	2013 epoch standard workstation
Dell 7010	4x Intel i5-3470	3.6 GHz	8GB RAM	Windows 10	GCC 5.4	rotational	2013 epoch standard workstation
Dell 7010 VirtualBox	2x Intel i5-3470	3.6 GHz	2.4GB RAM	Ubuntu 16.04	GCC 4.8	virtual disk	Non-commercial virtualisation with constrained RAM CPU & disk performance
Dell 7010 VMware	2x Intel i5-3470	3.6 GHz	2.4GB RAM	Ubuntu 16.04	GCC 4.8	virtual disk	Commercial virtualisation with constrained RAM CPU & disk performance
Dell 780	2xIntel Core2 Duo E8400	3.0 GHz	4GB RAM	Ubuntu 16.04	GCC 5.4	rotational	2010 epoch typical of legacy kit
MacBook Pro (2012)	2x Intel i5	2.5 GHz	8GB RAM	OSX 10.10.5	GCC 5.3	SSD	Typical legacy OSX laptop
MacBook Air (2013)	Intel i5	1.3 GHz	4GB RAM	OSX 10.13.0	GCC 6.3	SSD	Fast SSD and memory, faster than GHz would indicate
Odroid C2	AArch64	1.53 GHz	1.8GB RAM	Ubuntu 16.04	GCC 4.9	SDHC	One of the more powerful ARM SBC
Pinebook64	AArch64	1.34 GHz	2GB RAM	Ubuntu Mate	GCC 5.4	emmc	ARM based Laptop with EMMC disk
Raspberry Pi3	ARMv71	1.2 GHz	1GB RAM	Debian 8	GCC 4.9	SDHC & rotational	Popular ARM SBC hosted on a rotational disk. Memory constrains the complexity of models that can be hosted.

Timings for a simple training model

In this table the assessment times are shown for ESP-r V12.7 compiled with no optimisation and high optimisation with the GCC tool chain on all computing platforms. Where EPP V8.8 was locally compiled the same compiler and optimisation levels were used. For Windows 10 the EPP suite was a pre-compiled download.

This particular ESP-r model is normally run at 4 TSPH, however the CFD solver in EnergyPlus requires at least 20 TSPH so ESP-r runs were made at both 4 and 20 timesteps per hour.

In the table below times are in seconds use the syntax for ESP-r is (4tsph simulation/reporting) [20tsph simulation/reporting] and for EnergyPlus is (CTF conduction transfer function simulation & ReadVars) [CFD conduction finite difference & ReadVars].

Simple training model
Platform	Suite	One week	Jan-Feb	Summer	Annual
Dell 7010 Linux Ubuntu	ESP-r optimized	(<1 <1) [2 <1]	(2 1) [6 2]	(3 1) [11 4]	(7 3) [30 10]
	ESP-r un-optimized	(1 <1) [4 <1]	(3 1) [12 2]	(5 2 ) [22 5]	(13 5) [59 13]
	EPP downloaded	(1 <1) [4 <1]	(2 <1) [11 <1]	(3 <1) [19 <1]	(7 1) [52 1]
	EPP local optimized build	(1 <1) [3 <1]	(2 <1) [11 <1]	(3 <1) [19 <1]	(7 1) [53 1]
	EPP local un-optimized build	(2 <1) [13 <1]	(5 <1) [41 <1]	(8 <1) [72 <1]	(22 1) [199 1]
Dell 7010 Windows 10	ESP-r optimized	(2 1) [2 2]	(3 1) [8 9]	(4 3) [14 16]	(8 10) [34 49]
Dell 7010 Windows 10	EPP downloaded	(1 <1) [4 <1]	(2 <1) [9 <1]	(4 1) [17 1]	(8 3) [47 3]
Dell 7010 VirtualBox Linux	ESP-r optimized	(2 <1) [3 1]	(3 1) [10 2]	(5 2) [17 7]	(10 5) [46 20]
Dell 7010 VirtualBox Linux	EPP optimized	(<1 <1) [4 <1]	(2 <1) [11 <1]	(3 <1) [20 <1]	(8 1) [55 2]
Dell 7010 VMware Linux	ESP-r optimized	(2 1) [3 1]	(3 1) [11 4]	(5 2) [18 8]	(10 5) [45 28]
Dell 7010 VMware Linux	EPP optimized	(1 <1) [4 <1]	(2 <1) [11 <1]	(3 <1) [20 <1]	(8 1) [56 1]
Dell 780 Linux	ESP-r optimized	(1 <1) [3 1]	(2 1) [8 3]	(3 2) [15 5]	(8 4) [40 13]
Dell 780 Linux	EPP optimized	(1 <1) [5 <1]	(3 <1) [16 <1]	(4 <1) [29 <1]	(11 2) [80 2]
MacBook Pro	ESP-r optimized	(1 <1) [4 1]	(2 1) [9 7]	(5 3) [15 13]	(9 9) [38 28]
MacBook Pro	EPP optimized	(1 <1) [4 <1]	(2 <1) [12 <1]	(4 1) [25 1]	(9 2) [59 2]
Odroid C2	ESP-r optimized	(6 2) [16 5]	(13 7) [61 24]	(22 18) [117 47]	(66 32) [320 322]
Odroid C2	EPP optimized	(5 <1) [16 <1]	(15 1) [97 1]	(29 3) [185 3]	(73 8) [539 8]
Pinebook64	ESP-r optimized	(7 4) [23 5]	(17 6) [74 15]	(41 11) [178 26]	(98 22) [378 72]
Pinebook64	EPP optimized	(9 <1) [51 <1]	(24 2) [164 2]	(42 3) [300 3]	(113 10) [826 10]
Raspberry Pi3	ESP-r optimized	(6 4) [20 5]	(14 7) [63 20]	(26 12) [116 39]	(68 28) [320 137]
	ESP-r un-optimized	(15 5) [61 8]	(40 14) [189 41]	(72 25) [346 95]	(197 68) [968 995]
	EPP optimized	(11 <1) [45 <1]	(21 2) [83 2]	(36 3) [248 3]	(95 10) [701 10]
	EPP un-optimized	(21 <1) [155 <1]	(55 2) [483 2]	(94 3) [849 3]	(246 9) [2390 10]

Discussion

The table indicates that reality-check runs are in the order of seconds and seasonal/annual assessment tasks of one minute or less across most of the matrix of computing platforms. Unless you are undertaking automated parameter excursions the interactive user experience is little changed across the matrix.

At this scale there is no need to use the legacy EnergyPlus CTF solver (unless you are running on one of the single board computers or running longer assessments on legacy kit).

Windows shows a disadvantage for disk intensive activities. The EnergyPlus downloaded executables on Windows have an advantage over locally compiled Linux versions.

There appears to be little difference between the performance of non-commercial and commercial virtual computers for simple models. Indeed, at this scale of model there is little or no increase in assessment times in comparison with native tool performance (even with constrained memory).

For summer and annual assessments the use of un-optimized simulation tools shows a significant increase in timings. Essentially one would avoid un-optimized builds for production work, especially with models with any level of complexity.

Timings for the High-mass traditional apartment

High-mass traditional apartment
Platform	Suite	week	Jan-Feb	summer	annual
Dell 7010 Linux Ubuntu	ESP-r optimized	(11 2) [49 4]	(33 6) [156 33]	(59 12) [265 4*]	(166 33) [690 9*]
	ESP-r un-optimized	(32 1) [147 2]	(94 4) [460 14]	(164 7) [793 3*]	(453 20) [2193 7*]
	EPP pre-compiled *	(3 <1) [51 <1]	(9 2) [153 2]	(17 4) [308 4]	(47 12) [791 12]
	EPP optimized *	(3 <1) [51 <1]	(9 2) [152 2]	(17 4) [315 4]	(46 12) [765 12]
	EPP un-optimized *	(11 <1) [207 <1]	(28 2) [605 2]	(54 4) [1243 4]	(142 12) [3080 13]
Dell 7010 VirtualBox Linux	ESP-r optimized	(23 2) [109 5]	(73 3) [348 423]	(130 15) [595 5*]	(350 560) [1653 12*]
Dell 7010 VirtualBox Linux	EPP optimized *	(4 <1) [58 <1]	(10 2) [168 2]	(20 4) [336 4]	(53 12) [974 12]
Dell 7010 VMware Linux	ESP-r optimized	(25 2) [171 5]	(72 10) [335 214]	(122 16) [317 5*]	(304 260) [1392 13*]
Dell 7010 VMware Linux	EPP optimized *	(4 <1) [55 <1]	(11 2) [170 2]	(20 4) [407 4]	(54 13) [837 13]
Dell 7010 Windows 10	ESP-r optimized	(12 4) [48 14]	(33 21) [157 108]	(58 43) [251 11*]	(158 126) [683 92*]
Dell 7010 Windows 10	EPP downloaded *	(4 4) [49 1]	(11 4) [156 4]	(21 9) [297 9]	(58 26) [763 26]
Dell 780 Linux	ESP-r optimized	(15 2) [74 7]	(49 11) [224 92]	(89 23) [366 6*]	(226 65) [1033 8*] ??
Dell 780 Linux	EPP optimized *	(6 1) [90 1]	(13 3) [279 3]	(26 6) [529 6]	(69 20) [1392 18]
MacBook Pro	ESP-r optimised	(4 2) [60 9]	(14 3) [197 86]	(76 66) [323 15*]	(195 36) [870 20*]
MacBook Pro	EPP pre-compiled *	(4 <1) [72 <1]	(13 3) [204 3]	(25 7) [404 7]	(67 20) [1028 23]
Odroid C2	ESP-r optimized	(134 9) [630 33]	(394 16) [1976 2172]	(775 328) [3455 24*]	(2171 2534) [9380 59*]
Odroid C2	EPP optimized *	(34 2) [571 2]	(106 12) [1654 12]	(212 25) [3401 25]	(555 76) [8600 15]
Pinebook64	ESP-r optimized	(193 6) [953 14]	(603 23) [3060 30]	(1119 44) [5121 15*]	(2899 128) [13852 36*]
Pinebook64	EPP optimized *	(51 2) [825 2]	(160 15) [2378 15]	(308 32) [4887 32]	(822 35) [12316 35]
Raspberry Pi3	ESP-r optimized	(159 7) [754 20]	(457 34) [2269 222]	(831 91) [4185 20*]	(2294 304) [11484 67*]
Raspberry Pi3	EPP optimized *	(44 2) [724 2]	(131 15) [1862 15]	(253 32) [4155 32]	(638 34) [10351 34]

Discussion

As the complexity of models increases different patterns emerge in the interplay between user directives and hardware.

The numerical hits from virtual computing are noticeable. For example the simulation time roughly doubles for ESP-r. EnergyPlus is less impacted when run within a virtual environment.

The disk access and (likely) disk size constraints of virtual computing are also noticable. This will be further explored in the next section.

Windows 10 disk access and post-processing are noticeably slower than Linux for both ESP-r and EnergyPlus.

Although ARM computers are capable of running more complex models their limits on disk space and disk speed argue against their use for seasonal and annual assessments.

The differences between optimized and un-optimized builds of simulation tools are clear. We do not report un-optimized runs for ARM because the impact was clear in the simple model in the earlier table.

Disk-bound assessments

Writing performance data and recovering it for inclusion in reports and graphs might be another elephant-in-the-room. The tests suggest a number of correlations with:

model complexity e.g. the number of zones and surfaces, controls etc.

simulation timesteps as they approach minutely or into the scale of seconds

the number of assessment domains to be solved

the breadth and frequency of performance data recorded

the method used to generate reports and graphs

operating system efficiency for using memory buffers in preference to direct file I/O

the degree of random access required for report and graph generation

the form of mass storage

As seen in the above tables, simulation tasks via models such as the simple pair-of-offices are rarely disk-bound. The slow SDHC disks of some single board computers are somewhat impacted but with most legacy and current computers in the matrix users will hardly notice.

The table below illustrates the time implications in the context of the traditional apartment:

Matrix of disk-bound variants (work in progress)
Platform	Suite	Variant	Summer	Annual	Comment
Dell 7010 Linux Ubuntu (rotational drive)	ESP-r optimized	20 TSPH-20 TSPH save everything	[288 145 elapsed ~7GB]	[787 2007 elapsed ~21GB]	Data recovery dependent on disk I/O, typically 50MB/s
	-	20 TSPH-20 TSHP save subset	[270 32 elapsed 590MB]	[731 96 elapsed 1.7GB]	Subset save reduces post-processing burden
	-	20 TSPH -> 1 TSHP save everything	[266 4 elapsed ~350MB]	[727 10 elapsed ~1GB]	Data recovery minimal burden
	-	20 TSPH -> 1 TSHP save subset	[264 2 29MB]	[727 6 88MB]	minimal gain from saving only a subset
	EnergyPlus	20 TSPH -> 20 TSHP	[411 83]	[1087 248 ~6GB]	noticable post-processing resource
	-	20 TSPH -> 1 TSHP	[312 4]	[793 12 ~310MB]	quick post-processing
Macbook Pro (SSD)	ESP-r optimized	20 TSPH-20 TSPH save everything	[387 1345]	[995 6036]	Data files ~21GB
	-	20 TSPH-20 TSHP save subset	[346 71]	[879 200]	Data files ~6GB
	-	20 TSPH -> 1 TSHP save everything	[323 15]	[870 20]	Data files ~1GB
	EnergyPlus	20 TSPH -> 20 TSHP	[563 12]	[1536 401]	Output folder ~6GB
	-	20 TSPH -> 1 TSHP	[404 7]	[1028 22]	Output folder ~2GB
Odroid C2	ESP-r optimized	20 TSPH-20 TSPH save everything	[- -]	[- -]	Not enough disk space.
	-	20 TSPH-20 TSHP save subset	[3392 249]	[9308 2866]	Data files ~6GB
	-	20 TSPH -> 1 TSHP save everything	[3455 23]	[9380 59]	Data files ~1GB
	EnergyPlus	20 TSPH -> 20 TSHP	[- -]	[- -]	Not enough disk space. A common limitation for ARM computers as well as virtual computers
	-	20 TSPH -> 1 TSHP	[3414 25]	[8649 74]	Output folder ~1GB

As models approach the complexity of the traditional apartment instances of disk-bound assessment begin to be noticeable, in some cases they begin to dominate. Where data recovery agent requests are supplied via the memory buffer tasks can be accomplished one or two magnitudes quicker. What we notice is that for disk writing from the simulation engine in ESP-r is roughly 10-50MB/second while disk reads, especially from large data stores might be half that speed. Where post-processing can take advantage of the operating system file buffers the data recovery is one or two orders of magnitude faster. Remember, writing of performance data is essentially a sequential task while post-processing of multiple topics can involve random access into the performance access if not multiple passes over the data store to derive statistics and/or generate graphs.

Although it does not show up in the graph, response to user interactions for non-simulation tasks are impacted, especially during post-processing. Pointers would freeze and simple tasks like folder listings could be an order of magnitude slower. Multiple processors may speed automated tasks but the user experience is degraded.

Users have a number of choices:

breaking a single assessment that generates tens of GB of data into a sequence of shorter assessments

altering the frequency of data recordings - the difference between hourly and timesteps for ESP-r and EnergyPlus are shown below

choosing fewer performance data items

implementing hardware changes - adding more RAM or upgrading from a rotational disk to an SSD

Workflows

The ordering of simulation tasks can have a substantial impact. Assessments which generate a large performance data sets and which are immediately followed by data recovery actions tend to be more efficient than running a series of assessments followed by a sequence of data recovery tasks.

On most computing platforms, even those with multiple processors, users will notice sluggish response for tasks carried out at the same time as heavy disk activity. To explore this consider the table below which looks at the time required for sequential and parallel simulation tasks.

Task sequence matrix for ESP-r
Computer	Period	Sequential sim	Sequential post-processing	Parallel sim	Parallel post-processing	comments
Dell 7010 Linux Ubuntu	May	21	3	49	4	with 4 cores, fast disk & 8GB memory there are almost no wait states
	June	21	3
	July	21	3
	August	22	3
	sum of sequential	85	12			the warmup periods in sequential are noticable.
	May-August	60	12			post-processing via memory buffer
Odroid C2 Linux Ubuntu	May	264	32	443	165	4 cores but limited memory and disk I/O wait states of >50% are observed during major writes.
	June	262	29
	July	274	29
	August	339	29
	sum of sequential	1139	119			Sequential recovery is faster because it is largely via memory.
	May-August	844	380			post-processing via disk access

On workstations with a reasonable hardware provisions it is possible to speed up simulation tasks by invoking assessments in parallel. Four cores never runs four assessments in the same time as a single assessment. In the table above the monthly assessments each include a warm-up period so the sum-of-sequential is greater than a single run covering the season. The parallel runs does save time - but of post-processing needs to span the whole season then the bookkeepping to aggregate predictions may be problematic.

On computers which have multiple cores but are constrained by memory and/or disk I/O the story is mixed. The table shows that post-processing dominates the seasonal run. The parallel assessments do run in a fraction of time but parallel post-processing is disk-bound.

Is this pattern unique to ESP-r? The same scenario was run for EnergyPlus V8.8 on the same computing platform. The results are presented below for the Conduction Finite Difference solver at 20 TSPH:

Task sequence matrix for EnergyPlus
Computer	Period	Sequential sim	Sequential post-processing	Parallel sim	Parallel post-processing	comments
Dell 7010 Linux Ubuntu	May	108	1	169	2	with 4 cores, fast disk & 8GB memory computer is fully engaged when running in parallel. Using 4 cores is not 4x faster.
	June	105	1
	July	116	1
	August	108	1
	sum of sequential	437	4			writing minimal performance data there is little difference between sequential or parallel.
	May-August	305	4			a single summer run is faster than monthly runs because of the warm-up periods

The patterns are, indeed, similar. A summer assessment is somewhat more efficient than sequential runs but it takes roughly half the time when running 4 assessments in parallel. EnergyPlus is doing much less post-processing so no clear pattern emerges.

Observations

More often than not we select our matrix of model attributes, assessment directives and mix of hardware based on habit rather than using benchmarks. This risks over-and-under-specified kit for the goals of projects. For example, a high resolution monitor could be much more important to model creation than cutting edge processing capabilities. If we anticipate disk intensive tasks will be a bottleneck then fitting additional memory and an SSD can often result in better turn-around than a faster processor.

The tests indicate that computers from earlier epochs are often suitable for model creation, editing and QA checks (with the caveat optimized software is used and memory and disks are upgraded where possible).

Alternative computing platforms are also suitable for a range of simulation tasks and training workshops with the caveat that optimized software is used. The hassle of participants not being able to run a tool or licensing issues go away if presenters could arrive with a suitcase of ARM computers pre-loaded with the software and models.

Offloading computing tasks to compute servers with fast disks makes sense in many situations although this has not been specifically benchmarked.

The research indicates that the timings for EnergyPlus's Conduction Finite Difference solver are less dire than is usually assumed in the community. Yes, it is a hassle that the conductivity of some metal layers need to be relaxed to use the Finite Difference solver (such sensitivities only kick in for sub-millimeter layers in ESP-r). The 20, 30 or 60 time steps per hour requirement is more restrictive than in the ESP-r solver. Surely the EnergyPlus development team could sort the metal layer conductivity limits and find ways to support somewhat longer timesteps. Both of these would encourage a default use of Finite Difference solver. Experts should, or course, be able to choose CTF.

Although ESP-r's res module is very efficient when the data store is less than the available memory buffer or where the data store is less than ~2GB post-processing tasks are hardly an issue. The 21GB data store highlights inefficiencies in the logic which controls random access to the file store. The process is disk bound more by the speed of the read statements and multiple passes through the data store rather than the raw speed of the disk drive.

Meltdown and spector

In January 2018 faults in a range of CPU architectures were uncovered with speculation that the required patches might slow down some user tasks. The table below explores the implications on simulation tool timings for a subset of the test matrix. The 1890s apartment building has been used and the following notes explain the hardware and software context:

Linux on a Dell 7010 - the Ubuntu 16.04 LTS OS was updated to kernel version 4.4.0-96. ESP-r was recompiled with the same gcc build settings and compaired with the last timings run prior to the upgrade. The existing version of EnergyPlus (gcc local build with full optimisation) was used to rerun the standard test sequence and compaired with the last timings prior to the OS upgrade.

Windows 10 on a Dell 7010 - Windows 10 update was applied on 16 January until there were no further patches available. The existing native Windows version of ESP-r and the existing (standard downloaded) version of EnergyPlus were used to rerun the standard test sequence and compaired with the last timings prior to the OS upgrade.

MacBook Air with macOS 10.13.2 - the computer is running the most recent macOS and thus had an OS patch available. The existing ESP-r (local optimized compile) was used to re-run the standard tests and and compaired with the last timings run prior to the upgrade. The existing (downloaded standard distribution) of EnergyPlus 8.8 macOS was also used to re-run the standard tests and is compaired with the last test run prior to the OS upgrade.

In the table below times are in seconds. The syntax for ESP-r is (4tsph simulation/reporting) [20tsph simulation/reporting] and for EnergyPlus is (CTF conduction transfer function simulation & ReadVars) [CFD conduction finite difference & ReadVars]. A * indicates performance data generated at 20 TSPH but saved hourly.

High-mass traditional apartment - pre & post spector-meltdown
Platform	Suite	week	Jan-Feb	summer	annual
Dell 7010 Linux Ubuntu	ESP-r pre-spector	(11 2) [49 4]	(33 6) [156 33]	(59 12) [265 4*]	(166 33) [690 9*]
	ESP-r post-spector	(13 2) [59 4]	(38 6) [188 34]	(70 13) [305 4]	(- -) [- -]
	EPP pre-spector	(3 <1) [51 <1*]	(9 2) [152 2*]	(17 4) [315 4*]	(46 12) [765 12*]
	EPP post-spector	(3 1) [54 <1*]	(9 2) [159 2*]	(18 4) [315 4*]	(47 12) [773 12*]
Dell 7010 Windows 10	ESP-r pre-spector	(12 4) [48 14]	(33 21) [157 108]	(58 43) [251 11*]	(158 126) [683 92*]
	ESP-r post-spector	(16 5) [62 18]	(44 29) [227 146]	(59 60) [380 17*]	(- -) [- -]
	EPP pre-spector	(4 4) [49 1*]	(11 4) [156 4*]	(21 9) [297 9*]	(58 26) [763 26*]
	EPP post-spector	(5 <1) [51 <1*]	(13 5) [148 5*]	(24 11) [298 10*]	(63 31) [799 31*]
MacBook Air OSX 10.13.2	ESP-r pre-spector	(13 2) [72 17]	(16 3) [230 1500]	(103 346) [368 462]	(240 1958) [1016 72*]
	ESP-r post-spector	(12 4) [72 16]	(18 5) [265 1551]	(96 108) [379 30*]	(261 1991) [1044 65*]
	EPP pre-spector	(5 <1) [64 <1*]	(15 3) [206 3*]	(31 7) [434 7*]	(74 20) [1032 19*]
	EPP post-spector	(6 <1) [69 <1*]	(16 4) [206 4*]	(32 8) [400 7*]	(86 23) [1023 22*]

Discussion

The scale of performance change seems to vary depending on the mix of operating system and simulation tool. Some changes are small enough to be attributable to noise in the testing (i.e. other background OS tasks). ESP-r is impacted more than EnergyPlus for simulation phases as well as for post-processing.

The Linux upgrade impacts ESP-r timings for the simulations (i.e. the summer season goes from 59 to 70 seconds) but has almost no impact on post-processing tasks. We have not yet tested to see if a re-compiled ESP-r would performe differently.

The Linux upgrade seems to have minimal impact on EnergyPlus (i.e. the summer season 46-47 seconds). Whether this would also be the case if EnergyPlus was re-compiled locally has not been tested.

The Windows 10 upgrade impacts ESP-r simulation tasks more than was seen in the Linux update but not uniformally. For example, the summer season at 4 TSPH is 58-59 seconds but at 20 TSPH it jumps from 251-380 seconds). Disk I/O tasks (already slower than on Linux) are noticably slower. Task manager showed disk saturation (the CPU use dropped considerably) during the simulations that generated multi GB files.

The Windows 10 upgrade has a slight impact on EnergyPlus (i.e. the summer season 21-24 seconds). Only during the final stages of summer and annual assessments was disk I/O seen to saturate.

The macOS upgrade re-run shows ESP-r simulations are mixed or slightly slower and there is a hit for some post-processing tasks. It could be that change in memory overhead or available buffer space caused some cases to switch from buffer use to direct file access during postprocessing.

The macOS upgrade Energyplus timings are mixed but generally small enough to be within the range of noise from background processes. Post processing times tended to be longer suggesting some impact on disk I/O.

Next steps

There are gaps in the benchmarks:

We need to devise scripts that will exercise EnergyPlus parallel and sequential assessment tasks

It would be good include test model in the order of 90 zones and 1000 surfaces

It would be good to add to the matrix virtual computer variants which have more memory.

It would be good to find a faster computer to include in the benchmarks. This is less straightforward than it would appear as simulation tools are not designed to take advantage of multiple processors and although the correlation with clock cycles is strong the definition of what constitutes the next epoch in hardware is open for debate.

It would be great to include more simulation tools in the comparison. Ensuring equivalence between models is a daunting task as many validation studies have concluded -- this would require considerable passion

This page is work-in-progress. I still occasionally notice typos in the tables.