Skip to content

Commit 03dcea8

Browse files
committed
added R examples for chapters 2-6
1 parent de024ac commit 03dcea8

File tree

10 files changed

+592
-49
lines changed

10 files changed

+592
-49
lines changed

episodes/data/code.tar.gz

14.6 KB
Binary file not shown.

episodes/data/code.zip

8.49 KB
Binary file not shown.

episodes/message-passing.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,18 @@ if ( myrank == 0 ): # Only rank 0 will print results
201201

202202
### R
203203

204-
Not implemented yet.
204+
* [Rmpi documentations](https://cran.r-project.org/web/packages/Rmpi/Rmpi.pdf)
205+
206+
While there is an MPI interface for R called **Rmpi**, it was originally developed for
207+
LAM MPI which has not been actively developed in 20 years and the documentations
208+
still cite LAM commands.
209+
Although it is getting some updates it is strongly recommended to avoid
210+
this package. An **Rmpi** version **dot_product_doMPI.R" in the **code** directory
211+
run on a modern Linux system with up-to-date R 4.3.2 and a GNU build of OpenMPI 5.0.3
212+
spawns processes but never returns from the **startMPIcluster()** call. **Rmpi** can
213+
also be used to write explicit MPI code. If **Rmpi** can be made to work it would
214+
bring the ability to spread work across multiple compute nodes as well as multiple
215+
cores within each node, but the performance is unknown.
205216

206217
### C
207218

@@ -445,7 +456,7 @@ the Python matrix multiply code and test the scaling.
445456

446457
### R
447458

448-
Not implemented yet.
459+
**Rmpi** is not recommended.
449460

450461
### C
451462

@@ -559,9 +570,11 @@ Not implemented yet.
559570

560571
### Links for additional information
561572

562-
* [mpi4py](https://mpi4py.readthedocs.io/)
563573
* [LLNL MPI tutorial](https://hpc-tutorials.llnl.gov/mpi/)
564574
* [MPICH user guides](https://www.mpich.org/documentation/guides/)
565575
* [OpenMPI function man pages](https://www.open-mpi.org/doc/)
576+
* [mpi4py](https://mpi4py.readthedocs.io/)
577+
* [CRAN Rmpi](https://cran.r-universe.dev/Rmpi)
578+
566579

567580

episodes/multi-threaded.md

Lines changed: 263 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,224 @@ print( 2.0*N*8.0/1.0e9, ' GB memory used' )
160160

161161
### R
162162

163-
Not implemented yet.
163+
Let's go through the multi-threaded version of the dot product code
164+
below to illustrate the changes that had to be made to the code to parallelize it.
165+
In R we need to define a virtual cluster that will be used to spread the work
166+
from the **for** loops across the CPU cores. This can be done in several ways
167+
in R and the choice should come down to what works best for the problem you
168+
are coding up.
169+
170+
We start by loading the library **parallel** in the
171+
first example code below to pull in the detectCores(), makeCluster(),
172+
clusterExport(), and clusterApply() functions.
173+
We next detect the number of cores accessible to the job
174+
then define the cluster with **makeCluster()** spawning independent worker processes
175+
to handle parts of the parallel processing.
176+
For this **parallel** library we need the body of the loop to be put into a
177+
function and any variables that need to be used inside this function must
178+
be exported using **clusterExport()** commands.
179+
The **clusterApply()** command uses the cluster object, iteration range,
180+
and function name which then spawns multiple processes to execute the
181+
function for the iteration loop, automatically splitting them across the
182+
cores in the virtual cluster on a single compute node.
183+
At the end there is a **stopCluster()** statement
184+
that cleans up the virtual cluster before the program ends.
185+
186+
This basic approach is simple and can be useful but also may be inefficient since
187+
the overhead for dividing the work between threads may be much greater
188+
than the work done within each iteration, as is clearly the case in
189+
our simle example where there is only a single multiplication for each
190+
pass through the loop.
191+
In the second part of this code, the loop is instead
192+
divided over the number of threads with the function then manually splitting
193+
the loop iterations internally. This greatly limits the assignment-of-work
194+
overhead since only the initial assignment is needed and you will see
195+
for yourself that the resulting performance is enormously better.
196+
197+
```R
198+
# Dot product in R using a loop and a vector summation
199+
# USAGE: Rscript dot_product_multi_thread.R 100000 8 for 100,000 elements on 8 cores
200+
201+
library( parallel )
202+
203+
# Get the vector size and nThreads from the command line
204+
205+
args <- commandArgs(TRUE)
206+
if( length( args ) == 2 ) {
207+
n <- as.integer( args[1] )
208+
nThreads <- as.integer( args[2] )
209+
} else {
210+
n <- 100000
211+
nThreads <- detectCores()
212+
}
213+
214+
cl <- makeCluster( nThreads )
215+
216+
217+
# Allocate space for and initialize the arrays
218+
219+
x <- vector( "double", n )
220+
y <- vector( "double", n )
221+
222+
for( i in 1:n )
223+
{
224+
x[i] <- as.double(i)
225+
y[i] <- as.double(3*i)
226+
}
227+
228+
# Export variables needed within the functions
229+
230+
clusterExport( cl, "x" )
231+
clusterExport( cl, "y" )
232+
clusterExport( cl, "n" )
233+
clusterExport( cl, "nThreads" )
234+
235+
# Time a multi-threaded dot product even though it's inefficient
236+
237+
dot_product_function <- function( i ) {
238+
239+
return( x[i] * y[i] )
240+
241+
}
242+
243+
dummy <- matrix( 1:125000000 ) # Clear the cache buffers before timing
244+
245+
t_start <- proc.time()[[3]]
246+
247+
dot_product_list <- clusterApply( cl, 1:n, dot_product_function )
248+
dot_product <- sum( unlist(dot_product_list) )
249+
250+
t_end <- proc.time()[[3]]
251+
252+
print(sprintf("Threaded dot product by clusterApply took %6.3f seconds", t_end-t_start))
253+
print(sprintf("dot_product = %.6e on %i threads for vector size %i", dot_product, nThreads, n ) )
254+
255+
256+
# Now try dividing the iterations manually between workers
257+
258+
dot_product_workers <- function( myThread ) {
259+
260+
mySum <- 0.0
261+
for( i in seq( myThread, n, by = nThreads ) )
262+
{
263+
mySum <- mySum + x[i] * y[i]
264+
}
265+
return( mySum )
266+
267+
}
268+
269+
dummy <- matrix( 1:125000000 ) # Clear the cache buffers before timing
270+
271+
t_start <- proc.time()[[3]]
272+
273+
dot_product_list <- clusterApply( cl, 1:nThreads, dot_product_workers )
274+
dot_product <- sum( unlist(dot_product_list) )
275+
276+
t_end <- proc.time()[[3]]
277+
278+
print(sprintf("Threaded dot product with nThreads workers took %6.3f seconds", t_end-t_start))
279+
print(sprintf("dot_product = %.6e on %i threads for vector size %i", dot_product, nThreads, n ) )
280+
281+
stopCluster( cl )
282+
```
283+
284+
This second multi-threaded example below uses the **foreach** and **doParallel**
285+
libraries. This code similarly defines and initiates a virtual
286+
cluster. The **foreach** loop is similar to a **for** loop but you can
287+
choose between different back ends. A **%do%** back end would run the body
288+
in scalar, while the **%dopar** will split the iterations across the cores
289+
of the virtual cluster, and we will discuss later that there is a
290+
**%doMPI%** back end that can split the work across cores on different
291+
compute nodes.
292+
While similar to the previous example, the **foreach** approach is cleaner
293+
programming in that you don't have to create a separate function for the
294+
body of the loop. You also don't need to manually export variables since
295+
the processes that are spawned inherit the environment of the parent process.
296+
So we get more flexibility in the back ends as well as a more convenient
297+
programming approach.
298+
You'll be asked to measure the performance of each approach in the
299+
excersize below.
300+
301+
```R
302+
# Dot product in R using a loop and a vector summation
303+
# USAGE: Rscript dot_product_threaded_dopar.R 100000 8 for 100,000 elements on 8 threads
304+
305+
library( foreach )
306+
library( iterators )
307+
library( parallel )
308+
library( doParallel )
309+
310+
# Get the vector size and nThreads from the command line
311+
312+
args <- commandArgs(TRUE)
313+
if( length( args ) == 2 ) {
314+
n <- as.integer( args[1] )
315+
nThreads <- as.integer( args[2] )
316+
} else {
317+
n <- 100000
318+
nThreads <- detectCores()
319+
}
320+
321+
# Initialize the vectors and our virtual cluster
322+
323+
x <- vector( "double", n )
324+
y <- vector( "double", n )
325+
326+
for( i in 1:n )
327+
{
328+
x[i] <- as.double(i)
329+
y[i] <- as.double(3*i)
330+
}
331+
332+
cl <- makeCluster( nThreads )
333+
registerDoParallel( cl, nThreads )
334+
335+
# Time the multi-threaded dot product foreach loop
336+
# This returns a vector of size 'n' that will need to be summed
337+
# so it is very inefficient.
338+
339+
dummy <- matrix( 1:125000000 ) # Clear the cache buffers before timing
340+
341+
t_start <- proc.time()[[3]]
342+
343+
#dot_product_vector <- foreach( i = 1:n, .combine = c, mc.preschedule = TRUE ) %dopar% {
344+
dot_product_vector <- foreach( i = 1:n, .combine = c ) %dopar% {
345+
346+
x[i] * y[i]
347+
348+
}
349+
dot_product <- sum( dot_product_vector )
350+
351+
t_end <- proc.time()[[3]]
352+
353+
print(sprintf("dopar dot product took %6.3f seconds", t_end-t_start))
354+
print(sprintf("dot_product = %.6e on %i threads for vector size %i", dot_product, nThreads, n ) )
355+
356+
# Now let's try a more complex but more efficient method where
357+
# we manually divide the work between the threads.
358+
359+
dummy <- matrix( 1:125000000 ) # Clear the cache buffers before timing
360+
361+
t_start <- proc.time()[[3]]
362+
363+
dot_product_vector <- foreach( myThread = 1:nThreads, .combine = c ) %dopar% {
364+
365+
psum <- 0.0
366+
for( j in seq( myThread, n, nThreads ) ) {
367+
psum <- psum + x[j] * y[j]
368+
}
369+
psum
370+
371+
}
372+
dot_product <- sum( dot_product_vector )
373+
374+
t_end <- proc.time()[[3]]
375+
376+
print(sprintf("dopar dot product with nThreads workers took %6.3f seconds", t_end-t_start))
377+
print(sprintf("dot_product = %.6e on %i threads for vector size %i", dot_product, nThreads, n ) )
378+
379+
stopCluster( cl )
380+
```
164381

165382
### C
166383

@@ -390,7 +607,7 @@ This always leads to terrible scaling and should almost never be used.
390607

391608
### Python
392609

393-
Measure the execution time for the dot_product_threaded.py code
610+
Measure the execution time for the **dot_product_threaded.py** code
394611
for 1, 4, 8, and 16 cores. If possible, use a job script
395612
requesting 16 cores and do all runs in the same job.
396613
You can look at the job scripts like **sb.ddot_py** in the **code**
@@ -401,7 +618,17 @@ run to see how close to ideal the performance is.
401618

402619
### R
403620

404-
Not implemented yet.
621+
Measure the performance of **dot_product_threaded.R** and
622+
**dot_product_threaded_dopar.R** for a given number of threads
623+
like 8 if you have that many cores available. You should be able
624+
to run both in a few minutes using 100,000 elements.
625+
626+
If you have time, try running a scaling study using a job script
627+
similar to **sb.ddot_R** in the **code** directory.
628+
This will allow us to see how each code scales with the number of
629+
cores used.
630+
Then calculate the speedup compared to the scalar (single-core)
631+
run to see how close to ideal the performance is.
405632

406633
### C
407634

@@ -455,7 +682,38 @@ few computations being done in each pass through the loop.
455682

456683
### R
457684

458-
Not implemented yet.
685+
For 8 cores on a modern Intel processor I get 4.7 seconds for the
686+
first loop and 33 ms for the second loop in **dot_product_threaded.R**.
687+
Manually dividing the workload between our processes greatly reduces the
688+
overhead compared to letting R handle it. The extra programming we did
689+
is an absolute necessity in this case since we don't have much work in
690+
the body of the loop. It may be less necessary in more realistic
691+
applications but this illustrates that the difference in overhead is enormous.
692+
693+
For the **dot_product_threaded_dopar.R** code, I measure 18.6 seconds for the
694+
first loop and 70 ms for the second, so again there is an enormous saving in
695+
overhead by manually dividing the work among the threads to limit the
696+
overhead from R scheduling the iterations across the workers.
697+
698+
If you have time you can try increasing the workload greatly.
699+
Comment out the first loop in each code since that would take too long,
700+
then increase the number of elements by 100 to 10,000,000.
701+
My scaling studies now show **dot_product_threaded.R** getting
702+
a 3.65 times speedup on 4 cores, a 6.5 times speedup on 8 cores,
703+
and a 9.3 times speedup on 16 cores. These are now very reasonable.
704+
However, for **dot_product_threaded_dopar.R** I still measure over
705+
a second for a single core and the time increases as I add more cores,
706+
so the overhead for this method is still dominating the computations
707+
inside the loop.
708+
You can also measure the performance difference using the matrix
709+
multiplication codes if you wish.
710+
711+
The conclusion from all this is that while using a **foreach**
712+
is simple and clean code, the **clusterApply()** approach or
713+
**foreach** over the number of threads with manually splitting the
714+
iterations internally provides much greater performance. If each
715+
iteration is doing enough calculations then the overhead may not
716+
matter.
459717

460718
### C
461719

@@ -496,6 +754,6 @@ Not implemented yet.
496754

497755
### Links for additional information
498756

499-
* [github pymp](https://github.com/classner/pymp)
500757
* [LLNL OpenMP tutorial](https://hpc-tutorials.llnl.gov/openmp/)
758+
* [github pymp](https://github.com/classner/pymp)
501759

0 commit comments

Comments
 (0)