@@ -203,16 +203,82 @@ if ( myrank == 0 ): # Only rank 0 will print results
203203
204204* [ Rmpi documentations] ( https://cran.r-project.org/web/packages/Rmpi/Rmpi.pdf )
205205
206- While there is an MPI interface for R called ** Rmpi** , it was originally developed for
207- LAM MPI which has not been actively developed in 20 years and the documentations
208- still cite LAM commands.
209- Although it is getting some updates it is strongly recommended to avoid
210- this package. An ** Rmpi** version ** dot_product_doMPI.R" in the ** code** directory
211- run on a modern Linux system with up-to-date R 4.3.2 and a GNU build of OpenMPI 5.0.3
212- spawns processes but never returns from the ** startMPIcluster()** call. ** Rmpi** can
213- also be used to write explicit MPI code. If ** Rmpi** can be made to work it would
214- bring the ability to spread work across multiple compute nodes as well as multiple
215- cores within each node, but the performance is unknown.
206+ The ** Rmpi** package was developed about 20 years ago but has been updated every few
207+ years to be compatible with current versions of R and OpenMPI (except my tests
208+ failed with OpenMPI 5.0.3 so I had to use an older OpenMPI 4.1.6 version).
209+ This package provides a ** doMPI** back end that can be easily slipped into a
210+ program using ** foreach** loops with ** %dopar%** allowing the code to run
211+ on cores on multiple compute nodes.
212+ ** Rmpi** also provides wrappered MPI commands for programmers who wish to
213+ write explicit MPI programs in R.
214+
215+ ``` R
216+ # Do the dot product between two vectors X and Y then print the result
217+ # USAGE: mpirun -np 4 Rscript dot_product_message_passing.R 100000
218+ # This will run 100,000 elements on 4 cores, possibly spread on multiple compute nodes
219+ # must install.packages("Rmpi") first
220+
221+ library( Rmpi ) # This does the MPI_Init() behind the scenes
222+
223+ # Get the vector size from the command line
224+
225+ args <- commandArgs(TRUE )
226+ if ( length( args ) == 1 ) {
227+ n <- as.integer( args [1 ] )
228+ } else {
229+ n <- 100000
230+ }
231+
232+ # Get my rank and the number of ranks - (MPI talks about ranks instead of threads)
233+
234+ com <- 0 # MPI_COMM_WORLD or all ranks
235+ nRanks <- mpi.comm.size( com ) # The number of ranks (threads)
236+ myRank <- mpi.comm.rank( com ) # Which rank am I ( 1 .. nRanks )
237+
238+ if ( (n %% nRanks ) != 0 ) {
239+ print(" Please ensure vector size is divisable by the number of ranks" )
240+ quit()
241+ }
242+ myElements <- n / nRanks
243+
244+ # Allocate space and initialize the reduced arrays for each rank
245+
246+ x <- vector( " double" , myElements )
247+ y <- vector( " double" , myElements )
248+
249+ j <- 0
250+ for ( i in seq( myRank + 1 , n , nRanks ) )
251+ {
252+ j <- j + 1
253+ x [j ] <- as.double(i )
254+ y [j ] <- as.double(3 * i )
255+ }
256+
257+ # Clear cache then barrier sync so all ranks are ready then time
258+
259+ dummy <- matrix ( 1 : 125000000 ) # Clear the cache buffers before timing
260+
261+ ret <- mpi.barrier( com ) # mpi.barrier() returns 1 if successful
262+
263+ t_start <- proc.time()[[3 ]]
264+
265+ p_sum <- 0.0
266+ for ( i in 1 : myElements )
267+ {
268+ p_sum <- p_sum + x [i ] * y [i ]
269+ }
270+
271+ dot_product <- mpi.allreduce( p_sum , type = 2 , op = " sum" , comm = com )
272+
273+ t_end <- proc.time()[[3 ]]
274+
275+ if ( myRank == 0 ) {
276+ print(sprintf(" Rmpi dot product with nRanks workers took %6.3f seconds" , t_end - t_start ))
277+ print(sprintf(" dot_product = %.6e on %i MPI ranks for vector size %i" , dot_product , nRanks , n ) )
278+ }
279+
280+ mpi.quit( )
281+ ```
216282
217283### C
218284
@@ -456,7 +522,13 @@ the Python matrix multiply code and test the scaling.
456522
457523### R
458524
459- ** Rmpi** is not recommended.
525+ Measure the execution time for the ** dot_product_message_passing.R** code
526+ for 1, 4, 8, and 16 cores on a single compute node to compare with other
527+ parallelizaton methods available in R.
528+ If you are on an HPC system with multiple nodes, try running the same
529+ tests on 2 or 4 compute nodes to compare.
530+ You can also try running the ** dot_product_doMPI.R** code to see how
531+ it compares to using explicit ** MPI** programming in R.
460532
461533### C
462534
@@ -509,7 +581,27 @@ the added global summation after the loop.
509581
510582### R
511583
512- Not implemented yet.
584+ For the single node tests I used 10,000,000 element vectors to
585+ get a good result with enough work to expect better scaling.
586+ Smaller vector tests will illustrate the difference in
587+ overhead better but be less indicative of the performance of most
588+ real applications.
589+
590+ For the ** dot_product_message_passing.R** code I got 481 ms for 1 core,
591+ 132 ms for 4 cores, 68 ms for 8 cores, and 49 ms for 16 cores showing good
592+ performance and scaling which is expected given the only communication is
593+ for the global summation at the end.
594+ The ** dot_product_doMPI.R** code had 1.1 seconds, 0.85 seconds, 4.8 seconds,
595+ and 8.4 seconds respectively showing much poorer performance and actually
596+ got worse as more cores were used. The overhead was just too great, so while
597+ using a ** doMPI** back end is much easier than using explicit MPI commands,
598+ the performance and scaling are much worse.
599+
600+ Running on 4 nodes 4 cores each I got 53 ms for ** dot_product_message_passing.R**
601+ compared to 49 ms on a single node which is very good but again the only
602+ communication is the global summation at the end.
603+ I did not manage to get the ** dot_product_doMPI.R** code to run on multiple
604+ nodes yet.
513605
514606### C
515607
0 commit comments