@@ -160,7 +160,224 @@ print( 2.0*N*8.0/1.0e9, ' GB memory used' )
160160
161161### R
162162
163- Not implemented yet.
163+ Let's go through the multi-threaded version of the dot product code
164+ below to illustrate the changes that had to be made to the code to parallelize it.
165+ In R we need to define a virtual cluster that will be used to spread the work
166+ from the ** for** loops across the CPU cores. This can be done in several ways
167+ in R and the choice should come down to what works best for the problem you
168+ are coding up.
169+
170+ We start by loading the library ** parallel** in the
171+ first example code below to pull in the detectCores(), makeCluster(),
172+ clusterExport(), and clusterApply() functions.
173+ We next detect the number of cores accessible to the job
174+ then define the cluster with ** makeCluster()** spawning independent worker processes
175+ to handle parts of the parallel processing.
176+ For this ** parallel** library we need the body of the loop to be put into a
177+ function and any variables that need to be used inside this function must
178+ be exported using ** clusterExport()** commands.
179+ The ** clusterApply()** command uses the cluster object, iteration range,
180+ and function name which then spawns multiple processes to execute the
181+ function for the iteration loop, automatically splitting them across the
182+ cores in the virtual cluster on a single compute node.
183+ At the end there is a ** stopCluster()** statement
184+ that cleans up the virtual cluster before the program ends.
185+
186+ This basic approach is simple and can be useful but also may be inefficient since
187+ the overhead for dividing the work between threads may be much greater
188+ than the work done within each iteration, as is clearly the case in
189+ our simle example where there is only a single multiplication for each
190+ pass through the loop.
191+ In the second part of this code, the loop is instead
192+ divided over the number of threads with the function then manually splitting
193+ the loop iterations internally. This greatly limits the assignment-of-work
194+ overhead since only the initial assignment is needed and you will see
195+ for yourself that the resulting performance is enormously better.
196+
197+ ``` R
198+ # Dot product in R using a loop and a vector summation
199+ # USAGE: Rscript dot_product_multi_thread.R 100000 8 for 100,000 elements on 8 cores
200+
201+ library( parallel )
202+
203+ # Get the vector size and nThreads from the command line
204+
205+ args <- commandArgs(TRUE )
206+ if ( length( args ) == 2 ) {
207+ n <- as.integer( args [1 ] )
208+ nThreads <- as.integer( args [2 ] )
209+ } else {
210+ n <- 100000
211+ nThreads <- detectCores()
212+ }
213+
214+ cl <- makeCluster( nThreads )
215+
216+
217+ # Allocate space for and initialize the arrays
218+
219+ x <- vector( " double" , n )
220+ y <- vector( " double" , n )
221+
222+ for ( i in 1 : n )
223+ {
224+ x [i ] <- as.double(i )
225+ y [i ] <- as.double(3 * i )
226+ }
227+
228+ # Export variables needed within the functions
229+
230+ clusterExport( cl , " x" )
231+ clusterExport( cl , " y" )
232+ clusterExport( cl , " n" )
233+ clusterExport( cl , " nThreads" )
234+
235+ # Time a multi-threaded dot product even though it's inefficient
236+
237+ dot_product_function <- function ( i ) {
238+
239+ return ( x [i ] * y [i ] )
240+
241+ }
242+
243+ dummy <- matrix ( 1 : 125000000 ) # Clear the cache buffers before timing
244+
245+ t_start <- proc.time()[[3 ]]
246+
247+ dot_product_list <- clusterApply( cl , 1 : n , dot_product_function )
248+ dot_product <- sum( unlist(dot_product_list ) )
249+
250+ t_end <- proc.time()[[3 ]]
251+
252+ print(sprintf(" Threaded dot product by clusterApply took %6.3f seconds" , t_end - t_start ))
253+ print(sprintf(" dot_product = %.6e on %i threads for vector size %i" , dot_product , nThreads , n ) )
254+
255+
256+ # Now try dividing the iterations manually between workers
257+
258+ dot_product_workers <- function ( myThread ) {
259+
260+ mySum <- 0.0
261+ for ( i in seq( myThread , n , by = nThreads ) )
262+ {
263+ mySum <- mySum + x [i ] * y [i ]
264+ }
265+ return ( mySum )
266+
267+ }
268+
269+ dummy <- matrix ( 1 : 125000000 ) # Clear the cache buffers before timing
270+
271+ t_start <- proc.time()[[3 ]]
272+
273+ dot_product_list <- clusterApply( cl , 1 : nThreads , dot_product_workers )
274+ dot_product <- sum( unlist(dot_product_list ) )
275+
276+ t_end <- proc.time()[[3 ]]
277+
278+ print(sprintf(" Threaded dot product with nThreads workers took %6.3f seconds" , t_end - t_start ))
279+ print(sprintf(" dot_product = %.6e on %i threads for vector size %i" , dot_product , nThreads , n ) )
280+
281+ stopCluster( cl )
282+ ```
283+
284+ This second multi-threaded example below uses the ** foreach** and ** doParallel**
285+ libraries. This code similarly defines and initiates a virtual
286+ cluster. The ** foreach** loop is similar to a ** for** loop but you can
287+ choose between different back ends. A ** %do%** back end would run the body
288+ in scalar, while the ** %dopar** will split the iterations across the cores
289+ of the virtual cluster, and we will discuss later that there is a
290+ ** %doMPI%** back end that can split the work across cores on different
291+ compute nodes.
292+ While similar to the previous example, the ** foreach** approach is cleaner
293+ programming in that you don't have to create a separate function for the
294+ body of the loop. You also don't need to manually export variables since
295+ the processes that are spawned inherit the environment of the parent process.
296+ So we get more flexibility in the back ends as well as a more convenient
297+ programming approach.
298+ You'll be asked to measure the performance of each approach in the
299+ excersize below.
300+
301+ ``` R
302+ # Dot product in R using a loop and a vector summation
303+ # USAGE: Rscript dot_product_threaded_dopar.R 100000 8 for 100,000 elements on 8 threads
304+
305+ library( foreach )
306+ library( iterators )
307+ library( parallel )
308+ library( doParallel )
309+
310+ # Get the vector size and nThreads from the command line
311+
312+ args <- commandArgs(TRUE )
313+ if ( length( args ) == 2 ) {
314+ n <- as.integer( args [1 ] )
315+ nThreads <- as.integer( args [2 ] )
316+ } else {
317+ n <- 100000
318+ nThreads <- detectCores()
319+ }
320+
321+ # Initialize the vectors and our virtual cluster
322+
323+ x <- vector( " double" , n )
324+ y <- vector( " double" , n )
325+
326+ for ( i in 1 : n )
327+ {
328+ x [i ] <- as.double(i )
329+ y [i ] <- as.double(3 * i )
330+ }
331+
332+ cl <- makeCluster( nThreads )
333+ registerDoParallel( cl , nThreads )
334+
335+ # Time the multi-threaded dot product foreach loop
336+ # This returns a vector of size 'n' that will need to be summed
337+ # so it is very inefficient.
338+
339+ dummy <- matrix ( 1 : 125000000 ) # Clear the cache buffers before timing
340+
341+ t_start <- proc.time()[[3 ]]
342+
343+ # dot_product_vector <- foreach( i = 1:n, .combine = c, mc.preschedule = TRUE ) %dopar% {
344+ dot_product_vector <- foreach( i = 1 : n , .combine = c ) %dopar % {
345+
346+ x [i ] * y [i ]
347+
348+ }
349+ dot_product <- sum( dot_product_vector )
350+
351+ t_end <- proc.time()[[3 ]]
352+
353+ print(sprintf(" dopar dot product took %6.3f seconds" , t_end - t_start ))
354+ print(sprintf(" dot_product = %.6e on %i threads for vector size %i" , dot_product , nThreads , n ) )
355+
356+ # Now let's try a more complex but more efficient method where
357+ # we manually divide the work between the threads.
358+
359+ dummy <- matrix ( 1 : 125000000 ) # Clear the cache buffers before timing
360+
361+ t_start <- proc.time()[[3 ]]
362+
363+ dot_product_vector <- foreach( myThread = 1 : nThreads , .combine = c ) %dopar % {
364+
365+ psum <- 0.0
366+ for ( j in seq( myThread , n , nThreads ) ) {
367+ psum <- psum + x [j ] * y [j ]
368+ }
369+ psum
370+
371+ }
372+ dot_product <- sum( dot_product_vector )
373+
374+ t_end <- proc.time()[[3 ]]
375+
376+ print(sprintf(" dopar dot product with nThreads workers took %6.3f seconds" , t_end - t_start ))
377+ print(sprintf(" dot_product = %.6e on %i threads for vector size %i" , dot_product , nThreads , n ) )
378+
379+ stopCluster( cl )
380+ ```
164381
165382### C
166383
@@ -390,7 +607,7 @@ This always leads to terrible scaling and should almost never be used.
390607
391608### Python
392609
393- Measure the execution time for the dot_product_threaded.py code
610+ Measure the execution time for the ** dot_product_threaded.py** code
394611for 1, 4, 8, and 16 cores. If possible, use a job script
395612requesting 16 cores and do all runs in the same job.
396613You can look at the job scripts like ** sb.ddot_py** in the ** code**
@@ -401,7 +618,17 @@ run to see how close to ideal the performance is.
401618
402619### R
403620
404- Not implemented yet.
621+ Measure the performance of ** dot_product_threaded.R** and
622+ ** dot_product_threaded_dopar.R** for a given number of threads
623+ like 8 if you have that many cores available. You should be able
624+ to run both in a few minutes using 100,000 elements.
625+
626+ If you have time, try running a scaling study using a job script
627+ similar to ** sb.ddot_R** in the ** code** directory.
628+ This will allow us to see how each code scales with the number of
629+ cores used.
630+ Then calculate the speedup compared to the scalar (single-core)
631+ run to see how close to ideal the performance is.
405632
406633### C
407634
@@ -455,7 +682,38 @@ few computations being done in each pass through the loop.
455682
456683### R
457684
458- Not implemented yet.
685+ For 8 cores on a modern Intel processor I get 4.7 seconds for the
686+ first loop and 33 ms for the second loop in ** dot_product_threaded.R** .
687+ Manually dividing the workload between our processes greatly reduces the
688+ overhead compared to letting R handle it. The extra programming we did
689+ is an absolute necessity in this case since we don't have much work in
690+ the body of the loop. It may be less necessary in more realistic
691+ applications but this illustrates that the difference in overhead is enormous.
692+
693+ For the ** dot_product_threaded_dopar.R** code, I measure 18.6 seconds for the
694+ first loop and 70 ms for the second, so again there is an enormous saving in
695+ overhead by manually dividing the work among the threads to limit the
696+ overhead from R scheduling the iterations across the workers.
697+
698+ If you have time you can try increasing the workload greatly.
699+ Comment out the first loop in each code since that would take too long,
700+ then increase the number of elements by 100 to 10,000,000.
701+ My scaling studies now show ** dot_product_threaded.R** getting
702+ a 3.65 times speedup on 4 cores, a 6.5 times speedup on 8 cores,
703+ and a 9.3 times speedup on 16 cores. These are now very reasonable.
704+ However, for ** dot_product_threaded_dopar.R** I still measure over
705+ a second for a single core and the time increases as I add more cores,
706+ so the overhead for this method is still dominating the computations
707+ inside the loop.
708+ You can also measure the performance difference using the matrix
709+ multiplication codes if you wish.
710+
711+ The conclusion from all this is that while using a ** foreach**
712+ is simple and clean code, the ** clusterApply()** approach or
713+ ** foreach** over the number of threads with manually splitting the
714+ iterations internally provides much greater performance. If each
715+ iteration is doing enough calculations then the overhead may not
716+ matter.
459717
460718### C
461719
@@ -496,6 +754,6 @@ Not implemented yet.
496754
497755### Links for additional information
498756
499- * [ github pymp] ( https://github.com/classner/pymp )
500757* [ LLNL OpenMP tutorial] ( https://hpc-tutorials.llnl.gov/openmp/ )
758+ * [ github pymp] ( https://github.com/classner/pymp )
501759
0 commit comments