PageRank parallel solutions

Assignment 4 
Due Friday by 11:59pm Points 70 
Submitting a file upload Available Oct 4 at 12am - Dec 24 at 11:59pm Start Assignment 
Assignment 4 (70 Points) 
ue Friday October 11 @ 11:59PM In this assignment, we will improve the parallel solutions of PageRank from Assignment 3. You will 
implement different task decomposition and mapping strategies, and observe their effect. You can use the same serial versions of the pull-based and push-basedPageRank from Assignment 3 to check for the correctness of your results. You should use your parallel solutions " page_rank_pull_parallel.cpp " 
and " page_rank_push_parallel_atomic.cpp 
" as the starting point of this assignment. You do not need to download a separate tarball for this assignment since all you need was provided in Assignment 3. 
You should have completed the Slurm Tutorial (https://canvas.sfu.ca/courses/84236/pages/slurm-tutorial) which walks you through how to use our servers for your code testing. If you're still have trouble with 
Slurm, please post questions on the canvas discussion board. It is strongly recommended to use Slurm 
for this assignment instead of CSIL machines, since the graphs you need for checking your programs 
re on the /scratch/ drive on the cluster's compute nodes. General Instructions 1. You are provided with the serial versions and some other tools in the same tarball from Assignment 

Assignment 4 

83. You will be asked to print the time spent by different threads on specific code regions. The time spent 
by any code region can be computed as follows: 
timer t1; 
t1.start(); 
/* ---- Code region whose time is to be measured --- */ 
double time_taken = t1.stop(); 
4. If you need to time a sub-section inside a loop, you can do that as follows: 
double time_taken = 0.0; 
timer t1; 
while(True){ 
/* ---- Code region whose time should not be measured --- */ 
t1.start(); 
/* ---- Code region whose time is to be measured --- */ 
time_taken += t1.stop(); 
/* ---- Code region whose time should not be measured --- */ 

std::cout << "Time spent on required code region : " << time_taken << "\n"; 
5. The programs operate on graph datasets. Sample input graphs are available 
at 
/scratch/input_graphs/ 
on the compute nodes (note they are present on the compute nodes only, 
and hence you can access them via slurm only). 
$ ls /scratch/input_graphs/*.cs* 
rmat.csc         test_25M_50M.csc   rmat.csr         test_25M_50M.csr  lj.csc           
      roadNet-CA.csc       
web-Google.csc  lj.csr                 roadNet-CA.csr       web-Google.csr 
Please read Assignment 3's general instructions if you want to generate more tests graphs. 
1. Add a "strategy" Command Line Parameter 
For your parallel programs " 
page_rank_pull_parallel.cpp 
" and " page_rank_push_parallel_atomic.cpp 

from Assignment 3, you need to add a command line parameter "strategy" which is used to define the program's task decomposition and mapping strategy. If the parameter is set to 1, i.e., 
--strategy 1 , your parallel program should run the same exact way as specified in Assignment 3. "strategy" values 
correspond to the strategies described in the following sections. Strategy 1: 
Vertex-based decomposition and static mapping, the same strategy as your Assignment 3 programs. 
Vertices are statically distributed among threads such that each thread performs all computations on n/T 
vertices. 
2. Edge-based Decomposition for PageRank 
Strategy 2: To achieve more load balance among threads, vertices are distributed among threads such that each thread performs computations on m/T edges. Pull-based PageRank distributes vertices based  
Assignment 4 2/8on their in-degrees, while Push-based PageRank distributes vertices based on their out-degrees. Both of your parallel programs need to support this strategy by setting the strategy parameter to 2 in the 
command line ( 
--strategy 2 
). For example, the pull-based algorithm's pseudo-code is: 
Create T threads Distribute work between threads so that each thread gets to compute on approximately m/T edges 
for(i=0; i<max_iterations; i++) { 
for each thread in parallel { 
for each vertex 'v' allocated to the thread { 
for vertex 'u' in inNeighbor(v) 
next_page_rank[v] += (current_page_rank[u]/outdegree[u]) 


for each thread in parallel { 
for each vertex 'v' allocated to the thread { 
compute the new_pagerank using the accumulated values in next_page_rank[v]. 
current_page_rank[v] = new_pagerank 
Reset next_page_rank[v] to 0 



As an example for how edge-based task decomposition works, a graph has the following vertex distribution <vertex, in-degree, out-degree>: 
v1, 8, 1 
v2, 5, 6 
v3, 8, 3 
v4, 2, 8 
v5, 3, 5 
v6, 1, 4 
v7, 5, 6 
v8, 4, 3 The graph has 36 edges. To distribute the work evenly among 4 threads, each thread needs to compute approximately 36/4=9 edges. Each thread gets assigned vertices until the total assigned edges is >= (thread_id+1) * m/T. Examining the in-degrees, the pull-basedalgorithm has the following vertex 
distribution: Thread 0: v1, v2 (13 edges >= 1x9); Thread 1: v3 (8 edges so 13+8 >= 2x9); Thread 2: v4, v5, v6 (6 edges, so 13+8+6 >= 3x9); Thread 3: v7, v8 (9 edges) 
For the push-based algorithm, we can use the following vertex distribution based on out-degrees: Thread 0: v1, v2, v3 (10 edges >= 1x9); Thread 1: v4 (8 edges so 10+8 >= 2x9); Thread 2: v5, v6 (9 
edges so 10+8+9 >= 27); Thread 3: v7, v8 (9 edges) Note that the pull-based algorithm in the above example should have a similar runtime as the vertex
based decomposition since the longest thread needs to compute 13 edges in both cases. However, the 
push-based algorithm has a shorter runtime since the longest thread computes 10 edges while the vertex-based decomposition requires 11 edge computations for the longest thread. 
 Assignment 4 3/83. Vertex-based Decomposition for PageRank with Dynamic 
Mapping 
Strategy 3: To achieve even more load balance among threads, vertices are mapped to threads dynamically based on the time at which different threads finish theirtasks. Instead of allocating approximately equal numbers of vertices (or edges) to threads, we can dynamically allocate work to each thread whenever it is free. In this strategy, each thread dynamically gets the next vertex to be 
computed until all the vertices are processed. Both of your parallel programs need to support this 
strategy by setting the strategy parameter to 3 in the command line ( 
--strategy 3 
). Below is the 
seudo-code showing dynamic task mapping for the push-based Page Rank algorithm with vertex-based 
decomposition: 
Create T threads 
for each thread in parallel { 
for(i=0; i<max_iterations; i++) { 
while(true){ 
u = getNextVertexToBeProcessed(); 
if(u == -1) break; 
               
edges_processed += outDegree(u) // used in output validation 
for vertex v in outNeighbor(u) 
next_page_rank[v] += (current_page_rank[u]/outdegree[u]) 

barrier1 
while(true){ 
v = getNextVertexToBeProcessed(); 
if(v == -1) break; 
vertices_processed += 1 // used in output validation 
compute the new_pagerank using the accumulated values in next_page_rank[v]. 
current_page_rank[v] = new_pagerank 
Reset next_page_rank[v] to 0 

barrier2 


You need to implement the above dynamic mapping strategy in your solutions for 
both " 
page_rank_pull_parallel.cpp 
" and " 
page_rank_push_parallel_atomic.cpp 

4. Vertex-based Decomposition for PageRank with Coarse-Grained 
Dynamic Mapping 
Strategy 4: To reduce the time spent by each thread on the 
getNextVertexToBeProcessed() 
, we will vary 
the task granularity so that each thread receives multiple vertices to be processed each time it 
calls 
getNextVertexToBeProcessed() 
. Both of your parallel programs need to support this strategy by 
setting the strategy parameter to 4 in the command line ( 
--strategy 4 
). You need to update the dynamic 
load distribution logic as follows: 
Each thread processes 
k vertices and then calls 
getNextVertexToBeProcessed() 

Here, 

determines the granularity of the work done by each thread before requesting new work. 
 
Assignment 4 

4/8For example, 
If 
k = 1 
, the thread calls 
getNextVertexToBeProcessed() 
after processing each vertex (exactly the 
same way as strategy 3) 
If 
k = 1000 
, the thread calls 
getNextVertexToBeProcessed() 
after processing 1000 vertices. 
The 
getNextVertexToBeProcessed() 
function should return 




2k 
, ... depending on the 
granularity 



should be provided at run time using command-line parameter. e.g: 
--granularity 100 
Below is the pseudo-code showing the logic of our parallel solution: 
k = 1000 // granularity 
Create T threads 
for each thread in parallel { 
for(i=0; i<max_iterations; i++) { 
while(true){ 
u = getNextVertexToBeProcessed() 
if(u == -1) break; 
for (j = 0; j < k; j++) { 
edges_processed += outDegree(u) // used in output validation 
for vertex v in outNeighbor(u) 
next_page_rank[v] += (current_page_rank[u]/outdegree[u] 
u++ 
if(u >= n) break; // n is the total number of vertices in the graph 


barrier1 
while(true){ 
v = getNextVertexToBeProcessed() 
if(v == -1) break; 
for (j = 0; j < k; j++) { 
vertices_processed += 1 // used in output validation 
compute the new_pagerank using the accumulated values in next_page_rank[v]. 
current_page_rank[v] = new_pagerank 
Reset next_page_rank[v] to 0 
v++ 
if(v >= n) break; // n is the total number of vertices in the graph 


barrier2 


This strategy should be used with command-line parameter 
--granularity 
to specify the granularity. You 
need to add support for 
--granularity 
to both of your parallel programs " 
page_rank_pull_parallel.cpp 

and " 
page_rank_push_parallel_atomic.cpp 

Input and Output Formats: 
1. Your program files should be named 
page_rank_pull_parallel.cpp 
and 
page_rank_push_parallel_atomic.cpp 
and should support the 
following command-line parameters: 
--nThreads 
: The number of threads. 
--nIterations 
: The number of iterations (similar to the serial code). Default is 10. 
 
Assignment 4 

5/8--inputFile 
: The absolute path to the input graph file (similar to the serial code). 
--strategy 
: A value of either 1, 2, 3, or 4 corresponding to the strategies defined above. Any 
other value should lead to program termination. The default strategy is 1. 
--granularity 
: A positive integer to be used only in strategy 4. This parameter is ignored for 
other strategies. You need to check that 
granularity 
is always a positive integer. Zero, negative, 
and non-integer values lead to program termination. The default granularity is 1. 
2. Your parallel solution must output the following information: 
Total number of threads used. 
Number of iterations. 
Strategy used 
Granularity used 
For each thread: 
Thread id (your threads should be numbered between 
[0, T) 

Number of vertices processed (across all iterations) - This is the number of vertices processed only in the second for 
loop across all iterations. Refer the pseudocode 代 写PageRank parallel solutions of pagerank above. Number of edges processed (across all iterations) - This is the number of edges processed only inthe first 
for 
loop across all iterations. Refer the pseudocode of pagerank above. 
Cumulative time spent waiting at 
barrier1 
(in seconds) 
Cumulative time spent waiting at 
barrier2 
(in seconds) 
Cumulative time spent waiting at 
getNextVertexToBeProcessed() 
(in seconds). 
Time taken by the thread (in seconds). 
The sum of pageranks of all vertices. 
The total time taken for the entire execution, including the time used in allocating vertices or 
edges to threads. 
3. The sample console output is below. Please note that you should use this same format exactly for all 
4 parts. If an output is not generated (e.g., getNextVertex time in strategy 1 and 2, print 0.0 instead). 
Using DOUBLE 
Number of Threads : 4 
Strategy : 4 
Granularity : 1 
Iterations : 20 
Reading graph 
Created graph 
thread_id, num_vertices, num_edges, barrier1_time, barrier2_time, getNextVertex_time, total_time 
0, 4576646, 26069763, 0.001790, 0.001063, 4.067461, 10.366047 
1, 4625648, 23284082, 0.001947, 0.001063, 4.107634, 10.366017 
2, 4617670, 26393229, 0.001654, 0.001049, 4.029203, 10.365975 
3, 4508596, 26353706, 0.001668, 0.001059, 4.101155, 10.365950 
Sum of page ranks : 618961.562500 
Time taken (in seconds) : 10.366616 
Testing We are not providing a testing script for this assignment. However, you should strictly adhere to the 
format above, including all the spaces and commas. You can test your program using similar command 

Assignment 4 6/8lines to the test cases of Assignment 3 while adding the strategy parameter. We will not disclose the test 
cases used for grading before the assignment due date. Assignment Report In addition to your parallel code, you need to submit a report (in pdf format) that answers the following 
two questions: 
Q1. Run each of your two parallel programs with strategy 4 above using 8 threads on the test_25M_50M data set with granularities of 10, 100, 1000, 2000. Each of your parallel programs should run 3 times, 
ach including 20 iterations on the slow cluster using the default floating point data type. {Total number of runs is 2 (parallel versions) x 4 (different granularities) x 3 (number of runs for each version/thread 
combination) = 24 runs] 
Plot a graph with average execution time on the y-axis, thread count on the x-axis where each 
granularity has 2 bars, one for the average runtime of each of your parallel versions. Your graph should 
be something like this (obviously with different values): 
Q2. Based on data from the graph in Q1, which of the four granularities is better for pull? And which is 
better for the push_atomic algorithm? 
Submission Guidelines 
Make sure that your solutions folder has the following files and sub-folders. Let's say your solutions 
folder is called 
my_assignment4_solutions 
. It should contain: 
core/ 
-- The folder containing all core files. It is already available in the assignment package. Do not modify it or remove any files. Makefile 
-- Makefile for the project. This file should not be changed. page_rank_pull_parallel.cpp 
 Assignment 4 7/8page_rank_push_parallel_atomic.cpp report.pdf -- A pdf file that includes answers to questions in the previous section.  
To create the submission file, follow the steps below: 
1. Enter in your solutions folder, and remove all the object/temporary files. 
$ cd my_assignment4_solutions/ 
$ make clean 
$ rm -rf input_graphs OR mv input_graphs ../ to avoid large tarballs 
2. Create the tar.gz file. 
$ tar cvzf assignment4.tar.gz * which creates a compressed tar ball that contains the contents of the folder. Submit via Canvas by the deadline.  
Assignment 4 8/8

posted @ 2024-10-11 20:21  r78b6d  阅读(24)  评论(0)    收藏  举报