I am trying to use `pytorch`

to perform simple calculations across multiple gpu. I am not wanting to train a machine learning model. I’ve posted this in the distributed forum here, but I haven’t gotten a response back about a particular question. Here is the code I have thus far:

```
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
import torch.nn.functional as F
import pandas as pd
def calc_cos_sims(rank, world_size):
dist.init_process_group('gloo', rank=rank, world_size=world_size)
cuda_device = torch.device('cuda:'+str(rank))
data_path = './embed_pairs_df_million_part_' + str(rank) + '.pkl'
tmp_df = pd.read_pickle(data_path)
embeds_a_list = [embed_a for embed_a in tmp_df['embeds_a']]
embeds_b_list = [embed_b for embed_b in tmp_df['embeds_b']]
embeds_a_tensor = torch.tensor(embeds_a_list, device=cuda_device)
embeds_b_tensor = torch.tensor(embeds_b_list, device=cuda_device)
cosine_tensor = F.cosine_similarity(embeds_a_tensor, embeds_b_tensor)
def main():
world_size = 4 #since I have 4 GPUs on a single machine
mp.spawn(calc_cos_sims,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__ == 'main':
main()
```

Basically, the code calculates the cosine similarity between two different embeddings. I have 4 GPU available to me and I have split my data into 4 slices to run on a given GPU.

It was recommended to use the pytorch collective api to aggregate the results. I read through it, but I’m not entirely sure how to implement it. How would that be done in this case or is there a better way to do all of this? I’d like to be able to save off the aggregated results to a file or have available for use at a further point in my program.

I welcome any feedback about potential improvements. Thank you in advance!