Skip to content

add download from ploughshare#192

Open
andrpie wants to merge 1 commit intomasterfrom
ploughshare_dijets_download
Open

add download from ploughshare#192
andrpie wants to merge 1 commit intomasterfrom
ploughshare_dijets_download

Conversation

@andrpie
Copy link

@andrpie andrpie commented Mar 2, 2026

Addresses pinefarm#102, i.e. adds a script that downloads and converts the full color grids from ploughshare for each of the 5 dijet datasets.

The scripts work well on my device (MacOS), but of course haven't been tested otherwise.

Thank you @achiefa for the CMS 13TEV script!

@andrpie andrpie requested review from achiefa and scarlehoff March 2, 2026 13:50
@achiefa
Copy link

achiefa commented Mar 3, 2026

Given that we need to download also the grids for single-inclusive jets, I can take advantage of this PR and add the relative scripts for the single-jet cases. I'll start with NNPDF/nnpdf#2407.

@scarlehoff
Copy link
Member

Thank you very much for this. I have just a two (global) comments:

RE the download itself. Would it be possible to separate it from the per-grid script? In particular I am thinking you could add the list of links in something like a .yaml or .json (or .txt for all I care) file, in such a way that you don't need to repeat

FILENAME="applfast-atlas-dijets-v2-fc-fnlo-arxiv-1711.02692"
wget "https://ploughshare.web.cern.ch/ploughshare/db/applfast/$FILENAME/${FILENAME}.tgz"

in every script.
Then the download, instead of doing it in bash, do it in pinefarm as a small python downloader (note that there's e.g. no wget in macos).
Downloading should not need conda or anything fancy and, for this particular case, they are applgrids but in other cases they will be directly pineappl grids so I think it is better if the download and the conversion are separated.

RE the conversion: I'm ok with this part. However, please remove the conda dependent piece from the script. The installation of pineappl should've been taken care of by pineappl. Leave only the part that is actually needed so that someone looking at the script in the future doesn't need to break it down to understand it.
Also rename the scripts to postrun.sh so that they are in sync with all the other postrun scripts in the repository.

Btw, in particular I'm very much against semi-inclusive checks like this:

if [ -z "$CONDA_BASE" ]; then
    for candidate in \
        "$HOME/miniforge3" \
        "$HOME/mambaforge" \
        "$HOME/miniconda3" \
        "$HOME/anaconda3" \
        "/opt/conda" \
        "/opt/miniforge3" \
        "/opt/miniconda3" \
        "/opt/anaconda3" \
        "/usr/local/miniconda3" \
        "/usr/local/anaconda3"
    do
        if [ -f "$candidate/etc/profile.d/conda.sh" ]; then
            CONDA_BASE="$candidate"
            break
        fi
    done
fi
if [ -z "$CONDA_BASE" ]; then
    echo "Error: Could not find conda installation" >&2
    exit 1
fi

It is practically impossible to be fully inclusive in possible conda installations so it is better not to even try. This might even be activating the wrong conda installation creating chaos in the target computer!

@achiefa
Copy link

achiefa commented Mar 4, 2026

Hi @scarlehoff, thanks for your comment. Indeed we discussed this yesterday during the code meeting. I agree that the conda part is horrible to say the least, but this was something that I wrote in my bash script that was meant to be temporarly and local. So I agree with you that the conda part must be removed alltogether.

Downloading should not need conda or anything fancy and, for this particular case, they are applgrids but in other cases they will be directly pineappl grids so I think it is better if the download and the conversion are separated.

I see your point. However, I don't want to fall in the rabbit hole and build complex abstractions for such simple problems. In the end, this is meant to be a simple script that downloads and converts the grids, with some renaming conventions which must be set case-by-case. @andrpie and I will look into this, but at the moment this doesn't have my highest priority.

RE the conversion: I'm ok with this part. However, please remove the conda dependent piece from the script. The installation of pineappl should've been taken care of by pineappl. Leave only the part that is actually needed so that someone looking at the script in the future doesn't need to break it down to understand it.
Also rename the scripts to postrun.sh so that they are in sync with all the other postrun scripts in the repository.

RE this I just want to ask for clarifications. It's not clear to me how these scripts will be run. Are they meant to be run individually by hand? or they enter an automatised workflow that runs all the scripts for which grids should be "produced". Honestly I haven't thought about this when I wrote the script because this wasn't clear to me. The goal was just having something where I could log the steps so that I didn't forget.

@scarlehoff
Copy link
Member

scarlehoff commented Mar 4, 2026

I see your point. However, I don't want to fall in the rabbit hole and build complex abstractions for such simple problems

I hate complex abstractions, I'm happy if you do a simple one :_)
Also because if something changes in plougshare (e.g., you need to use a mozilla agent in wget/curl to avoid being confused by a LLM bot) you need to do it to all of them.

RE this I just want to ask for clarifications. It's not clear to me how these scripts will be run. Are they meant to be run individually by hand? or they enter an automatised workflow that runs all the scripts for which grids should be "produced". Honestly I haven't thought about this when I wrote the script because this wasn't clear to me. The goal was just having something where I could log the steps so that I didn't forget.

In the ideal world I was thinking the following:

I download pinefarm and pinecards. Then I go and do

pinefarm run ATLAS_2JET_13TEV_DIF_MJJ-Y <and some theory file I guess...>

then pinefarm will read the plougshare_links.txt which is just a txt with a link per line.
When pinefarm sees there's is a plougshare_links.txt file in the pinecard then it does the download. Pinefarm will loop over these links and download them using only python primitives or at worst curl.

Then it will run the pineappl import if necessary.

Then, after the download has finished it automatically runs the postrun.sh script, which will do all the complicated crap reorganization and conversion.

After that the metadata.txt will be burned into the grids.


So this is why I'd like to separate download and re-organization. The pineappl import part I'd put with the download because it seems to be quite painless as well? But it'd be just a call to subprocess so it doesn't make much of a difference to leave it as part of the postrun.sh.

This is just the picture I had in mind. The leading order thing for me is not to repeat the wget piece in every script tbh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants