Skip to content

pairtools merge fails with "Argument list too long" #274

@aringeri

Description

@aringeri

When running pairtools merge with a large list of files (~4000 files), I received the following error:

Traceback (most recent call last):
File "/home/epi2melabs/conda/bin/pairtools", line 11, in
sys.exit(cli())
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/pairtools/cli/merge.py", line 134, in merge
merge_py(
File "/home/epi2melabs/conda/lib/python3.8/site-packages/pairtools/cli/merge.py", line 254, in merge_py
subprocess.check_call(command, shell=True, stdout=outstream)
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 359, in check_call
retcode = call(*popenargs, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 1720, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/bin/sh'

I am running pairtools in the wf-pore-c nextflow pipeline. But it internally calls pairtools with the command (with my 4000 files in the to_merge/ directory). See source

pairtools merge -o output.pairs.gz  --concatenate 'to_merge/*'

I can see from the pairtools source code that the merge function iterates all the files given to it and combines them into a single subprocess command:

for path in paths:
if kwargs.get("cmd_in", None):
command += r""" <(cat {} | {} | sed -n -e '\''/^[^#]/,$p'\'')""".format(
path, kwargs["cmd_in"]
)
elif path.endswith(".gz"):
command += (
r""" <(bgzip -dc -@ {} {} | sed -n -e '\''/^[^#]/,$p'\'')""".format(
kwargs["nproc_in"], path
)
)
elif path.endswith(".lz4"):
command += r""" <(lz4c -dc {} | sed -n -e '\''/^[^#]/,$p'\'')""".format(
path
)
else:
command += r""" <(sed -n -e '\''/^[^#]/,$p'\'' {})""".format(path)
command += "'"
subprocess.check_call(command, shell=True, stdout=outstream)

This is causing the OS error when the command is too large.
Is there a way to restructure this code so that the command does not become so large (potentially process the files in chunks or batches)?

Much appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions