-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Description
When running pairtools merge with a large list of files (~4000 files), I received the following error:
Traceback (most recent call last):
File "/home/epi2melabs/conda/bin/pairtools", line 11, in
sys.exit(cli())
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/site-packages/pairtools/cli/merge.py", line 134, in merge
merge_py(
File "/home/epi2melabs/conda/lib/python3.8/site-packages/pairtools/cli/merge.py", line 254, in merge_py
subprocess.check_call(command, shell=True, stdout=outstream)
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 359, in check_call
retcode = call(*popenargs, **kwargs)
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/epi2melabs/conda/lib/python3.8/subprocess.py", line 1720, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/bin/sh'
I am running pairtools in the wf-pore-c nextflow pipeline. But it internally calls pairtools with the command (with my 4000 files in the to_merge/ directory). See source
pairtools merge -o output.pairs.gz --concatenate 'to_merge/*'I can see from the pairtools source code that the merge function iterates all the files given to it and combines them into a single subprocess command:
pairtools/pairtools/cli/merge.py
Lines 235 to 254 in f896311
| for path in paths: | |
| if kwargs.get("cmd_in", None): | |
| command += r""" <(cat {} | {} | sed -n -e '\''/^[^#]/,$p'\'')""".format( | |
| path, kwargs["cmd_in"] | |
| ) | |
| elif path.endswith(".gz"): | |
| command += ( | |
| r""" <(bgzip -dc -@ {} {} | sed -n -e '\''/^[^#]/,$p'\'')""".format( | |
| kwargs["nproc_in"], path | |
| ) | |
| ) | |
| elif path.endswith(".lz4"): | |
| command += r""" <(lz4c -dc {} | sed -n -e '\''/^[^#]/,$p'\'')""".format( | |
| path | |
| ) | |
| else: | |
| command += r""" <(sed -n -e '\''/^[^#]/,$p'\'' {})""".format(path) | |
| command += "'" | |
| subprocess.check_call(command, shell=True, stdout=outstream) |
This is causing the OS error when the command is too large.
Is there a way to restructure this code so that the command does not become so large (potentially process the files in chunks or batches)?
Much appreciated.
Metadata
Metadata
Assignees
Labels
No labels