2.20. 故障排除#

本节将介绍执行 CWL 程序遇到问题时的排查、解决方法。此处我们重点介绍 cwltool, 不过同样的技巧也可能适用于其他 CWL 运行程序。

2.20.1. 制定 cachedir(缓存目录)运行 cwltool#

运行工作流时,您可使用 --cachedir 选项令 cwltool 缓存中间文件(不属于输入输出文件,而是工作流运行过程中创建)。默认情况下这些文件创建于一个临时目录下,不过将其写入指定的独立目录可以更方便地访问。

以下示例 troubleshooting-wf1.cwl 中设有两个步骤 step_astep_b. 该工作流相当于执行命令 echo "Hello World" | rev, 将 “Hello World” 反向输出为 “dlroW olleH”. 但是第二个步骤 step_b 有一处错误:本该执行 rev 命令,却成了 revv, 于是失败。

troubleshooting-wf1.cwl#
cwlVersion: v1.2
class: Workflow

inputs:
  text:
    type: string
    default: 'Hello World'
outputs:
  reversed_message:
    type: string
    outputSource: step_b/reversed_message

steps:
  step_a:
    run:
      class: CommandLineTool
      stdout: stdout.txt
      inputs:
        text: string
      outputs:
        step_a_stdout:
          type: File
          outputBinding:
            glob: 'stdout.txt'
      arguments: ['echo', '-n', '$(inputs.text)']
    in:
      text: text
    out: [step_a_stdout]
  step_b:
    run:
      class: CommandLineTool
      stdout: stdout.txt
      inputs:
        step_a_stdout: File
      outputs:
        reversed_message:
          type: string
          outputBinding:
            glob: stdout.txt
            loadContents: true
            outputEval: $(self[0].contents)
      baseCommand: revv
      arguments: [ $(inputs.step_a_stdout) ]
    in:
      step_a_stdout:
        source: step_a/step_a_stdout
    out: [reversed_message]

我们来以 /tmp/cachedir/--cachedir 选项的值重新执行这个工作流(如果这个目录不存在,cwltool 会为你创建之):

$ cwltool --cachedir /tmp/cachedir/ troubleshooting-wf1.cwl
INFO /home/docs/checkouts/readthedocs.org/user_builds/common-workflow-languageuser-guide-zh-hans/envs/latest/bin/cwltool 3.1.20240508115724
INFO Resolved 'troubleshooting-wf1.cwl' to 'file:///home/docs/checkouts/readthedocs.org/user_builds/common-workflow-languageuser-guide-zh-hans/checkouts/latest/src/_includes/cwl/troubleshooting/troubleshooting-wf1.cwl'
INFO [workflow ] start
INFO [workflow ] starting step step_a
INFO [step step_a] start
INFO [job step_a] Output of job will be cached in /tmp/cachedir/edb2bbda4f67d8bf15e1112f6a5a10cf
INFO [job step_a] /tmp/cachedir/edb2bbda4f67d8bf15e1112f6a5a10cf$ echo \
    -n \
    'Hello World' > /tmp/cachedir/edb2bbda4f67d8bf15e1112f6a5a10cf/stdout.txt
INFO [job step_a] completed success
INFO [step step_a] completed success
INFO [workflow ] starting step step_b
INFO [step step_b] start
INFO [job step_b] Output of job will be cached in /tmp/cachedir/609ea62e2a895d4dd4f7fd481ae06273
INFO [job step_b] /tmp/cachedir/609ea62e2a895d4dd4f7fd481ae06273$ revv \
    /tmp/nneu764p/stgf7b76632-cc68-4293-9ce1-78ba147abc40/stdout.txt > /tmp/cachedir/609ea62e2a895d4dd4f7fd481ae06273/stdout.txt
ERROR 'revv' not found: [Errno 2] No such file or directory: 'revv'
WARNING [job step_b] completed permanentFail
ERROR [step step_b] Output is missing expected field file:///home/docs/checkouts/readthedocs.org/user_builds/common-workflow-languageuser-guide-zh-hans/checkouts/latest/src/_includes/cwl/troubleshooting/troubleshooting-wf1.cwl#step_b/reversed_message
WARNING [step step_b] completed permanentFail
INFO [workflow ] completed permanentFail
{
    "reversed_message": null
}WARNING Final process status is permanentFail

该工作流处于 permanentFail 状态,因为 step_b 中执行不存在的命令 revv 失败。step_a 得到了成功执行,其输出换存在了你通过 cachedir 选项指定的位置。你可以检查它创建的中间文件:

$ tree /tmp/cachedir
/tmp/cachedir
├── 0xhr0oha
├── 609ea62e2a895d4dd4f7fd481ae06273
   └── stdout.txt
├── 609ea62e2a895d4dd4f7fd481ae06273.status
├── edb2bbda4f67d8bf15e1112f6a5a10cf
│   └── stdout.txt
└── edb2bbda4f67d8bf15e1112f6a5a10cf.status

3 directories, 4 files

工作流中的每一步骤均有唯一的 ID(即形如散列/哈希值的一长串字符)。文件 ${HASH}.status 包含工作流各步骤的执行状态。在上例命令的输出结果中可以看到 step_a 这一步骤的输出文件 stdout.txt.

现在,将 step_b 中的错误(即 revv)改正过来,使之执行 rev 命令。然后,用与之前相同的命令行选项执行 cwltool. 注意这时 cwltool 的输出包含 step_a 经缓存的输出,而 step_b 的输出有了新的缓存项。同时,可以注意到 step_b 的状态已经是“成功”。

$ cwltool --cachedir /tmp/cachedir/ troubleshooting-wf1-stepb-fixed.cwl
INFO /home/docs/checkouts/readthedocs.org/user_builds/common-workflow-languageuser-guide-zh-hans/envs/latest/bin/cwltool 3.1.20240508115724
INFO Resolved 'troubleshooting-wf1-stepb-fixed.cwl' to 'file:///home/docs/checkouts/readthedocs.org/user_builds/common-workflow-languageuser-guide-zh-hans/checkouts/latest/src/_includes/cwl/troubleshooting/troubleshooting-wf1-stepb-fixed.cwl'
INFO [workflow ] start
INFO [workflow ] starting step step_a
INFO [step step_a] start
INFO [job step_a] Using cached output in /tmp/cachedir/edb2bbda4f67d8bf15e1112f6a5a10cf
INFO [step step_a] completed success
INFO [workflow ] starting step step_b
INFO [step step_b] start
INFO [job step_b] Output of job will be cached in /tmp/cachedir/3dfb3e8c82b46e9e2d650a90a303a16a
INFO [job step_b] /tmp/cachedir/3dfb3e8c82b46e9e2d650a90a303a16a$ rev \
    /tmp/fixg71v7/stg2f088009-1280-477b-8f0a-237bf2ee8923/stdout.txt > /tmp/cachedir/3dfb3e8c82b46e9e2d650a90a303a16a/stdout.txt
INFO [job step_b] completed success
INFO [step step_b] completed success
INFO [workflow ] completed success
{
    "reversed_message": "dlroW olleH"
}INFO Final process status is success

此例子中,工作流步骤 step_a 并没有得到重新计算,因为它仍在缓存中,而且其执行或输出都没有改变。另外,cwltool 能够认识到我们修改了 step_b 中可执行文件的名称后,它必须重新计算这一步骤。这一技巧十分适用于排查 CWL 程序中的错误时,而且可以避免 cwltool 无谓地重复运算每一步骤。