Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
The simplest and best way is to leverage pandas_udf
in PySpark.
In the pandas UDF,
you can call subprocess.run
to run any shell command
and capture its output.
from pathlib …