Assign Index¶
Rename and Drop Columns in Spark DataFrames
Comment¶
You can use withColumnRenamed to rename a column in a DataFrame.
You can also do renaming using alias when select columns.
Git Errors and Solutions
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
git clone throws the error message "fatal: unable to fork"
The reason is due to missing SSH.
The solution is simply to install openssh-client.
sudo apt-get install openssh
git pull throw …
Docker Images for Programming Languages
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Python
continuumio/miniconda3
- It is hard to figure out the version of Python from the version of the Docker image.
continuumio/anaconda3
- It is hard to figure out the version of …
Sort DataFrame in Spark
Comments¶
- After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.
Use pyarrow to Share Data in Memory in Python
References¶
https://github.com/apache/arrow
https://stackoverflow.com/questions/54582073/sharing-objects-across-workers-using-pyarrow
https://github.com/pytorch/pytorch/issues/13039
https://issues.apache.org/jira/browse/ARROW-5130
https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html