Comments¶
String is a immutable class in Java. Extensive operations on strings (e.g.,
+in a big loop) is usually very slow before Java 7 (the+operator is optimized by the compiler automatically starting from Java 7). To avoid this problem (in older versions of Java), you can use theStringBuilder
Aggregate DataFrames in Spark
Aggregation Without Grouping¶
You can aggregate all values in Columns of a DataFrame. Just use aggregation functions in
selectwithoutgroupBy, which is very similar to SQL syntax.The aggregation functions
allandanyare available since Spark 3.0. However, they can be achieved using other aggregation functions such assum
Using Optimus for Data Profiling in PySpark
Tips & Traps¶
- Optimus requires Python 3.6+.
Hands on the Python module Multiprocessing
Comments¶
multiprocess is a fork of the Python standard libary multiprocessing . multiprocess extends multiprocessing to provide enhanced serialization, using dill. multiprocess leverages multiprocessing to support the spawning of processes using the API of the python standard library's threading module.
multiprocessing.Pool.mapdoes not work with lambda functions due to the fact that lambda functions cannot be pickled. There are multiple approaches to avoid the issue. You can define a function or usefunctools.partial
Hands on the Python module threading
Gradle Kotlin DSL
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
AVOID using the Kotlin DSL for Gradle! The Kotlin DSL for Gradle is not mature and lack of documentation at this time. Stick with Groovy DSL for Gradle.
shadowJar
https://github …