Tips and Traps¶
You can use the
split
function to split a delimited string into an array. It is suggested that removing trailing separators before you apply thesplit
function. Please refer to the split section before for more detailed discussions.Some string functions (e.g.,
right
, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.Notice that functions
trim
/rtrim
/ltrim
behaves a little counter-intuitive. First, they trim spaces only rather than white spaces by default. Second, when explicitly passing the characters to trim, the 1st parameter is the characters to trim and the 2nd parameter is the string from which to trim characters.instr
andlocate
behaves similar to each other except that their parameters are reversed.Notice that
replace
is for replacing elements in a column NOT for replacemnt inside each string element. To replace substring with another one in a string, you have to use eitherregexp_replace
ortranslate
.The operator
+
does not work as concatenation for sting columns. You have to use the functionconcat
instead.
import re
re.search("\\s", "nima ")
s = "\s"
"\s\\s"
"\s" == "\\s"
"\n" == "\\n"
"\\n"
"\n"
import pandas as pd
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark_Str_Func").enableHiveSupport().getOrCreate()
)
df = spark.createDataFrame(
pd.DataFrame(
data=[
("2017/01/01", 1),
("2017/02/01", 2),
("2018/02/05", 3),
(None, 4),
("how \t", 5),
],
columns=["col1", "col2"],
)
)
df.show()
The +
operator does not work as concatenation for 2 string columns.
df.withColumn("col", col("date") + col("month")).show()
The function concat
concatenate 2 string columns.
df.withColumn("col", concat(col("date"), col("month"))).show()
df.withColumn("col", concat(col("date"), lit("_"), col("month"))).show()
spark.sql(
"""
select instr("abcd", "ab") as index
"""
).show()
spark.sql(
"""
select instr("abcd", "AB") as index
"""
).show()
spark.sql(
"""
select
left("how are you doing?", 7) as phrase
"""
).show()
val df = Seq(
("2017", 1),
("2017/02", 2),
("2018/02/05", 3),
(null, 4)
).toDF("date", "month")
df.show
import org.apache.spark.sql.functions.length
df.select($"date", length($"date")).show
ltrim¶
Notice that functions trim
/rtrim
/ltrim
behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select ltrim("a ", "a a abcd") as after_ltrim
"""
).show()
df.withColumn("date", translate($"date", "/", "-")).show
public static Column regexp_extract(Column e, String exp, int groupIdx)
df.withColumn("date", regexp_replace(col("date"), "/", "-")).show()
spark.sql(
"""
select right("abcdefg", 3)
"""
).show()
df.show()
df.filter(col("col1").rlike("\\d{4}/02/\\d{2}")).show()
df.filter(col("col1").rlike(r"\s")).show()
df.createOrReplaceTempView("t1")
spark.sql(
r"""
select
*
from
t1
where
col1 rlike '\\d'
"""
).show()
rtrim¶
Notice that functions trim
/rtrim
/ltrim
behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select rtrim("abcd\t ") as after_trim
"""
).show()
spark.sql(
"""
select rtrim(" \t", "abcd\t ") as after_trim
"""
).show()
spark.sql(
"""
select rtrim("a ", "a a abcda a a") as after_ltrim
"""
).show()
split¶
If there is a trailing separator,
then an emptry string is generated at the end of the array.
It is suggested that you get rid of the trailing separator
before applying split
to avoid unnecessary empty string generated.
The benefit of doing this is 2-fold.
- Avoid generating non-neeed data (emtpy strings).
- Too many empty strings can causes serious data skew issues if the corresponding column is used for joining with another table. By avoiding generating those empty strings, we avoid potential Spark issues in the beginning.
spark.sql(
"""
select split("ab;cd;ef", ";") as elements
"""
).show()
spark.sql(
"""
select split("ab;cd;ef;", ";") as elements
"""
).show()
import org.apache.spark.sql.functions._
val df = Seq(
("2017/01/01", 1),
("2017/02/01", 2),
(null, 3)
).toDF("date", "month")
df.show
df.withColumn("year", substring($"date", 1, 4)).show
df.withColumn("month", substring($"date", 6, 2)).show
df.withColumn("month", substring($"date", 9, 2)).show
trim¶
Notice that functions trim
/rtrim
/ltrim
behaves a little counter-intuitive.
First,
they trim spaces only rather than white spaces by default.
Second,
when explicitly passing the characters to trim,
the 1st parameter is the characters to trim
and the 2nd parameter is the string from which to trim characters.
spark.sql(
"""
select trim("abcd\t ") as after_trim
"""
).show()
spark.sql(
"""
select trim(" \t", "abcd\t ") as after_trim
"""
).show()