Tips & Traps¶
- Optimus requires Python 3.6+.
In [1]:
import pandas as pd
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)
In [5]:
from optimus import Optimus
ops = Optimus(master="local")
In [6]:
df = ops.create.df(
[
("names", "str"),
("height", "float"),
("function", "str"),
("rank", "int"),
],
[
("bumbl#ebéé ", 17.5, "Espionage", 7),
("Optim'us", 28.0, "Leader", 10),
("ironhide&", 26.0, "Security", 7),
("Jazz", 13.0, "First Lieutenant", 8),
("Megatron", None, "None", None),
],
)
df.table()
Viewing 5 of 5 rows / 4 columns
15 partition(s)
names
1 (string)
nullable
|
height
2 (float)
nullable
|
function
3 (string)
nullable
|
rank
4 (int)
nullable
|
---|---|---|---|
bumbl#ebéé⋅⋅
|
17.5
|
Espionage
|
7
|
Optim'us
|
28.0
|
Leader
|
10
|
ironhide&
|
26.0
|
Security
|
7
|
Jazz
|
13.0
|
First⋅Lieutenant
|
8
|
Megatron
|
None
|
None
|
None
|
Viewing 5 of 5 rows / 4 columns
15 partition(s)
In [7]:
ops.profiler.run(df)
Overview
Dataset info
Number of columns | 4 |
Number of rows | 5 |
Total Missing (%) | 2 |
Total size in memory | -1 Bytes |
Column types
Categorical | 0 |
Numeric | 0 |
Date | 0 |
Array | 0 |
Not available | 0 |
names
categoricalUnique | 4 |
Unique (%) | |
Missing | 0 |
Missing (%) |
Datatypes
String | 5 |
Integer | |
Decimal | |
Bool | |
Date | |
Missing | 0 |
Null | 0 |
Frequency
Value | Count | Frequency (%) |
---|---|---|
Jazz | 1 | 20.0% |
Megatron | 1 | 20.0% |
bumbl#ebéé | 1 | 20.0% |
Optim'us | 1 | 20.0% |
ironhide& | 1 | 20.0% |
"Missing" | 0 | % |
|
height
numericUnique | 4 |
Unique (%) | |
Missing | 1 |
Missing (%) |
Datatypes
String | |
Integer | |
Decimal | 4 |
Bool | |
Date | |
Missing | 0 |
Null | 1 |
Basic Stats
Mean | 21.125 |
Minimum | 13.0 |
Maximum | 28.0 |
Zeros(%) | 0 |
Quantile statistics
Minimum | 13.0 |
5-th percentile | 13.0 |
Q1 | 13.0 |
Median | 17.5 |
Q3 | 26.0 |
95-th percentile | 28.0 |
Maximum | 28.0 |
Range | |
Interquartile range |
Descriptive statistics
Standard deviation | 7.07549 |
Coef of variation | |
Kurtosis | -1.70021 |
Mean | 21.125 |
MAD | |
Skewness | -0.15561 |
Sum | 84.5 |
Variance | 50.0625 |
|
function
categoricalUnique | 5 |
Unique (%) | |
Missing | 0 |
Missing (%) |
Datatypes
String | 5 |
Integer | |
Decimal | |
Bool | |
Date | |
Missing | 0 |
Null | 0 |
Frequency
Value | Count | Frequency (%) |
---|---|---|
First Lieutenant | 1 | 20.0% |
Leader | 1 | 20.0% |
Security | 1 | 20.0% |
Espionage | 1 | 20.0% |
None | 1 | 20.0% |
"Missing" | 0 | % |
|
rank
numericUnique | 3 |
Unique (%) | |
Missing | 1 |
Missing (%) |
Datatypes
String | |
Integer | 4 |
Decimal | |
Bool | |
Date | |
Missing | 0 |
Null | 1 |
Basic Stats
Mean | 8.0 |
Minimum | 7 |
Maximum | 10 |
Zeros(%) | 0 |
Quantile statistics
Minimum | 7 |
5-th percentile | 7 |
Q1 | 7 |
Median | 7 |
Q3 | 8 |
95-th percentile | 10 |
Maximum | 10 |
Range | |
Interquartile range |
Descriptive statistics
Standard deviation | 1.41421 |
Coef of variation | |
Kurtosis | -1.0 |
Mean | 8.0 |
MAD | |
Skewness | 0.8165 |
Sum | 32 |
Variance | 2.0 |
|
Out[7]:
<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>
References¶
https://github.com/ironmussa/Optimus
https://github.com/ironmussa/Optimus/tree/master/examples
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
In [ ]: