Tips & Traps¶
- Optimus requires Python 3.6+.
In [1]:
import pandas as pd
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)
In [5]:
from optimus import Optimus
ops = Optimus(master="local")
In [6]:
df = ops.create.df(
[
("names", "str"),
("height", "float"),
("function", "str"),
("rank", "int"),
],
[
("bumbl#ebéé ", 17.5, "Espionage", 7),
("Optim'us", 28.0, "Leader", 10),
("ironhide&", 26.0, "Security", 7),
("Jazz", 13.0, "First Lieutenant", 8),
("Megatron", None, "None", None),
],
)
df.table()
Viewing 5 of 5 rows / 4 columns
15 partition(s)
|
names
1 (string)
nullable
|
height
2 (float)
nullable
|
function
3 (string)
nullable
|
rank
4 (int)
nullable
|
|---|---|---|---|
|
bumbl#ebéé⋅⋅
|
17.5
|
Espionage
|
7
|
|
Optim'us
|
28.0
|
Leader
|
10
|
|
ironhide&
|
26.0
|
Security
|
7
|
|
Jazz
|
13.0
|
First⋅Lieutenant
|
8
|
|
Megatron
|
None
|
None
|
None
|
Viewing 5 of 5 rows / 4 columns
15 partition(s)
In [7]:
ops.profiler.run(df)
Overview
Dataset info
| Number of columns | 4 |
| Number of rows | 5 |
| Total Missing (%) | 2 |
| Total size in memory | -1 Bytes |
Column types
| Categorical | 0 |
| Numeric | 0 |
| Date | 0 |
| Array | 0 |
| Not available | 0 |
names
categorical| Unique | 4 |
| Unique (%) | |
| Missing | 0 |
| Missing (%) |
Datatypes
| String | 5 |
| Integer | |
| Decimal | |
| Bool | |
| Date | |
| Missing | 0 |
| Null | 0 |
Frequency
| Value | Count | Frequency (%) |
|---|---|---|
| Jazz | 1 | 20.0% |
| Megatron | 1 | 20.0% |
| bumbl#ebéé | 1 | 20.0% |
| Optim'us | 1 | 20.0% |
| ironhide& | 1 | 20.0% |
| "Missing" | 0 | % |
|
|
height
numeric| Unique | 4 |
| Unique (%) | |
| Missing | 1 |
| Missing (%) |
Datatypes
| String | |
| Integer | |
| Decimal | 4 |
| Bool | |
| Date | |
| Missing | 0 |
| Null | 1 |
Basic Stats
| Mean | 21.125 |
| Minimum | 13.0 |
| Maximum | 28.0 |
| Zeros(%) | 0 |
Quantile statistics
| Minimum | 13.0 |
| 5-th percentile | 13.0 |
| Q1 | 13.0 |
| Median | 17.5 |
| Q3 | 26.0 |
| 95-th percentile | 28.0 |
| Maximum | 28.0 |
| Range | |
| Interquartile range |
Descriptive statistics
| Standard deviation | 7.07549 |
| Coef of variation | |
| Kurtosis | -1.70021 |
| Mean | 21.125 |
| MAD | |
| Skewness | -0.15561 |
| Sum | 84.5 |
| Variance | 50.0625 |
|
|
function
categorical| Unique | 5 |
| Unique (%) | |
| Missing | 0 |
| Missing (%) |
Datatypes
| String | 5 |
| Integer | |
| Decimal | |
| Bool | |
| Date | |
| Missing | 0 |
| Null | 0 |
Frequency
| Value | Count | Frequency (%) |
|---|---|---|
| First Lieutenant | 1 | 20.0% |
| Leader | 1 | 20.0% |
| Security | 1 | 20.0% |
| Espionage | 1 | 20.0% |
| None | 1 | 20.0% |
| "Missing" | 0 | % |
|
|
rank
numeric| Unique | 3 |
| Unique (%) | |
| Missing | 1 |
| Missing (%) |
Datatypes
| String | |
| Integer | 4 |
| Decimal | |
| Bool | |
| Date | |
| Missing | 0 |
| Null | 1 |
Basic Stats
| Mean | 8.0 |
| Minimum | 7 |
| Maximum | 10 |
| Zeros(%) | 0 |
Quantile statistics
| Minimum | 7 |
| 5-th percentile | 7 |
| Q1 | 7 |
| Median | 7 |
| Q3 | 8 |
| 95-th percentile | 10 |
| Maximum | 10 |
| Range | |
| Interquartile range |
Descriptive statistics
| Standard deviation | 1.41421 |
| Coef of variation | |
| Kurtosis | -1.0 |
| Mean | 8.0 |
| MAD | |
| Skewness | 0.8165 |
| Sum | 32 |
| Variance | 2.0 |
|
|
Out[7]:
<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>
References¶
https://github.com/ironmussa/Optimus
https://github.com/ironmussa/Optimus/tree/master/examples
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
In [ ]: