Tips & Traps¶

Optimus requires Python 3.6+.

In [1]:

import pandas as pd
import findspark

# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark Example").enableHiveSupport().getOrCreate()
)

In [5]:

from optimus import Optimus

ops = Optimus(master="local")

In [6]:

df = ops.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),
    ],
)
df.table()

Viewing 5 of 5 rows / 4 columns

15 partition(s)

names 1 (string) nullable	height 2 (float) nullable	function 3 (string) nullable	rank 4 (int) nullable
bumbl#ebéé⋅⋅	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7
Jazz	13.0	First⋅Lieutenant	8
Megatron	None	None	None

Viewing 5 of 5 rows / 4 columns

15 partition(s)

In [7]:

ops.profiler.run(df)

Overview

        Dataset info
        
                Number of columns
                4

                Number of rows
                5

                Total Missing (%)
                2

                Total size in memory
                -1 Bytes

        Column types
        
                Categorical
                0

                Numeric
                0

                Date
                0

                Array
                0
            
                Not available
                0

                names
                categorical
            
                    Unique
                     4
                
                    Unique (%)
                     
                    Missing
                    0
                
                    Missing (%)
                    
                    Datatypes
                
                        String
                    
                        5
                    
                        Integer
                    
                        Decimal
                    
                        Bool
                    
                        Date
                    
                        Missing
                    
                        0
                    
                        Null
                    
                        0

            
            Frequency
            
                    Value
                    Count
                    Frequency (%)
                
                    Jazz
                    1
                    20.0%
                
                    Megatron
                    1
                    20.0%
                
                    bumbl#ebéé  
                    1
                    20.0%
                
                    Optim'us
                    1
                    20.0%
                
                    ironhide&
                    1
                    20.0%
                
                    "Missing"
                    0
                    %

                height
                numeric
            
                    Unique
                     4
                
                    Unique (%)
                     
                    Missing
                    1
                
                    Missing (%)
                    
                    Datatypes
                
                        String
                    
                        Integer
                    
                        Decimal
                    
                        4
                    
                        Bool
                    
                        Date
                    
                        Missing
                    
                        0
                    
                        Null
                    
                        1
                    
                    Basic Stats
                
                    Mean
                    21.125
                
                    Minimum
                    13.0
                
                    Maximum
                    28.0
                
                    Zeros(%)
                    0

            Quantile statistics
            
                    Minimum
                    13.0
                
                    5-th percentile
                    13.0
                
                    Q1
                    13.0
                
                    Median
                    17.5
                
                    Q3
                    26.0
                
                    95-th percentile
                    28.0
                
                    Maximum
                    28.0
                
                    Range
                    
                    Interquartile range

            Descriptive statistics
            
                    Standard deviation
                    7.07549
                
                    Coef of variation
                    
                    Kurtosis
                    -1.70021
                
                    Mean
                    21.125
                
                    MAD
                    
                    Skewness
                    -0.15561
                
                    Sum
                    84.5
                
                    Variance
                    50.0625

                function
                categorical
            
                    Unique
                     5
                
                    Unique (%)
                     
                    Missing
                    0
                
                    Missing (%)
                    
                    Datatypes
                
                        String
                    
                        5
                    
                        Integer
                    
                        Decimal
                    
                        Bool
                    
                        Date
                    
                        Missing
                    
                        0
                    
                        Null
                    
                        0

            
            Frequency
            
                    Value
                    Count
                    Frequency (%)
                
                    First Lieutenant
                    1
                    20.0%
                
                    Leader
                    1
                    20.0%
                
                    Security
                    1
                    20.0%
                
                    Espionage
                    1
                    20.0%
                
                    None
                    1
                    20.0%
                
                    "Missing"
                    0
                    %

                rank
                numeric
            
                    Unique
                     3
                
                    Unique (%)
                     
                    Missing
                    1
                
                    Missing (%)
                    
                    Datatypes
                
                        String
                    
                        Integer
                    
                        4
                    
                        Decimal
                    
                        Bool
                    
                        Date
                    
                        Missing
                    
                        0
                    
                        Null
                    
                        1
                    
                    Basic Stats
                
                    Mean
                    8.0
                
                    Minimum
                    7
                
                    Maximum
                    10
                
                    Zeros(%)
                    0

            Quantile statistics
            
                    Minimum
                    7
                
                    5-th percentile
                    7
                
                    Q1
                    7
                
                    Median
                    7
                
                    Q3
                    8
                
                    95-th percentile
                    10
                
                    Maximum
                    10
                
                    Range
                    
                    Interquartile range

            Descriptive statistics
            
                    Standard deviation
                    1.41421
                
                    Coef of variation
                    
                    Kurtosis
                    -1.0
                
                    Mean
                    8.0
                
                    MAD
                    
                    Skewness
                    0.8165
                
                    Sum
                    32
                
                    Variance
                    2.0

Out[7]:

<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>

References¶

https://github.com/ironmussa/Optimus

https://github.com/ironmussa/Optimus/tree/master/examples

https://htmlpreview.github.io/?https://github.com/ironmussa/Optimus/blob/master/docs/cheatsheet/optimus_cheat_sheet.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [ ]:

Ben Chuanlong Du's Blog

It is never too late to learn.

Using Optimus for Data Profiling in PySpark

Tips & Traps¶

Overview

Dataset info

Column types

names

Datatypes

Frequency

height

Datatypes

Basic Stats

Quantile statistics

Descriptive statistics

function

Datatypes

Frequency

rank

Datatypes

Basic Stats

Quantile statistics

Descriptive statistics

References¶

Comments

Number of columns	4
Number of rows	5
Total Missing (%)	2
Total size in memory	-1 Bytes

Value	Count	Frequency (%)
Jazz	1	20.0%
Megatron	1	20.0%
bumbl#ebéé	1	20.0%
Optim'us	1	20.0%
ironhide&	1	20.0%
"Missing"	0	%

Minimum	13.0
5-th percentile	13.0
Q1	13.0
Median	17.5
Q3	26.0
95-th percentile	28.0
Maximum	28.0
Range
Interquartile range

Standard deviation	7.07549
Coef of variation
Kurtosis	-1.70021
Mean	21.125
MAD
Skewness	-0.15561
Sum	84.5
Variance	50.0625

Value	Count	Frequency (%)
First Lieutenant	1	20.0%
Leader	1	20.0%
Security	1	20.0%
Espionage	1	20.0%
None	1	20.0%
"Missing"	0	%

Standard deviation	1.41421
Coef of variation
Kurtosis	-1.0
Mean	8.0
MAD
Skewness	0.8165
Sum	32
Variance	2.0