12 July 2020

SQL Northwind Database

by Ru Kein

The Northwind SQL Database Project demonstrates how to use SQL queries and hypothesis testing in order to recommend business strategies for increasing sales and reducing costs for the fictitious “Northwind” company. The Northwind SQL database was created by Microsoft for data scientists to practice SQL queries and hypothesis testing in their analyses.

GitHub repo size GitHub license

northwind

Prerequisites

Before you begin, ensure you have met the following requirements:

FSDS - standard data science tools

You can do a quick install of all the standard data science libraries (matplotlib, pandas, etc) using the FSDS pypi package from James Irving:

pip install -U fsds_100719

Running the Northwind SQL Database Project

To run this project locally, follow these steps:

In the command line/terminal:

git clone https://github.com/hakkeray/northwind-sql-database-project
cd northwind-sql-database-project
jupyter notebook

Hypothesis Testing

Below are 4 hypotheses (each including a null hypothesis and alternative hypothesis) which I will test for statistical significance to determine if there are any relationships which would be useful from a strategic business perspective. Following this I will summarize the results, make final recommendations, and propose ideas for future analytical work.


Objectives

H1: Discount and Order Quantity

Does discount amount have a statistically significant effect on order quantity? If so, at what level(s) of discount?

H2: Countries and Order Quantity: Discount vs Full Price

Do order quantities of individual countries differ when discounted vs full price?

H3: Region and Order Revenue

Does region have a statistically significant effect on average revenue per order?

H4: Month and Order Quantity

Does time of year have a statistically significant effect on average revenue per order?

Process Outline

Outline of process I will follow in order to answer questions above:

-Question

  1. Hypotheses
  2. Exploratory Data Analysis (EDA)

-Select dataset -Group data -Explore data

  1. Assumption Tests: -Sample size -Normality and Variance

  2. Statistical Tests: -Statistical test -Effect size (if necessary) -Post-hoc tests (if necessary)

  3. Summarize Results


Statistical Analysis Pipeline

For #3 and #4 above (Assumption and Statistical Tests):

  1. Check if sample sizes allow us to ignore assumptions by visualizing sample size comparisons for two groups (normality check).
    • Bar Plot: SEM (Standard Error of the Mean)
  2. If above test fails, check for normality and homogeneity of variance:
    • Test Assumption Normality:
      • D’Agostino-Pearson: scipy.stats.normaltest
      • Shapiro-Wilik Test: scipy.stats.shapiro
    • Test for Homogeneity of Variance:
      • Levene’s Test: scipy.stats.levene) Parametric tests (means) Nonparametric tests (medians) 1-sample t test 1-sample Sign, 1-sample Wilcoxon 2-sample t test Mann-Whitney tes One-Way ANOVA Kruskal-Wallis, Mood’s median tes Factorial DOE with one factor and one blocking variable Friedman test
  3. Choose appropriate test based on above
    • T Test (1-sample)
      • stats.ttest_1samp()
    • T Test (2-sample)
      • stats.ttest_ind()
    • Welch’s T-Test (2-sample)
      • stats.ttest_ind(equal_var=False)
    • Mann Whitney U
      • stats.mannwhitneyu()
    • ANOVA
      • stats.f_oneway()
  4. Calculate effect size for significant results.
    • Effect size:
      • cohen’s d

    -Interpretation:

    • Small effect = 0.2 ( cannot be seen by naked eye)
    • Medium effect = 0.5
    • Large Effect = 0.8 (can be seen by naked eye)
  5. If significant, follow up with post-hoc tests (if have more than 2 groups)
    • Tukey’s
      • statsmodels.stats.multicomp.pairwise_tukeyhsd

Contact

If you want to contact me you can reach me at rukeine@gmail.com.

License

This project uses the following license: MIT License.

#         _ __ _   _
#  /\_/\ | '__| | | |
#  [===] | |  | |_| |
#   \./  |_|   \__,_|

Project Demo

# connect to database / import data
import sqlite3
conn = sqlite3.connect('Northwind_small.sqlite')
cur = conn.cursor()
# function for converting tables into dataframes on the fly
def get_table(cur, table):
    cur.execute(f"SELECT * from {table};")
    df = pd.DataFrame(cur.fetchall())
    df.columns = [desc[0] for desc in cur.description]
    return df
# create dataframe of table names for referencing purposes
cur.execute("""SELECT name from sqlite_master WHERE type='table';""")
df_tables = pd.DataFrame(cur.fetchall(), columns=['Table'])
df_tables
Table
0 Employee
1 Category
2 Customer
3 Shipper
4 Supplier
5 Order
6 Product
7 OrderDetail
8 CustomerCustomerDemo
9 CustomerDemographic
10 Region
11 Territory
12 EmployeeTerritory
</div>

H1: Discount–Quantity

Hypotheses

EDA

Select the proper dataset for analysis, perform EDA, and generate data groups for testing.

Select dataset

df_orderDetail = get_table(cur, 'OrderDetail')
df_orderDetail.head()
Id OrderId ProductId UnitPrice Quantity Discount
0 10248/11 10248 11 14.0 12 0.0
1 10248/42 10248 42 9.8 10 0.0
2 10248/72 10248 72 34.8 5 0.0
3 10249/14 10249 14 18.6 9 0.0
4 10249/51 10249 51 42.4 40 0.0

Group

# check value counts for each level of discount
df_orderDetail['Discount'].value_counts()
0.00    1317
0.05     185
0.10     173
0.20     161
0.15     157
0.25     154
0.03       3
0.02       2
0.01       1
0.04       1
0.06       1
Name: Discount, dtype: int64
# insert boolean column showing whether or not an order was discounted
df_orderDetail['discounted'] = np.where(df_orderDetail['Discount'] == 0.0, 0, 1)

# compare number of discount vs fullprice orders
df_orderDetail['discounted'].value_counts()
0    1317
1     838
Name: discounted, dtype: int64
# split orders into two groups (series): discount and fullprice order quantity
fullprice = df_orderDetail.groupby('discounted').get_group(0)['Quantity']
discount = df_orderDetail.groupby('discounted').get_group(1)['Quantity']

Explore

diff = (discount.mean() - fullprice.mean())
diff
5.394523243866239
# visually inspect differences in mean and StDev of distributions
sns.set_style("whitegrid")
%config InlineBackend.figure_format='retina'
%matplotlib inline
fig = plt.figure(figsize=(10,8))
ax = fig.gca()

ax.axvline(fullprice.mean(), color='blue', lw=2, ls='--', label='FP Avg')
ax.axvline(discount.mean(), color='orange', lw=2, ls='--', label='DC Avg')

fdict = {'fontfamily': 'PT Mono','fontsize': 16}

sns.distplot(fullprice, ax=ax, hist=True, kde=True, color='blue')
sns.distplot(discount, ax=ax, hist=True, kde=True, color='orange')
ax.legend(['Full Price', 'Discount'])
ax.set_title("Distribution of Full Price vs Discount Order Quantity", fontdict=fdict)
Text(0.5, 1.0, 'Distribution of Full Price vs Discount Order Quantity')

png

fig = plt.figure(figsize=(10,8))
ax = fig.gca()
ax = sns.barplot(x='Discount', y='Quantity', data=df_orderDetail)
ax.set_title('Discount Levels and Order Qty', fontdict={'family': 'PT Mono', 'size':16})
Text(0.5, 1.0, 'Discount Levels and Order Qty')

png

We can already see that there is a clear relationship between order quantity and specific discount levels before running any statistical tests. However, what is more interesting to note from the visualization above is that the discount levels that DO have an effect appear to be very similar as far as the mean order quantity. The indication is that discount amount produces diminishing returns (offering a discount higher than 5% - the minimum effective amount - does not actually produce higher order quantity which means we are losing revenue we would have otherwise captured).

Assumption Tests

Select the appropriate t-test based on tests for the assumptions of normality and homogeneity of variance.

Sample Size

Check if sample sizes allow us to ignore assumptions; if not, test assumption normality.

# visualize sample size comparisons for two groups (normality check)
import scipy.stats as stat
plt.bar(x='Full Price', height=fullprice.mean(), yerr=stat.sem(fullprice))
plt.bar(x='Discount', height=discount.mean(), yerr=stat.sem(discount))
plt.title("Order Quantity Sample Sizes: Full Price vs Discount")
Text(0.5, 1.0, 'Order Quantity Sample Sizes: Full Price vs Discount')

png

Normality Test

Check assumptions of normality and homogeneity of variance

# Test for normality - D'Agostino-Pearson's normality test: scipy.stats.normaltest
stat.normaltest(fullprice), stat.normaltest(discount)
(NormaltestResult(statistic=544.5770045551502, pvalue=5.579637380545965e-119),
 NormaltestResult(statistic=261.528012299789, pvalue=1.6214878452829618e-57))

Failed normality test (p-values < 0.05). Run non-parametric test:

# Run non-parametric test (since normality test failed)
stat.mannwhitneyu(fullprice, discount)
MannwhitneyuResult(statistic=461541.0, pvalue=6.629381826999866e-11)

Statistical Test

Perform chosen statistical test.

# run tukey test for OQD (Order Quantity Discount) 
data = df_orderDetail['Quantity'].values
labels = df_orderDetail['Discount'].values

import statsmodels.api as sms
model = sms.stats.multicomp.pairwise_tukeyhsd(data,labels)
# save OQD tukey test model results into dataframe (OQD: order quantity discount)
tukey_OQD = pd.DataFrame(data=model._results_table[1:], columns=model._results_table[0])
tukey_OQD
group1 group2 meandiff p-adj lower upper reject
0 0.0 0.01 -19.7153 0.9 -80.3306 40.9001 False
1 0.0 0.02 -19.7153 0.9 -62.593 23.1625 False
2 0.0 0.03 -20.0486 0.725 -55.0714 14.9742 False
3 0.0 0.04 -20.7153 0.9 -81.3306 39.9001 False
4 0.0 0.05 6.2955 0.0011 1.5381 11.053 True
5 0.0 0.06 -19.7153 0.9 -80.3306 40.9001 False
6 0.0 0.1 3.5217 0.4269 -1.3783 8.4217 False
7 0.0 0.15 6.6669 0.0014 1.551 11.7828 True
8 0.0 0.2 5.3096 0.0303 0.2508 10.3684 True
9 0.0 0.25 6.525 0.0023 1.3647 11.6852 True
10 0.01 0.02 0.0 0.9 -74.2101 74.2101 False
11 0.01 0.03 -0.3333 0.9 -70.2993 69.6326 False
12 0.01 0.04 -1.0 0.9 -86.6905 84.6905 False
13 0.01 0.05 26.0108 0.9 -34.745 86.7667 False
14 0.01 0.06 0.0 0.9 -85.6905 85.6905 False
15 0.01 0.1 23.237 0.9 -37.5302 84.0042 False
16 0.01 0.15 26.3822 0.9 -34.4028 87.1671 False
17 0.01 0.2 25.0248 0.9 -35.7554 85.805 False
18 0.01 0.25 26.2403 0.9 -34.5485 87.029 False
19 0.02 0.03 -0.3333 0.9 -55.6463 54.9796 False
20 0.02 0.04 -1.0 0.9 -75.2101 73.2101 False
21 0.02 0.05 26.0108 0.6622 -17.0654 69.087 False
22 0.02 0.06 0.0 0.9 -74.2101 74.2101 False
23 0.02 0.1 23.237 0.7914 -19.8552 66.3292 False
24 0.02 0.15 26.3822 0.6461 -16.7351 69.4994 False
25 0.02 0.2 25.0248 0.7089 -18.0857 68.1354 False
26 0.02 0.25 26.2403 0.6528 -16.8823 69.3628 False
27 0.03 0.04 -0.6667 0.9 -70.6326 69.2993 False
28 0.03 0.05 26.3441 0.3639 -8.9214 61.6096 False
29 0.03 0.06 0.3333 0.9 -69.6326 70.2993 False
30 0.03 0.1 23.5703 0.5338 -11.7147 58.8553 False
31 0.03 0.15 26.7155 0.3436 -8.6001 62.0311 False
32 0.03 0.2 25.3582 0.428 -9.9492 60.6656 False
33 0.03 0.25 26.5736 0.3525 -8.7485 61.8957 False
34 0.04 0.05 27.0108 0.9 -33.745 87.7667 False
35 0.04 0.06 1.0 0.9 -84.6905 86.6905 False
36 0.04 0.1 24.237 0.9 -36.5302 85.0042 False
37 0.04 0.15 27.3822 0.9 -33.4028 88.1671 False
38 0.04 0.2 26.0248 0.9 -34.7554 86.805 False
39 0.04 0.25 27.2403 0.9 -33.5485 88.029 False
40 0.05 0.06 -26.0108 0.9 -86.7667 34.745 False
41 0.05 0.1 -2.7738 0.9 -9.1822 3.6346 False
42 0.05 0.15 0.3714 0.9 -6.2036 6.9463 False
43 0.05 0.2 -0.986 0.9 -7.5166 5.5447 False
44 0.05 0.25 0.2294 0.9 -6.3801 6.839 False
45 0.06 0.1 23.237 0.9 -37.5302 84.0042 False
46 0.06 0.15 26.3822 0.9 -34.4028 87.1671 False
47 0.06 0.2 25.0248 0.9 -35.7554 85.805 False
48 0.06 0.25 26.2403 0.9 -34.5485 87.029 False
49 0.1 0.15 3.1452 0.9 -3.5337 9.824 False
50 0.1 0.2 1.7879 0.9 -4.8474 8.4231 False
51 0.1 0.25 3.0033 0.9 -3.7096 9.7161 False
52 0.15 0.2 -1.3573 0.9 -8.1536 5.4389 False
53 0.15 0.25 -0.1419 0.9 -7.014 6.7302 False
54 0.2 0.25 1.2154 0.9 -5.6143 8.0451 False
# Plot a universal confidence interval of each group mean comparing significant differences in group means. 
# Significant differences at the alpha=0.05 level can be identified by intervals that do not overlap 

oq_data = df_orderDetail['Quantity'].values
discount_labels = df_orderDetail['Discount'].values

from statsmodels.stats.multicomp import MultiComparison
oqd = MultiComparison(oq_data, discount_labels)
results = oqd.tukeyhsd()
results.plot_simultaneous(comparison_name=0.05, xlabel='Order Quantity', ylabel='Discount Level');

png

Effect Size

Calculate effect size using Cohen’s D as well as any post-hoc tests.

#### Cohen's d
def Cohen_d(group1, group2):
    # Compute Cohen's d.
    # group1: Series or NumPy array
    # group2: Series or NumPy array
    # returns a floating point number 
    diff = group1.mean() - group2.mean()

    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()

    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    
    return d
Cohen_d(discount, fullprice)
0.2862724481729282

Post-hoc Tests

The mean quantity per order is similar for each of the discount levels that we identified as significant. The obvious conclusion to draw from this is that offering a discount higher than 5% does not increase the order quantities; higher discounts only produce higher loss in revenue.

# Extract revenue lost per discounted order where discount had no effect on order quantity
cur.execute("""SELECT Discount, 
                SUM(UnitPrice * Quantity) as 'revLoss',
                COUNT(OrderId) as 'NumOrders'
                FROM orderDetail  
                GROUP BY Discount
                HAVING Discount != 0 AND Discount != 0.05
                ORDER BY revLoss DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()

9
Discount revLoss NumOrders
0 0.25 131918.09 154
1 0.20 111476.38 161
2 0.15 102948.44 157
3 0.10 101665.71 173
4 0.03 124.65 3
print("Total Revenue Forfeited $", df.revLoss.sum())
print("Number of Orders Affected ", df.NumOrders.sum())
print("Avg Forfeited Per Order $", df.revLoss.sum()/df.NumOrders.sum())
Total Revenue Forfeited $ 448373.27
Number of Orders Affected  653
Avg Forfeited Per Order $ 686.6359418070444

Analyze Results

Where alpha = 0.05, the null hypothesis is rejected. Discount amount has a statistically significant effect on the quantity in an order where the discount level is equal to 5%, 15%, 20% or 25%.

H2: Country–Discount

Do individual countries show a statistically significant preference for discount?

If so, which countries and to what extent?

Hypotheses

EDA

Select the proper dataset for analysis, perform EDA, and generate data groups for testing.

Select

df_order = get_table(cur, "'Order'")
display(df_order.head())
display(df_orderDetail.head())
OrderId CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
0 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
1 10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
2 10250 HANAR 4 2012-07-08 2012-08-05 2012-07-12 2 65.83 Hanari Carnes Rua do Paço, 67 Rio de Janeiro South America 05454-876 Brazil
3 10251 VICTE 3 2012-07-08 2012-08-05 2012-07-15 1 41.34 Victuailles en stock 2, rue du Commerce Lyon Western Europe 69004 France
4 10252 SUPRD 4 2012-07-09 2012-08-06 2012-07-11 2 51.30 Suprêmes délices Boulevard Tirou, 255 Charleroi Western Europe B-6000 Belgium
Id OrderId ProductId UnitPrice Quantity Discount discounted
0 10248/11 10248 11 14.0 12 0.0 0
1 10248/42 10248 42 9.8 10 0.0 0
2 10248/72 10248 72 34.8 5 0.0 0
3 10249/14 10249 14 18.6 9 0.0 0
4 10249/51 10249 51 42.4 40 0.0 0
# Rename 'Id' to 'OrderId' for joining tables with matching primary key name
df_order.rename({'Id':'OrderId'}, axis=1, inplace=True)
display(df_order.head())
OrderId CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
0 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
1 10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
2 10250 HANAR 4 2012-07-08 2012-08-05 2012-07-12 2 65.83 Hanari Carnes Rua do Paço, 67 Rio de Janeiro South America 05454-876 Brazil
3 10251 VICTE 3 2012-07-08 2012-08-05 2012-07-15 1 41.34 Victuailles en stock 2, rue du Commerce Lyon Western Europe 69004 France
4 10252 SUPRD 4 2012-07-09 2012-08-06 2012-07-11 2 51.30 Suprêmes délices Boulevard Tirou, 255 Charleroi Western Europe B-6000 Belgium
df_order.set_index('OrderId',inplace=True)
display(df_order.head())
CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
OrderId
10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
10250 HANAR 4 2012-07-08 2012-08-05 2012-07-12 2 65.83 Hanari Carnes Rua do Paço, 67 Rio de Janeiro South America 05454-876 Brazil
10251 VICTE 3 2012-07-08 2012-08-05 2012-07-15 1 41.34 Victuailles en stock 2, rue du Commerce Lyon Western Europe 69004 France
10252 SUPRD 4 2012-07-09 2012-08-06 2012-07-11 2 51.30 Suprêmes délices Boulevard Tirou, 255 Charleroi Western Europe B-6000 Belgium
df_country = df_orderDetail.merge(df_order, on='OrderId', copy=True)

Explore

fs.ft.hakkeray.hot_stats(df_country, 'ShipCountry')
-------->
HOT!STATS
<--------

SHIPCOUNTRY
Data Type: object

min    Argentina
max    Venezuela
Name: ShipCountry, dtype: object 

à-la-Mode: 
0    USA
dtype: object


No Nulls Found!

Non-Null Value Counts:
USA            352
Germany        328
Brazil         203
France         184
UK             135
Austria        125
Venezuela      118
Sweden          97
Canada          75
Mexico          72
Belgium         56
Ireland         55
Spain           54
Finland         54
Italy           53
Switzerland     52
Denmark         46
Argentina       34
Portugal        30
Poland          16
Norway          16
Name: ShipCountry, dtype: int64

# Unique Values: 21

Group

countries = df_country.groupby('ShipCountry').groups
countries.keys()
dict_keys(['Argentina', 'Austria', 'Belgium', 'Brazil', 'Canada', 'Denmark', 'Finland', 'France', 'Germany', 'Ireland', 'Italy', 'Mexico', 'Norway', 'Poland', 'Portugal', 'Spain', 'Sweden', 'Switzerland', 'UK', 'USA', 'Venezuela'])
df_countries = df_country[['ShipCountry','Quantity','discounted']].copy()
df_countries.ShipCountry.value_counts()
USA            352
Germany        328
Brazil         203
France         184
UK             135
Austria        125
Venezuela      118
Sweden          97
Canada          75
Mexico          72
Belgium         56
Ireland         55
Spain           54
Finland         54
Italy           53
Switzerland     52
Denmark         46
Argentina       34
Portugal        30
Poland          16
Norway          16
Name: ShipCountry, dtype: int64
import researchpy as rp
rp.summary_cont(df_countries.groupby(['discounted']))
Quantity
N Mean SD SE 95% Conf. Interval
discounted
0 1317 21.715262 17.507493 0.482426 20.769706 22.660818
1 838 27.109785 20.771439 0.717537 25.703412 28.516159

Test

Sample Size

# Check if sample sizes allow us to ignore assumptions;
# visualize sample size comparisons for two groups (normality check)

stat_dict = {}

for k,v in countries.items():
    try:
        grp0 = df_countries.loc[v].groupby('discounted').get_group(0)['Quantity']
        grp1 = df_countries.loc[v].groupby('discounted').get_group(1)['Quantity']
        print(f"{k}")
        
        import scipy.stats as stat

        plt.bar(x='Full Price', height=grp0.mean(), yerr=stat.sem(grp0))
        plt.bar(x='Discounted', height=grp1.mean(), yerr=stat.sem(grp1))
        plt.show()
        
    except:
        pass
        
    try:
        result = stat.ttest_ind(grp0,grp1)
        if result[1] < 0.05:
            stat_dict[k] = result[1]
            print(f"\n{k} PREFERS DISCOUNTS!")
        else:
            continue
    except:
        print(f"{k} does not contain one of the groups.")
stat_dict
Argentina does not contain one of the groups.
Austria

png

Belgium

png

Brazil

png

Canada

png

Canada PREFERS DISCOUNTS!
Denmark

png

Finland

png

France

png

Germany

png

Ireland

png

Italy

png

Mexico

png

Norway PREFERS DISCOUNTS!
Portugal

png

Spain

png

Spain PREFERS DISCOUNTS!
Sweden

png

Switzerland

png

UK

png

UK PREFERS DISCOUNTS!
USA

png

USA PREFERS DISCOUNTS!
Venezuela

png

{'Canada': 0.0010297982736886485,
 'Norway': 0.04480094051665529,
 'Spain': 0.0025087181106716217,
 'UK': 0.00031794803200322925,
 'USA': 0.019868707223971476}
stat_dict
{'Canada': 0.0010297982736886485,
 'Norway': 0.04480094051665529,
 'Spain': 0.0025087181106716217,
 'UK': 0.00031794803200322925,
 'USA': 0.019868707223971476}

Normality Test

fig = plt.figure(figsize=(10,8))
ax = fig.gca(title="Distribution of Full price vs Discounted Orders")

sns.distplot(grp0)
sns.distplot(grp1)
ax.legend(['Full Price','Discounted'])
<matplotlib.legend.Legend at 0x1a25ab2978>

png

# Test for normality - D'Agostino-Pearson's normality test: scipy.stats.normaltest
stat.normaltest(grp0), stat.normaltest(grp1)
(NormaltestResult(statistic=9.316225653095811, pvalue=0.009484344125890621),
 NormaltestResult(statistic=10.255309993341813, pvalue=0.005930451108115991))
# Run non-parametric test (since normality test failed)
stat.mannwhitneyu(grp0, grp1)
MannwhitneyuResult(statistic=1632.5, pvalue=0.44935140740973323)

Canada, Spain, UK and the USA have pvalues < 0.05 indicating there is a relationship between discount and order quantity and the null hypothesis is rejected for these individual countries.

Statistical Test

import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols("Quantity~C(discounted)+C(ShipCountry)+C(discounted):C(ShipCountry)", data=df_countries).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
/Users/hakkeray/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/statsmodels/base/model.py:1752: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 20, but rank is 18
  'rank is %d' % (J, J_), ValueWarning)
/Users/hakkeray/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/statsmodels/base/model.py:1752: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 20, but rank is 18
  'rank is %d' % (J, J_), ValueWarning)
# reformat scientific notation of results for easier interpretation
anova_table.style.format("{:.5f}", subset=['PR(>F)'])
sum_sq df F PR(>F)
C(discounted) 9.78092e-08 1 3.07557e-10 0.99999
C(ShipCountry) 101347 20 15.9341 0.00000
C(discounted):C(ShipCountry) 15584.9 20 2.4503 0.00061
Residual 672930 2116 nan nan
# calculate ttest_ind p-values and significance for individual countries
print(f"\n Countries with p-values < 0.05 - Null Hypothesis Rejected:")
for k,v in countries.items():
    try:
        grp0 = df_countries.loc[v].groupby('discounted').get_group(0)['Quantity']
        grp1 = df_countries.loc[v].groupby('discounted').get_group(1)['Quantity']
        result = stat.ttest_ind(grp0,grp1)
        if result[1] < 0.05:
            
            print(f"\n\t{k}: {result[1].round(4)}")
        else:
            continue
    except:
        None 
 Countries with p-values < 0.05 - Null Hypothesis Rejected:

	Canada: 0.001

	Spain: 0.0025

	UK: 0.0003

	USA: 0.0199

Although discount does not have a significant effect on countries overall (p = 0.99), there is a statistically significant relationship between order quantities and discount in some of the countries (p=0.0006).

Countries with p-values < 0.05 - Null Hypothesis Rejected:

Canada: 0.001

Spain: 0.0025

UK: 0.0003

USA: 0.0199
y1 = df_countries.groupby('discounted').get_group(1)['Quantity']


fig = plt.figure(figsize=(18,12))
ax = fig.gca()

ax = sns.barplot(x='ShipCountry', y=y1, data=df_countries)

ax.set_title('Average Discount Order Quantity by Country', fontdict={'family': 'PT Mono', 'size':16})
Text(0.5, 1.0, 'Average Discount Order Quantity by Country')

png

Effect Size

Effect size testing is unnecessary since the null hypothesis for the main question was not rejected.

Post-hoc Tests

#!pip install pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
# Compare number of discount vs fullprice orders by country.
# Create bar plots grouped as discount vs fullprice orders by country
#fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(18,8))

q1 = "SELECT ShipCountry, AVG(Quantity) as OrderQty from df_countries where discounted = 0 group by 1;"
q2 = "SELECT ShipCountry, AVG(Quantity) as OrderQty from df_countries where discounted = 1 group by 1;"

df_fpCount = pysqldf(q1)
df_dcCount = pysqldf(q2)

df_fpCount['Group'] = 'FullPrice'
df_dcCount['Group'] = 'Discount'

df_country_qty = pd.concat([df_fpCount, df_dcCount], axis=0)

display(df_country_qty.describe())

#ax = sns.barplot(x='ShipCountry', y='NumOrders', data=country_df, hue='Group', palette='pastel', orient='v')
#ax.set_title('Number of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})

#ax1 = sns.barplot(x='ShipCountry', y='TotalQty', data=country_df, hue='Group', palette='pastel', orient='v')
#ax1.set_title('Total Qty of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})

sns.set_style("whitegrid")
fig = plt.figure(figsize=(18,8))
ax = fig.gca(title="Average Order Quantity by Country: Fullprice vs Discount")

sns.barplot(x='ShipCountry', y='OrderQty', ax=ax, data=df_country_qty, hue='Group', 
            palette='pastel', orient='v', ci=68, capsize=.2)

## Set Title,X/Y Labels,fonts,formatting
ax_font = {'family':'monospace','weight':'semibold','size':14}
tick_font = {'size':12,'ha':'center','rotation':45}
t_label = "Average Order Quantity by Country: Fullprice vs Discount Orders"
t_font = {'family': 'PT Mono', 'size':18}

ax.set_ylabel("Order Qty", fontdict=ax_font)
ax.set_xlabel("Country", fontdict=ax_font)
#ax.set_title('Average Order Quantity by Country: Fullprice vs Discount', fontdict={'family': 'PT Mono', 'size':16})
ax.set_title(t_label, fontdict=t_font)

OrderQty
count 39.000000
mean 22.596281
std 7.620086
min 9.970588
25% 17.458458
50% 21.750000
75% 25.985539
max 43.172414
Text(0.5, 1.0, 'Average Order Quantity by Country: Fullprice vs Discount Orders')

png

According to the plot above, the actual number of discounted orders is lower than the number of full price orders. Let’s compare the sum of quantities for these orders in each group.

# Compare number of discount vs fullprice orders by country.
# Create bar plots grouped as discount vs fullprice orders by country
#fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(18,8))

q1 = "SELECT ShipCountry, Count(*) as OrderCount from df_countries where discounted = 0 group by 1;"
q2 = "SELECT ShipCountry, Count(*) as OrderCount from df_countries where discounted = 1 group by 1;"

df_fpCount = pysqldf(q1)
df_dcCount = pysqldf(q2)

df_fpCount['Group'] = 'FullPrice'
df_dcCount['Group'] = 'Discount'

df_country_count = pd.concat([df_fpCount, df_dcCount], axis=0)

display(df_country_count.describe())

#ax = sns.barplot(x='ShipCountry', y='NumOrders', data=country_df, hue='Group', palette='pastel', orient='v')
#ax.set_title('Number of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})

#ax1 = sns.barplot(x='ShipCountry', y='TotalQty', data=country_df, hue='Group', palette='pastel', orient='v')
#ax1.set_title('Total Qty of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})


fig = plt.figure(figsize=(18,8))
ax = fig.gca(title="Mean QPO by Country")

sns.barplot(x='ShipCountry', y='OrderCount', ax=ax, data=df_country_count, hue='Group', palette='Reds_d', 
            orient='v', ci=68, capsize=.2)

## Set Title,X/Y Labels,fonts,formatting
ax_font = {'family':'monospace','weight':'semibold','size':14}
tick_font = {'size':12,'ha':'center','rotation':45}
t_label = "Count of Fullprice vs Discount Orders by Country"
t_font = {'family': 'PT Mono', 'size':18}

ax.set_ylabel("Number of Orders", fontdict=ax_font)
ax.set_xlabel("Country", fontdict=ax_font)
#ax.set_title('Average Order Quantity by Country: Fullprice vs Discount', fontdict={'family': 'PT Mono', 'size':16})
ax.set_title(t_label, fontdict=t_font)

OrderCount
count 39.000000
mean 55.256410
std 48.722478
min 8.000000
25% 22.000000
50% 39.000000
75% 69.500000
max 210.000000
Text(0.5, 1.0, 'Count of Fullprice vs Discount Orders by Country')

png

This still doesn’t tell us much about whether or not these countries prefer discounts (tend to order more products) or not - in order to get better insight, we need to look at the average order size (mean quantities per order) for each group.

# Compare number of discount vs fullprice orders by country.
# Create bar plots grouped as discount vs fullprice orders by country
#fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(18,8))

#q1 = "SELECT ShipCountry, Count(*) as NumOrders, SUM(Quantity) as TotalQty, AVG(Quantity) as MeanQPO from df_countries where discounted = 0 group by 1;"
#q2 = "SELECT ShipCountry, Count(*) as NumOrders, SUM(Quantity) as TotalQty, AVG(Quantity) as MeanQPO from df_countries where discounted = 1 group by 1;"

q1 = "SELECT ShipCountry, AVG(Quantity) as MeanQPO from df_countries where discounted = 0 group by 1;"
q2 = "SELECT ShipCountry, AVG(Quantity) as MeanQPO from df_countries where discounted = 1 group by 1;"

fullprice_df = pysqldf(q1)
discount_df = pysqldf(q2)

fullprice_df['Group'] = 'FullPrice'
discount_df['Group'] = 'Discount'

country_df = pd.concat([fullprice_df, discount_df], axis=0)

display(country_df.describe())

#ax = sns.barplot(x='ShipCountry', y='NumOrders', data=country_df, hue='Group', palette='pastel', orient='v')
#ax.set_title('Number of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})

#ax1 = sns.barplot(x='ShipCountry', y='TotalQty', data=country_df, hue='Group', palette='pastel', orient='v')
#ax1.set_title('Total Qty of Fullprice vs Discount Orders by Country', fontdict={'family': 'monospace', 'size':16})


fig = plt.figure(figsize=(18,8))
ax = fig.gca(title="Mean QPO by Country")

sns.barplot(x='ShipCountry', y='MeanQPO', ax=ax, data=country_df, hue='Group', palette='Greens_d', 
            orient='v', capsize=.2)

## Set Title,X/Y Labels,fonts,formatting
ax_font = {'family':'monospace','weight':'semibold','size':14}
tick_font = {'size':12,'ha':'center','rotation':45}
t_label = "Average Order Quantity by Country: Fullprice vs Discount"
t_font = {'family': 'PT Mono', 'size':18}

ax.set_ylabel("Avg Qty per Order ", fontdict=ax_font)
ax.set_xlabel("Country", fontdict=ax_font)
#ax.set_title('Average Order Quantity by Country: Fullprice vs Discount', fontdict={'family': 'PT Mono', 'size':16})
ax.set_title(t_label, fontdict=t_font)

MeanQPO
count 39.000000
mean 22.596281
std 7.620086
min 9.970588
25% 17.458458
50% 21.750000
75% 25.985539
max 43.172414
Text(0.5, 1.0, 'Average Order Quantity by Country: Fullprice vs Discount')

png

The above plots indicate that when a discount is offered, certain countries order higher quantities of products. Let’s look at the values to determine what percentage more they purchase when an order is discounted.

# add new col for countries where discount has significant effect
fig = plt.figure(figsize=(18,12))
ax = fig.gca()
df_countries['effect_cqd'] = df_countries['ShipCountry'].isin(['Spain', 'UK', 'USA', 'Canada'])
ax = sns.barplot(x='ShipCountry', y='Quantity', hue='effect_cqd', palette='pastel', data=df_countries)

png

q1 = "SELECT ShipCountry, Count(*) as OrderCount from df_countries where discounted = 0 group by 1;"
q2 = "SELECT ShipCountry, Count(*) as OrderCount from df_countries where discounted = 1 group by 1;"

df_fpCount = pysqldf(q1)
df_dcCount = pysqldf(q2)

df_fpCount['Group'] = 'FullPrice'
df_dcCount['Group'] = 'Discount'

df_countryCount = pd.concat([df_fpCount, df_dcCount])

fig = plt.figure(figsize=(18,8))
ax = fig.gca(title="Average Order Quantity by Country")

ax = sns.barplot(x='ShipCountry', y='OrderCount', data=df_countryCount)
ax.set_title('Order Count by Country', fontdict={'family': 'PT Mono', 'size':16})
Text(0.5, 1.0, 'Order Count by Country')

png

Results

For certain individual countries (Spain, Canada, UK, USA), the null hypothesis is rejected with 95% certainty (alpha=0.05)

H3: Region & Revenue

Does average revenue per order vary between different customer regions?

If so, how do the regions rank in terms of average revenue per order?

Additional questions to explore: Does geographic distance between distributor and shipcountry have an effect on order quantity? Does shipping cost have an effect on order quantity?

Hypotheses

$H_0$ the average revenue per order is the same between different customer regions.

$H_1$ Alternate hypothesis: the average revenue per order is different (higher or lower) across different customer regions.

The alpha level (i.e. the probability of rejecting the null hypothesis when it is true) is = 0.05.

EDA

Select the proper dataset for analysis, generate data groups for testing, perform EDA.

Select

# Extract revenue per product per order
cur.execute("""SELECT c.Region, od.OrderId, od.Quantity, od.UnitPrice, od.Discount
FROM Customer c
JOIN 'Order' o ON c.Id = o.CustomerId
JOIN OrderDetail od USING(OrderId);""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
2078
Region OrderId Quantity UnitPrice Discount
0 Western Europe 10248 12 14.0 0.0
1 Western Europe 10248 10 9.8 0.0
2 Western Europe 10248 5 34.8 0.0
3 Western Europe 10249 9 18.6 0.0
4 Western Europe 10249 40 42.4 0.0
# Get total revenue per order

df['Revenue'] = df.Quantity * df.UnitPrice * (1-df.Discount)
# Drop unnecessary columns
df.drop(['Quantity', 'UnitPrice', 'Discount'], axis=1, inplace=True)

Group

# Group data by order and get average revenue per order for each region
df_region = df.groupby(['Region', 'OrderId'])['Revenue'].mean().reset_index()
# drop Order Id (no longer necessary)
df_region.drop('OrderId', axis=1, inplace=True)
# check changes
df_region.head()
Region Revenue
0 British Isles 239.70
1 British Isles 661.25
2 British Isles 352.40
3 British Isles 258.40
4 British Isles 120.20
# Explore sample sizes before testing: n > 30 to pass assumptions
df_region.groupby('Region').count()
Revenue
Region
British Isles 75
Central America 21
Eastern Europe 7
North America 152
Northern Europe 55
Scandinavia 28
South America 127
Southern Europe 64
Western Europe 272

Some of the sample sizes are too small to ignore assumptions of normality. We can combine some regions to meet the required threshold of n > 30.

# Group sub-regions together to create sample sizes adequately large for ANOVA testing  (min 30)

# Group Scandinavia, Northern and Eastern Europe
df_region.loc[(df_region.Region == 'Scandinavia') | (df_region.Region == 'Eastern Europe') | (df_region.Region == 'Northern Europe'), 'Region'] = 'North Europe'

# Group South and Central America
df_region.loc[(df_region.Region == 'South America') | (df_region.Region == 'Central America'), 'Region'] = 'South Americas'

# Review sizes of new groups
df_region.groupby('Region').count()
Revenue
Region
British Isles 75
North America 152
North Europe 90
South Americas 148
Southern Europe 64
Western Europe 272

Explore

fig = plt.figure(figsize=(10,8))
ax = fig.gca()

sns.distplot(grp0)
sns.distplot(grp1)
ax.legend(['Full Price','Discounted'])

# Plot number of orders, total revenue, and average revenue per order by region
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(8,8))
# Number of orders
df_region.groupby(['Region'])['Revenue'].count().plot(kind='barh', ax=ax1, color='b')

# Total Revenue
df_region.groupby(['Region'])['Revenue'].sum().plot(kind='barh', ax=ax2, color='r')

# Average Revenue
df_region.groupby(['Region'])['Revenue'].mean().plot(kind='barh', ax=ax3, color='g')

# Label plots and axes
ax1.set_title('Total Orders')
ax1.set_ylabel('')
ax2.set_title('Total Revenue in US$')
ax2.set_ylabel('')
ax3.set_title('Average Revenue per Order US$')
ax3.set_ylabel('')

fig.subplots_adjust(hspace=0.4);

png

png

The graphs show that Western Europe is the region with the greatest number of orders, and also has the greatest total revenue. However, North America has the most expensive order on average (followed by Western Europe). Southern and Eastern Europe has the lowest number of orders, lowest total revenue, and cheapest order on average. The third graph lent support to the alternate hypothesis that there are significant differences in average order revenue between regions.

Test

Sample Size

Check if sample sizes allow us to ignore assumptions of normality

# visualize sample size comparisons, check normality (pvals)
fig = plt.figure(figsize=(12,6))
ax = fig.gca()

ax = sns.barplot(x='Region', y='Revenue', data=df_region, ci=68, palette="pastel", hue='Region')
ax.set_title('Average Order Revenue by Region', fontdict={'family': 'PT Mono', 'size':16})
Text(0.5, 1.0, 'Average Order Revenue by Region')

png

Normality


Statistical

import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols("Revenue~C(Region)+Revenue:C(Region)", data=df_region).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
# reformat scientific notation of results for easier interpretation
anova_table.style.format("{:.5f}", subset=['PR(>F)'])
sum_sq df F PR(>F)
C(Region) 1.03486e+07 5 3.98262e+30 0.00000
Revenue:C(Region) 5.34162e+08 6 1.71309e+32 0.00000
Residual 4.10034e-22 789 nan nan
# run tukey test for OQD (Order Quantity Discount) 
data = df_region['Revenue'].values
labels = df_region['Region'].values

import statsmodels.api as sms
model = sms.stats.multicomp.pairwise_tukeyhsd(data,labels)

# save OQD tukey test model results into dataframe (OQD: order quantity discount)
tukey_OQD = pd.DataFrame(data=model._results_table[1:], columns=model._results_table[0])
tukey_OQD
group1 group2 meandiff p-adj lower upper reject
0 British Isles North America 116.4615 0.9 -213.9625 446.8854 False
1 British Isles North Europe -88.1693 0.9 -454.2704 277.9318 False
2 British Isles South Americas -84.2501 0.9 -416.146 247.6458 False
3 British Isles Southern Europe -271.7815 0.3745 -670.2535 126.6904 False
4 British Isles Western Europe 81.6889 0.9 -223.7052 387.083 False
5 North America North Europe -204.6308 0.4191 -516.0716 106.8101 False
6 North America South Americas -200.7115 0.2778 -471.1191 69.6961 False
7 North America Southern Europe -388.243 0.0191 -737.1631 -39.3228 True
8 North America Western Europe -34.7725 0.9 -271.9032 202.3581 False
9 North Europe South Americas 3.9193 0.9 -309.0829 316.9214 False
10 North Europe Southern Europe -183.6122 0.7177 -566.4899 199.2655 False
11 North Europe Western Europe 169.8582 0.5251 -114.889 454.6055 False
12 South Americas Southern Europe -187.5315 0.6259 -537.8459 162.783 False
13 South Americas Western Europe 165.939 0.3541 -73.2385 405.1164 False
14 Southern Europe Western Europe 353.4704 0.0242 28.1539 678.787 True

North America and Southern Europe: pval = 0.01, mean diff: -388.24

Southern Europe and Western Europe: pval = 0.02, mean diff: 353.4704

Effect Size

Cohen’s D

northamerica = df_region.loc[df_region['Region'] == 'North America']
southerneurope = df_region.loc[df_region['Region'] == 'Southern Europe']
westerneurope = df_region.loc[df_region['Region'] == 'Western Europe']

na_se = Cohen_d(northamerica.Revenue, southerneurope.Revenue)
se_we = Cohen_d(southerneurope.Revenue, westerneurope.Revenue)

print(na_se, se_we)
0.5891669383438923 -0.5462384714677272

Post-Hoc Tests

# log-transforming revenue per order
logRegion_df = df_region.copy()
logRegion_df['Revenue'] = np.log(df_region['Revenue'])

# Plotting the distributions for the log-transformed data
sns.set_style("whitegrid")

fig = plt.figure(figsize=(12,8))
ax = fig.gca(title="Distribution of Revenue Per Order by Region")

for region in set(logRegion_df.Region):
    region_group = logRegion_df.loc[logRegion_df['Region'] == region]
    sns.distplot(region_group['Revenue'], hist_kws=dict(alpha=0.5), label=region)
    ax.legend()
    ax.set_label('Revenue per Order (log-transformed)')

png

# The data is more normally distributed, and variances from the mean were more similar. 
# run an ANOVA test:

# Fitting a model of revenue per order on Region categories - ANOVA table
lm = ols('Revenue ~ C(Region)', logRegion_df).fit()
sm.stats.anova_lm(lm, typ=2)

sum_sq df F PR(>F)
C(Region) 48.004167 5.0 12.076998 2.713885e-11
Residual 631.999979 795.0 NaN NaN

Results

At an alpha level of 0.05 significance, revenue does vary between regions and therefore the null hypothesis is rejected.

The ANOVA table above revealed that the p-value is lower than the alpha value of 0.05. Therefore I was able to reject the null hypothesis and accept the alternate hypothesis. There are statistically significant differences in average order value between different regions, i.e. customers from different parts of the world spend different amounts of money on their orders, on average. Conclusions Business insights: There are statistically significant differences in the average revenue per order from customers from different regions. Western European customers place the most orders, and are the single biggest contributors to Northwind’s bottom line. However, although North American customers have placed roughly half as many orders as those from Western Europe, they spend more per order, on average. The difference between the region with the most expensive orders on average (North America, $1,945.93) and the region with the least expensive orders (Southern and Eastern Europe, $686.73) is $1,259.20, or 2.8 times more for orders from North America. Southern and Eastern Europe has the smallest number of orders, the lowest total revenue, and the lowest average revenue per order. North American customers have placed a similar number of orders to those from South and Central America, but their average expenditure per order is 1.8 times higher. Potential business actions and directions for future work: If Northwind was looking to focus on more profitable customers, a potential action would be to stop serving customers in Southern and Eastern Europe, and to focus more on customers in Western Europe and North America. However, further analysis would be needed to confirm these findings. For example, it might be the case that some more expensive products are only available in certain regions.


H4: Season+Quantity:ProductCategory

1: Does time of year (month) have an effect on order quantity overall?

2: Does time of year (month) have an effect on order quantity of specific product categories?

3: Does time of year (month) have an effect on order quantity by region?

Hypotheses

EDA

Select

df_months = df_orderDetail.merge(df_order, on='OrderId', copy=True)
df_months.head()
Id OrderId ProductId UnitPrice Quantity Discount discounted CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
0 10248/11 10248 11 14.0 12 0.0 0 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
1 10248/42 10248 42 9.8 10 0.0 0 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
2 10248/72 10248 72 34.8 5 0.0 0 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
3 10249/14 10249 14 18.6 9 0.0 0 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
4 10249/51 10249 51 42.4 40 0.0 0 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
pd.to_datetime(df_months['OrderDate'], format='%Y/%m/%d').head()
0   2012-07-04
1   2012-07-04
2   2012-07-04
3   2012-07-05
4   2012-07-05
Name: OrderDate, dtype: datetime64[ns]
df_months['OrderMonth'] = pd.DatetimeIndex(df_months['OrderDate']).month
df_months['OrderYear'] = pd.DatetimeIndex(df_months['OrderDate']).year
df_months.head()
Id OrderId ProductId UnitPrice Quantity Discount discounted CustomerId EmployeeId OrderDate ... ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry OrderMonth OrderYear
0 10248/11 10248 11 14.0 12 0.0 0 VINET 5 2012-07-04 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
1 10248/42 10248 42 9.8 10 0.0 0 VINET 5 2012-07-04 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
2 10248/72 10248 72 34.8 5 0.0 0 VINET 5 2012-07-04 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
3 10249/14 10249 14 18.6 9 0.0 0 TOMSP 6 2012-07-05 ... 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany 7 2012
4 10249/51 10249 51 42.4 40 0.0 0 TOMSP 6 2012-07-05 ... 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany 7 2012

5 rows × 22 columns

df_months.set_index('OrderDate', inplace=True)
df_months.head()
Id OrderId ProductId UnitPrice Quantity Discount discounted CustomerId EmployeeId RequiredDate ... ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry OrderMonth OrderYear
OrderDate
2012-07-04 10248/11 10248 11 14.0 12 0.0 0 VINET 5 2012-08-01 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
2012-07-04 10248/42 10248 42 9.8 10 0.0 0 VINET 5 2012-08-01 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
2012-07-04 10248/72 10248 72 34.8 5 0.0 0 VINET 5 2012-08-01 ... 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France 7 2012
2012-07-05 10249/14 10249 14 18.6 9 0.0 0 TOMSP 6 2012-08-16 ... 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany 7 2012
2012-07-05 10249/51 10249 51 42.4 40 0.0 0 TOMSP 6 2012-08-16 ... 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany 7 2012

5 rows × 21 columns

Group

# create seasonal-based dataframe with only columns we need
#keep_cols = ['OrderId', 'ProductId', 'UnitPrice', 'Quantity', 'ShipCountry', 'OrderMonth', 'OrderYear', 'Season']
drop_cols = ['OrderId', 'discounted', 'CustomerId', 'EmployeeId', 'Freight', 'RequiredDate', 'ShippedDate', 'ShipVia', 'ShipName', 'ShipAddress', 'ShipCity', 'ShipPostalCode']
df_monthly = df_months.copy()
df_monthly.drop(drop_cols, axis=1, inplace=True)
df_monthly.head()
Id ProductId UnitPrice Quantity Discount ShipRegion ShipCountry OrderMonth OrderYear
OrderDate
2012-07-04 10248/11 11 14.0 12 0.0 Western Europe France 7 2012
2012-07-04 10248/42 42 9.8 10 0.0 Western Europe France 7 2012
2012-07-04 10248/72 72 34.8 5 0.0 Western Europe France 7 2012
2012-07-05 10249/14 14 18.6 9 0.0 Western Europe Germany 7 2012
2012-07-05 10249/51 51 42.4 40 0.0 Western Europe Germany 7 2012
meanqpo = df_monthly.groupby('OrderMonth')['Quantity'].mean()

Explore

Test

Sample Size

sns.set_style("whitegrid")
%config InlineBackend.figure_format='retina'
%matplotlib inline


# Check if sample sizes allow us to ignore assumptions;
# visualize sample size comparisons for two groups (normality check)
fig = plt.figure(figsize=(14,8))
ax = fig.gca()
ax = sns.barplot(x='OrderMonth', y='Quantity', data=df_monthly)
ax.set_title('Monthly Order Qty', fontdict={'family': 'PT Mono', 'size':16})


Text(0.5, 1.0, 'Monthly Order Qty')

png

sns.set_style("whitegrid")
%config InlineBackend.figure_format='retina'
%matplotlib inline


# Check if sample sizes allow us to ignore assumptions;
# visualize sample size comparisons for two groups (normality check)
fig = plt.figure(figsize=(14,8))
ax = fig.gca()
ax = sns.barplot(x='OrderMonth', y='Quantity', data=df_monthly)
ax.set_title('Monthly Order Qty', fontdict={'family': 'PT Mono', 'size':16})


Text(0.5, 1.0, 'Monthly Order Qty')

png

# Anova Test - Season + Quantity ()

import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols("Quantity~C(OrderMonth)+Quantity:C(OrderMonth)", data=df_monthly).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
# reformat scientific notation of results for easier interpretation
anova_table.style.format("{:.5f}", subset=['PR(>F)'])
sum_sq df F PR(>F)
C(OrderMonth) 7395.98 11 2.94204e+29 0.00000
Quantity:C(OrderMonth) 772004 12 2.81504e+31 0.00000
Residual 4.87009e-24 2131 nan nan

Normality

Statistical

# split orders into two groups (series): discount and fullprice order quantity
Jan = df_monthly.groupby('OrderMonth').get_group(1)['Quantity']

# run tukey test for OQD (Order Quantity Discount) 
data = df_monthly['Quantity'].values
labels = df_monthly['OrderMonth'].values

import statsmodels.api as sms
model = sms.stats.multicomp.pairwise_tukeyhsd(data,labels)

# save OQD tukey test model results into dataframe (OQD: order quantity discount)
tukey_OQD = pd.DataFrame(data=model._results_table[1:], columns=model._results_table[0])
tukey_OQD
group1 group2 meandiff p-adj lower upper reject
0 1 2 1.3492 0.9 -4.6052 7.3037 False
1 1 3 -1.8729 0.9 -7.4759 3.73 False
2 1 4 0.5014 0.9 -5.0704 6.0733 False
3 1 5 -4.852 0.358 -11.2668 1.5627 False
4 1 6 -3.2421 0.9 -11.428 4.9438 False
... ... ... ... ... ... ... ...
61 9 11 0.3585 0.9 -6.73 7.4471 False
62 9 12 2.2267 0.9 -4.4923 8.9457 False
63 10 11 -1.5082 0.9 -8.3215 5.3051 False
64 10 12 0.3599 0.9 -6.068 6.7878 False
65 11 12 1.8682 0.9 -4.8142 8.5505 False

66 rows × 7 columns

Results

At a significance level of alpha = 0.05, we reject the null hypothesis which states there is no relationship between time of year (season) and sales revenue or volume of units sold.

Conclusion + Strategic Recommendations

Future Work

Questions to explore in future analyses might include:

  1. Build a product recommendation tool

  2. Create discounts or free shipping offers to increase sales volumes past a certain threshold.
    • Shipping Costs and Order Quantities/Sales Revenue Does shipping cost (freight) have a statistically significant effect on quantity? If so, at what level(s) of shipping cost?
  3. Customer Type and Product Category

Is there a relationship between type of customer and certain product categories? If so, we can run more highly targeted sales and marketing programs for increasing sales of certain products to certain market segments.

metricks

  1. What were the top 3 selling products overall?
  2. Top 3 selling products by country?
  3. Top 3 selling products by region?
  4. How did we do in sales for each product category?
  5. Can we group customers into customer types (fill the empty database) and build a product recommendation tool?
# Extract revenue per product category
cur.execute("""SELECT o.OrderId, o.CustomerId, od.ProductId, od.Quantity, od.UnitPrice, 
                od.Quantity*od.UnitPrice*(1-Discount) as Revenue, p.CategoryId, c.CategoryName
                FROM 'Order' o
                JOIN OrderDetail od 
                ON o.OrderId = od.OrderId
                JOIN Product p 
                ON od.ProductId = p.Id
                JOIN Category c
                ON p.CategoryId = c.Id
                ;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(len(df))
df.head(8)
2155
OrderId CustomerId ProductId Quantity UnitPrice Revenue CategoryId CategoryName
0 10248 VINET 11 12 14.0 168.0 4 Dairy Products
1 10248 VINET 42 10 9.8 98.0 5 Grains/Cereals
2 10248 VINET 72 5 34.8 174.0 4 Dairy Products
3 10249 TOMSP 14 9 18.6 167.4 7 Produce
4 10249 TOMSP 51 40 42.4 1696.0 7 Produce
5 10250 HANAR 41 10 7.7 77.0 8 Seafood
6 10250 HANAR 51 35 42.4 1261.4 7 Produce
7 10250 HANAR 65 15 16.8 214.2 2 Condiments
# Group data by Category and get sum total revenue for each
df_category = df.groupby(['CategoryName'])['Revenue'].sum().reset_index()
df_category
CategoryName Revenue
0 Beverages 267868.1800
1 Condiments 106047.0850
2 Confections 167357.2250
3 Dairy Products 234507.2850
4 Grains/Cereals 95744.5875
5 Meat/Poultry 163022.3595
6 Produce 99984.5800
7 Seafood 131261.7375
df.CategoryId.value_counts()
1    404
4    366
3    334
8    330
2    216
5    196
6    173
7    136
Name: CategoryId, dtype: int64
# Explore sample sizes before testing
categories = df.groupby('CategoryName').groups
categories.keys()
dict_keys(['Beverages', 'Condiments', 'Confections', 'Dairy Products', 'Grains/Cereals', 'Meat/Poultry', 'Produce', 'Seafood'])
df_category.loc[df_category['CategoryName'] == 'Beverages']['Revenue'].sum()
267868.17999999993
#create dict of months and order quantity totals
rev_per_cat = {}

for k,v in categories.items():
    rev = df_category.loc[df_category['CategoryName'] == k]['Revenue'].sum()
    rev_per_cat[k] = rev

rev_per_cat
{'Beverages': 267868.17999999993,
 'Condiments': 106047.08500000002,
 'Confections': 167357.22499999995,
 'Dairy Products': 234507.285,
 'Grains/Cereals': 95744.58750000001,
 'Meat/Poultry': 163022.3595,
 'Produce': 99984.57999999999,
 'Seafood': 131261.73750000002}
# plot order quantity totals by month
fig = plt.figure(figsize=(12,12))
for k,v in rev_per_cat.items():
    plt.bar(x=k, height=v)

png

# What were the top 3 selling product categories in each region or country?
# What were the lowest 3 selling product categories in each region or country?
tags: Your Browser Doesn't Support Canvas, Please Download Chrome or compatible browser.