3.1.6.1.3. Analysis of Iris petal and sepal sizes¶

Ilustrate an analysis on a real dataset:

Visualizing the data to formulate intuitions
Fitting of a linear model
Hypothesis test of the effect of a categorical variable in the presence of a continuous confound

../../../_images/plot_iris_analysis_1.png

Script output:

OLSRegressionResults
==============================================================================
Dep.Variable:sepal_widthR-squared:0.478
Model:OLSAdj.R-squared:0.468
Method:LeastSquaresF-statistic:44.63
Date:Mon,10Oct2016Prob(F-statistic):1.58e-20
Time:22:14:08Log-Likelihood:-38.185
No.Observations:150AIC:84.37
DfResiduals:146BIC:96.41
DfModel:3
CovarianceType:nonrobust
======================================================================================
coefstderrtP>|t|[0.0250.975]
--------------------------------------------------------------------------------------
Intercept2.98130.09929.9890.0002.7853.178
name[T.versicolor]    -1.48210.181-8.1900.000-1.840-1.124
name[T.virginica]     -1.66350.256-6.5020.000-2.169-1.158
petal_length0.29830.0614.9200.0000.1780.418
==============================================================================
Omnibus:2.868Durbin-Watson:1.753
Prob(Omnibus):0.238Jarque-Bera (JB):2.885
Skew:-0.082   Prob(JB):0.236
Kurtosis:3.659Cond.No.54.0
==============================================================================
Warnings:
[1]StandardErrorsassumethatthecovariancematrixoftheerrorsiscorrectlyspecified.
Testingthedifferencebetweeneffectofversicolorandvirginica
<Ftest:F=array([[ 3.24533535]]), p=0.073690587817, df_denom=146, df_num=1>

Python source code: plot_iris_analysis.py

importmatplotlib.pyplotasplt
importpandas
frompandas.toolsimportplotting
fromstatsmodels.formula.apiimportols
# Load the data
data=pandas.read_csv('iris.csv')
##############################################################################
# Plot a scatter matrix
# Express the names as categories
categories=pandas.Categorical(data['name'])
# The parameter 'c' is passed to plt.scatter and will control the color
plotting.scatter_matrix(data, c=categories.labels, marker='o')
fig=plt.gcf()
fig.suptitle("blue: setosa, green: versicolor, red: virginica", size=13)
##############################################################################
# Statistical analysis
# Let us try to explain the sepal length as a function of the petal
# width and the category of iris
model=ols('sepal_width ~ name + petal_length', data).fit()
print(model.summary())
# Now formulate a "contrast", to test if the offset for versicolor and
# virginica are identical
print('Testing the difference between effect of versicolor and virginica')
print(model.f_test([0, 1, -1, 0]))
plt.show()

Total running time of the example: 1.09 seconds ( 0 minutes 1.09 seconds)