12/08/2018, 15:55

Logistic regression with python

This will be my first post about machine learning using python. The prediction model has been done already by https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb . But it can be too overwelming for most people to understand this is my ...

This will be my first post about machine learning using python. The prediction model has been done already by https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb . But it can be too overwelming for most people to understand this is my attempt to elaborate more on the code written.

The dataset we are using can be obtained from https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/data/ . We'll run our prediction model on local machine. The data downloaded is pretty huge. We have to find a way to randomly select the most representative data used for training data set. Now first let's download the data. Then load it to a DataFrame. First import necessary libraries such as numpy(for working with numeric values) and pandas for DataFrame

# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

#matplotlib.pyplot is for plotting

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#need to run local machine and path
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

#We will select randomly the most representative data used for training
#This model will have multiple label attached to each record.
#Then we split the data into training dataset and test dataset.
from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split
#This library takes into account the interaction between each features.
from features.SparseInteractions import SparseInteractions
#This library will measure the error produced by our model. So it says about how accurate our model is.
from models.metrics import multi_multi_log_loss

Next read the data into DataFrame.

path_to_training_data = os.path.join(os.pardir,
                                     'data',
                                     'TrainingSet.csv')
#Set the first column as index by which reach row can be accessed
df = pd.read_csv(path_to_training_data, index_col=0)

#print the shape of the DataFrame
print(df.shape)
#(400277, 25) 
#400277 rows and 25 columns

The data is too much for our machine.

Resample the Data

After checking the data using EDA (Exploratory Data Analysis) we have see Feature with numeric value and non-numeric values.

#Non numeric values
LABELS = ['Function',
          'Use',
          'Sharing',
          'Reporting',
          'Student_Type',
          'Position_Type',
          'Object_Type', 
          'Pre_K',
          'Operating_Status']

#Numeric values using list comprehension
NON_LABELS = [c for c in df.columns if c not in LABELS]

SAMPLE_SIZE = 40000
#Set the categorical value using dummie value with 0 and 1.

sampling = multilabel_sample_dataframe(df,
                                       pd.get_dummies(df[LABELS]),
                                       size=SAMPLE_SIZE,
                                       min_count=25,
                                       seed=43)
#Get dummies variables from the sampling created
dummy_labels = pd.get_dummies(sampling[LABELS])
#Split the sample data into train set and test set
X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)

Dummy label translates the categorical value of a variable into a variable of its own and only one value equals 1.

Create preprocessing tools

NUMERIC_COLUMNS = ['FTE', "Total"]

def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)
    ```
    Create Function transformer
```Python
from sklearn.preprocessing import FunctionTransformer

get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

Combine text columns into one column

get_text_data.fit_transform(sampling.head(5))

With output

38     OTHER PURCHASED SERVICES  SCHOOL-WIDE SCHOOL P...
70     Extra Duty Pay/Overtime For Support Personnel ...
198    Supplemental *  Operation and Maintenance of P...
209    REPAIR AND MAINTENANCE SERVICES  PUPIL TRANSPO...
614     GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H...
dtype: object

There are 2 numeric columns: "FTE" and "Total"

get_numeric_data.fit_transform(sampling.head(5))

With output

	FTE	Total
38	NaN	653.460000
70	NaN	2153.530000
198	NaN	-8291.860000
209	NaN	618.290000
614	0.71	21747.666875

Create function to evaluate model

from sklearn.metrics.scorer import make_scorer

log_loss_scorer = make_scorer(multi_multi_log_loss)

It is necessary to use pipeline that the output for one function can be used as an input for another function.

from sklearn.feature_selection import chi2, SelectKBest

from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import Imputer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MaxAbsScaler

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=s+)'

Here we use Logistic regression for classifying. We also do normalization using MaxAbscaler.

%%time

#set a reasonable number of features before adding interactions
chi_k = 300

#create the pipeline object
pl = Pipeline([
        ('union', FeatureUnion( # Use FeatureUnion to combine 2 transformers.
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, # create vectorization with each Token as alphanumeric.
                                                     non_negative=True, norm=None, binary=False,
                                                     ngram_range=(1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

#fit the pipeline to our training data
pl.fit(X_train, y_train.values)

#print the score of our trained pipeline on our test set
print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))
path_to_holdout_data = os.path.join(os.pardir,
                                    'data',
                                    'TestSet.csv')

# Load holdout data
holdout = pd.read_csv(path_to_holdout_data, index_col=0)

# Make predictions
predictions = pl.predict_proba(holdout)

# Format correctly in new DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv called "predictions.csv"
prediction_df.to_csv("predictions.csv")
0