Logistic regression with python
This will be my first post about machine learning using python. The prediction model has been done already by https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb . But it can be too overwelming for most people to understand this is my ...
This will be my first post about machine learning using python. The prediction model has been done already by https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb . But it can be too overwelming for most people to understand this is my attempt to elaborate more on the code written.
The dataset we are using can be obtained from https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/data/ . We'll run our prediction model on local machine. The data downloaded is pretty huge. We have to find a way to randomly select the most representative data used for training data set. Now first let's download the data. Then load it to a DataFrame. First import necessary libraries such as numpy(for working with numeric values) and pandas for DataFrame
# ignore deprecation warnings in sklearn import warnings warnings.filterwarnings("ignore") #matplotlib.pyplot is for plotting import matplotlib.pyplot as plt import numpy as np import pandas as pd #need to run local machine and path import os import sys # add the 'src' directory as one where we can import modules src_dir = os.path.join(os.getcwd(), os.pardir, 'src') sys.path.append(src_dir) #We will select randomly the most representative data used for training #This model will have multiple label attached to each record. #Then we split the data into training dataset and test dataset. from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split #This library takes into account the interaction between each features. from features.SparseInteractions import SparseInteractions #This library will measure the error produced by our model. So it says about how accurate our model is. from models.metrics import multi_multi_log_loss
Next read the data into DataFrame.
path_to_training_data = os.path.join(os.pardir, 'data', 'TrainingSet.csv') #Set the first column as index by which reach row can be accessed df = pd.read_csv(path_to_training_data, index_col=0) #print the shape of the DataFrame print(df.shape) #(400277, 25) #400277 rows and 25 columns
The data is too much for our machine.
Resample the Data
After checking the data using EDA (Exploratory Data Analysis) we have see Feature with numeric value and non-numeric values.
#Non numeric values LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status'] #Numeric values using list comprehension NON_LABELS = [c for c in df.columns if c not in LABELS] SAMPLE_SIZE = 40000 #Set the categorical value using dummie value with 0 and 1. sampling = multilabel_sample_dataframe(df, pd.get_dummies(df[LABELS]), size=SAMPLE_SIZE, min_count=25, seed=43) #Get dummies variables from the sampling created dummy_labels = pd.get_dummies(sampling[LABELS]) #Split the sample data into train set and test set X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS], dummy_labels, 0.2, min_count=3, seed=43)
Dummy label translates the categorical value of a variable into a variable of its own and only one value equals 1.
Create preprocessing tools
NUMERIC_COLUMNS = ['FTE', "Total"] def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS): """ Takes the dataset as read in, drops the non-feature, non-text columns and then combines all of the text columns into a single vector that has all of the text for a row. :param data_frame: The data as read in with read_csv (no preprocessing necessary) :param to_drop (optional): Removes the numeric and label columns by default. """ # drop non-text columns that are in the df to_drop = set(to_drop) & set(data_frame.columns.tolist()) text_data = data_frame.drop(to_drop, axis=1) # replace nans with blanks text_data.fillna("", inplace=True) # joins all of the text items in a row (axis=1) # with a space in between return text_data.apply(lambda x: " ".join(x), axis=1) ``` Create Function transformer ```Python from sklearn.preprocessing import FunctionTransformer get_text_data = FunctionTransformer(combine_text_columns, validate=False) get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
Combine text columns into one column
get_text_data.fit_transform(sampling.head(5))
With output
38 OTHER PURCHASED SERVICES SCHOOL-WIDE SCHOOL P... 70 Extra Duty Pay/Overtime For Support Personnel ... 198 Supplemental * Operation and Maintenance of P... 209 REPAIR AND MAINTENANCE SERVICES PUPIL TRANSPO... 614 GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H... dtype: object
There are 2 numeric columns: "FTE" and "Total"
get_numeric_data.fit_transform(sampling.head(5))
With output
FTE Total 38 NaN 653.460000 70 NaN 2153.530000 198 NaN -8291.860000 209 NaN 618.290000 614 0.71 21747.666875
Create function to evaluate model
from sklearn.metrics.scorer import make_scorer log_loss_scorer = make_scorer(multi_multi_log_loss)
It is necessary to use pipeline that the output for one function can be used as an input for another function.
from sklearn.feature_selection import chi2, SelectKBest from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.preprocessing import Imputer from sklearn.feature_extraction.text import HashingVectorizer from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MaxAbsScaler TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=s+)'
Here we use Logistic regression for classifying. We also do normalization using MaxAbscaler.
%%time #set a reasonable number of features before adding interactions chi_k = 300 #create the pipeline object pl = Pipeline([ ('union', FeatureUnion( # Use FeatureUnion to combine 2 transformers. transformer_list = [ ('numeric_features', Pipeline([ ('selector', get_numeric_data), ('imputer', Imputer()) ])), ('text_features', Pipeline([ ('selector', get_text_data), ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, # create vectorization with each Token as alphanumeric. non_negative=True, norm=None, binary=False, ngram_range=(1, 2))), ('dim_red', SelectKBest(chi2, chi_k)) ])) ] )), ('int', SparseInteractions(degree=2)), ('scale', MaxAbsScaler()), ('clf', OneVsRestClassifier(LogisticRegression())) ]) #fit the pipeline to our training data pl.fit(X_train, y_train.values) #print the score of our trained pipeline on our test set print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))
path_to_holdout_data = os.path.join(os.pardir, 'data', 'TestSet.csv') # Load holdout data holdout = pd.read_csv(path_to_holdout_data, index_col=0) # Make predictions predictions = pl.predict_proba(holdout) # Format correctly in new DataFrame: prediction_df prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns, index=holdout.index, data=predictions) # Save prediction_df to csv called "predictions.csv" prediction_df.to_csv("predictions.csv")