Predict independent values with text data using linear regression
The theme of Machine Learning is quite popular and sought after now. Machine Learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious ...
The theme of Machine Learning is quite popular and sought after now. Machine Learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields.
This article is intended for those who are just beginning to learn the methods and approaches to solve problems. We consider one of the simplest methods, it is the method of linear regression for text data and prediction independent features
Task
Get prediction employees salary, based on the job description.
Data set with job descriptions and respective annual salaries are presented in file salary-train.csv
№ | FullDescription | LocationNormalized | ContractTime | SalaryNormalized |
---|---|---|---|---|
0 | International Sales Manager London ****k ****... | London | permanent | 33000 |
1 | An ideal opportunity for an individual that ha... | London | permanent | 50000 |
2 | Online Content and Brand Manager// Luxury Reta... | South East London | permanent | 40000 |
3 | A great local marketleader is seeking a perman... | Dereham | permanent | 22500 |
4 | Registered Nurse / RGN Nursing Home for Young... | Sutton Coldfield | NaN | 20355 |
5 | Sales and Marketing Assistant will provide adm... | Crawley | NaN | 22500 |
6 | Vacancy Ladieswear fashion Area Manager / Regi... | UK | permanent | 32000 |
We have to predict salary for the two examples
№ | FullDescription | LocationNormalized | ContractTime | SalaryNormalized |
---|---|---|---|---|
0 | We currently have a vacancy for an HR Project ... | Milton Keynes | contract | NaN |
1 | A Web developer opportunity has arisen with an... | Manchester | permanent | NaN |
Method of linear regression
Linear methods are well suited for working with sparse data, for example, when working with texts. This can be explained by a high rate of training and a small number of parameters, making possible to avoid retraining. Linear regression is one of the basic and most simple machine learning methods. The method is used to restore the relationship between independent and dependent variables.
Instruments
Python and packages such as Pandas, SciPy, and Scikit-Learn, which realizes many algorithms machine learning.
import pandas from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction import DictVectorizer from scipy.sparse import hstack from sklearn.linear_model import Ridge
Solution
Load the data set with the job description and relevant annual salary from the file
train = pandas.read_csv('salary-train.csv')
Data are presented in DataFrame It allows us comfortable work with data. In total, the training set of 240 thousand records.
Data Preparation
Before you start training, you first need to prepare the data because classifiers only work with numerical values.
Text comfortably encoded using "bag of words". It is formed as features of how many unique words in the text, and the meaning of each feature is the number of occurrences in the document corresponding to the word. It is clear that the total number of different words in a Text Set can reach tens of thousands, and it is only a small part of them will occur in a particular text. We can encode texts smarter. It doesn't record the number of occurrences of the word in the text, but TF-IDF. This figure, which is the product of two numbers: TF (term frequency) and IDF (inverse document frequency). The first is the ratio of the number of occurrences of a word in a document to the total length of the document. The second value depends on that word in the document how many samples. The more of these documents, the smaller the IDF. Thus, TF-IDF value will be high for those words that occur many times herein and in other rare. In Scikit-Learn sklearn.feature_extraction.text.TfidfVectorizer implemented in the class. Conversion training set needs to be done with the help of fit_transform function, test set with using a transform.
Formatting:
# For beginning, transform train['FullDescription'] to lowercase using text.lower() train['FullDescription'].str.lower() # Then replace everything except the letters and numbers in the spaces. # it will facilitate the further division of the text into words. train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True) # Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer vectorizer = TfidfVectorizer(min_df=5) X_tfidf = vectorizer.fit_transform(train['FullDescription'])
Cleaning:
# Replace NaN in LocationNormalized и ContractTime rows to special string 'nan'. train['LocationNormalized'].fillna('nan', inplace=True) train['ContractTime'].fillna('nan', inplace=True)
Transform Data:
LocationNormalized и ContractTime features have string type, and we are not able to work with it directly. We used DictVectorizer to do binary one-hot (aka one-of-K) coding. It means one boolean-valued feature is constructed for each of the possible string values that the feature can take on.
enc = DictVectorizer() X_train_categ = enc.fit_transform(train[['LocationNormalized', 'ContractTime']].to_dict('records')) # Take a sequence of arrays and stack them horizontally to make a single array. # Rebuild arrays divided by scipy.sparse.hstack. # Note that matrices are sparse. # In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. X = hstack([X_tfidf,X_train_categ])
Okay, we have got a matrix for fit.
Ridge Regression
Ridge Regression is a remedial measure taken to alleviate multicollinearity amongst regression predictor variables in a model. Often predictor variables used in a regression are highly correlated. When they are, the regression coefficient of any one variable depend on which other predictor variables are included in the model, and which ones are left out.
Train Ridge regression with parameters alpha=1 and random_state=241.
# Classifier: clf = Ridge(alpha=1.0, random_state=241) # The target value (algorithm has to predict) is SalaryNormalized y = train['SalaryNormalized'] # train model on data clf.fit(X, y)
Prediction
Get forecast for two examples from file salary-test-mini.csv. We have to do Data Preparation for the test Dataset and predict new
rslt = clf.predict(X_test)
Answer has 2 numbers [ 56555.61500155 , 37188.32442618]. This is solution of task.
Result
№ | FullDescription | LocationNormalized | ContractTime | SalaryNormalized |
---|---|---|---|---|
0 | We currently have a vacancy for an HR Project ... | Milton Keynes | contract | 56556 |
1 | A Web developer opportunity has arisen with an... | Manchester | permanent | 37188 |
There are easy way to predict variables.
Links
- Source code GitHub
- How to Make a Prediction - Intro to Deep Learning #1
- Scikit-Learn
- Scipy