Continue with Machine Learning - Linear Regression
In this post we'll use some financial data to test and apply linear regression. Quandl is: The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 250,000 people, including analysts from the world’s ...
In this post we'll use some financial data to test and apply linear regression. Quandl is:
The premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl’s platform is used by over 250,000 people, including analysts from the world’s top hedge funds, asset managers and investment banks.
Set up
pip install sklearn # machine learning library pip instal quandl # library for loading financial data pip install pandas # library for working with python dataframe
Definition
So what is linear regression?
In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. Linear regression map continuos data from X -> Y. First it learns from known data, then build a model and use this model to predict new X to an unknown Y.
Import Data
Let create a file called main.py with following codes:
import pandas as pd import quandl df = quandl.get('WIKI/GOOGL') print(df.head())
We get a nice data below:
Open High Low Close Volume Ex-Dividend Date 2004-08-19 100.01 104.06 95.96 100.335 44659000.0 0.0 2004-08-20 101.01 109.08 100.50 108.310 22834300.0 0.0 2004-08-23 110.76 113.48 109.05 109.400 18256100.0 0.0 2004-08-24 111.24 111.60 103.57 104.870 15247300.0 0.0 2004-08-25 104.76 108.00 103.88 106.000 9188600.0 0.0 Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Date 2004-08-19 1.0 50.159839 52.191109 48.128568 50.322842 2004-08-20 1.0 50.661387 54.708881 50.405597 54.322689 2004-08-23 1.0 55.551482 56.915693 54.693835 54.869377 2004-08-24 1.0 55.792225 55.972783 51.945350 52.597363 2004-08-25 1.0 52.542193 54.167209 52.100830 53.164113 Adj. Volume Date 2004-08-19 44659000.0 2004-08-20 22834300.0 2004-08-23 18256100.0 2004-08-24 15247300.0 2004-08-25 9188600.0
Close, Open, Volume... are all features. We won't use all the feature to do machine learning. Some features are more useful than the others. With some experiences and observation we can choose some features to start learning.
import pandas as pd import quandl df = quandl.get('WIKI/GOOGL') df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']] df['HL_PCT'] = (df['Adj. High'] -df['Adj. Close']) /df['Adj. Close'] * 100.0 df['PCT_change'] = (df['Adj. Close'] -df['Adj. Open']) /df['Adj. Open'] * 100.0 df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']] print(df.head()) Adj. Close HL_PCT PCT_change Adj. Volume Date 2004-08-19 50.322842 3.712563 0.324968 44659000.0 2004-08-20 54.322689 0.710922 7.227007 22834300.0 2004-08-23 54.869377 3.729433 -1.227880 18256100.0 2004-08-24 52.597363 6.417469 -5.726357 15247300.0 2004-08-25 53.164113 1.886792 1.183658 9188600.0
What we are trying to predict here is Adj Close price.
forecast_col = 'Adj. Close'
Let's also fill some missing data with some value since python cannot work with na.
df.fillna(-99999, inplace=True)
Let's get out top 1 percent of dataframe that we'll try to forecast by making a simple observation.
df.fillna(-99999, inplace=True) forecast_out = int(math.ceil(0.01 * len(df))) df['label'] = df[forecast_col].shift(-forecast_out) df.dropna(inplace=True) print(df.tail()) Adj. Close HL_PCT PCT_change Adj. Volume label Date 2017-10-27 1033.67 2.897443 0.259944 5139945.0 1085.09 2017-10-30 1033.13 0.648515 0.385751 2245352.0 1079.78 2017-10-31 1033.04 0.770541 0.003872 1490660.0 1073.56 2017-11-01 1042.60 0.504508 0.605990 2105729.0 1070.85 2017-11-02 1042.97 0.244494 0.286541 1233333.0 1068.86
We'll see a Adj. Close and label price get pretty close to each other.
Let's get to real training and prediction.
X = np.array(df.drop(['label'],1)) y = np.array(df['label']) X= preprocessing.scale(X) df.dropna(inplace=True) y = np.array(df['label']) X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2) clf = LinearRegression() clf.fit(X_train, y_train) accuracy = clf.score(X_test, y_test) print(accuracy) 0.973621902095
High accuracy