Machine Learning - Cleaning Data
We will explore stocks data and try to analyze them. The first piece of data we want to obtain is a snapshot of the stocks we want to analyze. One of the best ways to do this is to download data from one of the many stock screener applications that exist. But It is not possible to download it ...
We will explore stocks data and try to analyze them. The first piece of data we want to obtain is a snapshot of the stocks we want to analyze. One of the best ways to do this is to download data from one of the many stock screener applications that exist. But It is not possible to download it directly now. So I have try to scrape the data. Our favorite screener to download stock data from belongs to http://finviz.com.
Get the data
import requests import pandas as pd from bs4 import BeautifulSoup import csv import pdb import datetime #pdb.set_trace() - python step by step debugger command print datetime.datetime.now() print "Finviz Overview Start" url = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68" response = requests.get(url) html = response.content soup = BeautifulSoup(html, "html.parser") firstcount = soup.find_all('option') lastnum = len(firstcount) - 1 lastpagenum = firstcount[lastnum].attrs['value'] currentpage = int(lastpagenum) alldata = [] templist = [] # Overview = 111, Valuation = 121, Financial = 161, Ownership = 131, Performance = 141 #pagesarray = [111,121,161,131,141] titleslist = soup.find_all('td',{"class" : "table-top"}) titleslisttickerid = soup.find_all('td',{"class" : "table-top-s"}) titleticker = titleslisttickerid[0].text titlesarray = [] for title in titleslist: titlesarray.append(title.text) titlesarray.insert(1,titleticker) i = 0 while(currentpage > 0): i += 1 print str(i) + " page(s) done" secondurl = "https://finviz.com/screener.ashx?v=152&c=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68" + "&r=" + str(currentpage) secondresponse = requests.get(secondurl) secondhtml = secondresponse.content secondsoup = BeautifulSoup(secondhtml, "html.parser") stockdata = secondsoup.find_all('a', {"class" : "screener-link"}) stockticker = secondsoup.find_all('a', {"class" : "screener-link-primary"}) datalength = len(stockdata) #tickerdatalength = len(stockticker) j=0 m=0 while(j < datalength): templist = [stockdata[j+k].text for k in range(0, 68)] templist.insert(1, stockticker[m].text) alldata.append(templist) templist = [] j += 68 m += 1 currentpage -= 20 with open('stockoverview.csv', 'wb') as csvfile: overview = csv.DictWriter(csvfile, delimiter=',', lineterminator=' ', fieldnames=titlesarray) overview.writeheader() for stock in alldata: df = dict() for i in range(0,69): df[titlesarray[i]] = stock[i] overview.writerow(df) print datetime.datetime.now() print "Finviz Overview Completed"
Explore and Clean Data
We can test and see the first 5 rows
import pandas as pd import re df = pd.read_csv('stockoverview.csv') df = df.sort_values(df.columns[0]) df.to_csv('new_stockoverview.csv', sep=',', index=False) df.head()
The result can look something like this:
No. Ticker Company Sector 0 1 A Agilent Technologies, Inc. Healthcare 1 2 AA Alcoa Corporation Basic Materials 2 3 AAAP Advanced Accelerator Applications S.A. Healthcare 3 4 AABA Altaba Inc. Financial 4 5 AAC AAC Holdings, Inc. Healthcare Industry Country Market Cap P/E Fwd P/E PEG 0 Medical Laboratories & Research USA 23.69B 35.12 25.71 3.07 1 Aluminum USA 9.84B 45.99 14.14 - 2 Diagnostic Substances France 3.63B - - - 3 Asset Management USA 67.84B 41.57 146.81 - 4 Specialized Health Services USA 209.88M - 16.85 - ... RSI from Open Gap Recom Avg Volume Rel Volume Price 0 ... 71.21 -0.11% 0.30% 1.70 1.85M 0.95 73.58 1 ... 55.48 0.76% 0.78% 2.00 4.35M 0.58 53.30 2 ... 69.52 0.45% -0.09% 2.60 744.38K 0.01 82.32 3 ... 67.92 0.88% 0.73% 2.00 9.34M 0.67 77.81 4 ... 47.53 -1.32% 0.55% 1.70 169.23K 0.70 8.95 Change Volume Earnings 0 0.19% 1,754,331 Nov 20/a 1 1.54% 2,521,352 Jan 17/a 2 0.37% 7,485 Feb 16/b 3 1.62% 6,245,947 - 4 -0.78% 117,818 Feb 26/a
As we can see the data is not so clean. Imported numeric data often contains special characters such as percentage signs, dollar signs, commas, and so on. Beside the data are all of type objects except the first column(int64). But in fact if we observe most values numeric.
print(new_df.info) No. 7134 non-null int64 Ticker 7134 non-null object Company 7134 non-null object Sector 7134 non-null object Industry 7134 non-null object Country 7134 non-null object Market Cap 7134 non-null object P/E 7134 non-null object Fwd P/E 7134 non-null object PEG 7134 non-null object P/S 7134 non-null object P/B 7134 non-null object P/C 7134 non-null object P/FCF 7134 non-null object Dividend 7134 non-null object Payout Ratio 7134 non-null object EPS 7134 non-null object EPS this Y 7134 non-null object EPS next Y 7134 non-null object EPS past 5Y 7134 non-null object EPS next 5Y 7134 non-null object Sales past 5Y 7134 non-null object EPS Q/Q 7134 non-null object Sales Q/Q 7134 non-null object Outstanding 7134 non-null object Float 7134 non-null object Insider Own 7134 non-null object Insider Trans 7134 non-null object Inst Own 7134 non-null object Inst Trans 7134 non-null object Float Short 7134 non-null object Short Ratio 7134 non-null object ROA 7134 non-null object ROE 7134 non-null object ROI 7134 non-null object Curr R 7134 non-null object Quick R 7134 non-null object LTDebt/Eq 7134 non-null object Debt/Eq 7134 non-null object Gross M 7134 non-null object Oper M 7134 non-null object Profit M 7134 non-null object Perf Week 7134 non-null object Perf Month 7134 non-null object Perf Quart 7134 non-null object Perf Half 7134 non-null object Perf Year 7134 non-null object Perf YTD 7134 non-null object Beta 7134 non-null object ATR 7134 non-null object Volatility W 7134 non-null object Volatility M 7134 non-null object SMA20 7134 non-null object SMA50 7134 non-null object SMA200 7134 non-null object 50D High 7134 non-null object 50D Low 7134 non-null object 52W High 7134 non-null object 52W Low 7134 non-null object RSI 7134 non-null object from Open 7134 non-null object Gap 7134 non-null object Recom 7134 non-null object Avg Volume 7134 non-null object Rel Volume 7134 non-null object Price 7134 non-null object Change 7134 non-null object Volume 7134 non-null int64 Earnings 7134 non-null object dtypes: int64(2), object(67) memory usage: 3.8+ MB None
So let also convert some values to its appropriate type. Let's create function to clean data
def clean_numeric(s): return pd.to_numeric(re.sub(r'\%|$|,|)|(', ', s), errors='coerce') for i in new_df.columns[7:68]: new_df[i] = new_df[i].apply(clean_numeric) print(new_df.head()) print new_df.info() print new_df.describe()
Now what we get is a cleaner data make further analysis:
[5 rows x 69 columns] No. Ticker Company Sector 0 1 A Agilent Technologies, Inc. Healthcare 1 2 AA Alcoa Corporation Basic Materials 2 3 AAAP Advanced Accelerator Applications S.A. Healthcare 3 4 AABA Altaba Inc. Financial 4 5 AAC AAC Holdings, Inc. Healthcare Industry Country Market Cap P/E Fwd P/E PEG 0 Medical Laboratories & Research USA 23.69B 35.12 25.71 3.07 1 Aluminum USA 9.84B 45.99 14.14 NaN 2 Diagnostic Substances France 3.63B NaN NaN NaN 3 Asset Management USA 67.84B 41.57 146.81 NaN 4 Specialized Health Services USA 209.88M NaN 16.85 NaN ... RSI from Open Gap Recom Avg Volume Rel Volume Price 0 ... 71.21 -0.11 0.30 1.7 NaN 0.95 73.58 1 ... 55.48 0.76 0.78 2.0 NaN 0.58 53.30 2 ... 69.52 0.45 -0.09 2.6 NaN 0.01 82.32 3 ... 67.92 0.88 0.73 2.0 NaN 0.67 77.81 4 ... 47.53 -1.32 0.55 1.7 NaN 0.70 8.95 Change Volume Earnings 0 0.19 1754331 Nov 20/a 1 1.54 2521352 Jan 17/a 2 0.37 7485 Feb 16/b 3 1.62 6245947 - 4 -0.78 117818 Feb 26/a
Let's also check print(df.info)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7134 entries, 0 to 7133 Data columns (total 69 columns): No. 7134 non-null int64 Ticker 7134 non-null object Company 7134 non-null object Sector 7134 non-null object Industry 7134 non-null object Country 7134 non-null object Market Cap 7134 non-null object P/E 3040 non-null float64 Fwd P/E 3079 non-null float64 PEG 2252 non-null float64 P/S 4538 non-null float64 P/B 4440 non-null float64 P/C 4274 non-null float64 P/FCF 2684 non-null float64 Dividend 3714 non-null float64 Payout Ratio 2842 non-null float64 EPS 4732 non-null float64 EPS this Y 4560 non-null float64 EPS next Y 3833 non-null float64 EPS past 5Y 4304 non-null float64 EPS next 5Y 3236 non-null float64 Sales past 5Y 3859 non-null float64 EPS Q/Q 4469 non-null float64 Sales Q/Q 4450 non-null float64 Outstanding 0 non-null float64 Float 0 non-null float64 Insider Own 4548 non-null float64 Insider Trans 4246 non-null float64 Inst Own 4172 non-null float64 Inst Trans 4324 non-null float64 Float Short 4636 non-null float64 Short Ratio 6627 non-null float64 ROA 4331 non-null float64 ROE 4282 non-null float64 ROI 4244 non-null float64 Curr R 3523 non-null float64 Quick R 3452 non-null float64 LTDebt/Eq 4228 non-null float64 Debt/Eq 4228 non-null float64 Gross M 3464 non-null float64 Oper M 3841 non-null float64 Profit M 3817 non-null float64 Perf Week 7130 non-null float64 Perf Month 7110 non-null float64 Perf Quart 7011 non-null float64 Perf Half 6888 non-null float64 Perf Year 6671 non-null float64 Perf YTD 7131 non-null float64 Beta 3984 non-null float64 ATR 7131 non-null float64 Volatility W 7130 non-null float64 Volatility M 7110 non-null float64 SMA20 7131 non-null float64 SMA50 7131 non-null float64 SMA200 7131 non-null float64 50D High 7131 non-null float64 50D Low 7131 non-null float64 52W High 7131 non-null float64 52W Low 7131 non-null float64 RSI 7069 non-null float64 from Open 7131 non-null float64 Gap 7131 non-null float64 Recom 4131 non-null float64 Avg Volume 0 non-null float64 Rel Volume 7127 non-null float64 Price 7131 non-null float64 Change 7131 non-null float64 Volume 7134 non-null int64 Earnings 7134 non-null object dtypes: float64(60), int64(2), object(7) memory usage: 3.8+ MB
As we can see now there are many numeric types. Besides other disturbing symbols are removed. We can check the basic statistics of the data by using new_df.describe(). This methods only work on numeric column.
No. P/E Fwd P/E PEG P/S count 7134.000000 3040.000000 3079.000000 2252.000000 4538.000000 mean 3567.500000 76.372526 40.255661 9.272615 69.915591 std 2059.552743 632.983556 429.527497 74.322798 2049.459084 min 1.000000 0.000000 1.070000 0.000000 0.000000 25% 1784.250000 16.857500 13.550000 1.570000 1.010000 50% 3567.500000 23.485000 18.150000 2.435000 2.465000 75% 5350.750000 37.670000 26.160000 4.162500 5.470000 max 7134.000000 22495.000000 23250.000000 2001.670000 132866.310000 P/B P/C P/FCF Dividend Payout Ratio count 4440.000000 4274.000000 2684.000000 3714.000000 2842.000000 mean 6.980770 125.173613 83.639676 2.948244 45.881175 std 59.992237 2671.309330 938.556950 3.221843 83.578210 min 0.000000 0.020000 0.050000 0.010000 0.000000 25% 1.330000 4.532500 12.417500 1.190000 0.000000 50% 2.180000 10.035000 22.415000 2.030000 23.100000 75% 4.310000 27.540000 41.190000 3.410000 55.100000 max 3028.000000 169596.620000 42543.750000 48.410000 979.100000 ... 52W Low RSI from Open Gap count ... 7131.000000 7069.000000 7131.000000 7131.000000 mean ... 45.540217 58.054506 -0.464629 0.321847 std ... 148.317658 14.748561 2.131743 1.609517 min ... -8.330000 11.580000 -23.530000 -26.350000 25% ... 13.850000 47.590000 -1.130000 0.000000 50% ... 27.670000 58.140000 -0.290000 0.240000 75% ... 49.580000 68.990000 0.110000 0.590000 max ... 11133.330000 98.110000 28.220000 70.000000 Recom Avg Volume Rel Volume Price Change count 4131.000000 0.0 7127.000000 7131.000000 7131.000000 mean 2.299371 NaN 1.150251 87.844182 -0.148512 std 0.594775 NaN 2.038652 3831.215053 2.487509 min 1.000000 NaN 0.000000 0.020000 -28.820000 25% 2.000000 NaN 0.560000 11.915000 -0.910000 50% 2.200000 NaN 0.860000 26.380000 -0.110000 75% 2.700000 NaN 1.270000 50.190000 0.520000 max 5.000000 NaN 81.610000 323480.000000 32.840000 Volume count 7.134000e+03 mean 1.085803e+06 std 4.414247e+06 min 0.000000e+00 25% 2.200200e+04 50% 1.498810e+05 75% 6.555612e+05 max 1.671263e+08 [8 rows x 62 columns]