??本章是使用機器學習預測天氣系列教程的第一部分,使用Python和機器學習來構建模型,根據從Weather Underground收集的數據來預測天氣溫度。該教程將由三個不同的部分組成,涵蓋的主題是:
??本教程中使用的數據將從Weather Underground的免費層API服務中收集。我將使用python的requests庫來調用API,得到從2015年起Lincoln, Nebraska的天氣數據。 一旦收集完成,數據將需要進行處理并匯總轉成合適的格式,然后進行清理。
??第二篇文章將重點分析數據中的趨勢,目標是選擇合適的特性并使用python的statsmodels和scikit-learn庫來構建線性回歸模型。 我將討論構建線性回歸模型,必須進行必要的假設,并演示如何評估數據特征以構建一個健壯的模型。 并在最后完成模型的測試與驗證。
??最后的文章將著重于使用神經網絡。 我將比較構建神經網絡模型和構建線性回歸模型的過程,結果,準確性。
??Weather Underground是一家收集和分發全球各種天氣測量數據的公司。 該公司提供了大量的API,可用于商業和非商業用途。 在本文中,我將介紹如何使用非商業API獲取每日天氣數據。所以,如果你跟隨者本教程操作的話,您需要注冊他們的免費開發者帳戶。 此帳戶提供了一個API密鑰,這個密鑰限制,每分鐘10個,每天500個API請求。
API_KEY: 注冊賬戶獲取
YYYYMMDD: 你想要獲取的天氣數據的日期
STATE: 州名縮寫
CITY: 你請求的城市名
調用API??本教程調用Weather Underground API獲取歷史數據時,用到如下的python庫。
名稱 | 描述 | 來源 |
datetime | 處理日期 | 標準庫 |
time | 處理時間 | 標準庫 |
collections | 使用該庫的namedtuples來結構化數據 | 標準庫 |
pandas | 處理數據 | 第三方 |
requests | HTTP請求處理庫 | 第三方 |
matplotlib | 制圖庫 | 第三方 |
from datetime import datetime, timedelta import time from collections import namedtuple import pandas as pd import requests import matplotlib.pyplot as plt
API_KEY = "7052ad35e3c73564" # 第一個大括號是API_KEY,第二個是日期 BASE_URL = "http://api.wunderground.com/api/{}/history_{}/q/NE/Lincoln.json"
target_date = datetime(2016, 5, 16) features = ["date", "meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtempm", "mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm"] DailySummary = namedtuple("DailySummary", features)
def extract_weather_data(url, api_key, target_date, days): records = [] for _ in range(days): request = BASE_URL.format(API_KEY, target_date.strftime("%Y%m%d")) response = requests.get(request) if response.status_code == 200: data = response.json()["history"]["dailysummary"][0] records.append(DailySummary( date=target_date, meantempm=data["meantempm"], meandewptm=data["meandewptm"], meanpressurem=data["meanpressurem"], maxhumidity=data["maxhumidity"], minhumidity=data["minhumidity"], maxtempm=data["maxtempm"], mintempm=data["mintempm"], maxdewptm=data["maxdewptm"], mindewptm=data["mindewptm"], maxpressurem=data["maxpressurem"], minpressurem=data["minpressurem"], precipm=data["precipm"])) time.sleep(6) target_date += timedelta(days=1) return records
首先,定義個list records,用來存放上述的DailySummary,使用for循環來遍歷指定的所有日期。然后生成url,發起HTTP請求,獲取返回的數據,使用返回的數據,初始化DailySummary,最后存放到records里。通過這個函數的出,就可以獲取到指定日期開始的N天的歷史天氣數據,并返回。
records = extract_weather_data(BASE_URL, API_KEY, target_date, 500)格式化數據為Pandas DataFrame格式
??我們使用DailySummary列表來初始化Pandas DataFrame。DataFrame數據類型是機器學習領域經常會用到的數據結構。
df = pd.DataFrame(records, columns=features).set_index("date")特征提取
mean temperature
mean dewpoint
mean pressure
max humidity
min humidity
max dewpoint
min dewpoint
max pressure
min pressure
tmp = df[["meantempm", "meandewptm"]].head(10) tmp
# 1 day prior N = 1 # target measurement of mean temperature feature = "meantempm" # total number of rows rows = tmp.shape[0] # a list representing Nth prior measurements of feature # notice that the front of the list needs to be padded with N # None values to maintain the constistent rows length for each N nth_prior_measurements = [None]*N + [tmp[feature][i-N] for i in range(N, rows)] # make a new column name of feature_N and add to DataFrame col_name = "{}_{}".format(feature, N) tmp[col_name] = nth_prior_measurements tmp
def derive_nth_day_feature(df, feature, N): rows = df.shape[0] nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)] col_name = "{}_{}".format(feature, N) df[col_name] = nth_prior_measurements
for feature in features: if feature != "date": for N in range(1, 4): derive_nth_day_feature(df, feature, N)
df.columns Index(["meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtempm", "mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm", "meantempm_1", "meantempm_2", "meantempm_3", "meandewptm_1", "meandewptm_2", "meandewptm_3", "meanpressurem_1", "meanpressurem_2", "meanpressurem_3", "maxhumidity_1", "maxhumidity_2", "maxhumidity_3", "minhumidity_1", "minhumidity_2", "minhumidity_3", "maxtempm_1", "maxtempm_2", "maxtempm_3", "mintempm_1", "mintempm_2", "mintempm_3", "maxdewptm_1", "maxdewptm_2", "maxdewptm_3", "mindewptm_1", "mindewptm_2", "mindewptm_3", "maxpressurem_1", "maxpressurem_2", "maxpressurem_3", "minpressurem_1", "minpressurem_2", "minpressurem_3", "precipm_1", "precipm_2", "precipm_3"], dtype="object")數據清洗
??首先去掉我不感興趣的數據,來減少樣本集。我們的目標是根據過去三天的天氣數據預測天氣溫度,因此我們只保留min, max, mean三個字段的數據。
# make list of original features without meantempm, mintempm, and maxtempm to_remove = [feature for feature in features if feature not in ["meantempm", "mintempm", "maxtempm"]] # make a list of columns to keep to_keep = [col for col in df.columns if col not in to_remove] # select only the columns in to_keep and assign to df df = df[to_keep] df.columns Index(["meantempm", "maxtempm", "mintempm", "meantempm_1", "meantempm_2", "meantempm_3", "meandewptm_1", "meandewptm_2", "meandewptm_3", "meanpressurem_1", "meanpressurem_2", "meanpressurem_3", "maxhumidity_1", "maxhumidity_2", "maxhumidity_3", "minhumidity_1", "minhumidity_2", "minhumidity_3", "maxtempm_1", "maxtempm_2", "maxtempm_3", "mintempm_1", "mintempm_2", "mintempm_3", "maxdewptm_1", "maxdewptm_2", "maxdewptm_3", "mindewptm_1", "mindewptm_2", "mindewptm_3", "maxpressurem_1", "maxpressurem_2", "maxpressurem_3", "minpressurem_1", "minpressurem_2", "minpressurem_3", "precipm_1", "precipm_2", "precipm_3"], dtype="object")
df.info()DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27 Data columns (total 39 columns): meantempm 1000 non-null object maxtempm 1000 non-null object mintempm 1000 non-null object meantempm_1 999 non-null object meantempm_2 998 non-null object meantempm_3 997 non-null object meandewptm_1 999 non-null object meandewptm_2 998 non-null object meandewptm_3 997 non-null object meanpressurem_1 999 non-null object meanpressurem_2 998 non-null object meanpressurem_3 997 non-null object maxhumidity_1 999 non-null object maxhumidity_2 998 non-null object maxhumidity_3 997 non-null object minhumidity_1 999 non-null object minhumidity_2 998 non-null object minhumidity_3 997 non-null object maxtempm_1 999 non-null object maxtempm_2 998 non-null object maxtempm_3 997 non-null object mintempm_1 999 non-null object mintempm_2 998 non-null object mintempm_3 997 non-null object maxdewptm_1 999 non-null object maxdewptm_2 998 non-null object maxdewptm_3 997 non-null object mindewptm_1 999 non-null object mindewptm_2 998 non-null object mindewptm_3 997 non-null object maxpressurem_1 999 non-null object maxpressurem_2 998 non-null object maxpressurem_3 997 non-null object minpressurem_1 999 non-null object minpressurem_2 998 non-null object minpressurem_3 997 non-null object precipm_1 999 non-null object precipm_2 998 non-null object precipm_3 997 non-null object dtypes: object(39) memory usage: 312.5+ KB
df = df.apply(pd.to_numeric, errors="coerce") df.info()DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27 Data columns (total 39 columns): meantempm 1000 non-null int64 maxtempm 1000 non-null int64 mintempm 1000 non-null int64 meantempm_1 999 non-null float64 meantempm_2 998 non-null float64 meantempm_3 997 non-null float64 meandewptm_1 999 non-null float64 meandewptm_2 998 non-null float64 meandewptm_3 997 non-null float64 meanpressurem_1 999 non-null float64 meanpressurem_2 998 non-null float64 meanpressurem_3 997 non-null float64 maxhumidity_1 999 non-null float64 maxhumidity_2 998 non-null float64 maxhumidity_3 997 non-null float64 minhumidity_1 999 non-null float64 minhumidity_2 998 non-null float64 minhumidity_3 997 non-null float64 maxtempm_1 999 non-null float64 maxtempm_2 998 non-null float64 maxtempm_3 997 non-null float64 mintempm_1 999 non-null float64 mintempm_2 998 non-null float64 mintempm_3 997 non-null float64 maxdewptm_1 999 non-null float64 maxdewptm_2 998 non-null float64 maxdewptm_3 997 non-null float64 mindewptm_1 999 non-null float64 mindewptm_2 998 non-null float64 mindewptm_3 997 non-null float64 maxpressurem_1 999 non-null float64 maxpressurem_2 998 non-null float64 maxpressurem_3 997 non-null float64 minpressurem_1 999 non-null float64 minpressurem_2 998 non-null float64 minpressurem_3 997 non-null float64 precipm_1 889 non-null float64 precipm_2 889 non-null float64 precipm_3 888 non-null float64 dtypes: float64(36), int64(3) memory usage: 312.5 KB
# Call describe on df and transpose it due to the large number of columns spread = df.describe().T # precalculate interquartile range for ease of use in next calculation IQR = spread["75%"] - spread["25%"] # create an outliers column which is either 3 IQRs below the first quartile or # 3 IQRs above the third quartile spread["outliers"] = (spread["min"]<(spread["25%"]-(3*IQR)))|(spread["max"] > (spread["75%"]+3*IQR)) # just display the features containing extreme outliers spread.ix[spread.outliers,]
??評估異常值的潛在影響是任何分析項目的難點。 一方面,您需要關注引入虛假數據樣本的可能性,這些樣本將嚴重影響您的模型。 另一方面,異常值對于預測在特殊情況下出現的結果是非常有意義的。 我們將討論每一個包含特征的異常值,看看我們是否能夠得出合理的結論來處理它們。
??第一組特征看起來與最大濕度有關。 觀察這些數據,我可以看出,這個特征類別的異常值是非常低的最小值。這數據看起來沒價值,我想我想仔細看看它,最好是以圖形方式。 要做到這一點,我會使用直方圖。
%matplotlib inline plt.rcParams["figure.figsize"] = [14, 8] df.maxhumidity_1.hist() plt.title("Distribution of maxhumidity_1") plt.xlabel("maxhumidity_1") plt.show()
查看maxhumidity字段的直方圖,數據表現出相當多的負偏移。 在選擇預測模型和評估最大濕度影響的強度時,我會牢記這一點。 許多基本的統計方法都假定數據是正態分布的。 現在我們暫時不管它,但是記住這個異常特性。
df.minpressurem_1.hist() plt.title("Distribution of minpressurem_1") plt.xlabel("minpressurem_1") plt.show()
??要解決的最后一個數據質量問題是缺失值。 由于我構建DataFrame的時候,缺少的值由NaN表示。 您可能會記得,我通過推導代表前三天測量結果的特征,有意引入了收集數據前三天的缺失值。 直到第三天我們才能開始推導出這些特征,所以很明顯我會想把這些頭三天從數據集中排除出去。
# iterate over the precip columns for precip_col in ["precipm_1", "precipm_2", "precipm_3"]: # create a boolean array of values representing nans missing_vals = pd.isnull(df[precip_col]) df[precip_col][missing_vals] = 0
df = df.dropna()總結
摘要:為了建立線性回歸模型,我要用到里非常重要的兩個機器學習相關的庫和。使用逐步回歸建立一個健壯的模型一個強大的線性回歸模型必須選取有意義的重要的統計指標的指標作為預測指標。 概述 ??這篇文章我們接著前一篇文章,使用Weather Underground網站獲取到的數據,來繼續探討用機器學習的方法預測內布拉斯加州林肯市的天氣??上一篇文章我們已經探討了如何收集、整理、清洗數據。這篇文章我們...
摘要:概述這是使用機器學習預測平均氣溫系列文章的最后一篇文章了,作為最后一篇文章,我將使用的開源機器學習框架來構建一個神經網絡回歸器。請注意,我把這個聲明推廣到整個機器學習的連續體,而不僅僅是神經網絡。 概述 ??這是使用機器學習預測平均氣溫系列文章的最后一篇文章了,作為最后一篇文章,我將使用google的開源機器學習框架tensorflow來構建一個神經網絡回歸器。關于tensorflow...
