Pythonic “Data Science” Specialization

jasperyang 發布于2019-07-24 17:58 / 1105人閱讀

摘要：溫習統計學的知識為更深層次的學習做準備在的演講中說就是我們理解但不知道另外的是如何的我在臺下想對于那可以理解的我好像都只懂了參考標準高效的流程課程用的是我不想再學一門類似的語言了我會找出相對應的和的來源流程什么是干凈的一個變

Why The "Data Science" Specialization

溫習統計學的知識, 為更深層次的學習做準備
Andrew Ng 在 2015 GTC 的演講中說, deep learning 就是 black magic; 我們理解50%, 但不知道另外的50%是如何work的. 我在臺下想, 對于那可以理解的50%, 我好像都只懂了5%.

參考"標準高效"的流程
mine: emacs org mode + emacs magit + bitbucket + python. There must be some room for improvement.

How

課程用的是R. 我不想再學一門類似的語言了, 我會找出相對應的numpy 和 scipy solution.

Getting and Cleaning Data

Raw data 的來源

Website APIs

Databases

Json

Raw texts

Data analysis 流程

Raw data --> Processing scripts --> tidy data (often ignored in the classes but really important)

Record the meta data

Record the recipes

--> data analysis (covered in machine learning classes)

--> data communication

什么是干凈的data

Each variable you measure should be in one column, 一個變量占一列.

There should be one table for each "kind" of variable, generally data should be save in one file per table 為什么呢? 管理起來不會麻煩麼?

If you have multiple tables, they should include a column in the table thta allows them to be linked. 參見 dataframe.merge dataframe.join in pandas

The code book

代碼簿? (⊙o⊙)…

Info about the variables (including units!)
單位很重要! 沒有單位的測量是沒有物理意義的!
但測量時候必須要考慮的有效位數在課程中卻沒有提及. 大抵是因為python 和 R 對于有效位數handle地很好? 不需要像C 里邊一樣考慮 float 或者 double? 某些極端情況下也會需要像sympy這樣的library吧.

Info about the summary choice you made

Info about the experimental study design you used

代碼簿的作用類似于wet lab中的實驗記錄本. 很慶幸很早就知道了emacs 的 org mode, 用在這里很適合. 但是 Info about the variables 的重要性被我忽略了.

如果feature的數量很多, 而且feature本身意義深刻, 就需要仔細挑選. 記得一次聽報告, 有家金融公司用decision tree 做portfolio, 算法本身稀松平常, 但是對于具體用了哪些feature, lecturer守口如瓶.

"There are many stages to the design and analysis of a successful study. The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a "decision rule" to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled."

Leek, Jeffrey T., and Roger D. Peng. "Statistics: P values are just the tip of the iceberg." Nature 520.7549 (2015): 612-612.

Downloading Files

我通常都是直接用wget, 但是那樣就不容易整合到腳本中. 幾個很可能會在download時候用到的python function:

# set up the env
os.path.dirname(os.path.realpath(__file__))
os.getcwd()
os.path.join()
os.chdir()
os.path.exists()
os.makedirs()

# dowload
urllib.request.urlretrieve()
urllib.request.urlopen()

# to tag your downloaded files
datetime.timezone()
datetime.datetime.now()

# an example
import shutil
import ssl
import urllib.request as ur

def download(myurl):
    """
    download to the current directory
    """
    fn = myurl.split("/")[-1]
    context = ssl._create_unverified_context()
    with ur.urlopen(myurl, context=context) as response, open(fn, "wb") as out_file:
        shutil.copyfileobj(response, out_file)

    return fn

Loading flat files

pandas.read_csv()

Reading XML

Here is a very good introduction

Below are my summaries:

python 標準庫中自帶了xml.etree.ElementTree用來解析xml. 其中, ElementTree 表示整個XML文件, Element表示一個node.

The first element in every XML document is called the root element. 一個XML文件只能又一個root, 因此以下的不符合xml規范:

recursively 遍歷

# an excersice 
# find all elements with zipcode equals 21231
xml_fn = download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml")
tree = ET.parse(xml_fn)
for child in tree.iter():
    if child.tag == "zipcode" and child.text == "21231":
        print(child)

JSON

JSON stands for Javascript Object Notation

lightweight data storage

JSON 的格式肉眼看起來就像是nested python dict. python 自帶的json的用法類似pickle.

Pattern Matching

Python makes a distinction between matching and searching. Matching looks only at the start of the target string, whereas searching looks for the pattern anywhere in the target.

Always use raw strings for regx.

Character sets
sth like r"[A-Za-z_]" would match an underscore or any uppercase or lowercase ASCII letter.

Characters that have special meanings in other regular expression contexts do not have special meanings within square brackets. The only character with a special meaning inside square brackets is a ^, and then only if it is the first character after the left (open- ing) bracket.

Summarizing Data

import pandas as pd
df = pd.DataFrame
# Look at a bit of the data
df.head()
df.tail()

# summary
df.describe()
df.quantile()

# cov and corr
# DataFrame’s corr and cov methods return a full correlation or covariance matrix as a DataFrame, respectively

# to calcuate pairwise correlation between a DataFrame"s columns or rows
dset.corrwith(dset[""])

# you can write your own analsis function and apply it to the dataframe, for example:
f = lambda x: x.max() - x.min()
df.apply(f, axis=1)

Check for missing values

df.dropna()
df.fillna(0)
# to modify inplace
_ = df.fillna(0, inplace=True)

# fill the nan with the mean
# 或者用naive bayesian的prediction
data.fillna(data.mean())

Exploratory Data Analysis Analytic graphics

Principles of Analytic Graphics

Show comparisons
If you build a model that can do some predictions, please come along with the performance of random guess.

Show causality, mechanism, explanation, systematic structure

Show multivariate data
The world is inherently multivariate

Integration of evidence

Describe and document the evidence with appropriate labels, scales, sources, etc.

Simple Summaries of Data

Two dimensions

scatterplots

smooth scatterplots

> 2 dimensions

Overlayed/multiple 2-D plots; coplots

Use color, size, shape to add dimensions

Spinning plots

Actual 3-D plots (not very useful)

Graphics File Devices

pdf: usefule for line-type graphics, resizes well, not efficient if a plot has many objects/points

svg: XML-based scalable vector graphics; supports animation and interactivity, potentially useful for web-based plots

png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression, most web browers can read this format natively, does not resize well

jpeg: good for photographs or natural scenes, uses lossy compression, does not resize well

tiff: bitmapped format, supports lossless compression

Simulation in R

rnorm:generate random Normal variates with a given mean and standard deviation

dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)

pnorm: evaluate the cumulative distribution function for a Normal distribution

d for density

r for random number generation

p for cumulative distribution

q for quantile function

Setting the random number seed with set.seed ensures reproducibility

> set.seed(1)
> rnorm(5)

GPU云服務器云服務器 Specialization Pythonic

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://m.specialneedsforspecialkids.com/yun/37525.html

想入門人工智能? 這些優質的 AI 資源絕對不要錯過

摘要：該課程旨在面向有抱負的工程師，從人工智能的基本概念入門到掌握為人工智能解決方案構建深度學習模型所需技能。 showImg(https://segmentfault.com/img/bVbkP5z?w=800&h=664); 作者 | Jo Stichbury翻譯 | Mika本文為 CDA 數據分析師原創作品，轉載需授權前言如今人工智能備受追捧，由于傳統軟件團隊缺乏AI技能，常常會...

Barrior 2019-06-26 18:41 評論0 收藏0
蠎周刊 2015 年度最贊

摘要：蠎周刊年度最贊親俺們又來回顧又一個偉大的年份兒包去年最受歡迎的文章和項目如果你錯過了幾期就這一期不會丟失最好的嗯哼還為你和你的準備了一批紀念裇從這兒獲取任何時候如果想分享好物給大家在這兒提交喜歡我們收集的任何意見建議通過來吧原文 Title: 蠎周刊 2015 年度最贊Date: 2016-01-09 Tags: Weekly,Pycoder,Zh Slug: issue-198-to...

young.li 2019-07-24 18:32 評論0 收藏0
從入門到求職，成為數據科學家的終極指南

摘要：我強烈推薦這本書給初學者，因為本書側重于統計建模和機器學習的基本概念，并提供詳細而直觀的解釋。關于完善簡歷，我推薦以下網站和文章怎樣的作品集能幫助我們找到第一數據科學或機器學習方面的工作簡歷是不夠的，你還需要作品集的支撐。 showImg(https://segmentfault.com/img/bVblJ0R?w=800&h=533); 作者 | Admond Lee翻譯 | Mik...

yanwei 2019-06-26 18:41 評論0 收藏0
每個男孩的機械夢「GitHub 熱點速覽 v.21.41」

摘要：以下內容摘錄自微博的及熱帖簡稱熱帖，選項標準新發布實用有趣，根據項目時間分類，發布時間不超過的項目會標注，無該標志則說明項目超過半月。特性可監控記錄的正常運行時間。服務器打包為一組微服務，用戶可使用命令輕松使用。作者：HelloGitHub-小魚干機械臂可能在醫療劇中看過，可以用來...

laznrbfe 2021-10-14 09:43 評論0 收藏0

發表評論

登陸后可評論

0條評論

jasperyang

男|高級講師

我要關注我要私信

TA的文章

tensorflow訓練自己的數據集

閱讀 1975·2023-04-25 15:45
10. STM32——PWM 控制舵機（超聲波感應開蓋垃圾桶）

閱讀 1214·2021-09-29 09:34
別急著盲目地去學Python，先這么做，你會輕松不少！

閱讀 2504·2021-09-03 10:30
如何用原生JavaScript打造一款簡易谷歌插件

閱讀 2009·2019-08-30 15:56
前端每日實戰：34# 視頻演示如何用純 CSS 創作在文本前后穿梭的邊框

閱讀 1466·2019-08-29 15:31
pointer-events: none 在 SegmentFault 中的兩個應用場景

閱讀 1273·2019-08-29 15:29
HTML中關于class內容空格多類名的問題詳解

閱讀 3204·2019-08-29 11:24
再談談 Promise, setTimeout, rAF, rIC

閱讀 3061·2019-08-26 13:45

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

Pythonic “Data Science” Specialization

相關文章

想入門人工智能? 這些優質的 AI 資源絕對不要錯過

蠎周刊 2015 年度最贊

從入門到求職，成為數據科學家的終極指南

每個男孩的機械夢「GitHub 熱點速覽 v.21.41」

發表評論

0條評論

jasperyang

男|高級講師

TA的文章

tensorflow訓練自己的數據集

10. STM32——PWM 控制舵機（超聲波感應開蓋垃圾桶）

別急著盲目地去學Python，先這么做，你會輕松不少！

如何用原生JavaScript打造一款簡易谷歌插件

前端每日實戰：34# 視頻演示如何用純 CSS 創作在文本前后穿梭的邊框

pointer-events: none 在 SegmentFault 中的兩個應用場景

HTML中關于class內容空格多類名的問題詳解

再談談 Promise, setTimeout, rAF, rIC

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！

Pythonic “Data Science” Specialization

相關文章

發表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優惠，快來選購！