emacs org mode
+ emacs magit
+ bitbucket
+ python
. There must be some room for improvement.How
课程用的是R
. 我不想再学一门类似的语言了, 我会找出相对应的numpy
和 scipy
solution.
Getting and Cleaning Data
Raw data –> Processing scripts –> tidy data (often ignored in the classes but really important)
–> data analysis (covered in machine learning classes)
–> data communication
dataframe.merge
dataframe.join
in pandas
代码簿? (⊙o⊙)…
python
和 R
对于有效位数handle地很好? 不需要像C
里边一样考虑 float
或者 double
? 某些极端情况下也会需要像sympy
这样的library吧.代码簿的作用类似于wet lab中的实验记录本. 很庆幸很早就知道了emacs
的 org mode
, 用在这里很适合. 但是 Info about the variables 的重要性被我忽略了.
如果feature的数量很多, 而且feature本身意义深刻, 就需要仔细挑选. 记得一次听报告, 有家金融公司用decision tree 做portfolio, 算法本身稀松平常, 但是对于具体用了哪些feature, lecturer守口如瓶.
\”There are many stages to the design and analysis of a successful study. The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a \’decision rule\’ to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled.\”
Leek, Jeffrey T., and Roger D. Peng. \”Statistics: P values are just the tip of the iceberg.\” Nature 520.7549 (2015): 612-612.
我通常都是直接用wget
, 但是那样就不容易整合到脚本中. 几个很可能会在download时候用到的python
function:
# set up the env
os.path.dirname(os.path.realpath(__file__))
os.getcwd()
os.path.join()
os.chdir()
os.path.exists()
os.makedirs()
# dowload
urllib.request.urlretrieve()
urllib.request.urlopen()
# to tag your downloaded files
datetime.timezone()
datetime.datetime.now()
# an example
import shutil
import ssl
import urllib.request as ur
def download(myurl):
\"\"\"
download to the current directory
\"\"\"
fn = myurl.split(\'/\')[-1]
context = ssl._create_unverified_context()
with ur.urlopen(myurl, context=context) as response, open(fn, \'wb\') as out_file:
shutil.copyfileobj(response, out_file)
return fn
pandas.read_csv()
Here is a very good introduction
Below are my summaries:
python
标准库中自带了xml.etree.ElementTree
用来解析xml
. 其中, ElementTree
表示整个XML文件, Element
表示一个node.
The first element in every XML document is called the root element. 一个XML文件只能又一个root, 因此以下的不符合xml规范:
recursively 遍历
# an excersice
# find all elements with zipcode equals 21231
xml_fn = download(\"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml\")
tree = ET.parse(xml_fn)
for child in tree.iter():
if child.tag == \'zipcode\' and child.text == \'21231\':
print(child)
JSON 的格式肉眼看起来就像是nested python dict. python 自带的json的用法类似pickle.
Python makes a distinction between matching and searching. Matching looks only at the start of the target string, whereas searching looks for the pattern anywhere in the target.
Always use raw strings for regx.
Character sets
sth like r\'[A-Za-z_]\'
would match an underscore or any uppercase or lowercase ASCII letter.
Characters that have special meanings in other regular expression contexts do not have special meanings within square brackets. The only character with a special meaning inside square brackets is a ^, and then only if it is the first character after the left (open- ing) bracket.
import pandas as pd
df = pd.DataFrame
# Look at a bit of the data
df.head()
df.tail()
# summary
df.describe()
df.quantile()
# cov and corr
# DataFrame’s corr and cov methods return a full correlation or covariance matrix as a DataFrame, respectively
# to calcuate pairwise correlation between a DataFrame\'s columns or rows
dset.corrwith(dset[\'\'])
# you can write your own analsis function and apply it to the dataframe, for example:
f = lambda x: x.max() - x.min()
df.apply(f, axis=1)
df.dropna()
df.fillna(0)
# to modify inplace
_ = df.fillna(0, inplace=True)
# fill the nan with the mean
# 或者用naive bayesian的prediction
data.fillna(data.mean())
Principles of Analytic Graphics
Show comparisons
If you build a model that can do some predictions, please come along with the performance of random guess.
Show causality, mechanism, explanation, systematic structure
Show multivariate data
The world is inherently multivariate
Integration of evidence
Simple Summaries of Data
Two dimensions
> 2 dimensions
Graphics File Devices
rnorm
:generate random Normal variates with a given mean and standard deviationdnorm
: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)pnorm
: evaluate the cumulative distribution function for a Normal distribution
d
for density
r
for random number generationp
for cumulative distributionq
for quantile functionSetting the random number seed with set.seed
ensures reproducibility
> set.seed(1)
> rnorm(5)