a side-by-side reference sheet
grammar and invocation | variables and expressions | arithmetic and logic | strings | regexes | dates and time | tuples
arrays | sequences | multidimensional arrays | dictionaries | functions | execution control | file handles | directories
processes and environment | libraries and namespaces | reflection
ordered dictionaries | data sets | import and export | relational algebra | aggregation
vectors | matrices | statistics | linear regression and curve fitting | distributions | univariate charts | bivariate charts
multivariate charts | contact
| matlab | r | numpy | |
|---|---|---|---|
| version used | Octave 3.2 | 2.6 | Python 2.7 NumPy 1.6 SciPy 0.10 Pandas 0.9 Matplotlib 1.0 |
| show version | $ octave --version | $ r --version | sys.version np.__version__ sp.__version__ mpl.__version__ |
| implicit prologue | none | install.packages('ggplot2') library('ggplot2') |
import sys, os, re, math import numpy as np import scipy as sp import scipy.stats as stats import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt |
| grammar and invocation | |||
| matlab | r | numpy | |
| interpreter |
$ octave foo.m | $ Rscript foo.r $ r -f foo.r |
$ python foo.py |
| repl |
$ octave | $ r | $ python |
| command line program | $ octave --silent --eval 'printf("hi")' | $ Rscript -e 'print("hi")' | python -c 'print("hi")' |
| block delimiters | function endfunction if elseif else endif while endwhile do until for endfor |
{ } | offside rule |
| statement separator | ; or newline | ; or sometimes newline | newline or ; newlines not separators inside (), [], {}, triple quote literals, or after backslash: \ |
| end-of-line comment | 1 + 1 % addition Octave only: 1 + 1 # addition |
1 + 1 # addition | 1 + 1 # addition |
| variables and expressions | |||
| matlab | r | numpy | |
| assignment | i = 3 | i = 3 i <- 3 3 -> i assign("i", 3) |
i = 3 |
| compound assignment arithmetic, string, logical |
MATLAB has no compound assignment operators. Octave has these: += -= *= /= none none **= or ^= none none &= |= none |
none | # do not return values: += -= *= /= //= %= **= += *= &= |= ^= |
| increment and decrement | ++x --x x++ x-- |
none | none |
| null | only used in place of numeric values: NA |
NA NULL | None |
| null test | isna(v) true for '', []: isnull(v) |
is.na(v) is.null(v) |
v == None v is None |
| conditional expression | none | (if (x > 0) x else -x) ifelse(x > 0, x, -x) |
x if x > 0 else -x |
| arithmetic and logic | |||
| matlab | r | numpy | |
| true and false |
1 0 true false | TRUE FALSE T F | True False |
| falsehoods | false 0 0.0 matrices evaluate to false unless nonempty and all entries evaluate to true |
FALSE F 0 0.0 matrices evaluate to value of first entry; string in boolean context causes error |
False None 0 0.0 '' [] {} |
| logical operators | ~true | (true & false) Optional negation operator in Octave: ! short-circuit operators: && || |
!TRUE | (TRUE & FALSE) short-circuit operators: && || & and | can operate on and return vectors, but && and || return scalars |
and or not |
| relational operators | == ~= > < >= <= Optional inequality operator in Octave: != |
== != > < >= <= | == != > < >= <= |
| arithmetic operators add, sub, mult, div, quot, rem |
+ - * / none mod(n, divisor) | + - * / ? %% | + - * / // % |
| integer division |
fix(13 / 5) | as.integer(13 / 5) | 13 // 5 |
| integer division by zero |
Inf NaN or -Inf | result of converting Inf or NaN to an integer with as.integer: NA |
raises ZeroDivisionError |
| float division |
13 / 5 | 13 / 5 | float(13) / 5 |
| float division by zero dividend is positive, zero, negative |
these values are literals: Inf NaN -Inf |
these values are literals: Inf NaN -Inf |
raises ZeroDivisionError |
| power | 2 ^ 16 % Octave only: 2 ** 16 |
2 ^ 16 2 ** 16 |
2 ** 16 |
| sqrt |
sqrt(2) | sqrt(2) | math.sqrt(2) |
| sqrt(-1) | % returns 0 + 1i: sqrt(-1) |
# returns NaN: sqrt(-1) # returns 0+1i: sqrt(-1+0i) |
# raises ValueError: math.sqrt(-2) # returns 1.41421j: import cmath cmath.sqrt(-2) |
| transcendental functions | exp log sin cos tan asin acos atan atan2 | exp log sin cos tan asin acos atan atan2 | math.exp math.log math.sin math.cos math.tan math.asin math.acos math.atan math.atan2 |
| transcendental constants | pi e | pi exp(1) | math.pi math.e |
| float truncation round towards zero, to nearest integer, down, up |
fix(x) round(x) floor(x) ceil |
as.integer(x) round(x) floor(x) ceiling(x) |
int(x) int(round(x)) math.floor(x) math.ceil(x) |
| absolute value and signum |
abs sign | abs sign | abs(-3.7) math.copysign(1, -3.7) |
| integer overflow | becomes float; largest representable integer in the variable intmax | becomes float; largest representable integer in the variable .Machine$integer.max | becomes arbitrary length integer of type long |
| float overflow |
Inf | Inf | raises OverflowError |
| float limits |
eps realmax realmin |
.Machine$double.eps .Machine$double.xmax .Machine$double.xmin |
np.finfo(np.float64).eps np.finfo(np.float64).max np.finfo(np.float64).min |
| complex construction | 1 + 3i | 1 + 3i | 1 + 3j |
| complex decomposition | real imag abs arg conj |
Re Im abs Arg Conj |
import cmath z.real z.imag cmath.polar(z)[1] |
| random number uniform integer, uniform float |
floor(100*rand) rand |
floor(100*runif(1)) runif(1) |
np.random.randint(0, 100) np.random.rand() |
| random seed set, get, and restore |
rand('state', 17) sd = rand('state') rand('state', sd) |
set.seed(17) sd = .Random.seed none |
np.random.seed(17) sd = np.random.get_state() np.random.set_state(sd) |
| bit operators | bitshift(100, 3) bitshift(100, -3) bitand(1, 2) bitor(1, 2) bitxor(1, 2) % MATLAB: bitcmp(1, 'uint16') % Octave: bitcmp(1, 16) |
none | 100 << 3 100 >> 3 1 & 2 1 | 2 1 ^ 2 ~1 |
| strings | |||
| matlab | r | numpy | |
| literal | 'don''t say "no"' Octave also has double quoted strings: "don't say \"no\"" |
"don't say \"no\"" 'don\'t say "no"' |
'don\'t say "no"' "don't say \"no\"" r"don't " r'say "no"' |
| newline in literal |
no; use \n escape | yes | no |
| literal escapes | \\ \" \' \0 \a \b \f \n \r \t \v | \\ \" \' \a \b \f \n \r \t \v \ooo | single and double quoted: \newline \\ \' \" \a \b \f \n \r \t \v \ooo \xhh |
| character access |
'hello'(1) | substr("hello", 1, 1) | 'hello'[0] |
| chr and ord | char(65) toascii('A') |
intToUtf8(65) utf8ToInt("A") |
chr(65) ord('A') |
| length |
length('hello') | nchar("hello") | len('hello') |
| concatenate | horzcat('one ', 'two ', 'three') | paste("one ", "two ", "three") | 'one ' + 'two ' + 'three' literals, but not variables, can be concatenated with juxtaposition: 'one ' "two " 'three' |
| replicate |
hbar = repmat('-', 1, 80) | hbar = paste(rep('-', 80), collapse='') | hbar = '-' * 80 |
| index of substring | counts from one, returns zero if not found index('hello', 'el') |
counts from one, returns -1 if not found regexpr("el", "hello") |
counts from zero, raises ValueError if not found: 'hello'.index('el') |
| extract substring |
substr('hello', 1, 4) | substr("hello", 1, 4) | 'hello'[0:4] |
| split | returns tuple: strsplit('foo,bar,baz',',') |
strsplit('foo,bar,baz', ',') | 'foo,bar,baz'.split(',') |
| join | paste("foo", "bar", "baz", sep=",") paste(c('foo', 'bar', 'baz'), collapse=',') |
','.join(['foo', 'bar', 'baz']) | |
| trim | strtrim(' foo ') ?? deblank('foo ') |
gsub("(^[\n\t ]+|[\n\t ]+$)", "", " foo ") sub("^[\n\t ]+", "", " foo") sub("[\n\t ]+$", "", "foo ") |
' foo '.strip() ' foo'.lstrip() 'foo '.rstrip() |
| convert from string, to string | 7 + str2num('12') 73.9 + str2num('.037') horzcat('value: ', num2str(8)) |
7 + as.integer("12") 73.9 + as.double(".037") paste("value: ", toString("8")) |
7 + int('12') 73.9 + float('.037') 'value: ' + str(8) |
| case manipulation | lower('FOO') upper('foo') |
tolower("FOO") toupper("foo") |
'foo'.upper() 'FOO'.lower() 'foo'.capitalize() |
| sprintf |
sprintf('%s: %.3f %d', 'foo', 2.2, 7) | sprintf("%s: %.3f %d", "foo", 2.2, 7) | '%s: %.3f %d' % ('foo', 2.2, 7) |
| regular expressions | |||
| matlab | r | numpy | |
| regex test | regexp('hello', '^[a-z]+$') regexp('hello', '^\S+$') |
regexpr("^[a-z]+$", "hello") > 0 regexpr('^\\S+$', "hello",perl=T) > 0 |
re.search('^[a-z]+$', 'hello') re.search('^\S+$', 'hello') |
| regex substitution | regexprep('foo bar bar','bar','baz','once') regexprep('foo bar bar','bar','baz') |
sub('bar','baz','foo bar') gsub('bar','baz','foo bar bar') |
rx = re.compile('bar') s = rx.sub('baz', 'foo bar', 1) s2 = rx.sub('baz', 'foo bar bar') |
| dates and time | |||
| matlab | r | numpy | |
| current date/time |
t = now | t = as.POSIXlt(Sys.time()) | |
| date/time type | floating point number representing days since year 0 in the Gregorian calendar | POSIXlt | |
| date/time difference type | floating point number representing days | a difftime object which behaves like a floating point number representing seconds | |
| get date parts | datevec(t)(1) datevec(t)(2) datevec(t)(3) |
t$year + 1900 t$mon + 1 t$mday |
|
| get time parts | datevec(t)(4) datevec(t)(5) datevec(t)(6) |
t$hour t$min t$sec |
|
| build date/time from parts | t = datenum([2011 9 20 23 1 2]) | t = as.POSIXlt(Sys.time()) t$year = 2011 - 1900 t$mon = 9 - 1 t$mday = 20 t$hour = 23 t$min = 1 t$sec = 2 |
|
| convert to string |
datestr(t) | print(t) | |
| strptime | t = datenum('2011-09-20 23:01:02', 'yyyy-mm-dd HH:MM:SS') |
t = strptime('2011-09-20 23:01:02', '%Y-%m-%d %H:%M:%S') |
|
| strftime |
datestr(t, 'yyyy-mm-dd HH:MM:SS') | format(t, format='%Y-%m-%d %H:%M:%S') | |
| tuples | |||
| matlab | r | numpy | |
| tuple literal |
tup = {1.7, 'hello', [1 2 3]} | tup = list(1.7, "hello", c(1, 2, 3)) | tup = (1.7, "hello", [1,2,3]) |
| tuple element access | tup{1} | tup[[1]] | tup[0] |
| tuple length |
length(tup) | length(tup) | len(tup) |
| arrays | |||
| matlab | r | numpy | |
| literal |
# arrays and vectors are same type a = c(1, 2, 3, 4) |
a = [1, 2, 3, 4] | |
| size |
length(a) | len(a) | |
| empty test |
length(a) == 0 | not a | |
| lookup |
# indices start at one a[1] |
# indices start at zero a[0] |
|
| update |
a[1] = "lorem" | a[0] = 'lorem' | |
| out-of-bounds behavior | a = c() # evaluates as NA: a[10] # increases array size to 10: a[10] = "lorem" |
a = [] # raises IndexError: a[10] # raises IndexError: a[10] = 'lorem' |
|
| index of element | a = c('x', 'y', 'z', 'w', 'y') # c(2, 5): which(a == 'y') |
a = ['x', 'y', 'z', 'w', 'y'] a.index('y') # 1 a.rindex('y') # 4 |
|
| slice by endpoints, by length |
a = c("a", "b", "c", "d", "e") # return c("c", "d"): a[seq(3, 4)] a[seq(3, 3 + 1)] |
a = ['a', 'b', 'c', 'd', 'e'] # return ['c', 'd']: a[2:4] a[2:2 + 2] |
|
| slice to end |
tail(a, n=length(a) - 1) | a[1:] | |
| manipulate back |
none | a = [6, 7, 8] a.append(9) a.pop() |
|
| manipulate front |
none | a = [6, 7, 8] a.insert(0, 5) a.pop(0) |
|
| concatenate | a = c(1, 2, 3) a2 = append(a, c(4, 5, 6)) a = append(a, c(4, 5, 6)) |
a = [1, 2, 3] a2 = a + [4, 5, 6] a.extend([4, 5, 6]) |
|
| replicate | a = rep(NA, 10) | a = [None] * 10 a = [None for i in range(0, 10)] |
|
| copy address copy, shallow copy, deep copy |
# arrays cannot be elements of arrays a = [1, 2, 3, 4] none a2 = a |
import copy a = [1, 2, [3, 4]] a2 = a a3 = list(a) a4 = copy.deepcopy(a) |
|
| arrays as function arguments | modifying parameter will not modify original array | parameter contains address copy; modifying parameter modifies original array | |
| iteration |
for i in [1, 2, 3]: print(i) |
||
| indexed iteration | a = ['do', 're', 'mi', 'fa'] for i, s in enumerate(a): print('%s at index %d' % (s, i)) |
||
| reverse | a = c(1, 2, 3) a2 = rev(a) a = rev(a) |
a = [1, 2, 3] a2 = a[::-1] a.reverse() |
|
| sort | a = c('b', 'A', 'a', 'B') a2 = sort(a) a = sort(a) |
a = ['b', 'A', 'a', 'B'] sorted(a) a.sort() a.sort(key=str.lower) |
|
| dedupe | a = c(1, 2, 2, 3) a2 = unique(a) a = unique(a) |
a = [1, 2, 2, 3] a2 = list(set(a)) a = list(set(a)) |
|
| membership |
7 %in% a is.element(7, a) |
7 in a | |
| intersection |
intersect(c(1, 2), c(2, 3, 4)) | {1, 2} & {2, 3, 4} | |
| union |
union(c(1, 2), c(2, 3, 4)) | {1, 2} | {2, 3, 4} | |
| relative complement, symmetric difference | setdiff(c(1, 2, 3), c(2)) union(setdiff(c(1, 2), c(2, 3, 4)), setdiff(c(2, 3, 4), c(1, 2))) |
{1, 2, 3} - {2} {1, 2} ^ {2, 3, 4} |
|
| map |
map(lambda x: x * x, [1, 2, 3]) # or use list comprehension: [x * x for x in [1, 2, 3]] |
||
| filter |
filter(lambda x: x > 1, [1, 2, 3]) # or use list comprehension: [x for x in [1, 2, 3] if x > 1] |
||
| reduce |
reduce(lambda x, y: x + y, [1 ,2, 3], 0) | ||
| universal and existential tests |
all(i % 2 == 0 for i in [1, 2, 3, 4]) any(i % 2 == 0 for i in [1, 2, 3, 4]) |
||
| shuffle and sample | from random import shuffle, sample a = [1, 2, 3, 4] shuffle(a) sample(a, 2) |
||
| zip |
# array of 3 pairs: a = zip([1, 2, 3], ['a', 'b', 'c']) |
||
| sequences | |||
| matlab | r | numpy | |
| range | 1:100 | 1:100 seq(1, 100) |
range(1, 101) |
| iterate over range | range replaces xrange in Python 3: for i in xrange(1, 1000001): code |
||
| instantiate range as array | a = range(1, 11) Python 3: a = list(range(1, 11)) |
||
| arithmetic sequence of integers with difference 10 | 0:10:100 | seq(0, 100, 10) | range(0, 101, 10) |
| arithmetic sequence of floats with difference 0.1 | 0:0.1:10 | seq(0, 10, 0.1) | [0.1 * x for x in range(0, 101)] 3rd arg is length of sequence, not step size: sp.linspace(0, 10, 100) |
| multidimensional arrays | |||
| matlab | r | numpy | |
| 1d array literal | [1, 2, 3] commas are optional: [1 2 3] |
none | none |
| 2d array literal | [1, 2; 3, 4] spaces and newlines can replace commas and semicolons: [1 2 3 4] |
none | none |
| 3d array from 2d arrays | A = [1, 2; 3, 4] A(:,:,2) = [5, 6; 7, 8] |
||
| 1d array from elements | c(1, 2, 3) | ||
| 1d 1d array from sequential data type | array(c(1, 2, 3)) | np.array([1, 2, 3]) np.array((1, 2, 3)) |
|
| 2d array from sequential data type | array(c(1,2,3,4),dim=c(2,2)) | np.array([1, 2, 3, 4]).reshape(2, 2) | |
| 2d array from rows | rbind(c(1, 2, 3), c(4, 5, 6)) | ||
| 2d array from rows | cbind(c(1, 4), c(2, 5), c(3, 6)) | ||
| 3d array from sequential data type | array(c(1,2,3,4,5,6,7,8),dim=c(2,2,2)) | np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(2, 2, 2) | |
| 2d array from nested sequential data types | np.array([[1, 2], [3, 4]]) | ||
| 3d array from nested sequential data types | np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) | ||
| must arrays be homogeneous | yes | yes | yes |
| array data type | always numeric | class(c(1, 2, 3)) a = array(c(1, 2, 3)) class(c(a)) |
np.array([1, 2, 3]).dtype |
| data types permitted in arrays | numeric | boolean, numeric, string | np.bool, np.int64, np.float64, np.complex128, and others |
| 1d array element access | indices start at one: [1 2 3](1) |
indices start at one: c(1, 2, 3)[1] |
indices start at zero: a = np.array([1, 2, 3]) a[0] |
| 2d array element access | [1 2; 3 4](1, 1) | a = array(c(1, 2, 3, 4), dim=c(2, 2) a[1, 1] |
a = np.array([[1, 2], [3, 4]]) a[0][0] or a[0, 0] |
| 1d index access of higher dimensional array | returns 4: [1 2; 3 4](4) |
a = array(c(1, 2, 3, 4), dim=c(2, 2)) returns 4: a[4] |
|
| index of array element | which(c(7,8,9)==9) | ||
| array slice |
[1 2 3](1:2) | c(1,2,3)[1:2] | np.array([1,2,3])[0:2] |
| assign to array slice | |||
| integer array as index | [1 2 3]([1,3,3]) | c(1,2,3)[c(1,3,3)] | np.array([1,2,3])[[0,2,2]] |
| logical array as index | [1 2 3]([true false true]) | c(1,2,3)[c(T,F,T)] | np.array([1,2,3])[[True,False,True]] |
| array length |
length([1 2 3]) | length(c(1,2,3)) | len(np.array([1,2,3])) |
| multidimensional array size | length(dim(a)) dim(a) |
a.ndim a.shape |
|
| array concatenation | cat(2, [1 2 3], [4 5 6]) horzcat([1 2 3], [4 5 6]) |
append(c(1,2,3),c(4,5,6)) | a1 = np.array([1,2,3]) a2 = np.array([4,5,6]) np.concatenate([a1,a2]) |
| multidimensional array concatenation | m = matrix(c(1, 2, 3, 4), nrow=2) m4_by_2 = rbind(m, m) m2_by_4 = cbind(m, m) |
||
| array replication | rep("a", 100) rep(c("a", "b", "c"), c(30, 50, 90)) |
||
| sort | a = [3 1 4 2] a = sort(a) |
a = c(3,1,4,2) a = sort(a) |
a = np.array([3,1,4,2]) a.sort() |
| map | arrayfun( @(x) x*x, [1 2 3]) | sapply(c(1,2,3), function (x) { x * x}) | a = np.array([1,2,3]) np.vectorize(lambda x: x*x)(a) |
| filter | v = [1 2 3] v(v > 2) |
v = c(1,2,3) v[v > 2] |
v = np.array([1,2,3]) a = [x for x in v if x > 2] np.array(a) |
| sample w/o replacement | x = c(3,7,5,12,19,8,4) sample(x, 3) |
from random import sample sample([3,7,5,12,19,8,4], 3) |
|
| dictionaries | |||
| matlab | r | numpy | |
| literal |
d = struct('n', 10, 'avg', 3.7, 'sd', 0.4) | d = list(n=10, avg=3.7, sd=0.4) | d = {'n': 10, 'avg': 3.7, 'sd': 0.4} |
| size | length(fieldnames(d)) | length(d) | len(d) |
| lookup |
d.n | d$n | d['n'] |
| update |
d.var = d.sd**2 | d$var = d$sd**2 | d['var'] = d['sd']**2 |
| out-of-bounds behavior |
error | NULL | raises KeyError |
| is key present |
isfield(d, 'var') | is.null(d$var) | 'var' in d |
| delete |
d = rmfield(d, 'sd') | d$sd = NULL | del(d['sd']) |
| iterate | for k, v in d.iteritems(): code |
||
| keys and values as arrays | d.keys() d.values() |
||
| merge | d1 = list(a=1, b=2) d2 = list(b=3, c=4) values of first dictionary take precedence: d3 = c(d1, d2) |
d1 = {'a':1, 'b':2} d2 = {'b':3, 'c':4} d1.update(d2) |
|
| invert | to_num = {'t':1, 'f':0} to_let = {v:k for k, v in to_num.items()} |
||
| sort by values | from operator import itemgetter pairs = sorted(d.iteritems(), key=itemgetter(1)) for k, v in pairs: print('{}: {}'.format(k, v)) |
||
| functions | |||
| matlab | r | numpy | |
| definition | function add(a,b) a+b endfunction |
add = function(a,b) {a + b} | |
| invocation |
add(3, 7) | add(3, 7) | |
| return value | how to declare a return variable: function retvar = add(a,b) retvar = a + b endfunction the return value is the value assigned to the return variable if one is defined; otherwise it's the last expression evaluated. |
return argument or last expression evaluated. NULL if return called without an argument. | |
| function value |
@add | add | |
| anonymous function | @(a,b) a+b | function(a,b) {a+b} | |
| missing argument | raises error if code with the parameter that is missing an argument is executed | raises error | |
| extra argument |
ignored | raises error | |
| default argument | function mylog(x, base=10) log(x) / log(base) endfunction |
mylog = function(x,base=10) { log(x) / log(base) } |
|
| variable number of arguments | function s = add(varargin) if nargin == 0 s = 0 else r = add(varargin{2:nargin}) s = varagin{1} + r endif endfunction |
add = function (...) { a = list(...) if (length(a) == 0) return(0) s = 0 for(i in 1:length(a)) { s = s + a[[i]] } return(s) } |
|
| execution control | |||
| matlab | r | numpy | |
| if | if (x > 0) printf('positive\n') elseif (x < 0) printf('negative\n') else printf('zero\n') endif |
if (x > 0) { print('positive') } else if (x < 0) { print('negative') } else { print('zero') } |
if x > 0: print('positive') elif x < 0: print('negative') else: print('zero') |
| while | i = 0 while (i < 10) i++ printf('%d\n', i) endwhile |
while (i < 10) { i = i + 1 print(i) } |
while i < 10: i += 1 print(i) |
| for | for i = 1:10 printf('%d\n', i) endfor |
for (i in 1:10) { print(i) } |
for i in range(1,11): print(i) |
| break/continue |
break continue | break next | break continue |
| raise exception |
error('%s', 'failed') | stop('failed') | raise Exception('failed') |
| handle exception | try error('failed') catch printf('%s\n', lasterr()) end_try_catch |
tryCatch( stop('failed'), error=function(e) print(message(e))) |
try: raise Exception('failed') except Exception as e: print(e) |
| finally block | unwind_protect if ( rand > 0.5 ) error('failed') endif unwind_protect_cleanup printf('cleanup') end_unwind_protect |
risky = function() { if (runif(1) > 0.5) { stop('failed') } } tryCatch( risky(), finally=print('cleanup')) |
|
| file handles | |||
| matlab | r | numpy | |
| standard file handles | stdin stdout stderr | stdin() stdout() stderr() | sys.stdin sys.stdout sys.stderr |
| read line from stdin | line = input("", "s") | line = readLines(n=1) | line = sys.stdin.readline() |
| write line to stdout | puts("hello\n") | cat("hello\n") writeLines("hello") |
print('hello') |
| write formatted string to stdout | printf("%.2f\n", pi) | cat(sprintf("%.2f\n", pi)) | import math print('%.2f' % math.pi) |
| open file for reading | if ((f = fopen("/etc/hosts")) == -1) error("failed to open file") endif |
f = file("/etc/hosts", "r") | f = open('/etc/hosts') |
| open file for writing | if ((f = fopen("/tmp/test", "w") == -1) error("failed to open file") endif |
f = file("/tmp/test", "w") | f = open('/tmp/test', 'w') |
| open file for appending | if ((f = fopen("/tmp/err.log", "a") == -1) error("failed to open file") endif |
f = file("/tmp/err.log", "a") | f = open('/tmp/err.log', 'a') |
| close file | fclose(f) | close(f) | f.close() |
| i/o errors | fopen returns -1; fclose throws an error | raise IOError exception | |
| read line | line = fgets(f) | line = readLines(f, n=1) | line = f.readline() |
| iterate over file by line | while(!feof(f)) line = fgets(f) puts(line) endwhile |
for line in f: print(line) |
|
| read file into array of strings | lines = readLines(f) | lines = f.readlines() | |
| write string | fputs(f, "lorem ipsum") | cat("lorem ipsum", file=f) | f.write('lorem ipsum') |
| write line | fputs(f, "lorem ipsum\n") | writeLines("lorem ipsum", con=f) | f.write('lorem ipsum\n') |
| flush file handle | fflush(f) | flush(f) | f.flush() |
| file handle position get, set |
ftell(f) % 3rd arg can be SEEK_CUR or SEEK_END fseek(f, 0, SEEK_SET) |
seek(f) # sets seek point to 12 bytes after start; # origin can also be "current" or "end" seek(f, where=0, origin="start") |
f.tell() f.seek(0) |
| redirect stdout to file | sink("foo.txt") | ||
| directories | |||
| matlab | r | numpy | |
| working directory get, set |
pwd cd("/tmp") |
getwd() setwd("/tmp") |
os.path.abspath('.') os.chdir('/tmp') |
| build pathname | file.path("/etc", "hosts") | os.path.join('/etc', 'hosts') | |
| dirname and basename | dirname("/etc/hosts") basename("/etc/hosts") |
os.path.dirname('/etc/hosts') os.path.basename('/etc/hosts') |
|
| absolute pathname | normalizePath("..") | os.path.abspath('..') | |
| processes and environment | |||
| matlab | r | numpy | |
| command line arguments | % does not include interpreter name: argv() |
# first arg is name of interpreter: commandArgs() # arguments after --args only: commandArgs(TRUE) |
sys.argv |
| environment variable get, set |
getenv("HOME") setenv("PATH", "/bin") |
Sys.getenv("HOME") Sys.setenv(PATH="/bin") |
os.getenv('HOME') os.environ['PATH'] = '/bin' |
| exit |
exit(0) | quit(save="no", status=0) | sys.exit(0) |
| external command | if (shell_cmd("ls -l /tmp")) error("ls failed") endif |
if (system("ls -l /tmp")) { stop("ls failed") } |
if os.system('ls -l /tmp'): raise Exception('ls failed') |
| libraries and namespaces | |||
| matlab | r | numpy | |
| load library | % if installed as Octave package: pkg load foo |
require("foo") or library("foo") |
import foo |
| list loaded libraries | none | search() | dir() |
| library search path | path() addath('~/foo') rmpath('~/foo') |
.libPaths() | sys.path |
| source file |
source('foo.m') | source("foo.r") | none |
| install package | % installs packages downloaded from % Octave-Forge in Octave: pkg install foo-1.0.0.tar.gz |
install.packages("ggplot2") | $ pip install scipy |
| list installed packages | pkg list | library() | $ pip freeze |
| reflection | |||
| matlab | r | numpy | |
| data type | class(x) | class(x) | type(x) |
| attributes | if x is an object value: x |
attributes(x) | [m for m in dir(x) if not callable(getattr(o,m))] |
| methods | note that most values are not objects: methods(x) |
none; objects are implemented by functions which dispatch based on type of first arg | [m for m in dir(x) if callable(getattr(o,m))] |
| variables in scope | who() | objects() | dir() |
| undefine variable |
clear('x') | rm(v) | del(x) |
| undefine all variables | clear -a | rm(list=objects()) | |
| eval |
eval('1+1') | eval(parse(text='1+1')) | eval('1+1') |
| function documentation | help tan | help(tan) ?tan |
math.tan.__doc__ |
| list library functions | none | ls("package:moments") | dir(stats) |
| search documentation | not in Octave: docsearch tan |
??tan | $ pydoc -k tan |
| ordered dictionaries | |||
| matlab | r | numpy | |
| constructor | d = pd.Series([3, 5, 7], index=['a', 'b', 'c']) | ||
| aligned arithmetic | d1 = pd.Series([3, 5, 7], index=['a', 'b', 'c']) d2 = pd.Series([1, 2], index=['c', 'b') d3 = d1 + d2 values of d3 are: 'a': NaN, 'b': 7, 'c': 8 |
||
| aligned arithmetic with fill value | d1 = pd.Series([3, 5, 7], index=['a', 'b', 'c']) d2 = pd.Series([1, 2], index=['c', 'b') d3 = d1.add(d2, fill_value=0) values of d3 are: 'a': 3, 'b': 7, 'c': 8 |
||
| reindex | d = pd.Series([1, 2], index=['a', 'b']) d.reindex(['c', 'b', 'a']) values of d are: 'c': NaN, 'b': 2, 'a': 1 |
||
| reindex with fill value | |||
| data sets | |||
| r | numpy | ||
| construct from column arrays | gender, height, weight of some people in inches and lbs: sx = c("F","F","F","F","M","M") ht = c(69,64,67,66,72,70) wt = c(150,132,142,139,167,165) people = data.frame(sx, ht, wt) |
sx = ['F', 'F', 'F', 'F', 'M', 'M'] ht = [69, 64, 67, 66, 72, 70] wt = [150, 132, 142, 139, 167, 165] people = pd.DataFrame({'sx': sx, 'ht': ht, 'wt': wt}) |
|
| construct from row tuples | |||
| categorical variable column | by default any column with character data is a factor | ||
| index column | |||
| column names as array | names(people) colnames(people) |
returns Index object: people.columns |
|
| access column as array | vectors: people$ht people[,2] people[['ht']] people[[2]] 1 column data set: people[2] |
people['ht'] if name does not conflict with any DataFrame attributes: people.ht |
|
| access row as tuple | 1 row data set: people[1,] list: as.list(people[1,]) |
people.ix(0) | |
| access datum | datum in 1st row, 2nd column: people[1,2] |
people.get_value(0, 'ht') | |
| order rows by column | people[order(people$ht),] | ||
| order rows by multiple columns | people[order(sx, ht),] | ||
| order rows in descending order | people[order(-people$ht),] | ||
| limit rows | people[seq(1, 3),] | ||
| offset rows | people[seq(4, 6),] | ||
| attach columns | copy columns into variables named sx, ht and wt: attach(people) |
none | |
| detach columns | detach(people) | none | |
| spreadsheet editor | can edit data, in which case return value of edit must be saved people = edit(people) |
none | |
| import and export | |||
| r | numpy | ||
| import tab delimited | # first row defines variable names: df = read.delim('/path/to.tab') |
# first row defines column names: df = pd.read_table('/path/to.tab') |
|
| import csv |
# first row defines variable names: df = read.csv('/path/to.csv') |
# first row defines column names: df = pd.read_csv('/path/to.csv') |
|
| set column separator | df = read.delim('/etc/passwd', sep=':', header=FALSE, comment.char='#') |
# $ grep -v '^#' /etc/passwd > /tmp/passwd df = pd.read_table('/tmp/passwd', sep=':', header=None) |
|
| set column separator to whitespace | df = read.delim('/path/to.txt', sep='') | df = read_table('/path/to.txt', sep='\s+') | |
| set quote character | default quote character for both read.csv and read.delim is double quotes. The quote character is escaped by doubling it. # use single quote as quote character: df = read.csv('/path/to/single-quote.csv', quote="'") # no quote character: df = read.csv('/path/to/no-quote.csv', quote="") |
Both read_table and read_csv use double quotes as the quote character and there is no way to change it. A double quote can be esacped by doubling it. | |
| import file w/o header | # column names are V1, V2, … read.delim('/etc/passwd', sep=':', header=FALSE, comment.char='#') |
# $ grep -v '^#' /etc/passwd > /tmp/passwd # # column names are X0, X1, … df = pd.read_table('/tmp/passwd', sep=':', header=None) |
|
| set column names | df = read.csv('/path/to/no-header.csv', header=FALSE, col.names=c('ht', 'wt', 'age')) |
df = pd.read_csv('/path/to/no-header.csv', names=['ht', 'wt', 'age']) |
|
| set column types | # possible values: NA, 'logical', 'integer', 'numeric', # 'complex', 'character', 'raw', 'factor', 'Date', # 'POSIXct' # # If type is set to NA, actual type will be inferred to be # 'logical', 'integer', 'numeric', 'complex', or 'factor' # df = read.csv('/path/to/data.csv', colClasses=c('integer', 'numeric', 'character')) |
||
| recognize null values | df = read.csv('/path/to/data.csv', colClasses=c('integer', 'logical', 'character'), na.strings=c('nil')) |
df = read_csv('/path/to/data.csv', na_values=['nil']) |
|
| change decimal mark | df = read.csv('/path/to.csv', dec=',') | ||
| recognize thousands separator | none | df = read_csv('/path/to.csv', thousands='.') | |
| unequal row length behavior | Missing fields will be set to NA unless fill is set to FALSE. If the column is of type character then the fill value is an empty string ''. If there are extra fields they will be parsed as an extra row unless flush is set to FALSE |
||
| skip comment lines | df = read.delim('/etc/passwd', sep=':', header=FALSE, comment.char='#') |
none | |
| skip rows | df = read.csv('/path/to/data.csv', skip=4) | df = read_csv('/path/to/data.csv', skiprows=4) # rows to skip can be specified individually: df = read_csv('/path/to/data.csv', skiprows=range(0, 4)) |
|
| max rows to read | df = read.csv('/path/to/data.csv', nrows=4) | df = read_csv('/path/to/data.csv', nrows=4) | |
| index column | none | df = pd.read_csv('/path/to.csv', index_col='key_col') # hierarchical index: df = pd.read_csv('/path/to.csv', index_col=['col1', 'col2']) |
|
| export tab delimited | write.table(df, '/tmp/data.tab', sep='\t') | ||
| export csv |
# first column contains row names unless row.names # set to FALSE write.csv(df, '/path/to.csv', row.names=F) |
||
| relational algebra | |||
| matlab | r | numpy | |
| project columns by name | people[c('sx', 'ht')] | people[['sx', 'ht']] | |
| project columns by position | people[c(1, 2)] | ||
| project expression | convert to cm and kg: transform(people, ht=2.54*ht, wt=wt/2.2) |
||
| project all columns | people[people$ht > 66,] | ||
| rename columns | colnames(people) = c('gender', 'height', 'weight') | ||
| access sub data set | data set of first 3 rows with ht and wt columns reversed people[1:3,c(1,3,2)] |
||
| select rows | subset(people, ht > 66) people[people$ht > 66,] |
people[people['ht'] > 66] | |
| select distinct rows | |||
| split rows | |||
| inner join | pw = read.delim('/etc/passwd', sep=':', header=F, comment.char='#', col.names=c('name', 'passwd', 'uid', 'gid', 'gecos', 'home', 'shell')) grp = read.delim('/etc/group', sep=':', header=F, comment.char='#', col.names=c('name', 'passwd', 'gid', 'members')) merge(pw, grp, by.x='gid', by.y='gid') |
# $ grep -v '^#' /etc/passwd > /tmp/passwd # $ grep -v '^#' /etc/group > /tmp/group pw = pd.read_table('/tmp/passwd', sep=':', header=None, names=['name', 'passwd', 'uid', 'gid', 'gecos', 'home', 'shell']) grp = pd.read_table('/tmp/group', sep=':', header=None, names=['name', 'passwd', 'gid', 'members']) pd.merge(pw, grp, left_on='gid', right_on='gid') |
|
| nulls as join values | |||
| left join | merge(pw, grp, by.x='gid', by.y='gid', all.x=T) | pd.merge(pw, grp, left_on='gid', right_on='gid', how='left') | |
| full join | merge(pw, grp, by.x='gid', by.y='gid', all=T) | pd.merge(pw, grp, left_on='gid', right_on='gid', how='outer') | |
| antijoin | pw[!(pw$gid %in% grp$gid), ] | ||
| cross join | merge(pw, grp, by=c()) | ||
| aggregation | |||
| matlab | r | numpy | |
| row count | length(pw[, 1]) | len(pw) | |
| group by column | install.packages('data.table') library('data.table') # convert from data.frame to data.table: people_dt = data.table(people) people[,max(ht),by=sx] |
grouped = people.groupby('sx') grouped.aggregate(np.max)['ht'] |
|
| multiple aggregated values | grouped = people.groupby('sx') grouped.aggregate(np.max)[['ht', 'wt']] |
||
| group by multiple columns | |||
| aggregation functions | length sum min max mean sd | ||
| nulls and aggregation functions | value of sum, min, max, mean, and sd is NA if any of the values is NA | ||
| rank | |||
| quantile | |||
| having | |||
| vectors | |||
| matlab | r | numpy | |
| vector literal | same as array | same as array | same as array |
| element-wise arithmetic operators | + - .* ./ | + - * / | + - * / |
| result of vector length mismatch | raises error | values in shorter vector are recycled; warning if one vector is not a multiple length of the other | raises ValueError |
| scalar multiplication | 3 * [1, 2, 3] [1, 2, 3] * 3 |
3 * c(1, 2, 3) c(1, 2, 3) * 3 |
3 * np.array([1, 2, 3]) np.array([1, 2, 3]) * 3 |
| dot product | dot([1, 1, 1], [2, 2, 2]) | c(1, 1, 1) %*% c(2, 2, 2) | v1 = np.array([1, 1, 1]) v2 = np.array([2, 2, 2]) np.dot(v1, v2) |
| cross product | cross([1, 0, 0], [0, 1, 0]) | v1 = np.array([1, 0, 0]) v2 = np.array([0, 1, 0]) np.cross(v1, v2) |
|
| norms | norm([1, 2, 3], 1) norm([1, 2, 3], 2) norm([1, 2, 3], Inf) |
vnorm = function(x, t) { norm(matrix(x, ncol=1), t) } vnorm(c(1, 2, 3), "1") vnorm(c(1, 2, 3), "E") vnorm(c(1, 2, 3), "I") |
v = np.array([1, 2, 3]) np.linalg.norm(v, 1) np.linalg.norm(v, 2) np.linalg.norm(v, np.inf) |
| matrices | |||
| matlab | r | numpy | |
| literal or constructor | row contiguous: A = [1, 2; 3, 4] B = [4 3 2 1] |
column contiguous: A = matrix(c(1, 3, 2, 4), 2, 2) B = matrix(c(4, 2, 3, 1), nrow=2) |
row contiguous: A = np.matrix([[1, 2], [3, 4]]) B = np.matrix([[4, 3], [2, 1]]) |
| zero, identity, ones, diagonal matrix | zeros(3, 3) or zeros(3) eye(3) ones(3, 3) or ones(3) diag([1, 2, 3]) |
matrix(0, 3, 3) diag(3) matrix(1, 3, 3) diag(c(1, 2, 3)) |
|
| dimensions | rows(A) columns(A) |
dim(A)[1] dim(A)[2] |
|
| element access | A(1, 1) | A[1, 1] | A[0, 0] |
| row access | A(1, 1:2) | A[1,] | A[0] |
| column access | A(1:2, 1) | A[, 1] | |
| submatrix access | C = [1, 2, 3; 4, 5, 6; 7, 8, 9] C(1:2, 1:2) |
C = matrix(seq(1, 9), 3, 3, byrow=T) C[1:2, 1:2] |
|
| scalar multiplication | 3 * A A * 3 also: 3 .* A A .* 3 |
3 * A A * 3 |
3 * A A * 3 |
| element-wise operators | .+ .- .* ./ | + - * / | + - np.multiply() np.divide() |
| multiplication | A * B | A %*% B | A * B |
| power | A ** 3 | ||
| kronecker product | kron(A, B) | kronecker(A, B) | np.kron(A, B) |
| comparison | all(all(A==B)) any(any(A!=A)) |
all(A==B) any(A!=B) |
|
| norms | norm(A, 1) norm(A, 2) norm(A, Inf) norm(A, 'fro') |
norm(A, "1") ?? norm(A, "I") norm(A, "F") |
|
| transpose | transpose(A) A' |
t(A) | A.transpose() |
| conjugate transpose | A = [1i, 2i; 3i, 4i] A' |
A = matrix(c(1i, 2i, 3i, 4i), nrow=2, byrow=T) Conj(t(A)) |
A = np.matrix([[1j, 2j], [3j, 4j]]) A.conj().transpose() |
| inverse | inv(A) | solve(A) | np.linalg.inv(A) |
| determinant | det(A) | det(A) | np.linalg.det(A) |
| trace | trace(A) | sum(diag(A)) | A.trace() |
| eigenvalues | eig(A) | eigen(A)$values | np.linalg.eigvals(A) |
| eigenvectors | [evec, eval] = eig(A) evec(1:2) evec(3:4) |
eigen(A)$vectors | np.linalg.eig(A)[1] |
| system of equations | A \ [2;3] | solve(A, c(2, 3)) | np.linalg.solve(A, [2, 3]) |
| statistics | |||
| matlab | r | numpy | |
| first moment statistics | x = [1 2 3 8 12 19] sum(x) mean(x) |
x = c(1,2,3,8,12,19) sum(x) mean(x) |
x = [1,2,3,8,12,19] sp.sum(x) sp.mean(x) |
| second moment statistics | std(x, 1) var(x, 1) |
n = length(x) sd(x) * sqrt((n-1)/n) var(x) * (n-1)/n |
sp.std(x) sp.var(x) |
| second moment statistics for samples | std(x) var(x) |
sd(x) var(x) |
n = float(len(x)) sp.std(x) * math.sqrt(n/(n-1)) sp.var(x) * n/(n-1) |
| skewness | Octave uses sample standard deviation to compute skewness: skewness(x) |
install.packages('moments') library('moments') skewness(x) |
stats.skew(x) |
| kurtosis | Octave uses sample standard deviation to compute kurtosis: kurtosis(x) |
install.packages('moments') library('moments') kurtosis(x) - 3 |
stats.kurtosis(x) |
| nth moment and nth central moment | n = 5 moment(x, n) moment(x, n, "c") |
install.packages('moments') library('moments') n = 5 moment(x, n) moment(x, n, central=T) |
n = 5 ?? stats.moment(x, n) |
| mode | mode([1 2 2 2 3 3 4]) | samp = c(1,2,2,2,3,3,4) names(sort(-table(samp)))[1] |
stats.mode([1,2,2,2,3,3,4])[0][0] |
| quantile statistics | min(x) median(x) max(x) ? |
min(x) median(x) max(x) quantile(x, prob=.90) |
min(x) sp.median(x) max(x) stats.scoreatpercentile(x, 90.0) |
| bivariate statistiscs | x = [1 2 3] y = [2 4 7] cor(x, y) cov(x, y) |
x = c(1,2,3) y = c(2,4,7) cor(x, y) cov(x, y) |
x = [1,2,3] y = [2,4,7] stats.linregress(x, y)[2] ?? |
| frequency table | x = c(1,2,1,1,2,5,1,2,7) tab = table(x) |
||
| invert frequency table | rep(as.integer(names(tab)), unname(tab)) |
||
| bin | x = c(1.1, 3.7, 8.9, 1.2, 1.9, 4.1) xf = cut(x, breaks=c(0, 3, 6, 9)) bins = tapply(x, xf, length) |
||
| linear regression and curve fitting | |||
| matlab | r | numpy | |
| linear regression y = ax + b | x = [1 2 3] y = [2 4 7] [lsq, res] = polyfit(x, y, 1) a = lsq(1) b = lsq(2) y - (a*x+b) |
x = c(1,2,3) y = c(2,4,7) lsq = lm(y ~ x) a = lsq$coefficients[2] b = lsq$coefficients[1] lsq$residuals |
x = np.array([1,2,3]) y = np.array([2,4,7]) lsq = stats.linregress(x, y) a = lsq[0] b = lsq[1] y - (a*x+b) |
| distributions | |||
| matlab | r | numpy | |
| empirical density function | |||
| empirical cumulative distribution | F is a right-continuous step function: F = ecdf(rnorm(100)) |
||
| empirical quantile function | F = ecdf(rnorm(100)) Finv = ecdf(F(seq(0, 1, .01))) |
||
| binomial | binopdf(x, n, p) binocdf(x, n, p) binoinv(y, n, p) binornd(n, p) |
dbinom(x, n, p) pbinom(x, n, p) qbinom(y, n, p) rbinom(1, n, p) |
stats.binom.pmf(x, n, p) stats.binom.cdf(x, n, p) stats.binom.ppf(y, n, p) stats.binom.rvs(n, p) |
| poisson | poisspdf(x, lambda) poisscdf(x, lambda) poissinv(y, lambda) poissrnd(lambda) |
dpois(x, lambda) ppois(x, lambda) qpois(y, lambda) rpois(1, lambda) |
stats.poisson.pmf(x, lambda) stats.poisson.cdf(x, lambda) stats.poisson.ppf(y, lambda) stats.poisson.rvs(lambda, size=1) |
| normal | normpdf(x, mu, sigma) normcdf(x, mu, sigma) norminv(y, mu, sigma) normrnd(mu, sigma) |
dnorm(x, mu, sigma) pnorm(x, mu, sigma) qnorm(y, mu, sigma) rnorm(1, mu, sigma) |
stats.norm.pdf(x, mu, sigma) stats.norm.cdf(x, mu, sigma) stats.norm.ppf(y, mu, sigma) stats.norm.rvs(mu, sigma) |
| gamma | gampdf(x, k, theta) gamcdf(x, k, theta) gaminv(y, k, theta) gamrnd(k, theta) |
dgamma(x, k, scale=theta) pgamma(x, k, scale=theta) qgamma(y, k, scale=theta) rgamma(1, k, scale=theta) |
stats.gamma.pdf(x, k, scale=theta) stats.gamma.cdf(x, k, scale=theta) stats.gamma.ppf(y, k, scale=theta) stats.gamma.rvs(k, scale=theta) |
| exponential | exppdf(x, lambda) expcdf(x, lambda) expinv(y, lambda) exprnd(lambda) |
dexp(x, lambda) pexp(x, lambda) qexp(y, lambda) rexp(1, lambda) |
stats.expon.pdf(x, scale=1.0/lambda) stats.expon.cdf(x, scale=1.0/lambda) stats.expon.ppf(x, scale=1.0/lambda) stats.expon.rvs(scale=1.0/lambda) |
| chi-squared | chi2pdf(x, nu) chi2cdf(x, nu) chi2inv(y, nu) chi2rnd(nu) |
dchisq(x, nu) pchisq(x, nu) qchisq(y, nu) rchisq(1, nu) |
stats.chi2.pdf(x, nu) stats.chi2.cdf(x, nu) stats.chi2.ppf(y, nu) stats.chi2.rvs(nu) |
| beta | betapdf(x, alpha, beta) betacdf(x, alpha, beta) betainvf(y, alpha, beta) betarnd(alpha, beta) |
dbeta(x, alpha, beta) pbeta(x, alpha, beta) qbeta(y, alpha, beta) rbeta(1, alpha, beta) |
stats.beta.pdf(x, alpha, beta) stats.beta.cdf(x, alpha, beta) stats.beta.ppf(y, alpha, beta) stats.beta.pvs(alpha, beta) |
| uniform | unifpdf(x, a, b) unifcdf(x, a, b) unifinv(y, a, b) unifrnd(a, b) |
dunif(x, a, b) punif(x, a, b) qunif(y, a, b) runif(1, a, b) |
stats.uniform.pdf(x, a, b) stats.uniform.cdf(x, a, b) stats.uniform.ppf(y, a, b) stats.unifrom.rvs(a, b) |
| Student's t | dt(x, nu) pt(x, nu) qt(y, nu) rt(1, nu) |
stats.t.pdf(x, nu) stats.t.cdf(x, nu) stats.t.ppf(y, nu) stats.t.rvs(nu) |
|
| Snedecor's F | df(x, d1, d2) pf(x, d1, d2) qf(y, d1, d2) rf(1, d1, d2) |
stats.f.pdf(x, d1, d2) stats.f.cdf(x, d1, d2) stats.f.ppf(y, d1, d2) stats.f.rvs(d1, d2) |
|
| univariate charts | |||
| matlab | r | matplotlib | |
vertical bar chart |
bar([7 3 8 5 5]) | cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") barplot(cnts) x = floor(6*runif(100)) barplot(table(x)) |
cnts = [7,3,8,5,5] plt.bar(range(0,len(cnts)), cnts) |
![]() horizontal bar chart |
barh([7 3 8 5 5]) | cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") barplot(cnts, horiz=T) |
cnts = [7,3,8,5,5] plt.barh(range(0,len(cnts)), cnts) |
![]() pie chart |
labels = {'a','b','c','d','e'} pie([7 3 8 5 5], labels) |
cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") pie(cnts) |
cnts = [7,3,8,5,5] labs = ['a','b','c','d','e'] plt.pie(cnts, labels=labs) |
![]() dot plot |
stripchart(floor(10*runif(50)), method="stack", offset=1, pch=19) |
||
![]() stem plot |
generates an ascii chart: stem(20*rnorm(100)) |
||
![]() histogram |
hist(randn(1, 100), 10) | hist(rnorm(100), breaks=10) | plt.hist(sp.randn(100), bins=range(-5,5)) |
![]() box plot |
boxplot(rnorm(100)) boxplot(rnorm(100), rexp(100), runif(100)) |
plt.boxplot(sp.randn(100)) plt.boxplot([sp.randn(100), np.random.uniform(size=100), np.random.exponential(size=100)]) |
|
| chart title | bar([7 3 8 5 5]) title('bar chart example') |
all chart functions except for stem accept a main parameter: boxplot(rnorm(100), main="boxplot example", sub="to illustrate options") |
plt.boxplot(sp.randn(100)) plt.title('boxplot example') |
| bivariate charts | |||
| matlab | r | matplotlib | |
![]() stacked bar chart |
d = [7 1; 3 2; 8 1; 5 3; 5 1] bar(d, 'stacked') |
d = matrix(c(7,1,3,2,8,1,5,3,5,1), nrow=2) labels = c("a","b","c","d","e") barplot(d,names.arg=labels) |
a1 = [7,3,8,5,5] a2 = [1,2,1,3,1] plt.bar(range(0,5), a1, color='r') plt.bar(range(0,5), a2, color='b') |
![]() grouped bar chart |
d = [7 1; 3 2; 8 1; 5 3; 5 1] bar(d) |
d = matrix(c(7,1,3,2,8,1,5,3,5,1), nrow=2) labels = c("a","b","c","d","e") barplot(d,names.arg=labels,beside=TRUE) |
|
![]() scatter plot |
plot(randn(1,50),randn(1,50),'+') | plot(rnorm(50), rnorm(50)) | plt.scatter(sp.randn(50), sp.randn(50)) |
![]() hexagonal binning |
install.packages('hexbin') library('hexbin') plot(hexbin(rnorm(1000), rnorm(1000), xbins=12)) |
hexbin(randn(1000), randn(1000), gridsize=12) |
|
![]() linear regression line |
x = 0:20 y = 2 * x + rnorm(21)*10 fit = lm(y ~ x) plot(y) lines(x, fit$fitted.values, type='l') |
x = range(0,20) err = sp.randn(20)*10 y = [2*i for i in x] + err A = np.vstack([x,np.ones(len(x))]).T m, c = np.linalg.lstsq(A, y)[0] plt.scatter(x, y) plt.plot(x, [m*i + c for i in x]) |
|
![]() polygonal line plot |
plot(1:20,randn(1,20)) | plot(1:20, rnorm(20), type="l") | plot(range(0,20), randn(20)) |
![]() cubic spline |
f = splinefun(rnorm(20)) x = seq(1, 20, .1) plot(x, f(x), type="l") |
||
![]() function plot |
fplot(@sin, [-4 4]) | x = seq(-4, 4, .01) plot(sin(x), type="l") |
|
![]() quantile-quantile plot |
qqplot(runif(50),rnorm(50)) lines(c(-9,9), c(-9,9), col="red") |
||
| axis labels | plot( 1:20, (1:20) .** 2) xlabel('x') ylabel('x squared') |
plot(1:20, (1:20)^2, xlab="x", ylab="x squared") |
|
| axis limits | plot( 1:20, (1:20) .** 2) axis([1 20 -200 500]) |
plot(1:20, (1:20)^2, xlim=c(0, 20), ylim=c(-200,500)) |
|
| logarithmic y-axis | semilogy(x, x .** 2, x, x .** 3, x, x .** 4, x, x .** 5) |
x = 0:20 plot(x, x^2, log="y",type="l") lines(x, x^3, col="blue") lines(x, x^4, col="green") lines(x, x^5, col="red") |
x = range(0, 20) for i in [2,3,4,5]: y.append([j**i for j in x]) for i in [0,1,2,3]: semilogy(x, y[i]) |
| multivariate charts | |||
| matlab | r | matplotlib | |
![]() additional line set |
plot(1:20, randn(1, 20), 1:20, randn(1, 20)) optional method: plot(1:20, randn(1, 20)) hold on plot(1:20, randn(1, 20)) |
plot(1:20, rnorm(20), type="l") lines(1:20, rnorm(20), col="red") |
|
![]() legend |
x = (1:20) y = x + rnorm(20) y2 = x - 2 + rnorm(20) plot(x, y, type="l", col="black") lines(x, y2, type="l", col="red") legend(1, 15, c('first', 'second'), lty=c(1,1), lwd=c(2.5, 2.5), col=c('black', 'red')) |
||
![]() additional point set |
plot(rnorm(20), rnorm(20)) points(rnorm(20), rnorm(20), col='red') |
||
![]() stacked area chart |
x = rep(0:4, each=3) y = round(5 * runif(15)) letter = rep(LETTERS[1:3], 5) df = data.frame(x, y, letter) p = ggplot(df, aes(x=x, y=y, group=letter, fill=letter)) p + geom_area(position='stack') |
||
![]() overlapping area chart |
x = rep(0:4, each=3) y = round(5 * runif(15)) letter = rep(LETTERS[1:3], 5) df = data.frame(x, y, letter) alpha = rep(I(2/10), each=15) p = ggplot(df, aes(x=x, ymin=0, ymax=y, group=letter, fill=letter, alpha=alpha)) p + geom_ribbon() |
||
![]() 3d scatter plot |
install.packages('scatterplot3d') library('scatterplot3d') scatterplot3d(rnorm(50), rnorm(50), rnorm(50), type="h") |
||
![]() bubble chart |
df = data.frame(x=rnorm(20), y=rnorm(20), z=rnorm(20)) p = ggplot(df, aes(x=x, y=y, size=z)) p + geom_point() |
||
![]() scatter plot matrix |
x = rnorm(20) y = rnorm(20) z = x + 3*y w = y + 0.1*rnorm(20) df = data.frame(x, y, z, w) pairs(df) |
||
![]() contour plot |
m = matrix(0, 100, 100) for (i in 2:100) { for (j in 2:100) { m[i,j] = (m[i-1,j] + m[i,j-1])/2 + runif(1) - 0.5 } } filled.contour(1:100, 1:100, m) |
||
| _______________________________________________________ | _______________________________________________________ | _______________________________________________________ | |
General
version used
The version of software used to check the examples in the reference sheet.
show version
How to determine the version of an installation.
implicit prologue
Code which examples in the sheet assume to have already been executed.
r:
The ggplot2 library must be installed and loaded to use the plotting functions qplot and ggplot.
Grammar and Invocation
interpreter
How to invoke the interpreter on a script.
repl
How to launch a command line read-eval-print loop for the language.
r:
R installations come with a clickable GUI REPL.
command line program
How to pass the code to be executed to the interpreter as a command line argument.
environment variables
How to get and set an environment variable.
block delimiters
Punctuation or keywwords which define blocks.
octave:
The list of keywords which define blocks is not exhaustive. Blocks are also defined by
- switch, case, otherwise, endswitch
- unwind_protect, unwind_protect_cleanup, end_unwind_protect
- try, catch, end_try_catch
statement separator
How statements are separated.
octave:
Use a backslash to escape a newline and continue a statement on the following line. MATLAB, in contrast, uses three periods: '…' to continue a statement on the following line.
end-of-line comment
Character used to start a comment that goes to the end of the line.
Variables and Expressions
assignment
r:
Traditionally <- was used in R for assignment. Using an = for assignment was introduced in version 1.4.0 sometime before 2002. -> can also be used for assignment:
3 -> x
compound assignment
The compound assignment operators.
increment and decrement operator
The operator for incrementing the value in a variable; the operator for decrementing the value in a variable.
null
octave:
NA can be used for missing numerical values. Using a comparison operator on it always returns false, including NA == NA. Using a logical operator on NA raises an error.
r:
Relational operators return NA when one of the arguments is NA. In particular NA == NA is NA. When acting on values that might be NA, the logical operators observe the rules of ternary logic, treating NA is the unknown value.
null test
How to test if a value is null.
conditional expression
A conditional expression.
Arithmetic and Logic
true and false
The boolean literals.
octave:
true and false are functions which return matrices of ones and zeros of type logical. If no arguments are specified they return single entry matrices. If one argument is provided, a square matrix is returned. If two arguments are provided, they are the row and column dimensions.
falsehoods
Values which evaluate to false in a conditional test.
octave:
When used in a conditional, matrices evaluate to false unless they are nonempty and all their entries evaluate to true. Because strings are matrices of characters, an empty string ('' or "") will evaluate to false. Most other strings will evaluate to true, but it is possible to create a nonempty string which evaluates to false by inserting a null character; e.g. "false\000".
r:
When used in a conditional, a vector evaluates to the boolean value of its first entry. Using a vector with more than one entry in a conditional results in a warning message. Using an empty vector in a conditional, c() or NULL, raises an error.
logical operators
The boolean operators.
octave:
Note that MATLAB does not use the exclamation point '!' for negation.
&& and || are short circuit logical operators.
relational operators
The relational operators.
octave:
Note that MATLAB does not use '!=' for an inequality test.
arithmetic operators
The arithmetic operators: addition, subtraction, multiplication, division, quotient, remainder.
octave:
mod is a function and not an infix operator. mod returns a positive value if the first argument is positive, whereas rem returns a negative value.
integer division
How to compute the quotient of two integers.
integer division by zero
What happens when an integer is divided by zero.
float division
How to perform float division, even if the arguments are integers.
float division by zero
What happens when a float is divided by zero.
power
octave:
^ is a synonym for **.
r:
^ is a synonym for **.
sqrt
The square root function.
sqrt(-1)
The result of taking the square root of a negative number.
transcendental functions
The standard transcendental functions.
transcendental constants
Constants for pi and e.
float truncation
Ways of converting a float to a nearby integer.
absolute value
The absolute value and signum of a number.
integer overflow
What happens when an expression evaluates to an integer which is too big to be represented.
float overflow
What happens when an expression evaluates to a float which is too big to be represented.
float limits
The machine epsilon; the largest representable float and the smallest (i.e. closest to negative infinity) representable float.
complex construction
Literals for complex numbers.
complex decomposition
How to decompose a complex number into its real and imaginary parts; how to decompose a complex number into its absolute value and argument; how to get the complex conjugate.
random number
How to generate a random integer from a uniform distribution; how to generate a random float from a uniform distribution.
random seed
How to set, get, and restore the seed used by the random number generator.
octave:
At startup the random number generator is seeded using operating system entropy.
r:
At startup the random number generator is seeded using the current time.
numpy:
On Unix the random number generator is seeded at startup from /dev/random.
bit operators
The bit operators left shift, right shift, and, or , xor, and negation.
matlab/octave:
bitshift takes a second argument which is positive for left shift and negative for right shift.
bitcmp takes a second argument which is the size in bits of the integer being operated on. Octave is not compatible with MATLAB in how the integer size is indicated.
r:
There is a library on CRAN called bitops which provides bit operators.
Strings
literal
The syntax for a string literal.
newline in literal
Can a newline be included in a string literal? Equivalently, can a string literal span more than one line of source code?
literal escapes
Escape sequences for including special characters in string literals.
character access
How to get the character in a string at a given index.
chr and ord
How to convert an ASCII code to a character; how to convert a character to its ASCII code.
length
How to get the number of characters in a string.
concatenate
How to concatenate strings.
replicate
How to create a string which consists of a character of substring repeated a fixed number of times.
index of substring
How to get the index of first occurrence of a substring.
extract substring
How to get the substring at a given index.
split
How to split a string into an array of substrings. In the original string the substrings must be separated by a character, string, or regex pattern which will not appear in the array of substrings.
The split operation can be used to extract the fields from a field delimited record of data.
join
How to join an array of substrings into single string. The substrings can be separated by a specified character or string.
Joining is the inverse of splitting.
trim
How to remove whitespace from the beginning and the end of a string.
Trimming is often performed on user provided input.
convert from string, to string
How to convert strings to numbers and vice versa.
case manipulation
How to put a string into all caps. How to put a string into all lower case letters. How to capitalize the first letter of a string.
sprintf
How to create a string using a printf style format.
Regular Expressions
regex test
How to test whether a string matches a regular expression.
regex substitution
How to replace all substring which match a pattern with a specified string; how to replace the first substring which matches a pattern with a specified string.
Date and Time
current date/time
How to get the current date and time.
r:
Sys.time() returns a value of type POSIXct.
date/time type
The data type used to hold a combined date and time value.
date/time difference type
The data type used to hold the difference between two date/time types.
get date parts
How to get the year, the month as an integer from 1 through 12, and the day of the month from a date/time value.
get time parts
How to get the hour as an integer from 0 through 23, the minute, and the second from a date/time value.
build date/time from parts
How to build a date/time value from the year, month, day, hour, minute, and second as integers.
convert to string
How to convert a date value to a string using the default format for the locale.
strptime
How to parse a date/time value from a string in the manner of strptime from the C standard library.
strftime
How to write a date/time value to a string in the manner of strftime from the C standard library.
Tuples
| homogeneous array | vector | tuple | record | map | |
|---|---|---|---|---|---|
| NumPy | list | vector | tuple | dict | dict |
| Octave | rank 1 matrix | rank 1 matrix | cell array | struct | |
| R | vector | vector | list | list |
tuple literal
How to create a tuple, which we define as a fixed length, inhomogeneous list.
tuple element access
How to access an element of a tuple.
tuple length
How to get the number of elements in a tuple.
Arrays
literal
size
empty test
lookup
update
out-of-bounds behavior
index of element
slice
slice to end
manipulate back
manipulate front
concatenate
replicate
copy
How to make an address copy, a shallow copy, and a deep copy of an array.
After an address copy is made, modifications to the copy also modify the original array.
After a shallow copy is made, the addition, removal, or replacement of elements in the copy does not modify of the original array. However, if elements in the copy are modified, those elements are also modified in the original array.
A deep copy is a recursive copy. The original array is copied and a deep copy is performed on all elements of the array. No change to the contents of the copy will modify the contents of the original array.
r:
R does not provide a way to perform an address copy.
Because arrays cannot be elements of arrays, there is no distinction between a shallow copy and a deep copy.
Sequences
range
Multidimensional Arrays
Arrays map integers to arbitrary values. The arrays supported by the languages in this reference sheet are homogeneous, which means that the values in the codomain of the array must all be of the same type.
The languages in this sheet all support multidimensional arrays. A multidimensional array maps tuples of integers to values. All tuples which can be used as indices in a multidimensional array are of the same length and this length is the dimension of the array.
Arrays use contiguous regions of memory to store their values. Thus, an array with an element at index 1 and index 10 must allocate space for elements at indices 2 through 9, even if values are not explicitly set or needed. The shape of a multidimensional array can be expressed by a tuple of positive integers with the same length as the dimension of the array.
Arrays provide constant time access when looking up values by their indices.
A vector is a one dimensional array which supports these operations:
- addition on vectors of the same length
- scalar multiplication
- a dot product
- a norm
The languages in this reference sheet provide the above operations for all one dimensional arrays which contain numeric values.
NumPy adds the homogeneous ndarray type to the native Python list. A Python list is nonhomogeneous and one dimensional, but because they can contain lists as values they can be used to hold multidimensional data. Python lists are described in the Python reference sheet.
array literal
octave:
An array in Octave is in fact a 1 x n matrix.
r:
c(1,2,3) is a vector and array(c(1,2,3)) is a one dimensional array. The documentation says that some functions may treat the two objects differently. In the absence of knowing what those differences are it seems best to use the vector.
2d array literal
3d array literal
must arrays be homogeneous
Can an array be created with elements of different type?
octave:
The array literal
[1,'foo',3]
will create an array with 5 elements of class char.
r:
The array literal
c(1,'foo',3)
will create an array of 3 elements of class character, which is the R string type.
array data types
What data types are permitted in arrays.
octave:
Arrays in Octave can only contain numeric elements. This follows from the fact that Octave "arrays" are in fact 1 x n matrices.
Array literals can have a nested structure, but Octave will flatten them. The following literals create the same array:
[ 1 2 3 [ 4 5 6] ]
[ 1 2 3 4 5 6 ]
Logical values can be put into an array because true and false are synonyms for 1 and 0. Thus the following literals create the same arrays:
[ true false false ]
[ 1 0 0 ]
If a string is encountered in an array literal, the string is treated as an array of ASCII values and it is concatenated with other ASCII values to produce as string. The following literals all create the same string:
[ 'foo', 98, 97, 114]
[ 'foo', 'bar' ]
'foobar'
If the other numeric values in an array literal that includes a string are not integer values that fit into a ASCII byte, then they are converted to byte sized values.
r:
Array literals can have a nested structure, but R will flatten them. The following literals produce the same array of 6 elements:
c(1,2,3,c(4,5,6))
c(1,2,3,4,5,6)
If an array literal contains a mixture of booleans and numbers, then the boolean literals will be converted to 1 (for TRUE and T) and 0 (for FALSE and F).
If an array literal contains strings and either booleans or numbers, then the booleans and numbers will be converted to their string representations. For the booleans the string representations are "TRUE'" and "FALSE".
array element access
index of array element
array length
array concatenation
multidimensional array concatenation
map
filter
reduce
Dictionaries
dictionary literal
The syntax for a dictionary literal.
dictionary lookup
How to use a key to lookup a value in a dictionary.
Ordered Dictionaries
Data Sets
| r | pandas | |
|---|---|---|
| ordered dictionary | list() | Series() |
| data set | data frame | DataFrame() |
| data set column type | vector | Series |
| row.names | Index() | |
| hierarchical index | ||
| factor (ordered and unordered) |
construct from column arrays
How to construct a data set from a set of arrays representing the columns.
construct from row tuples
categorical variable column
index column
column names as array
How to show the names of the columns.
access column as array
How to access a column in a data set.
access row as tuple
How to access a row in a data set.
r:
people[1,] returns the 1st row from the data set people as a new data set with one row. This can be converted to a list using the function as.list. There is often no need because lists and one row data sets have nearly the same behavior.
access datum
How to access a single datum in a data set; i.e. the value in a column of a single row.
order rows by column
How to sort the rows in a data set according to the values in a specified column.
order rows by multiple columns
order rows in descending order
How to sort the rows in descending order according to the values in a specified column.
limit rows
How to select the first n rows according to some ordering.
offset rows
How to select rows starting at offset n according to some ordering.
attach columns
How to make column name a variable in the current scope which refers to the column as an array.
r:
Each column of the data set is copies into a variable named after the column containing the column as a vector. Modifying the data in the variable does not alter the original data set.
detach columns
How to remove attached column names from the current scope.
spreadsheet editor
How to view and edit the data set in a spreadsheet.
Import and Export
import tab delimited file
Load a data set from a tab delimited file.
import comma-separated values file
Load a data set from a CSV file.
set column separator
How to set the column separator when importing a delimited file.
set quote character
How to change the quote character. Quoting is used when strings contain the column separator or the line terminator.
import file w/o header
How to import a file that lacks a header.
set column names
How to set the column names.
set column types
How to indicate the type of the columns.
r:
If the column types are not set or if the type is set to NA or NULL, then the type will be set to logical, integer, numeric, complex, or factor.
recognize null values
Specify the input values which should be converted to null values.
unequal row length behavior
What happen when a row of input has less than or more than the expected number of columns.
skip comment lines
How to skip comment lines.
skip rows
maximum rows to read
index column
export tab delimited file
export comma-separated values file
Save a data set to a CSV file.
r:
If row.names is not set to F, the initial column will be the row number as a string starting from "1".
Relational Algebra
map data set
How to apply a mapping transformation to the rows of a data set.
filter data set
How to select the rows of a data set that satisfy a predicate.
Aggregation
Functions
definition
invocation
function value
Execution Control
if
How to write a branch statement.
while
How to write a conditional loop.
for
How to write a C-style for statement.
break/continue
How to break out of a loop. How to jump to the next iteration of a loop.
raise exception
How to raise an exception.
handle exception
How to handle an exception.
finally block
How to write code that executes even if an exception is raised.
File Handles
standard file handles
Standard input, standard output, and standard error.
read line from stdin
write line to stdout
write formatted string to stdout
open file for reading
open file for writing
open file for appending
close file
i/o errors
read line
iterate over file by line
read file into array of strings
write string
write line
flush file handle
file handle position
redirect stdout to file
Directories
working directory
How to get and set the working directory.
Processes and Environment
command line arguments
How to get the command line arguments.
environment variables
How to get and set and environment variable.
Libraries and Namespaces
load library
How to load a library.
list loaded libraries
Show the list of libraries which have been loaded.
library search path
The list of directories the interpreter will search looking for a library to load.
source file
How to source a file.
r:
When sourcing a file, the suffix if any must be specified, unlike when loading library. Also, a library may contain a shared object, but a sourced file must consist of just R source code.
install package
How to install a package.
list installed packages
How to list the packages which have been installed.
Reflection
data type
How to get the data type of a value.
r:
For vectors class returns the mode of the vector which is the type of data contained in it. The possible modes are
- numeric
- complex
- logical
- character
- raw
Some of the more common class types for non-vector entities are:
- matrix
- array
- list
- factor
- data.frame
attributes
How to get the attributes for an object.
r:
Arrays and vectors do not have attributes.
methods
How to get the methods for an object.
variables in scope
How to list the variables in scope.
undefine variable
How to undefine a variable.
undefine all variables
How to undefine all variables.
eval
How to interpret a string as source code and execute it.
function documentation
How to get the documentation for a function.
list library functions
How to list the functions and other definitions in a library.
search documentation
How to search the documentation by keyword.
Vectors
vector literal
element-wise arithmetic operators
scalar multiplication
dot product
cross product
norms
octave:
The norm function returns the p-norm, where the second argument is p. If no second argument is provided, the 2-norm is returned.
Matrices
literal or constructor
Literal syntax or constructor for creating a matrix.
The elements of a matrix must be specified in a linear order. If the elements of each row of the matrix are adjacent to other elements of the same row in the linear order we say the order is row contiguous. If the elements of each column are adjacent to other elements of the same column we say the order is column contiguous.
octave:
Square brackets are used for matrix literals. Semicolons are used to separate rows, and commas separate row elements. Optionally, newlines can be used to separate rows and whitespace to separate row elements.
r:
Matrices are created by passing a vector containing all of the elements, as well as the number of rows and columns, to the matrix constructor.
If there are not enough elements in the data vector, the values will be recycled. If there are too many extra values will be ignored. However, the number of elements in the data vector must be a factor or a multiple of the number of elements in the final matrix or an error results.
When consuming the elements in the data vector, R will normally fill by column. To change this behavior pass a byrow=T argument to the matrix constructor:
A = matrix(c(1,2,3,4),nrow=2,byrow=T)
dimensions
How to get the dimensions of a matrix.
element access
How to access an element of a matrix. All languages described here follow the convention from mathematics of specifying the row index before the column index.
octave:
Rows and columns are indexed from one.
r:
Rows and columns are indexed from one.
row access
How to access a row.
column access
How to access a column.
submatrix access
How to access a submatrix.
scalar multiplication
How to multiply a matrix by a scalar.
element-wise operators
Operators which act on two identically sized matrices element by element. Note that element-wise multiplication of two matrices is used less frequently in mathematics than matrix multiplication.
from numpy import array
matrix(array(A) * array(B))
matrix(array(A) / array(B))
multiplication
How to multiply matrices. Matrix multiplication should not be confused with element-wise multiplication of matrices. Matrix multiplication in non-commutative and only requires that the number of columns of the matrix on the left match the number of rows of the matrix. Element-wise multiplication, by contrast, is commutative and requires that the dimensions of the two matrices be equal.
kronecker product
The Kronecker product is a non-commutative operation defined on any two matrices. If A is m x n and B is p x q, then the Kronecker product is a matrix with dimensions mp x nq.
comparison
How to test two matrices for equality.
octave:
== and != perform entry-wise comparison. The result of using either operator on two matrices is a matrix of boolean values.
~= is a synonym for !=.
r:
== and != perform entry-wise comparison. The result of using either operator on two matrices is a matrix of boolean values.
norms
How to compute the 1-norm, the 2-norm, the infinity norm, and the frobenius norm.
octave:
norm(A) is the same as norm(A,2).
Statistics
A statistic is a single number which summarizes a population of data. The most familiar example is the mean or average. Statistics defined for discrete populations can often be meaningfully extended to continuous distributions by replacing summations with integration.
An important class of statistics are the nth moments. The nth moment $\mu'_n$ of a population of k values xi with mean μ is:
(1)The nth central moment μn of the same population is:
(2)first moment statistics
The sum and the mean.
The mean is the first moment. It is one definition of the center of the population. The median and the mode are also used to define the center. In most populations they will be close to but not identical to the mean.
second moment statistics
The variance and the standard deviation. The variance is the second central moment. It is a measure of the spread or width of the population.
The standard deviation is the square root of the variance. It is also a measurement of population spread. The standard deviation has the same units of measurement as the data in the population.
second moment statistics for samples
The sample variance and sample standard deviation.
skewness
The skewness of a population.
The skewness measures the asymmetrically of the population. The skewness will be negative, positive, or zero when the population is more spread out on the left, more spread out on the right, or similarly spread out on both sides, respectively.
The skewness can be calculated from the third moment and the standard deviation:
(3)When estimating the population skewness from a sample a correction factor is often used, yielding the sample skewness:
(4)octave and matlab:
Octave uses the sample standard deviation to compute skewness. This behavior is different from Matlab and should possibly be regarded as a bug.
Matlab, but not Octave, will take a flag as a second parameter. When set to zero Matlab returns the sample skewness:
skewness(x, 0)
numpy:
Set the named parameter bias to False to get the sample skewness:
stats.skew(x, bias=False)
kurtosis
The kurtosis of a population.
The formula for kurtosis is:
(5)When kurtosis is negative the sides of a distribution tend to be more convex than when the kurtosis is is positive. A negative kurtosis distribution tends to have a wide, flat peak and narrow tails. Such a distribution is called platykurtic. A positive kurtosis distribution tends to have a narrow, sharp peak and long tails. Such a distribution is called leptokurtic.
The fourth standardized moment is
(6)The fourth standardized moment is sometimes taken as the definition of kurtosis in older literature. The reason the modern definition is preferred is because it assigns the normal distribution a kurtosis of zero.
octave:
Octave uses the sample standard deviation when computing kurtosis. This should probably be regarded as a bug.
r:
R uses the older fourth standardized moment definition of kurtosis.
nth moment and nth central moment
How to compute the nth moment (also called the nth absolute moment) and the nth central moment for arbitrary n.
mode
The mode is the most common value in the sample.
The mode is a measure of central tendency like the mean and the median. A problem with the mean is that it can produce values not found in the data. For example the mean number of persons in an American household was 2.6 in 2009.
The mode might not be unique. If there are two modes the sample is said to be bimodal, and in general if there is more than one mode the sample is said to be multimodal.
quantile statistics
If the data is sorted from smallest to largest, the minimum is the first value, the median is the middle value, and the maximum is the last value. If there are an even number of data points, the median is the average of the middle two points.
The median divides the population into two halves. When the population is divided into four parts the division markers are called the first, second, and third quartile. When the population is divided into a hundred the division markers are called percentiles. If the population is divided into nparts the markers are called the 1st, 2nd, …, (n-1)th n-quantile.
bivariate statistics
The correlation and the covariance.
The correlation is a number from -1 to 1. It is a measure of the linearity of the data, with values of -1 and 1 representing indicating a perfectly linear relationship. When the correlation is positive the quantities tend to increase together and when the correlation is negative one quantity will tend to increase as the other decreases.
A variable can be completely dependent on another and yet the two variables can have zero correlation. This happens for Y = X2 where uniform X on the interval [-1, 1]. Anscombe's quartet gives four examples of data sets each with the same fairly high correlation 0.816 and yet which show significant qualitative differences when plotted.
The covariance is defined by
(7)The correlation is the normalized version of the covariance. It is defined by
(8)frequency table
How to compute the frequency table for a data set. A frequency table counts how often each value occurs in the data set.
r:
The table function returns an object of type table.
invert frequency table
How to convert a frequency table back into the original data set.
The order of the original data set is not preserved.
bin
How to bin a data set. The result is a frequency table where each frequency represents the number of samples from the data set for an interval.
r:
The cut function returns a factor.
A labels parameter can be provided with a vector argument to assign the bins names. Otherwise bin names are contructed from the breaks using "[0.0,1.0)" style notation.
The hist function can be used to bin a data set:
x = c(1.1, 3.7, 8.9, 1.2, 1.9, 4.1)
hist(x, breaks=c(0, 3, 6, 9), plot=FALSE)
hist returns an object of type histogram. The counts are in the $counts attribute.
Linear Regression and Curve Fitting
linear regression y = ax + b
How to get the slope a and intercept b for a line which best approximates the data. How to get the residuals.
If there are more than two data points, then the system is overdetermined and in general there is no solution for the slope and the intercept. Linear regression looks for line that fits the points as best as possible. The least squares solution is the line that minimizes the sum of the square of the distances of the points from the line.
The residuals are the difference between the actual values of y and the calculated values using ax + b. The norm of the residuals can be used as a measure of the goodness of fit.
Distributions
A distribution density function f(x) is a non-negative function which, when integrated over its entire domain is equal to one. The distributions described in this sheet have as their domain the real numbers. The support of a distribution is the part of the domain on which the density function is non-zero.
A distribution density function can be used to describe the values one is likely to see when drawing an example from a population. Values in areas where the density function is large are more likely than values in areas where the density function is small. Values where there density function is zero do not occur. Thus it can be useful to plot the density function.
To derive probabilities from a density function one must integrate or use the associated cumulative density function
(9)which gives the probability of seeing a value less than or equal to x. As probabilities are non-negative and no greater than one, F is a function from (-∞, ∞) to [0,1]. The inverse of F is called the inverse cumulative distribution function or the quantile function for the distribution.
For each distribution statistical software will generally provide four functions: the density, the cumulative distribution, the quantile, and a function which returns random numbers in frequencies that match the distribution. If the software does not provide a random number generating function for the distribution, the quantile function can be composed with the built-in random number generator that most languages have as long as it returns uniformly distributed floats from the interval [0, 1].
| density probability density probability mass |
cumulative density cumulative distribution distribution |
inverse cumulative density inverse cumulative distribution quantile percentile percent point |
random variate |
Discrete distributions such as the binomial and the poisson do not have density functions in the normal sense. Instead they have probability mass functions which assign probabilities which sum up to one to the integers. In R warnings will be given if non integer values are provided to the mass functions dbinom and dpoiss.
The cumulative distribution function of a discrete distribution can still be defined on the reals. Such a function is constant except at the integers where it may have jump discontinuities.
Most well known distributions are in fact parametrized families of distributions. This table lists some of them with their parameters and properties.
The information entropy of a continuous distribution with density f(x) is defined as:
(10)In Bayesian analysis the distribution with the greatest entropy, subject to the known facts about the distribution, is called the maximum entropy probability distribution. It is considered the best distribution for modeling the current state of knowledge.
binomial
The probability mass, cumulative distribution, quantile, and random number generating functions for the binomial distribution.
The binomial distribution is a discrete distribution. It models the number of successful trails when n is the number of trials and p is the chance of success for each trial. An example is the number of heads when flipping a coin 100 times. If the coin is fair then p is 0.50.
numpy:
Random numbers in a binomial distribution can also be generated with:
np.random.binomial(n, p)
poisson
The probability mass, cumulative distribution, quantile, and random number generating functions for the binomial distribution.
The poisson distribution is a discrete distribution. It is described by a parameter lam which is the mean value for the distribution. The poisson distribution is used to model events which happen at a specified average rate and independently of each other. Under these circumstances the time between successive events will be described by an exponential distribution and the events are said to be described by a poisson process.
numpy:
Random numbers in a poisson distribution can also be generated with:
np.random.poisson(lam, size=1)
normal
The probability density, cumulative distribution, quantile, and random number generating functions for the uniform distribution.
The parameters are the mean μ and the standard deviation σ. The standard normal distribution has μ of 0 and σ of 1.
The normal distribution is the maximum entropy distribution for a given mean and variance. According to the central limit theorem, if {X1, …, Xn} are any independent and identically distributed random variables with mean μ and variance σ2, then Sn := Σ Xi / n converges to a normal distribution with mean μ and variance σ2/n.
numpy:
Random numbers in a normal distribution can also be generated with:
np.random.randn()
gamma
The probability density, cumulative distribution, quantile, and random number generating functions for the gamma distribution.
The parameter k is called the shape parameter and θ is called the scale parameter. The rate of the distribution is β = 1/θ.
If Xi are n independent random variables with Γ(ki, θ) distribution, then Σ Xi has distribution Γ(Σ ki, θ).
If X has Γ(k, θ) distribution, then αX has Γ(k, αθ) distribution.
exponential
The probability density, cumulative distribution, quantile, and random number generating functions for the exponential distribution.
chi-squared
The probability density, cumulative distribution, quantile, and random number generating functions for the chi-squared distribution.
beta
The probability density, cumulative distribution, quantile, and random number generating functions for the beta distribution.
uniform
The probability density, cumulative distribution, quantile, and random number generating functions for the uniform distribution.
The uniform distribution is described by the parameters a and b which delimit the interval on which the density function is nonzero.
The uniform distribution is maximum entropy probability distribution with support [a, b].
Consider the uniform distribution on [0, b]. Suppose that we take k samples from it, and m is the largest of the samples. The minimum variance unbiased estimator for b is
(11)octave, r, numpy:
a and b are optional parameters and default to 0 and 1 respectively.
Student's t
The probability density, cumulative distribution, quantile, and random number generating functions for Student's t distribution.
Snedecor's F
The probability density, cumulative distribution, quantile, and random number generating functions for Snedecor's F distribution.
Univariate Charts
vertical bar chart
A chart in which numerical values are represented by horizontal bars. The bars are aligned at the bottom.
r:
How to produce a bar chart using ggplot2:
cnts = c(7,3,8,5,5)
names = c("a","b","c","d","e")
df = data.frame(names, cnts)
qplot(names, data=df, geom="bar", weight=cnts)
horizontal bar chart
A bar chart with horizontal bars which are aligned on the left.
pie chart
A pie chart displays values using the areas of circular sectors or equivalently the lengths of the arcs of those sectors.
A pie chart implies that the values are percentages of a whole.
dot plot
A chart which displays small, integral values with stacks of dots.
stem plot
Also called a stem-and-leaf plot.
A stem plot is a concise way of storing a small set of numbers which makes their distribution visually evident.
The original set of numbers can be recovered with some loss of accuracy by appending the number on the left with each of the digits on the right. In the example below the original data set contained -43, -42, -41, -39, -38, -35, …, 35, 44, 46, 50, 58.
> stem(20*rnorm(100))
The decimal point is 1 digit(s) to the right of the |
-4 | 321
-2 | 98544054310
-0 | 8864333111009998776444332222110
0 | 0001122333333466667778899122334555666789
2 | 00023669025
4 | 4608
histogram
A histogram is a bar chart where each bar represents a range of values that the data points can fall in. The data is tabulated to find out how often data points fall in each of the bins and in the final chart the length of the bars corresponds to the frequency.
A common method for choosing the number of bins using the number of data points is Sturges' formula:
(12)r:
How to make a histogram with the ggplot2 library:
qplot(rnorm(50), geom="histogram", binwidth=binwidth)
binwidth = (max(x)-min(x))/10
qplot(rnorm(50), geom="histogram", binwidth=binwidth)
box plot
Also called a box-and-whisker plot.
The box shows the locations of the 1st quartile, median, and 3rd quartile. These are the same as the 25th percentile, 50th percentile, and 75th percentile.
The whiskers are sometimes used to show the maximum and minimum values of the data set. Outliers are sometimes shown explicitly with dots, in which case all remaining data points occur inside the whiskers.
r:
How to create a box plot with ggplot2:
qplot(x="rnorm", y=rnorm(50), geom="boxplot")
qplot(x=c("rnorm", "rexp", "runif"), y=c(rnorm(50), rexp(50), runif(50)), geom="boxplot")
chart title
How to set the chart title.
r:
The qplot commands supports the main options for setting the title:
qplot(x="rnorm", y=rnorm(50), geom="boxplot", main="boxplot example")
Bivariate Charts
stacked bar chart
Two or more data sets with a common set of labels can be charted with a stacked bar chart. This makes the sum of the data sets for each label readily apparent.
grouped bar chart
Optionally data sets with a common set of labels can be charted with a grouped bar chart which clusters the bars for each label. The grouped bar chart makes it easier to perform comparisons between labels for each data set.
scatter plot
A scatter plot can be used to determine if two variables are correlated.
r:
How to make a scatter plot with ggplot:
x = rnorm(50)
y = rnorm(50)
p = ggplot(data.frame(x, y), aes(x, y))
p = p + layer(geom="point")
p
hexagonal binning
A hexagonal binning is the two-dimensional analog of a histogram. The number of data points in each hexagon is tabulated, and then color or grayscale is used to show the frequency.
A hexagonal binning is superior to a scatter-plot when the number of data points is high because most scatter-plot software doesn't indicate when points are occur on top of each other.
linear regression line
How to plot a line determined by linear regression on top of a scatter plot.
polygonal line plot
How to connect the dots of a data set with a polygonal line.
cubic spline
How to connect the dots of a data set with a line which has a continuous 2nd derivative.
function plot
How to plot a function.
quantile-quantile plot
Also called a Q-Q plot.
A quantile-quantile plot is a scatter plot created from two data sets. Each point depicts the quantile of the first data set with its x position and the corresponding quantile of the second data set with its y position.
If the data sets are drawn from the same distribution then most of the points should be close to the line y = x. If the data sets are drawn from distributions which have a linear relation then the Q-Q plot should also be close to linear.
axis labels
How to label the x and y axes.
r:
How to label the axes with ggplot2:
x = rnorm(20)
y = x^2
p = ggplot(data.frame(x, y), aes(x, y))
p + layer(geom="point") + xlab('x') + ylab('x squared')
axis limits
How to manually set the range of values displayed by an axis.
logarithmic y-axis
Multivariate Charts
additional line set
legend
How to put a legend on a chart.
r:
The named parameter lwd is the line width. It is roughly the width in pixels, though the exact interpretation is device specific.
The named parameter lty specifies the line type. The value can be either an integer or a string:
| number | string |
|---|---|
| 0 | 'blank' |
| 1 | 'solid' |
| 2 | 'dashed' |
| 3 | 'dotted' |
| 4 | 'dotdash' |
| 5 | 'longdash' |
| 6 | 'twodash' |
additional point set
stacked area chart
overlapping area chart
3d scatter plot
bubble chart
scatter plot matrix
contour plot
MATLAB
Octave Manual
MATLAB Documentation
gnuplot Documentation
Differences between Octave and MATLAB
Octave-Forge Packages
The basic data type of MATLAB is a matrix of floats. There is no distinction between a scalar and a 1x1 matrix, and functions that work on scalars typically work on matrices as well by performing the scalar function on each entry in the matrix and returning the resultings in a matrix with the same dimensions. Operators such as the logical operators ('&' '|' '!'), relational operators ('==', '!=', '<', '>'), and arithmetic operators ('+', '-') all work this way. However the multiplication '*' and division '/' operators perform matrix multiplication and matrix division, respectively. The '.*' and '.*' operators are available if entry-wise multiplication or division is desired.
Floats are by default double precision; single precision can be specified with the single constructor. MATLAB has convenient matrix literal notation: commas or spaces can be used to separate row entries, and semicolons or newlines can be used to separate rows.
Arrays and vectors are implemented as single-row (1xn) matrices. As a result an n-element vector must be transposed before it can be multiplied on the right of a mxn matrix.
Numeric literals that lack a decimal point such as 17 and -34 create floats, in contrast to most other programming languages. To create an integer, an integer constructor which specifies the size such as int8 and uint16 must be used. Matrices of integers are supported, but the entries in a given matrix must all have the same numeric type.
Strings are implemented as single-row (1xn) matrices of characters, and as a result matrices cannot contain strings. If a string is put in matrix literal, each character in the string becomes an entry in the resulting matrix. This is consistent with how matrices are treated if they are nested inside another matrix. The following literals all yield the same string or 1xn matrix of characters:
'foo'
[ 'f' 'o' 'o' ]
[ 'foo' ]
[ [ 'f' 'o' 'o' ] ]
true and false are functions which return matrices of ones and zeros. The ones and zeros have type logical instead of double, which is created by the literals 1 and 0. Other than having a different class, the 0 and 1 of type logical behave the same as the 0 and 1 of type double.
MATLAB has a tuple type (in MATLAB terminology, a cell array) which can be used to hold multiple strings. It can also hold values with different types.
Octave is a free, open source application for floating point and matrix computations which can interface with numerical routines implemented in C or Fortran. Octave implements the core MATLAB language, and as a result MATLAB scripts will usually run under Octave. Octave scripts are less likely to run under MATLAB because of extensions which Octave is made to the core language.. Octave's plotting functions use gnuplot.
R
An Introduction to R
The Comprehensive R Archive Network
ggplot2 reference manual
R is an application for statistical analysis. It is a free, open source implementation of the S programming language developed at Bell Labs.
The basic data types of R are vectors of floats, vectors of strings, and vectors of booleans. There is no distinction between a scalar and a vector with one entry in it, and functions and operators which accept a scalar argument will typically accept a vector argument, returning a vector of the same size with the scalar operation performed on each the entries of the original vector.
The scalars in a vector must all be of the same type, but R also provides a list data type which can be used as a tuple (entries accessed by index) or a record (entries accessed by name).
In addition R provides a data frame type which is a list (in R terminology) of vectors all of the same length. Data frames are equivalent to the data sets of other statistical analysis packages.
NumPy
NumPy and SciPy Documentation
matplotlib intro
NumPy for Matlab Users
Pandas Documentation
Pandas Method/Attribute Index
NumPy is a Python library which provides a data type called array. It differs from the Python list data type in the following ways:
- N-dimensional. Although the list type can be nested to hold higher dimension data, the array can hold higher dimension data in a space efficient manner without using indirection.
- homogeneous. The elements of an array are restricted to be of a specified type. The NumPy library introduces new primitive types not available in vanilla Python. However, the element type of an array can be object which permits storing anything in the array.
In the reference sheet the array section covers the vanilla Python list and the multidimensional array section covers the NumPy array.
List the NumPy primitive types
SciPy, Matplotlib, and Pandas are libraries which depend on Numpy.























