Numerical Analysis & Statistics: MATLAB, R, NumPy, Julia

a side-by-side reference sheet

matlabrnumpyjulia
version usedMATLAB 8.3

Octave 3.8
3.1Python 2.7
NumPy 1.7
SciPy 0.13
Pandas 0.12
Matplotlib 1.3
0.4
show version$matlab -nojvm -nodisplay -r 'exit'$ octave --version
$R --versionsys.version np.__version__ sp.__version__ mpl.__version__$ julia --version
implicit prologuenoneimport sys, os, re, math
import numpy as np
import scipy as sp
import scipy.stats as stats
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
grammar and invocation
matlabrnumpyjulia
interpreter

$cat >>foo.m 1 + 1 exit$ matlab -nojvm -nodisplay -r "run('foo.m')"

$octave foo.m$ cat >>foo.r
1 + 1

$Rscript foo.r$ R -f foo.r
$cat >>foo.py print(1 + 1)$ python foo.py
$cat >>foo.jl println(1 + 1)$ julia foo.jl
repl

$matlab -nojvm -nodisplay$ octave
$R$ python$julia command line program$ matlab -nojvm -nodisplay -r 'disp(1 + 1); exit'

$octave --silent --eval '1 + 1'$ Rscript -e 'print("hi")'python -c 'print("hi")'$julia -e 'println("hi")' block delimitersfunction end if elseif else end while end for end { }offside rule statement separator; or newline Newlines not separators after three dots: ... Output is suppressed when lines end with a semicolon. ; or sometimes newline Newlines not separators inside (), [], {}, '', "", or after binary operator. newline or ; Newlines not separators inside (), [], {}, triple quote literals, or after backslash: \ end-of-line comment 1 + 1 % addition1 + 1 # addition1 + 1 # addition1 + 1 # addition variables and expressions matlabrnumpyjulia assignmenti = 3i = 3 i <- 3 3 -> i assign("i", 3) i = 3i = 3 parallel assignment swap compound assignment arithmetic, string, logical nonenone# do not return values: += -= *= /= //= %= **= += *= &= |= ^= null% Only used in place of numeric values: NaN NA NULLNone np.nan # None cannot be stored in a numpy array; # np.nan can if dtype is float64. # Only used in place of float values: NaN null testisnan(v) % true for '', []: isempty(v) is.na(v) is.null(v) v == None v is None np.isnan(np.nan) # np.nan == np.nan is False isnan(v) conditional expressionnone(if (x > 0) x else -x) ifelse(x > 0, x, -x) x if x > 0 else -xx > 0 ? x : -x arithmetic and logic matlabrnumpyjulia true and false 1 0 true falseTRUE FALSE T FTrue Falsetrue false falsehoodsfalse 0 0.0 matrices evaluate to false unless nonempty and all entries evaluate to true FALSE F 0 0.0 matrices evaluate to value of first entry; string in boolean context causes error False None 0 0.0 '' [] {}false logical operators~true | (true & false) % short-circuit operators: && || !TRUE | (TRUE & FALSE) short-circuit operators: && || & and | can operate on and return vectors, but && and || return scalars and or not&& || ! relational operators == ~= > < >= <=== != > < >= <=== != > < >= <=== != > < >= <= integer typeInt8 Int16 Int32 Int64 Int128 Int is either 32 or 64 bits, depending on WORD_SIZE unsigned typeUInt8 UInt16 UInt32 UInt64 UInt128 UInt is either 32 or 64 bits, depending on WORD_SIZE float type Float16 Float32 Float64 arithmetic operators add, sub, mult, div, quot, rem + - * / none mod(n, divisor)+ - * / %/% %%+ - * / // %+ - * / div(n, divisor) rem(n, divisor) # always non-negative: mod(n, divisor) integer division fix(13 / 5)13 %/% 5 as.integer(13 / 5) 13 // 5div(13, 5) integer division by zero Inf NaN or -Infresult of converting Inf or NaN to an integer with as.integer: NA raises ZeroDivisionErrorraises DivideError float division 13 / 513 / 5float(13) / 513 / 5 5 \ 13 float division by zero dividend is positive, zero, negative these values are literals: Inf NaN -Inf these values are literals: Inf NaN -Inf raises ZeroDivisionErrorthese values are literals: Inf NaN -Inf power2 ^ 162 ^ 16 2 ** 16 2 ** 162 ^ 16 sqrt sqrt(2)sqrt(2)math.sqrt(2)sqrt(2) sqrt -1% returns 0 + 1i: sqrt(-1) # returns NaN: sqrt(-1) # returns 0+1i: sqrt(-1+0i) # raises ValueError: math.sqrt(-1) # returns 1.41421j: import cmath cmath.sqrt(-1) # raises DomainError: sqrt(-1) # returns 0.0 + 1.0im: sqrt(-1 + 0im) transcendental functionsexp log sin cos tan asin acos atan atan2exp log sin cos tan asin acos atan atan2math.exp math.log math.sin math.cos math.tan math.asin math.acos math.atan math.atan2exp log sin cos tan asin acos atan atan2 transcendental constantspi epi exp(1)math.pi math.epi e float truncation round towards zero, to nearest integer, down, up fix(x) round(x) floor(x) ceil(x) as.integer(x) round(x) floor(x) ceiling(x) int(x) int(round(x)) math.floor(x) math.ceil(x) Int(trunc(x)) Int(round(x)) Int(floor(x)) Int(ceil(x)) # trunc() and other functions return floats. # Int() raises InexactError if float argument has # nonzero fractional portion. absolute value and signum abs signabs signabs(-3.7) math.copysign(1, -3.7) abs(-3.7) sign(-3.7) integer overflowbecomes float; largest representable integer in the variable intmaxbecomes float; largest representable integer in the variable .Machine$integer.maxbecomes arbitrary length integer of type long
float overflow

InfInfraises OverflowError
float limits

eps
realmax
realmin
.Machine$double.eps .Machine$double.xmax
.Machine$double.xmin np.finfo(np.float64).eps np.finfo(np.float64).max np.finfo(np.float64).min rational construction22 // 7 rational decompositionnum(22 // 7) den(22 // 7) complex construction 1 + 3i1 + 3i1 + 3j1 + 3im complex(1, 3) complex decompositionreal imag abs arg conj Re Im abs Arg Conj import cmath z.real z.imag cmath.polar(z)[1] real(1 + 3im) imag(1 + 3im) abs(1 + 3im) angle(1 + 3im) conj(1 + 3im) random number uniform integer, uniform float floor(100 * rand) rand floor(100 * runif(1)) runif(1) np.random.randint(0, 100) np.random.rand() rand(1:100) rand() random seed set, get, and restore rand('state', 17) sd = rand('state') rand('state', sd) set.seed(17) sd = .Random.seed none np.random.seed(17) sd = np.random.get_state() np.random.set_state(sd) bit operatorsbitshift(100, 3) bitshift(100, -3) bitand(1, 2) bitor(1, 2) bitxor(1, 2) % MATLAB: bitcmp(1, 'uint16') % Octave: bitcmp(1, 16) none100 << 3 100 >> 3 1 & 2 1 | 2 1 ^ 2 ~1 binary, octal, and hex literals0b101010 0o52 0x2a radix convert integer to and from string with radix base(7, 42) parse(Int, "60", 7) strings matlabrnumpyjulia literal'don''t say "no"' % Octave only: "don't say \"no\"" "don't say \"no\"" 'don\'t say "no"' 'don\'t say "no"' "don't say \"no\"" r"don't " r'say "no"' "don't say \"no\"" newline in literal noyesnoyes literal escapes% Octave double quote only: \\ \" \' \0 \a \b \f \n \r \t \v \\ \" \' \a \b \f \n \r \t \v \ooo# single and double quoted: \newline \\ \' \" \a \b \f \n \r \t \v \ooo \xhh \\ \" \' \a \b \f \n \t \r \v \ooo \xhh \uhhhh \Uhhhhhhhh variable interpolationcount = 3 item = "ball" println("$count $(item)s") expression interpolation "1 + 1 =$(1 + 1)"
concatenatestrcat('one ', 'two ', 'three')paste("one", "two", "three", sep=" ")'one ' + 'two ' + 'three'
literals, but not variables, can be concatenated with juxtaposition:
'one ' "two " 'three'
"one " * "two " * "three"

string("one ", "two ", "three")
replicate

hbar = repmat('-', 1, 80)hbar = paste(rep('-', 80), collapse='')hbar = '-' * 80hbar = "-" ^ 80

hbar = repeat("-", 80)
index of substring% returns array of one-indexed
% locations

strfind('hello', 'el')
counts from one, returns

regexpr("el", "hello")
'hello'.index('el')
# returns UnitRange:
search("hello", "el")
extract substring

s = 'hello'
% syntax error: 'hello'(1:4)
s(1:4)
substr("hello", 1, 4)'hello'[0:4]"hello"[1:4]
split% returns cell array:
strsplit('foo,bar,baz', ',')
strsplit('foo,bar,baz', ',')'foo,bar,baz'.split(',')split("foo,bar,baz", ",")
join% takes cell array as arg:
strjoin({'foo', 'bar', 'baz'}, ',')
paste("foo", "bar", "baz", sep=",")
paste(c('foo', 'bar', 'baz'),
collapse=',')
','.join(['foo', 'bar', 'baz'])join(["foo", "bar", "baz"], ",")
trim
both sides, left, right
strtrim(' foo ')
none
deblank('foo ')
gsub("(^[\n\t ]+|[\n\t ]+$)", "", " foo ") sub("^[\n\t ]+", "", " foo") sub("[\n\t ]+$", "", "foo ")
' foo '.strip()
' foo'.lstrip()
'foo '.rstrip()
on right, on left, centered
s = repmat(' ', 1, 10)
s(1:5) = 'lorem'
strjust(s, 'left')
strjust(s, 'right')
strjust(s, 'center')
sprintf("%-10s", "lorem")
sprintf("%10s", "lorem")
none
'lorem'.ljust(10)
'lorem'.rjust(10)
'lorem'.center(10)
number to string

strcat('value: ', num2str(8))paste("value: ", toString("8"))'value: ' + str(8)
string to number7 + str2num('12')
73.9 + str2num('.037')
7 + as.integer("12")
73.9 + as.double(".037")
7 + int('12')
73.9 + float('.037')
7 + parse(Int, "12")
73.9 + parse(Float64, ".037")
translate caselower('FOO')
upper('foo')
tolower("FOO")
toupper("foo")
'foo'.upper()
'FOO'.lower()
uppercase("foo")
lowercase("FOO")
sprintf

sprintf('%s: %.3f %d', 'foo', 2.2, 7)sprintf("%s: %.3f %d", "foo", 2.2, 7)'%s: %.3f %d' % ('foo', 2.2, 7)@sprintf("%s: %.2f %d", "foo", 2.2, 7)
length

length('hello')nchar("hello")len('hello')length("hello")

# index of first byte of last char:
endof("hello")
character access

s = 'hello'
% syntax error: 'hello'(1)
s(1)
substr("hello", 1, 1)'hello'[0]"hello"[1]

# index must be byte-index of the first byte of a
# character. Raises BoundsErrror if no such byte,
# and UnicodeError if byte not first in char.
chr and ordchar(65)
double('A')
intToUtf8(65)
utf8ToInt("A")
chr(65)
ord('A')
Char(65)
Int('A')
regular expressions
matlabrnumpyjulia
character class abbreviations. \d \D \s \S \w \W

% also C-string style backslash escapes:
\a \b \f \n \r \t \v
# escape backslash in strings by doubling:
. \d \D \s \S \w \W
. \d \D \s \S \w \W. \d \D \h \H \s \S \v \V \w \W
anchors

^ $\< \># escape backslash in strings by doubling: ^$ \< \> \b \B
^ $\A \b \B \Z^$ \A \b \B \z \Z
match testregexp('hello', '^[a-z]+$') regexp('hello', '^\S+$')
regexpr("^[a-z]+$", "hello") > 0 regexpr('^\\S+$', "hello") > 0
re.search(r'^[a-z]+$', 'hello') re.search(r'^\S+$', 'hello')
ismatch(r"^[a-z]+$", "hello") case insensitive match testregexpi('Lorem Ipsum', 'lorem')regexpr('(?i)lorem', "Lorem Ipsum") > 0re.search(r'lorem', 'Lorem Ipsum', re.I)ismatch(r"lorem"i, "Lorem Ipsum") modifiers none(?i) (?m) (?s) (?x)re.I re.M re.S re.Xi m s x substitution first match, all matches s = 'do re mi mi mi' regexprep(s, 'ma', 'once') regexprep(s, 'mi', 'ma') sub('mi', 'ma', 'do re mi mi mi') gsub('mi', 'ma', 'do re mi mi mi') rx = re.compile(r'mi') s = rx.sub('ma', 'do re mi mi mi', 1) s2 = rx.sub('ma', 'do re mi mi mi') replace("do re mi mi mi", r"mi", s"ma", 1) replace("do re mi mi mi", r"mi", s"ma") backreference in match and substitutionregexp('do do', '(\w+) \1') regexprep('do re', '(\w+) (\w+)', '$2 $1') regexpr('(\\w+) \\1', 'do do') > 0 sub('(\\w+) (\\w+)', '\\2 \\1', 'do re') none rx = re.compile(r'(\w+) (\w+)') rx.sub(r'\2 \1', 'do re') ismatch(r"(\w+) \1", "do do") group capturerx = '(\d{4})-(\d{2})-(\d{2})' m = re.search(rx, '2010-06-03') yr, mo, dy = m.groups() rx = r"(\d{4})-(\d{2})-(\d{2})" m = match(rx, "2010-06-03") yr, mn, dy = m.captures dates and time matlabrnumpyjulia current date/time t = nowt = as.POSIXlt(Sys.time())import datetime t = datetime.datetime.now() t = now() date/time typefloating point number representing days since year 0 in the Gregorian calendarPOSIXltdatettimeDateTime date/time difference typefloating point number representing daysa difftime object which behaves like a floating point number representing secondstimedelta, which can be converted to float value in seconds via total_seconds() methodBase.Dates.Millisecond get date partsdv = datevec(t) dv(1) dv(2) dv(3) % syntax error: datevec(t)(1) t$year + 1900
t$mon + 1 t$mday
t.year
t.month
t.day
Dates.year(t)
Dates.month(t)
Dates.day(t)
get time partsdv = datevec(t)
dv(4)
dv(5)
dv(6)
t$hour t$min
t$sec t.hour t.minute t.second Dates.hour(t) Dates.minute(t) Dates.second(t) build date/time from partst = datenum([2011 9 20 23 1 2])t = as.POSIXlt(Sys.time()) t$year = 2011 - 1900
t$mon = 9 - 1 t$mday = 20
t$hour = 23 t$min = 1
t$sec = 2 import datetime t = datetime.datetime(2011, 9, 20, 23, 1, 2) t = DateTime(2011, 9, 20, 23, 1, 2) convert to string datestr(t)print(t)str(t)"$t"
parse datetimes = '2011-09-20 23:01:02'
fmt = 'yyyy-mm-dd HH:MM:SS'
t = datenum(s, fmt)
t = strptime('2011-09-20 23:01:02',
'%Y-%m-%d %H:%M:%S')
import datetime

s = '2011-05-03 10:00:00'
fmt = '%Y-%m-%d %H:%M:%S'
t = datetime.datetime.strptime(s, fmt)
fmt = "yyyy-mm-dd HH:MM:SS"
t = DateTime("2011-05-03 10:00:00", fmt)

# fmt string can be compiled:
df = Dates.DateFormat(fmt)
t2 = DateTime("2011-05-03 10:00:00", df)
format datetime

datestr(t, 'yyyy-mm-dd HH:MM:SS')format(t, format='%Y-%m-%d %H:%M:%S')t.strftime('%Y-%m-%d %H:%M:%S')Dates.format(t, "yyyy-mm-dd HH:MM:SS")
sleeppause(0.5)Sys.sleep(0.5)import time

time.sleep(0.5)
sleep(0.5)
tuples
matlabrnumpyjulia
type

celllisttupleTuple{T[, ...]}
literal

tup = {1.7, 'hello', [1 2 3]}tup = list(1.7, "hello", c(1, 2, 3))tup = (1.7, "hello", [1,2,3])tup = (1.7, "foo", [1, 2, 3])
lookup element

% indices start at one:
tup{1}
# indices start at one:
tup[[1]]
# indices start at zero:
tup[0]
# indices start at one:
tup[1]
update element

tup{1} = 2.7tup[[1]] = 2.7tuples are immutabletuples are immutable
length

length(tup)length(tup)len(tup)length(tup)
arrays
matlabrnumpyjulia
element typealways numeric# "numeric":
class(c(1, 2, 3))

# arrays can also have "boolean" or "string" elements
# values can have different types:
[type(x) for x in a]
a = [1, 2, 3]

# Array{Int64, 2}:
typeof(a)
# Int64:
typeof(a[1])
literal

a = [1, 2, 3, 4]

% commas are optional:
a = [1 2 3 4]
# use c() constructor:
a = c(1, 2, 3, 4)
a = [1, 2, 3, 4]a = [1, 2, 3, 4]
size

length(a)length(a)len(a)length(a)
empty test

length(a) == 0

% An array used in a conditional test is
% false unless nonempty and all entries evaluate
% as true.
length(a) == 0not aisempty(a)
lookup

% Indices start at one:
a(1)
# Indices start at one:
a[1]
# Indices start at zero:
a[0]
# Indices start at one:
a[1]
update

a(1) = -1a[1] = -1a[0] = -1a[1] = -1
out-of-bounds behaviora = []

% error:
a(1)

% increases array size to 10;
% zero-fills slots 1 through 9:

a(10) = 10
a = c()
# evaluates as NA:
a[10]
# increases array size to 10:
a[10] = "lorem"
a = []
# raises IndexError:
a[10]
# raises IndexError:
a[10] = 'lorem'
a = []

# raises BoundsError:
a[10]
# raises BoundsError:
a[10] = "lorem"
index of elementa = [7 8 9 10 8]

% returns [2 5]:
find(a == 8)

% returns 2:
find(a == 8, 1, 'first')
a = c('x', 'y', 'z', 'w', 'y')

# c(2, 5):
which(a == 'y')
a = ['x', 'y', 'z', 'w', 'y']

a.index('y')   # 1
a.rindex('y')  # 4
a = ["x", "y", "z", "w", "y"]

# 2:
findfirst(a, "y")
# 5:
julia> findlast(a, "y")
slice
by endpoints
a = ['a' 'b' 'c' 'd' 'e']

% ['c' 'd']:
a(3:4)
a = c("a", "b", "c", "d", "e")

# c("c", "d"):
a[seq(3, 4)]
a = ['a', 'b', 'c', 'd', 'e']

# ['c', 'd']:
a[2:4]
a = ["a", "b", "c", "d", "e"]

# ["c", "d"]:
a[3:4]
slice to end

a = ['a' 'b' 'c' 'd' 'e']

% ['c' 'd' 'e']:
a(3:end)
a = c("a", "b", "c", "d", "e")

# both return c("c", "d", "e"):
tail(a, n=length(a) - 2)
a[-1:-2]
a = ['a', 'b', 'c', 'd', 'e']

# ['c', 'd', 'e']:
a[2:]
a = ["a", "b", "c", "d", "e"]

# ["c", "d", "e"]:
a[3:end]
integer array as index[7 8 9]([1 3 3])c(7, 8, 9)[c(1, 3, 3)]np.array([7, 8, 9])[[0, 2, 2]]# [7, 9, 9]:
[7, 8, 9][[1, 3, 3]]
logical array as index[7 8 9]([true false true])c(7, 8, 9)[c(T, F, T)]np.array([7, 8, 9])[[True, False, True]]# [7, 9]:
[7, 8, 9][[true, false, true]]
concatenatea = [1 2 3]
a2 = [a [4 5 6]]
a = [a [4 5 6]]
% or:
a = horzcat(a, a2)
a = c(1, 2, 3)
a2 = append(a, c(4, 5, 6))
a = append(a, c(4, 5, 6))
a = [1, 2, 3]
a2 = a + [4, 5, 6]
a.extend([4, 5, 6])
a = [1, 2, 3]
a2 = vcat(a, [4, 5, 6])
a = vcat(a, [4, 5, 6])
replicatea = repmat(NA, 1, 10)a = rep(NA, 10)

# 30 a's, 50 b's, and 90 c's:
rep(c("a", "b", "c"), c(30, 50, 90))
a = [None] * 10
a = [None for i in range(0, 10)]
fill(NaN, 10)
copy
address copy, shallow copy, deep copy
There is no address copy. Because arrays cannot be nested, there is no distinction between shallow copy and deep copy. Assignment and passing an array to a function can be regarded as performing a shallow or deep copy, though MATLAB does not allocate memory for a 2nd array until one of the arrays is modified.Arrays in R behave like arrays in MATLAB.import copy

a = [1, 2, [3, 4]]

a2 = a
a3 = list(a)
a4 = copy.deepcopy(a)
a = Any[1, 2, [3, 4]]

a2 = a
a3 = copy(a)
a4 = deepcopy(a)
iteration

a = [9 7 3]
for i = 1:numel(a)
x = a(i)
disp(x)
end
for (x in c(9, 7, 3)) {
print(x)
}
for i in [9, 7, 3]:
print(i)
for i = [9, 7, 3]
println(i)
end
indexed iterationfor (i in seq_along(a)) {
cat(sprintf("%s at index %d\n", i, a[i]))
}
a = ['do', 're', 'mi', 'fa']
for i, s in enumerate(a):
print('%s at index %d' % (s, i))
a = ["do", "re", "mi", "fa"]
for (i, s) in enumerate(a)
println(i, " ", s)
end
reversea = [1 2 3]
a2 = fliplr(a)
a = fliplr(a)
a = c(1, 2, 3)
a2 = rev(a)
a = rev(a)
a = [1, 2, 3]
a2 = a[::-1]
a.reverse()
a = [1, 2, 3]
a2 = reverse(a)
reverse!(a)
sorta = [3 1 4 2]
a = sort(a)
a = c('b', 'A', 'a', 'B')
a2 = sort(a)
a = sort(a)
a = ['b', 'A', 'a', 'B']
sorted(a)
a.sort()
a.sort(key=str.lower)
a = [3, 1, 4, 2]
a2 = sort(a)
sort!(a)
dedupea = [1 2 2 3]
a2 = unique(a)
a = c(1, 2, 2, 3)
a2 = unique(a)
a = [1, 2, 2, 3]
a2 = list(set(a)))
a = unique([1, 2, 2, 3])
membership

ismember(7, a)7 %in% a
is.element(7, a)
7 in a7 in a
7 ∈ a
a ∋ 7
intersection

intersect([1 2], [2 3 4])intersect(c(1, 2), c(2, 3, 4)){1, 2} & {2, 3, 4}intersection([1, 2], [2, 3, 4])
∩([1, 2], [2, 3, 4])
union

union([1 2], [2 3 4])union(c(1, 2), c(2, 3, 4)){1, 2} | {2, 3, 4}union([1, 2], [2, 3, 4])
∪([1, 2], [2, 3, 4])
relative complement, symmetric differencesetdiff([1 2 3], [2])

a1 = [1 2]
a2 = [2 3 4]
union(setdiff(a1, a2), setdiff(a2, a1))
setdiff(c(1, 2, 3), c(2))

union(setdiff(c(1, 2), c(2, 3, 4)),
setdiff(c(2, 3, 4), c(1, 2)))
{1, 2, 3} - {2}

{1, 2} ^ {2, 3, 4}
setdiff([1, 2, 3], [2])
symdiff([1, 2], [2, 3, 4])
map

arrayfun( @(x) x*x, [1 2 3])sapply(c(1,2,3), function (x) { x * x})map(lambda x: x * x, [1, 2, 3])
# or use list comprehension:
[x * x for x in [1, 2, 3]]
filter

a = [1 2 3]
a(a > 2)
a = c(1, 2, 3)
a[a > 2]

Filter(function(x) { x > 2}, a)
filter(lambda x: x > 1, [1, 2, 3])
# or use list comprehension:
[x for x in [1, 2, 3] if x > 1]
reduce

Reduce(function(x, y) { x + y }, c(1, 2, 3), 0)reduce(lambda x, y: x + y, [1 ,2, 3], 0)reduce(+, [1, 2, 3])
foldl(-, 0, [1, 2, 3])
foldr(-, 0, [1, 2, 3])
universal and existential tests

all(mod([1 2 3 4], 2) == 0)
any(mod([1 2 3 4]) == 0)
all(c(1, 2, 3, 4) %% 2 == 0)
any(c(1, 2, 3, 4) %% 2 == 0)
all(i % 2 == 0 for i in [1, 2, 3, 4])
any(i % 2 == 0 for i in [1, 2, 3, 4])
all([x % 2 == 0 for x in [1, 2, 3, 4]])
any([x % 2 == 0 for x in [1, 2, 3, 4]])
shuffle and samplea = c(1, 1, 2, 3, 9, 28)
sample(a, 3)
a[sample.int(length(a))]
from random import shuffle, sample

a = [1, 2, 3, 4]
shuffle(a)
sample(a, 2)
zip

none; MATLAB arrays can't be nested# R arrays can't be nested.
# One approximation of zip is a 2d array:

a = rbind(c(1, 2, 3), c('a', 'b', 'c'))

# To prevent data type coercion, use a data frame:
df = data.frame(numbers=c(1, 2, 3),
letters=c('a', 'b', 'c'))
# array of 3 pairs:
a = zip([1, 2, 3], ['a', 'b', 'c'])
arithmetic sequences
matlabrnumpyjulia
unit difference1:100# type integer:
1:100
seq(1, 100)

# type double:
seq(1, 100, 1)
range(1, 101)1:100
difference of 100:10:100# type double:
seq(0, 100, 10)
range(0, 101, 10)0:10:100
difference of 0.10:0.1:10seq(0, 10, 0.1)[0.1 * x for x in range(0, 101)]

# 3rd arg is length of sequence, not step size:
sp.linspace(0, 10, 100)
0:0.1:10
iterate over arithmetic sequence# range replaces xrange in Python 3:
n = 0
for i in xrange(1, 1000001):
n += i
n = 0
for i in 1:1000000
n += i
end
convert arithmetic sequence to arraya = range(1, 11)
# Python 3:
a = list(range(1, 11))
a = Array(1:10)
two dimensional arrays
matlabrnumpyjulia
element typealways numericA = array(c(1, 2, 3))

# "array":
class(A)

# "boolean", "numeric", or "string":
class(c(A))
np.array([1, 2, 3]).dtype

# possible values: np.bool, np.int64, np.float64,
# np.complex128, ...
literal[1, 2; 3, 4]

% commas optional; newlines can replace semicolons::
[1 2
3 4]
nonenone
construct from sequencereshape([1 2 3 4], 2, 2)array(c(1, 2, 3, 4), dim=c(2, 2))np.array([1, 2, 3, 4]).reshape(2, 2)
construct from rowsrow1 = [1 2 3]
row2 = [4 5 6]

A = [row1; row2]
rbind(c(1, 2, 3), c(4, 5, 6))np.vstack((np.array([1, 2, 3]), np.array([4, 5, 6])))

np.array([[1, 2], [3, 4]])
construct from columnscol1 = [1; 4]
col2 = [2; 5]
col3 = [3; 6]

% commas are optional:
A = [col1, col2, col3]
cbind(c(1, 4), c(2, 5), c(3, 6))cols = (
np.array([1, 4]),
np.array([2, 5]),
np.array([3, 6])
)
np.vstack(cols).transpose()
construct from subarraysA = [1 3; 2 4]

A4_by_2 = [A; A]
A2_by_4 = [A A]
A = matrix(c(1, 2, 3, 4), nrow=2)
A4_by_2 = rbind(A, A)
A2_by_4 = cbind(A, A)
A = np.array([[1, 2], [3, 4]])
A2_by_4 = np.hstack([A, A])
A4_by_2 = np.vstack([A, A])
size
number of elements, number of dimensions, dimension lengths
numel(A)
ndims(A)
size(A)

% length of 1st dimension (i.e. # of rows):
size(A, 1)

% length of longest dimension:
length(A)
length(A)
length(dim(A))
dim(A)
A.size
A.ndim
A.shape

# number of rows:
len(A)
length(A)
ndims(A)
size(A)
lookup% indices start at one:
[1 2; 3 4](1, 1)
# indices start at one:
A = array(c(1, 2, 3, 4), dim=c(2, 2)

A[1, 1]
# indices start at zero:
A = np.array([[1, 2], [3, 4]])

A[0][0] or
A[0, 0]
1d lookupA = [2 4; 6 8]
% returns 8:
A(4)

% convert to column vector of length 4:
A2 = A(:)
A = array(c(2, 4, 6, 8), dim=c(2, 2))

# returns 8:
A[4]
A = np.array([[2, 4], [6, 8]])

# returns np.array([6, 8]):
A[1]

# returns 8:
A.reshape(4)[3]
lookup row or columnA = [1 2 3; 4 5 6; 7 8 9]

% 2nd row:
A(2, :)

% 2nd column:
A(:, 2)
A = t(array(1:9, dim=c(3, 3)))

# 2nd row:
A[2, ]

# 2nd column:
A[, 2]
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 2nd row:
A[1, :]

# 2nd column:
A[:, 1]
updateA = [2 4; 6 8]
A(1, 1) = 3
A = array(c(2, 4, 6, 8), dim=c(2, 2))
A[1, 1] = 3
A = np.array([[2, 4], [6, 8]])
A[0, 0] = 3
update row or columnA = [1 2; 3 4]

% [2 1; 3 4]:
A(1, :) = [2 1]

% [3 1; 2 4]:
A(:, 1) = [3 2]
A = t(array(1:4, dim=c(2, 2)))

A[1, ] = c(2, 1)
A[, 1] = c(3, 2)
A = np.array([[1, 2], [3, 4]])

A[0, :] = [2, 1]
# or
A[0] = [2, 1]

A[:, 0] = [3, 2]
update subarrayA = ones(3, 3)
A(1:2, 1:2) = 2 * ones(2, 2)
% or just:
A(1:2, 1:2) = 2
A = array(1, dim=c(3, 3))
A[1:2, 1:2] = array(2, dim=c(2, 2))
# or just:
A[1:2, 1:2] = 2
A = np.ones([3, 3])
A[0:2, 0:2] = 2 * np.ones([2, 2])
out-of-bounds behaviorA = [2 4; 6 8]

% error:
x = A(3, 1)

% becomes 3x2 array with zero at (3, 2):
A(3, 1) = 9
Lookups and updates both cause subscript out of bounds error.Lookups and updates both raise an IndexError exception.
slice subarrayA = reshape(1:16, 4, 4)'

% top left 2x2 subarray:
A(1:2, 1:2)

% bottom right 2x2 subarray:
A(end-1:end, end-1:end)

% 2x2 array containing corners:
A([1 4], [1 4])
A([1 end], [1 end])
A = t(array(1:16, dim=c(4, 4)))

# top left 2x2 subarray:
A[1:2, 1:2]

# bottom right 2x2 subarray:
A[-1:-2, -1:-2]

# 2x2 array containing corners:
A[c(1, 4), c(1, 4)]
A = np.array(range(1, 17)).reshape(4, 4)

# top left 2x2 subarray:
A[0:2, 0:2]

# bottom right 2x2 subarray:
A[2:, 2:]
transposeA = [1 2; 3 4]

A'
transpose(A)
A = array(c(1, 2, 3, 4), dim=c(2, 2))
t(A)
A = np.array([[1, 2], [3, 4]])
A.transpose()
A.T
flip% [ 2 1; 4 3]:
fliplr([1 2; 3 4])

% [3 4; 1 2]:
flipud([1 2; 3 4])
# install.packages('pracma'):
require(pracma)

A = t(array(1:4, dim=c(2, 2)))

fliplr(A)
flipud(A)
A = np.array([[1, 2], [3, 4]])

np.fliplr(A)
np.flipud(A)
circular shift

along columns, along rows
A = [1 2; 3, 4]

% [3 4; 1 2]:
circshift(A, 1)

% [2 1; 4 3]:
circshift(A, 1, 2)

% The 2nd argument can be any integer; negative values shift
% in the opposite direction.
# install.packages('pracma'):
require(pracma)

A = t(array(1:4, dim=c(2, 2)))

circshift(A, c(1, 0))
circshift(A, c(0, 1))
A = np.array([[1, 2], [3, 4]])

np.roll(A, 1, axis=0)
np.roll(A, 1, axis=1)
rotate
clockwise, counter-clockwise
A = [1 2; 3 4]

% [3 1; 4 2]:
rot90(A, -1)

% [2 4; 1 3]:
rot90(A)

% set 2nd arg to 2 for 180 degree rotation
# install.packages('pracma'):
require(pracma)

A = t(array(1:4, dim=c(2, 2)))

rot90(A)
rot90(A, -1)
rot90(A, 2)
A = np.array([[1, 2], [3, 4]])

np.rot90(A)
np.rot90(A, -1)
np.rot90(A, 2)
reduce
rows, columns
M = [1 2; 3 4]

% sum each row:
cellfun(@sum, num2cell(M, 2))

% sum each column:
cellfun(@sum, num2cell(M, 1))

% sum(M, 2) and sum(M, 1) also sum rows and columns
M = matrix(c(1, 2, 3, 4), nrow=2)

# sum each row:
apply(M, 1, sum)

# sum each column:
apply(M, 2, sum)
M = np.array([[1, 2], [3, 4]])

# np.add is a built-in universal function. All universal functions have a reduce method.

# np.sum(A, 1,) and np.sum(A, 0) also sum rows and columns
three dimensional arrays
matlabrnumpyjulia
construct from sequencereshape([1 2 3 4 5 6 7 8], 2, 2, 2)array(c(1, 2, 3, 4, 5, 6, 7, 8), dim=c(2, 2, 2))np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(2, 2, 2)
construct from nested sequencesnonenonenp.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
construct 3d array from 2d arraysA = [1, 2; 3, 4]
A(:,:,2) = [5, 6; 7, 8]
permute axesA = reshape([1 2 3 4 5 6 7 8], 2, 2, 2)

% swap 2nd and 3rd axes:
permute(A, [1 3 2])
A = array(1:8, dim=c(2, 2, 2))

# swap 2nd and 3rd axes:
aperm(A, perm=c(1, 3, 2))
A = np.array(range(1, 9)).reshape(2, 2, 2)

# swap 2nd and 3rd axes:
A.transpose((0, 2, 1))
flipA = reshape([1 2 3 4 5 6 7 8], 2, 2, 2)

flipdim(A, 3)
nonenone
circular shiftA = reshape([1 2 3 4 5 6 7 8], 2, 2, 2)

% 3rd arg specifies axis:
circshift(A, 1, 3)
noneA = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(2, 2, 2)

np.roll(A, 1, axis=2)
dictionaries
matlabrnumpyjulia
literal

% no literal; use constructor:
d = struct('n', 10, 'avg', 3.7, 'sd', 0.4)

% or build from two cell arrays:
d = cell2struct({10 3.7 0.4}, {'n' 'avg' 'sd'}, 2)
# keys are 'n', 'avg', and 'sd':
d = list(n=10, avg=3.7, sd=0.4)

# keys are 1, 2, and 3:
d2 = list('do', 're', 'mi')
d = {'n': 10, 'avg': 3.7, 'sd': 0.4}d = Dict("n"=>10.0, "avg"=>3.7, "sd"=>0.4)
size

length(fieldnames(d))length(d)len(d)length(d)
lookup

d.n
getfield(d, 'n')
d[['n']]

# if 'n' is a key:
d$n d['n'] update d.var = d.sd**2d$var = d$sd**2d['var'] = d['sd']**2 missing key behavior errorNULLraises KeyError is key present isfield(d, 'var')is.null(d$var)'var' in dhaskey(d, "var")
delete

d = rmfield(d, 'sd')d$sd = NULLdel(d['sd']) iteratefor i = 1:numel(fieldnames(d)) k = fieldnames(d){i} v = d.(k) code end for (k in names(d)) { v = d[[k]] code } for k, v in d.iteritems(): code keys and values as arrays% these return cell arrays: fieldnames(d) struct2cell(d) names(d) unlist(d, use.names=F) d.keys() d.values() mergenoned1 = list(a=1, b=2) d2 = list(b=3, c=4) # values of first dictionary take precedence: d3 = c(d1, d2) d1 = {'a':1, 'b':2} d2 = {'b':3, 'c':4} d1.update(d2) functions matlabrnumpyjulia define functionfunction add(x, y) x + y end add = function(x, y) {x + y}function add(x,y) x + y end # optional syntax when body is an expression: add(x, y) = x + y invoke function add(3, 7)add(3, 7)add(3, 7) nested functionfunction ret1 = add3(x, y, z) function ret2 = add2(x, y) ret2 = x + y; end ret1 = add2(x, y) + z; end add3 = function(x, y, z) { add2 = function(x, y) { x + y } add2(x, y) + z } function add3(x, y, z) function add2(x2, y2) x2 + y2 end add2(x, y) + z end missing argument behaviorraises error if code with the parameter that is missing an argument is executedraises errorraises MethodError extra argument behavior ignoredraises errorraises MethodError default argumentfunction mylog(x, base=10) log(x) / log(base) end mylog = function(x,base=10) { log(x) / log(base) } variadic functionfunction s = add(varargin) if nargin == 0 s = 0 else r = add(varargin{2:nargin}) s = varargin{1} + r end end add = function (...) { a = list(...) if (length(a) == 0) return(0) s = 0 for(i in 1:length(a)) { s = s + a[[i]] } return(s) } return valuefunction ret = add(x, y) ret = x + y; end % If a return variable is declared, the value assigned % to it is returned. Otherwise the value of the last % statement will be used if it does not end with a % semicolon. return argument or last expression evaluated. NULL if return called without an argument.return argument or last expression evaluated. Void if return called without an argument. multiple return valuesfunction [x, y] = first_two(a) x = a(1); y = a(2); end % sets first to 7; second to 8: [first, second] = first_two([7 8 9]) function first_two(a) a[1], a[2] end x, y = first_two([1, 2, 3]) anonymous function literal% body must be an expression: @(x, y) x + y function(x, y) {x + y}add = (x, y) -> x + y add = function(x, y) x + y end invoke anonymous functionadd(1, 2) closuremake_counter = function() { i = 0 function() { i <<- i + 1 i } } function as value @addaddadd overload operator call operator like function+(3, 7)+(3, 7) execution control matlabrnumpyjulia ifif (x > 0) disp('positive') elseif (x < 0) disp('negative') else disp('zero') end if (x > 0) { print('positive') } else if (x < 0) { print('negative') } else { print('zero') } if x > 0: print('positive') elif x < 0: print('negative') else: print('zero') if x > 0 println("positive") elseif x < 0 println("negative") else println("zero") end whilei = 0 while (i < 10) i = i + 1 disp(i) end while (i < 10) { i = i + 1 print(i) } while i < 10: i += 1 print(i) i = 0 while i < 10 i += 1 println(i) end forfor i = 1:10 disp(i) end for (i in 1:10) { print(i) } for i in range(1,11): print(i) for i = 1:10 println(i) end break/continue break continuebreak nextbreak continuebreak continue raise exception error('%s', 'failed')stop('failed')raise Exception('failed')throw("failed") handle exceptiontry error('failed') catch err disp(err) end tryCatch( stop('failed'), error=function(e) print(message(e))) try: raise Exception('failed') except Exception as e: print(e) file handles matlabrnumpyjulia standard file handles0 1 2 % Octave has predefined variables % containing the above descriptors: stdin stdout stderr stdin() stdout() stderr()sys.stdin sys.stdout sys.stderrSTDIN STDOUT STDERR read line from stdinline = input('', 's')line = readLines(n=1)line = sys.stdin.readline()line = readline() write line to stdoutfprintf(1, 'hello\n')cat("hello\n") writeLines("hello") print('hello')println("hello") write formatted string to stdoutfprintf(1, '%.2f\n', pi)cat(sprintf("%.2f\n", pi))import math print('%.2f' % math.pi) open file for readingf = fopen('/etc/hosts') if (f == -1) error('failed to open file') end f = file("/etc/hosts", "r")f = open('/etc/hosts')f = open("/etc/hosts") open file for writingif ((f = fopen('/tmp/test', 'w') == -1) error('failed to open file') endif f = file("/tmp/test", "w")f = open('/tmp/test', 'w')f = open("/etc/hosts", "w") open file for appendingif ((f = fopen('/tmp/err.log', 'a') == -1) error('failed to open file') endif f = file("/tmp/err.log", "a")f = open('/tmp/err.log', 'a')f = open("/tmp/err.log", "a") close filefclose(f)close(f)f.close()close(f) i/o errorsfopen returns -1; fclose throws an errorraise IOError exception read lineline = fgets(f)line = readLines(f, n=1)line = f.readline()line = readline(f) iterate over file by linewhile(!feof(f)) line = fgets(f) puts(line) endwhile for line in f: print(line) read file into array of stringslines = readLines(f)lines = f.readlines()lines = readlines(f) write stringfputs(f, 'lorem ipsum')cat("lorem ipsum", file=f)f.write('lorem ipsum')write(f, "lorem ipsum") write linefputs(f, 'lorem ipsum\n')writeLines("lorem ipsum", con=f)f.write('lorem ipsum\n') flush file handlefflush(f)flush(f)f.flush() file handle position get, set ftell(f) % 3rd arg can be SEEK_CUR or SEEK_END fseek(f, 0, SEEK_SET) seek(f) # sets seek point to 12 bytes after start; # origin can also be "current" or "end" seek(f, where=0, origin="start") f.tell() f.seek(0) redirect stdout to filesink("foo.txt") write variables to fileA = [1 2; 3 4] B = [4 3; 2 1] save('data.mdata', 'A', 'B') A = matrix(c(1, 3, 2, 4), nrow=2) B = matrix(c(4, 2, 3, 1), nrow=2) save(A, B, file='data.rdata') A = np.matrix([[1, 2], [3, 4]]) B = np.matrix([[4, 3], [2, 1]]) # Data must be of type np.array; # file will have .npz suffix: np.savez('data', A=A, B=B) read variables from file% puts A and B in scope: load('data.mdata') % puts just A in scope: load('data.mdata', 'A') # puts A and B in scope: load('data.rdata') data = np.load('data.npz') A = data['A'] B = data['B'] write all variables in scope to filesave('data.txt')save.image('data.txt') directories matlabrnumpyjulia working directory get, set pwd cd('/tmp') getwd() setwd("/tmp") os.path.abspath('.') os.chdir('/tmp') build pathnamefullfile('/etc', 'hosts')file.path("/etc", "hosts")os.path.join('/etc', 'hosts') dirname and basename[dir, base] = fileparts('/etc/hosts')dirname("/etc/hosts") basename("/etc/hosts") os.path.dirname('/etc/hosts') os.path.basename('/etc/hosts') absolute pathnamenormalizePath("..")os.path.abspath('..') iterate over directory by file% lists /etc: ls('/etc') % lists working directory: ls() # list.files() defaults to working directory for (path in list.files('/etc')) { print(path) } for filename in os.listdir('/etc'): print(filename) glob pathsglob('/etc/*')Sys.glob('/etc/*')import glob glob.glob('/etc/*') processes and environment matlabrnumpyjulia command line arguments% does not include interpreter name: argv() # first arg is name of interpreter: commandArgs() # arguments after --args only: commandArgs(TRUE) sys.argvARGS environment variable get, set getenv('HOME') setenv('PATH', '/bin') Sys.getenv("HOME") Sys.setenv(PATH="/bin") os.getenv('HOME') os.environ['PATH'] = '/bin' ENV["HOME"] ENV["PATH"] = "/bin" exit exit(0)quit(save="no", status=0)sys.exit(0)exit(0) external commandif (shell_cmd('ls -l /tmp')) error('ls failed') endif if (system("ls -l /tmp")) { stop("ls failed") } if os.system('ls -l /tmp'): raise Exception('ls failed') command substitutions = readall(‘ls) libraries and namespaces matlabrnumpyjulia load libraryWhen a function is invoked, MATLAB searches the library path for a file with the same name and a .m suffix. Other functions defined in the file are not visible outside the file.# quoting the name of the package is optional: require("foo") # or: library("foo") # if the package does not exist, require returns false, and library raises an error. import fooinclude("foo.jl") list loaded librariesnonesearch()dir() library search pathpath() addath(’~/foo') rmpath('~/foo') .libPaths()sys.path source file run('foo.m')source("foo.r")none install package% Octave: how to install package % downloaded from Octave-Forge: pkg install foo-1.0.0.tar.gz install.packages("ggplot2")$ pip install scipy
require("foo")
# or:
library("foo")
import foo
list installed packages% Octave:
pkg list
library()
installed.packages()
$pip freeze reflection matlabrnumpyjulia data typeclass(x)class(x) # often more informative: str(x) type(x)typeof(x) attributes% if x holds an object: x attributes(x)[m for m in dir(x) if not callable(getattr(o,m))] fieldnames(x) methods% if x holds an object: methods(x) none; objects are implemented by functions which dispatch based on type of first arg[m for m in dir(x) if callable(getattr(o,m))] methods(x) variables in scopewho() % with size and type: whos() objects() ls() # with type and description: ls.str() dir()whos() undefine variable clear('x')rm(v)del(x)none undefine all variablesclear -arm(list=objects())none eval eval('1 + 1')eval(parse(text='1 + 1'))eval('1 + 1')eval(parse("1 + 1")) function documentationhelp tanhelp(tan) ?tan math.tan.__doc__?tan list library functionsnonels("package:moments")dir(stats)whos(Base) search documentationdocsearch tan??tan$ pydoc -k tanapropos("tan")
tables
rnumpyjulia
construct from column arrayssx = {'F' 'F' 'F' 'M' 'M' 'M'}
ht = [69 64 67 68 72 71]
wt = [148 132 142 149 167 165]
cols = {'sx', 'ht', 'wt'}
people = table(sx', ht', wt', 'VariableNames', cols)
# gender, height, weight of some people
# in inches and lbs:

sx = c("F", "F", "F", "M", "M", "M")
ht = c(69, 64, 67, 68, 72, 71)
wt = c(148, 132, 142, 149, 167, 165)
people = data.frame(sx, ht, wt)
sx = ['F', 'F', 'F', 'F', 'M', 'M']
ht = [69, 64, 67, 66, 72, 70]
wt = [150, 132, 142, 139, 167, 165]
people = pd.DataFrame({'sx': sx, 'ht': ht, 'wt': wt})
construct from row dictionariesrows = [
{'sx': 'F', 'ht': 69, 'wt': 150},
{'sx': 'F', 'ht': 64, 'wt': 132},
{'sx': 'F', 'ht': 67, 'wt': 142},
{'sx': 'F', 'ht': 66, 'wt': 139},
{'sx': 'M', 'ht': 72, 'wt': 167},
{'sx': 'M', 'ht': 70, 'wt': 165}]
people = pd.DataFrame(rows)
sizeheight(people)
width(people)
nrow(people)
ncol(people)

# number of rows and cols in 2-element vector:
dim(people)
len(people)
len(people.columns)
column names as arraypeople.Properties.VariableNamesnames(people)
colnames(people)
returns Index object:
people.columns
access column as arraypeople.ht
people.(2)
# vectors:
people$ht people[,2] people[['ht']] people[[2]] # 1 column data frame: people[2] people['ht'] # if name does not conflict with any DataFrame attributes: people.ht access row as tuplepeople(1,:)# 1 row data frame: people[1, ] # list: as.list(people[1, ]) people.ix[0] access datum% height of 1st person: people(1,2) # height of 1st person: people[1,2] people.get_value(0, 'ht') order rows by columnsortrows(people, 'ht')people[order(people$ht), ]people.sort(['ht'])
order rows by multiple columnssortrows(people, {'sx', 'ht'})people[order(people$sx, people$ht), ]people.sort(['sx', 'ht'])
order rows in descending ordersortrows(people, 'ht', 'descend')people[order(-people$ht), ]people.sort('ht', ascending=[False]) limit rows people(1:3, :)people[seq(1, 3), ]people[0:3] offset rows people(4:6, :)people[seq(4, 6), ]people[3:] reshapepeople$couple = c(1, 2, 3, 1, 2, 3)
reshape(people, idvar="couple", direction="wide",
timevar="sx", v.names=c("ht", "wt"))
remove rows with null fieldssx = c('F', 'F', 'M', 'M')
wt = c(120, NA, 150, 170)

df = data.frame(sx, wt)
df2 = na.omit(df)
attach columns# put columns ht, wt, and sx
# in variable name search path:

attach(people)
sum(ht)

# alternative which doesn't put columns in
# search path:

with(people, sum(ht))
none
detach columns

detach(people)none
spreadsheet editorcan edit data, in which case return value of edit must be saved
people = edit(people)
none
import and export
rnumpyjulia
import tab delimited# first row defines variable names:

# file suffix must be .txt, .dat, or .csv
# first row defines variable names:
# first row defines column names:
import csv

% first row defines variable names:
# first row defines variable names:
# first row defines column names:
'Delimiter', ':',
sep=':',
comment.char='#')
# $grep -v '^#' /etc/passwd > /tmp/passwd df = pd.read_table('/tmp/passwd', sep=':', header=None) set column separator to whitespacedf = read.delim('/path/to.txt', sep='')df = read_table('/path/to.txt', sep='\s+') set quote character# default quote character for both read.csv and read.delim # is double quotes. The quote character is escaped by doubling it. # use single quote as quote character: df = read.csv('/path/to/single-quote.csv', quote="'") # no quote character: df = read.csv('/path/to/no-quote.csv', quote="") Both read_table and read_csv use double quotes as the quote character and there is no way to change it. A double quote can be escaped by doubling it. import file w/o header# column names are V1, V2, … read.delim('/etc/passwd', sep=':', header=FALSE, comment.char='#') #$ grep -v '^#' /etc/passwd > /tmp/passwd
#
# column names are X0, X1, …

df.Properties.VariableNames = {'ht', 'wt', 'age'}
col.names=c('ht', 'wt', 'age'))
names=['ht', 'wt', 'age'])
set column types# possible values: NA, 'logical', 'integer', 'numeric',
# 'complex', 'character', 'raw', 'factor', 'Date',
# 'POSIXct'
#
# If type is set to NA, actual type will be inferred to be
# 'logical', 'integer', 'numeric', 'complex', or 'factor'
#

colClasses=c('integer', 'numeric', 'character'))
colClasses=c('integer', 'logical', 'character'),
na.strings=c('nil'))
na_values=['nil'])
change decimal markdf = read.csv('/path/to.csv', dec=',')
recognize thousands separatornonedf = read_csv('/path/to.csv', thousands='.')
unequal row length behaviorMissing fields will be set to NA unless fill is set to FALSE. If the column is of type character then the fill value is an empty string ''.

If there are extra fields they will be parsed as an extra row unless
flush is set to FALSE
sep=':',
comment.char='#')
none

# rows to skip can be specified individually:

# hierarchical index:
export tab delimitedwrite.table(df, '/tmp/data.tab', sep='\t')
export csv

# first column contains row names unless row.names
# set to FALSE

write.csv(df, '/path/to.csv', row.names=F)
relational algebra
matlabrnumpyjulia
project columns by namepeople(:, {'sx', 'ht'})people[c('sx', 'ht')]people[['sx', 'ht']]
project columns by positionpeople(:, [1 2])people[c(1, 2)]
project expression# convert to cm and kg:
transform(people, ht=2.54*ht, wt=wt/2.2)
project all columnspeople(people.ht > 66, :)people[people$ht > 66, ] rename columnscolnames(people) = c('gender', 'height', 'weight') access sub data frame# data frame of first 3 rows with # ht and wt columns reversed: people[1:3, c(1, 3, 2)] select rowspeople(people.ht > 66, :)subset(people, ht > 66) people[people$ht > 66, ]
people[people['ht'] > 66]
select distinct rowsunique(people(:,{'sx'}))unique(people[c('sx')])
split rows# class(x) is list:
x = split(people, people$sx == 'F') # data.frame only containing females: x$T
sep=':',
comment.char='#',
col.names=c('name', 'passwd', 'uid', 'gid', 'gecos',
'home', 'shell'))

sep=':',
comment.char='#',
col.names=c('name', 'passwd', 'gid', 'members'))

merge(pw, grp, by.x='gid', by.y='gid')
# $grep -v '^#' /etc/passwd > /tmp/passwd #$ grep -v '^#' /etc/group > /tmp/group

pw = pd.read_table('/tmp/passwd', sep=':', header=None, names=['name', 'passwd', 'uid', 'gid', 'gecos', 'home', 'shell'])

pd.merge(pw, grp, left_on='gid', right_on='gid')
nulls as join values
left joinmerge(pw, grp, by.x='gid', by.y='gid', all.x=T)pd.merge(pw, grp, left_on='gid', right_on='gid', how='left')
full joinmerge(pw, grp, by.x='gid', by.y='gid', all=T)pd.merge(pw, grp, left_on='gid', right_on='gid', how='outer')
antijoinpw[!(pw$gid %in% grp$gid), ]
cross joinmerge(pw, grp, by=c())
aggregation
matlabrnumpyjulia
group by columngrouped = people.groupby('sx')
grouped.aggregate(np.max)['ht']
multiple aggregated valuesgrouped = people.groupby('sx')
grouped.aggregate(np.max)[['ht', 'wt']]
group by multiple columns
aggregation functions
nulls and aggregation functions
vectors
matlabrnumpyjulia
vector literal

same as arraysame as arraysame as array
element-wise arithmetic operators+ - .* ./+ - * /+ - * /
result of vector length mismatchraises errorvalues in shorter vector are recycled; warning if one vector is not a multiple length of the otherraises ValueError
scalar multiplication3 * [1, 2, 3]
[1, 2, 3] * 3
3 * c(1, 2, 3)
c(1, 2, 3) * 3
3 * np.array([1, 2, 3])
np.array([1, 2, 3]) * 3
dot productdot([1, 1, 1], [2, 2, 2])c(1, 1, 1) %*% c(2, 2, 2)v1 = np.array([1, 1, 1])
v2 = np.array([2, 2, 2])
np.dot(v1, v2)
cross productcross([1, 0, 0], [0, 1, 0])v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
np.cross(v1, v2)
normsnorm([1, 2, 3], 1)
norm([1, 2, 3], 2)
norm([1, 2, 3], Inf)
vnorm = function(x, t) {
norm(matrix(x, ncol=1), t)
}

vnorm(c(1, 2, 3), "1")
vnorm(c(1, 2, 3), "E")
vnorm(c(1, 2, 3), "I")
v = np.array([1, 2, 3])
np.linalg.norm(v, 1)
np.linalg.norm(v, 2)
np.linalg.norm(v, np.inf)
matrices
matlabrnumpyjulia
literal or constructor% row-major order:
A = [1, 2; 3, 4]
B = [4 3
2 1]
# column-major order:
A = matrix(c(1, 3, 2, 4), 2, 2)
B = matrix(c(4, 2, 3, 1), nrow=2)

# row-major order:
A = matrix(c(1, 2, 3, 4), nrow=2, byrow=T)
# row-major order:
A = np.matrix([[1, 2], [3, 4]])
B = np.matrix([[4, 3], [2, 1]])
constant matrices
all zeros, all ones
zeros(3, 3) or zeros(3)
ones(3, 3) or ones(3)
matrix(0, 3, 3)
matrix(1, 3, 3)
np.matrix(np.ones([3, 3]))
np.matrix(np.zeros([3, 3]))
zeros(Float64, (3, 3))
ones(Float64, (3, 3))
diagonal matrices
and identity
diag([1, 2, 3])
% 3x3 identity:
eye(3)
diag(c(1, 2, 3)
# 3x3 identity:
diag(3)
np.diag([1, 2, 3])
np.identity(3)
dimensionsrows(A)
columns(A)
dim(A)[1]
dim(A)[2]
nrows, ncols = A.shape
element access

A(1, 1)A[1, 1]A[0, 0]
row access

A(1, 1:2)A[1, ]A[0]
column access

A(1:2, 1)A[, 1]A[:, 0]
submatrix accessC = [1, 2, 3; 4, 5, 6; 7, 8, 9]
C(1:2, 1:2)
C = matrix(seq(1, 9), 3, 3, byrow=T)
C[1:2, 1:2]
A = np.matrix(range(1, 10)).reshape(3, 3)
A[:2, :2]
scalar multiplication3 * A
A * 3
also:
3 .* A
A .* 3
3 * A
A * 3
3 * A
A * 3
element-wise operators.+ .- .* ./+ - * /+ - np.multiply() np.divide()
multiplication

A * BA %*% BA * B
powerA ^ 3

% power of each entry:
A .^ 3
A ** 3
kronecker productkron(A, B)kronecker(A, B)np.kron(A, B)
comparisonall(all(A == B))
any(any(A != B))
all(A == B)
any(A != B)
np.all(A == B)
np.any(A != B)
normsnorm(A, 1)
norm(A, 2)
norm(A, Inf)
norm(A, 'fro')
norm(A, "1")
??
norm(A, "I")
norm(A, "F")
transposetranspose(A)
A'
t(A)A.transpose()
conjugate transposeA = [1i, 2i; 3i, 4i]
A'
A = matrix(c(1i, 2i, 3i, 4i), nrow=2, byrow=T)
Conj(t(A))
A = np.matrix([[1j, 2j], [3j, 4j]])
A.conj().transpose()
inverse

inv(A)solve(A)np.linalg.inv(A)
pseudoinverseA = [0 1; 0 0]

pinv(A)
install.packages('corpcor')
library(corpcor)

A = matrix(c(0, 0, 1, 0), nrow=2)
pseudoinverse(A)
A = np.matrix([[0, 1], [0, 0]])

np.linalg.pinv(A)
determinant

det(A)det(A)np.linalg.det(A)
trace

trace(A)sum(diag(A))A.trace()
eigenvalues

eig(A)eigen(A)$valuesnp.linalg.eigvals(A) eigenvectors[evec, eval] = eig(A) % each column of evec is an eigenvector % eval is a diagonal matrix of eigenvalues eigen(A)$vectorsnp.linalg.eig(A)[1]
singular value decompositionX = randn(10)

[u, d, v] = svd(X)
X = matrix(rnorm(100), nrow=10)
result = svd(X)

# singular values:
result$d # matrix of eigenvectors: result$u

# unitary matrix:
result$v np.linalg.svd(np.random.randn(100).reshape(10, 10)) solve system of equationsA \ [2;3]solve(A, c(2, 3))np.linalg.solve(A, [2, 3]) sparse matrices matlabrnumpyjulia sparse matrix construction% 100x100 matrix with 5 at (1, 1) and 4 at (2, 2): X = sparse([1 2], [1 2], [5 4], 100, 100) X = spMatrix(100, 100, c(1, 2), c(1, 2), c(5, 4))import scipy.sparse as sparse row, col, val = [5, 4], [1, 2], [1, 2] X = sparse.coo_matrix((val, (row, col)), shape=(100, 100)) sparse matrix decomposition[rows, cols, vals] = find(X) % just the values: nonzeros(X) sparse identity matrix% 100x100 identity: speye(100) sparse.identity(100) # not square; ones on diagonal: sparse.eye(100, 200) dense matrix to sparse matrix and back X = sparse([1 0 0; 0 0 0; 0 0 0]) X2 = full(X) imoprt scipy.sparse as sparse A = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) X = sparse.coo_matrix(A) X2 = X.todense() sparse matrix storage% is storage sparse: issparse(X) % memory allocation in bytes: nzmax(X) % number of nonzero entries: nnz(X) # memory allocation in bytes: object.size(X) import scipy.sparse as sparse sparse.issparse(X) optimization matlabrnumpyjulia linear minimization% minimize c * x subject to the constraint that A * x <= b linprog(c, A, b) linear minimization with bounds on decision variables% also impose constraints Lb <= x <= Ub linprog(c, A, b, [], [], Lb, Ub) linear minimization with equality constraints% also impose constraints Aeq * x == beq lingprog(c, A, b, Aeq, beq) quadratic minimizationquadprog() integer linear minimizationintlinprog() polynomials matlabrnumpyjulia exact polynomial fitx = [1 2 3 4] y = [3 9 2 1] % polynomial coefficient array: p = polyfit(x, y, 3) % plot polynomial: xx = -10:.1:10 yy = polyval(p, xx) plot(xx, yy) exact polynomial fit with derivative values piecewise polynomial fit cubic spline f = spline(1:20, normrnd(0, 1, 1, 20)) x = 1:.1:20 plot(x, ppval(f, x)) f = splinefun(rnorm(20)) x = seq(1, 20, .1) plot(x, f(x), type="l") underdetermined polynomail fit overdetermined polynomial fit multivariate polynomial fit descriptive statistics matlabrnumpyjulia first moment statisticsx = [1 2 3 8 12 19] sum(x) mean(x) x = c(1,2,3,8,12,19) sum(x) mean(x) x = [1,2,3,8,12,19] sp.sum(x) sp.mean(x) x = [1 2 3 8 12 19] sum(x) mean(x) second moment statisticsstd(x, 1) var(x, 1) n = length(x) sd(x) * sqrt((n-1)/n) var(x) * (n-1)/n sp.std(x) sp.var(x) second moment statistics for samplesstd(x) var(x) sd(x) var(x) n = float(len(x)) sp.std(x) * math.sqrt(n/(n-1)) sp.var(x) * n/(n-1) std(x) var(x) skewnessOctave uses sample standard deviation to compute skewness: skewness(x) install.packages('moments') library('moments') skewness(x) stats.skew(x) kurtosisOctave uses sample standard deviation to compute kurtosis: kurtosis(x) install.packages('moments') library('moments') kurtosis(x) - 3 stats.kurtosis(x) nth moment and nth central momentn = 5 moment(x, n) moment(x, n, "c") install.packages('moments') library('moments') n = 5 moment(x, n) moment(x, n, central=T) n = 5 ?? stats.moment(x, n) modemode([1 2 2 2 3 3 4])samp = c(1,2,2,2,3,3,4) names(sort(-table(samp)))[1] stats.mode([1,2,2,2,3,3,4])[0][0] quantile statisticsmin(x) median(x) max(x) iqr(x) quantile(x, .90) min(x) median(x) max(x) IQR(x) quantile(x, prob=.90) min(x) sp.median(x) max(x) ?? stats.scoreatpercentile(x, 90.0) bivariate statistiscs correlation, covariance x = [1 2 3] y = [2 4 7] cor(x, y) cov(x, y) x = c(1,2,3) y = c(2,4,7) cor(x, y) cov(x, y) x = [1,2,3] y = [2,4,7] stats.linregress(x, y)[2] ?? correlation matrixx1 = randn(100, 1) x2 = 0.5 * x1 + randn(100, 1) x3 = 0.1 * x1 + 0.1 * x2 + 0.1 * randn(100, 1) corr([x1 x2 x3]) x1 = rnorm(100) x2 = x1 + 0.5 * rnorm(100) x3 = 0.3 * x1 + 0.1 * 2 + 0.1 * rnorm(100) cor(cbind(x1, x2, x3)) data set to frequency tablex = c(1,2,1,1,2,5,1,2,7) tab = table(x) frequency table to data setrep(as.integer(names(tab)), unname(tab)) binx = c(1.1, 3.7, 8.9, 1.2, 1.9, 4.1) xf = cut(x, breaks=c(0, 3, 6, 9)) # bins are (0, 3], (3, 6], and (6, 9]: bins = tapply(x, xf, length) distributions matlabrnumpyjulia binomial density, cumulative, quantile, sample of 10 binopdf(x, n, p) binocdf(x, n, p) binoinv(y, n, p) binornd(n, p, 1, 10) dbinom(x, n, p) pbinom(x, n, p) qbinom(y, n, p) rbinom(10, n, p) stats.binom.pmf(x, n, p) stats.binom.cdf(x, n, p) stats.binom.ppf(y, n, p) stats.binom.rvs(n, p) poisson density, cumulative, quantile, sample of 10 poisspdf(x, lambda) poisscdf(x, lambda) poissinv(y, lambda) poissrnd(lambda, 1, 10) dpois(x, lambda) ppois(x, lambda) qpois(y, lambda) rpois(10, lambda) stats.poisson.pmf(x, lambda) stats.poisson.cdf(x, lambda) stats.poisson.ppf(y, lambda) stats.poisson.rvs(lambda, size=1) normal density, cumulative, quantile, sample of 10 normpdf(x, mu, sigma) normcdf(x, mu, sigma) norminv(y, mu, sigma) normrnd(mu, sigma, 1, 10) dnorm(x, mu, sigma) pnorm(x, mu, sigma) qnorm(y, mu, sigma) rnorm(10, mu, sigma) stats.norm.pdf(x, mu, sigma) stats.norm.cdf(x, mu, sigma) stats.norm.ppf(y, mu, sigma) stats.norm.rvs(mu, sigma) gamma density, cumulative, quantile, sample of 10 gampdf(x, k, theta) gamcdf(x, k, theta) gaminv(y, k, theta) gamrnd(k, theta, 1, 10) dgamma(x, k, scale=theta) pgamma(x, k, scale=theta) qgamma(y, k, scale=theta) rgamma(10, k, scale=theta) stats.gamma.pdf(x, k, scale=theta) stats.gamma.cdf(x, k, scale=theta) stats.gamma.ppf(y, k, scale=theta) stats.gamma.rvs(k, scale=theta) exponential density, cumulative, quantile, sample of 10 exppdf(x, lambda) expcdf(x, lambda) expinv(y, lambda) exprnd(lambda, 1, 10) dexp(x, lambda) pexp(x, lambda) qexp(y, lambda) rexp(10, lambda) stats.expon.pdf(x, scale=1.0/lambda) stats.expon.cdf(x, scale=1.0/lambda) stats.expon.ppf(x, scale=1.0/lambda) stats.expon.rvs(scale=1.0/lambda) chi-squared density, cumulative, quantile, sample of 10 chi2pdf(x, nu) chi2cdf(x, nu) chi2inv(y, nu) chi2rnd(nu, 1, 10) dchisq(x, nu) pchisq(x, nu) qchisq(y, nu) rchisq(10, nu) stats.chi2.pdf(x, nu) stats.chi2.cdf(x, nu) stats.chi2.ppf(y, nu) stats.chi2.rvs(nu) beta density, cumulative, quantile, sample of 10 betapdf(x, alpha, beta) betacdf(x, alpha, beta) betainvf(y, alpha, beta) betarnd(alpha, beta, 1, 10) dbeta(x, alpha, beta) pbeta(x, alpha, beta) qbeta(y, alpha, beta) rbeta(10, alpha, beta) stats.beta.pdf(x, alpha, beta) stats.beta.cdf(x, alpha, beta) stats.beta.ppf(y, alpha, beta) stats.beta.pvs(alpha, beta) uniform density, cumulative, quantile, sample of 10 unifpdf(x, a, b) unifcdf(x, a, b) unifinv(y, a, b) unifrnd(a, b, 1, 10) dunif(x, a, b) punif(x, a, b) qunif(y, a, b) runif(10, a, b) stats.uniform.pdf(x, a, b) stats.uniform.cdf(x, a, b) stats.uniform.ppf(y, a, b) stats.unifrom.rvs(a, b) Student's t density, cumulative, quantile, sample of 10 tpdf(x, nu) tcdf(x, nu) tinv(y, nu) trnd(nu, 1, 10) dt(x, nu) pt(x, nu) qt(y, nu) rt(10, nu) stats.t.pdf(x, nu) stats.t.cdf(x, nu) stats.t.ppf(y, nu) stats.t.rvs(nu) Snedecor's F density, cumulative, quantile, sample of 10 fpdf(x, d1, d2) fcdf(x, d1, d2) finv(y, d1, d2) frnd(d1, d2, 1, 10) df(x, d1, d2) pf(x, d1, d2) qf(y, d1, d2) rf(10, d1, d2) stats.f.pdf(x, d1, d2) stats.f.cdf(x, d1, d2) stats.f.ppf(y, d1, d2) stats.f.rvs(d1, d2) empirical density function%$ apt-get install octave-econometrics

x = (-3:.05:3)'
y = kernel_density(x, normrnd(0, 1, 100, 1))
dfunc = density(rnorm(100))

dfunc$x dfunc$y
empirical cumulative distributionF is a right-continuous step function:
F = ecdf(rnorm(100))
empirical quantile functionF = ecdf(rnorm(100))
Finv = ecdf(F(seq(0, 1, .01)))
linear regression
matlabrnumpyjulia
simple linear regression
coefficient, intercept, and residuals
x = [1 2 3]
y = [2 4 7]

[lsq, res] = polyfit(x, y, 1)
a = lsq(1)
b = lsq(2)
y - (a*x+b)
x = seq(10)
y = 2 * x + 1 + rnorm(10)

fit = lm(y ~ x)
summary(fit)

# yhat = ax + b:
a = fit$coefficients[2] b = fit$coefficients[1]

# y - (ax + b):
fit$residuals x = np.array([1,2,3]) y = np.array([2,4,7]) lsq = stats.linregress(x, y) a = lsq[0] b = lsq[1] y - (a*x+b) no interceptx = seq(10) y = 2 * x + 1 + rnorm(10) fit = lm(y ~ x + 0) summary(fit) # y = ax: a = fit$coefficients[1]
multiple linear regressionx1 = rnorm(100)
x2 = rnorm(100)
y = 2 * x2 + rnorm(100)

fit = lm(y ~ x1 + x2)
summary(fit)
interactionx1 = rnorm(100)
x2 = rnorm(100)
y = 2 * x1 + x2 + 3 * x1 * x2 + rnorm(100)

# x1, x2, and x1*x2 as predictors:
fit = lm(y ~ x1 * x2)
summary(fit)

# just x1*x2 as predictor:
fit2 = lm(Y ~ x1:x2)
logistic regressiony = round(runif(100))
x1 = round(runif(100))
x2 = y + rnorm(100)

fit = glm(y ~ x1 + x2, family="binomial")
summary(fit)
statistical tests
matlabrnumpyjulia
wilcoxon signed-rank test
variable is symmetric around zero
x = unifrnd(-0.5, 0.5, 100, 1)

% null hypothesis is true:
wilcoxon_test(x, zeros(100, 1))

% alternative hypothesis is true:
wilcoxon_test(x + 1.0, zeros(100, 1))
# null hypothesis is true:
wilcox.test(runif(100) - 0.5)

alternative hypothesis is true:
wilcox.test(runif(100) + 0.5)
stats.wilcoxon()
kruskal-wallis rank sum test
variables have same location parameter
x = unifrnd(0, 1, 200, 1)

% null hypothesis is true:
kruskal_wallis_test(randn(100, 1), randn(200, 1))

% alternative hypothesis is true:
kruskal_wallis_test(randn(100, 1), x)
# null hypothesis is true:
kruskal.test(list(rnorm(100), rnorm(200)))

# alternative hypothesis is true:
kruskal.test(list(rnorm(100), runif(200)))
stats.kruskal()
kolmogorov-smirnov test
variables have same distribution
x = randn(100, 1)
y1 = randn(100, 1)
y2 = unifrnd(-0.5, 0.5, 100, 1)

% null hypothesis is true:
kolmogorov_smirnov_test_2(x, y1)

% alternative hypothesis is true:
kolmogorov_smirnov_test_2(x, y2)
# null hypothesis is true:
ks.test(rnorm(100), rnorm(100))

# alternative hypothesis is true:
ks.test(rnorm(100), runif(100) - 0.5)
stats.ks_2samp()
one-sample t-test
mean of normal variable with unknown variance is zero
x1 = 3 * randn(100, 1)
x2 = 3 * randn(100, 1) + 3

% null hypothesis is true:
t_test(x1, 0)

% alternative hypothesis is true:
t_test(x2, 0)
# null hypothesis is true:
t.test(rnorm(100, 0, 3))

# alternative hypothesis is true:
t.test(rnorm(100, 3, 3))
stats.ttest_1samp()
independent two-sample t-test
two normal variables have same mean
x = randn(100, 1)
y1 = randn(100, 1)
y2 = randn(100, 1) + 1.5

% null hypothesis is true:
t_test_2(x, y1)

% alternative hypothesis is true:
t_test_2(x, y2)
# null hypothesis is true:
t.test(rnorm(100), rnorm(100))

# alternative hypothesis is true:
t.test(rnorm(100), rnorm(100, 3))
stats.ttest_ind()
one-sample binomial test
binomial variable parameter is as given
n = 100
x = rbinom(1, n, 0.5)

# null hypothesis that p=0.5 is true:
binom.test(x, n)

# alternative hypothesis is true:
binom.test(x, n, p=0.3)
stats.binom_test()
two-sample binomial test
parameters of two binomial variables are equal
prop_test_2()n = 100
x1 = rbinom(1, n, 0.5)
x2 = rbinom(1, n, 0.5)

# null hypothesis that p=0.5 is true:
prop.test(c(x1, x2), c(n, n))

y = rbinom(1, n, 0.3)
# alternative hypothesis is true:
prop.test(c(x1, y), c(n, n))
chi-squared test
parameters of multinomial variable are all equal
chisquare_test_independence()fair = floor(6 * runif(100)) + 1
loaded = floor(7 * runif(100)) + 1

# null hypothesis is true:
chisq.test(table(fair))

# alternative hypothesis is true:
stats.chisquare()
poisson test
parameter of poisson variable is as given
# null hypothesis is true:
poisson.test(rpois(1, 100), r=100)

# alternative test is true:
poisson.test(rpois(1, 150), r=100)
F test
ratio of variance of normal variables is as given
var_test()x = rnorm(100)
y = rnorm(100, 0, sd=sqrt(3))

# null hypothesis is true:
var.test(y, x, ratio=3)

# alternative hypothesis is true:
var.test(y, x, ratio=1)
pearson product moment test
normal variables are not correlated
cor_test()x1 = rnorm(100)
x2 = rnorm(100)
y = x2 + rnorm(100)

# null hypothesis is true:
cor.test(y, x1)

# alternative hypothesis is true:
cor.test(y, x2)
stats.pearsonr()
shapiro-wilk test
variable has normal distribution
# null hypothesis is true:
shapiro.test(rnorm(1000))

# alternative hypothesis is true:
shapiro.test(runif(1000))
stats.shapiro()
bartlett's test
two or more normal variables have same variance
bartlett_test()x = rnorm(100)
y1 = rnorm(100)
y2 = 0.1 * rnorm(100)

# null hypothesis is true:
bartlett.test(list(x, y1))

# alternative hypothesis is true:
bartlett.test(list(x, y))
stats.bartlett()
levene's test
two or more variables have same variance
install.packages('reshape', 'car')
library(reshape)
library(car)

x = rnorm(100)
y1 = rnorm(100)
y2 = 0.1 * rnorm(100)

# null hypothesis is true:
df = melt(data.frame(x, y1))
leveneTest(df$value, df$variable)

# alternative hypothesis is true:
df = melt(data.frame(x, y2))
leveneTest(df$value, df$variable)
stats.levene()
one-way anova
two or more normal variables have same mean
x1 = randn(100, 1)
x2 = randn(100, 1)
x3 = randn(100, 1)
x = [x1; x2; x3]
y = [x1; x2; x3 + 0.5]
units = ones(100, 1)
grp = [units; 2 * units; 3 * units]

% null hypothesis is true:
anova(x, grp)

% alternative hypothesis is true:
anova(y, grp)
install.packages('reshape')
library(reshape)

# null hypothesis that all means are the same
# is true:

x1 = rnorm(100)
x2 = rnorm(100)
x3 = rnorm(100)

df = melt(data.frame(x1, x2, x3))
fit = lm(df$value ~ df$variable)
anova(fit)
stats.f_oneway()
time series
matlabrnumpyjulia
time series# first observation time is 1:
y = ts(rnorm(100))

# first observation time is 0:
y2 = ts(rnorm(100), start=0)

plot(y)
# first observation time is 0:
y = pd.Series(randn(100))

# first observation time is 1:
y2 = pd.Series(randn(100), index=range(1,101))

y.plot()
monthly time series# monthly observations 1993-1997:
y = ts(rnorm(60), frequency=12, start=1993)

# monthly observations from Oct 1993:
y2 = ts(rnorm(60), frequency=12, start=c(1993, 10))

plot(y)
dt = pd.datetime(2013, 1, 1)
idx = pd.date_range(dt, periods=60, freq='M')
y = pd.Series(randn(60), index=idx)

dt2 = pd.datetime(2013, 10, 1)
idx2 = pd.date_range(dt2, periods=60, freq='M')
y2 = pd.Series(randn(60), index=idx2)
lookup by timestart = tsp(y2)[1]
end = tsp(y2)[2]
freq = tsp(y2)[3]

# value for Jan 1994:
y2[(1994 - start) * freq + 1]
y2[pd.datetime(2014, 1, 31)]
lookup by position in seriesfor (i in 1:length(y)) {
print(y[i])
}
for i in range(0, len(y)):
y.ix[i]
aligned arithmeticy = ts(rnorm(10), start=0)
y2 = ts(rnorm(10), start=5)

# time series with 5 data points:
y3 = y + y2
y = pd.Series(randn(10))
y2 = pd.Series(randn(10), index=range(5, 15))

# time series with 15 data points; 10 of
# which are NaN:

# y3 = y + y2
lag operatorx = ts(rnorm(100))
y = x + lag(x, 1)
x = pd.Series(randn(100))
y = x + x.shift(-1)
lagged difference

delta = diff(y, lag=1)delta = y.diff(1)
simple moving averageinstall.packages('TTR')
library('TTR')

ma = SMA(y, n=4)

plot(y)
lines(ma, col='red')
y = pd.Series(randn(50))
ma = pd.rolling_mean(y, 4)

plot(y, 'k', ma, 'r')
weighted moving averageinstall.packages('TTR')
library('TTR')

ma = WMA(y, n=4, wts=c(1, 2, 3, 4))

plot(y)
lines(ma, col='red')
exponential smoothingx = rnorm(100)
fit = HoltWinters(x, alpha=0.5, beta=F, gamma=F)

values = fit$fitted plot(fit) alpha = 0.5 span = (2 / alpha) - 1 fit = pd.ewma(y, span=span, adjust=False) fit.plot() exponential smoothing with best least squares fitx = rnorm(100) fit = HoltWinters(x, beta=F, gamma=F) alpha = fit$a
plot(fit)
decompose into seasonal and trendraw = seq(1,100) + rnorm(100) + rep(seq(1,10), 10)
y = ts(raw, frequency=10)

# additive model: t + s + r:
yd = decompose(y)
yd$trend yd$seasonal
yd$random plot(yd) # multiplicative model: t * s * r: yd2 = decompose(y, type="multiplicative") correlogramx = rnorm(100) x2 = append(x[4:100], x[1:3]) acf(x, lag.max=20) acf(x + x2, lag.max=20) test for stationarity arma arima arima with automatic model selection fast fourier transform matlabrnumpyjulia fftx = 3 * sin(1:100) + sin(3 * (1:100)) + randn(1, 100) dft = fft(x) inverse fft shift constant component to center two-dimensional fft n-dimensional fft clustering matlabrnumpyjulia distance matrixpts = [1 1; 1 2; 2 1; 2 3; 3 4; 4 4] % value at (i, j) is distance between i-th % and j-th observation dm = squareform(pdist(pts, 'euclidean')) distance options'euclidean' 'seuclidian' 'cityblock' 'minkowski' 'chebychev' 'mahalanobis' 'cosine' 'correlation' 'spearman' 'hamming' 'jaccard' hierarchical clusters dendogram silhouette plot k-means univariate charts matlabrnumpyjulia vertical bar chartbar([7 3 8 5 5])cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") barplot(cnts) x = floor(6*runif(100)) barplot(table(x)) cnts = [7,3,8,5,5] plt.bar(range(0,len(cnts)), cnts) horizontal bar chart barh([7 3 8 5 5])cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") barplot(cnts, horiz=T) cnts = [7,3,8,5,5] plt.barh(range(0,len(cnts)), cnts) pie chart labels = {'a','b','c','d','e'} pie([7 3 8 5 5], labels) cnts = c(7,3,8,5,5) names(cnts) = c("a","b","c","d","e") pie(cnts) cnts = [7,3,8,5,5] labs = ['a','b','c','d','e'] plt.pie(cnts, labels=labs) stem-and-leaf plot nonegenerates an ascii chart: stem(20*rnorm(100)) none histogram hist(randn(1, 100), 10)hist(rnorm(100), breaks=10)plt.hist(sp.randn(100), bins=range(-5,5)) box plot boxplot(randn(1, 100)) boxplot([randn(1, 100) exprnd(1, 1, 100) unifrnd(0, 1, 1, 100)]') boxplot(rnorm(100)) boxplot(rnorm(100), rexp(100), runif(100)) plt.boxplot(sp.randn(100)) plt.boxplot([sp.randn(100), np.random.uniform(size=100), np.random.exponential(size=100)]) chart titlebar([7 3 8 5 5]) title('bar chart example') all chart functions except for stem accept a main parameter: boxplot(rnorm(100), main="boxplot example", sub="to illustrate options") plt.boxplot(sp.randn(100)) plt.title('boxplot example') grid of subplots% 3rd arg refers to the subplot; % subplots are numbered in row-major order. for i = 1:4 subplot(2, 2, i), hist(randn(50)) end for (i in split.screen(c(2, 2))) { screen(n=i) hist(rnorm(100)) } for i in [1, 2, 3, 4]: plt.subplot(2, 2, i) plt.hist(sp.randn(100), bins=range(-5,5)) open new plot windowopen new plot figure open new plot hist(rnorm(100)) dev.new() hist(rnorm(100)) close all plot windowsclose allgraphics.off() save plot as pngf = figure hist(randn(100)) print(f, '-dpng', 'histogram.png') png('hist.png') hist(rnorm(100)) dev.off() y = randn(50) plot(y) savefig('line-plot.png') bivariate charts matlabrnumpyjulia stacked bar chart d = [7 1; 3 2; 8 1; 5 3; 5 1] bar(d, 'stacked') d = matrix(c(7,1,3,2,8,1,5,3,5,1), nrow=2) labels = c("a","b","c","d","e") barplot(d,names.arg=labels) a1 = [7,3,8,5,5] a2 = [1,2,1,3,1] plt.bar(range(0,5), a1, color='r') plt.bar(range(0,5), a2, color='b') grouped bar chart d = [7 1; 3 2; 8 1; 5 3; 5 1] bar(d) d = matrix(c(7,1,3,2,8,1,5,3,5,1), nrow=2) labels = c("a","b","c","d","e") barplot(d,names.arg=labels,beside=TRUE) scatter plot plot(randn(1,50),randn(1,50),'+')plot(rnorm(50), rnorm(50))plt.scatter(sp.randn(50), sp.randn(50), marker='x') point types'.': point 'o': circle 'x': x-mark '+': plus '*': star 's': square 'd': diamond 'v': triangle (down) '^': triangle (up) '<': triangle (left) '>': traingle (right) 'p': pentagram 'h': hexagram Integer values for pch parameter: 0: open square 1: open circle 2: open triangle, points up 3: cross 4: x 5: open diamond 6: open triangle, points down 15: solid square 16: solid circle 17: solid triangle, points up 18: solid diamond marker parameter takes these string values: '.': point ',': pixel 'o': circle 'v': triangle_down '^': triangle_up '<': triangle_left '>': triangle_right '1': tri_down '2': tri_up '3': tri_left '4': tri_right '8': octagon 's': square 'p': pentagon '*': star 'h': hexagon1 'H': hexagon2 '+': plus 'x': x 'D': diamond 'd': thin_diamond '|': vline '_': hline hexagonal binning install.packages('hexbin') library('hexbin') plot(hexbin(rnorm(1000), rnorm(1000), xbins=12)) hexbin(randn(1000), randn(1000), gridsize=12) linear regression line x = 0:20 y = 2 * x + rnorm(21)*10 fit = lm(y ~ x) plot(y) lines(x, fit$fitted.values, type='l')
x = range(0,20)
err = sp.randn(20)*10
y = [2*i for i in x] + err

A = np.vstack([x,np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]

plt.scatter(x, y)
plt.plot(x, [m*i + c for i in x])

polygonal line plot
plot(1:20,randn(1,20))plot(1:20, rnorm(20), type="l")plt.plot(range(0,20), sp.randn(20), '-')
line typesOptional 3rd argument to plot:

'-': solid
':': dotted
'-.': dashdot
'—': dashed
Integer or string values for lty parameter:

0: 'blank'
1: 'solid' (default)
2: 'dashed'
3: 'dotted'
4: 'dotdash'
5: 'longdash'
6: 'twodash'
Optional 3rd argument to plot:

'-': solid line
'--': dashed line
'-.': dash-dot line
':': dotted line
'.': point
',': pixel
'o': circle
'v': triangle_down
'^': triangle_up
'<': triangle_left
'>': triangle_right
'1': tri_down
'2': tri_up
'3': tri_left
'4': tri_right
's': square
'p': pentagon
'*': star
'h': hexagon1
'H': hexagon2
'+': plus
'x': x
'D': diamond
'd': thin_diamond
'|': vline
'_': hline

function plot
fplot(@sin, [-4 4])x = seq(-4, 4, .01)
plot(sin(x), type="l")
x = [i * .01 for i in range(-400, 400)]
plt.plot(x, sin(x), '-')

quantile-quantile plot
qqplot(runif(50), rnorm(50))
lines(c(-9,9), c(-9,9), col="red")
axis labelsplot( 1:20, (1:20) .** 2)
xlabel('x')
ylabel('x squared')
plot(1:20, (1:20)^2,
xlab="x", ylab="x squared")
x = range(0, 20)
plt.plot(x, [i * i for i in x], '-')
plt.xlabel('x')
plt.ylabel('x squared')
axis limitsplot( 1:20, (1:20) .** 2)
% [xmin, xmax, ymin, ymax]:
axis([1 20 -200 500])
plot(1:20, (1:20)^2,
xlim=c(0, 20), ylim=c(-200,500))
x = range(0, 20)
plt.plot(x, [i * i for i in x], '-')
plt.xlim([0, 20])
plt.ylim([-200, 500])
logarithmic y-axissemilogy(x, x .** 2,
x, x .** 3,
x, x .** 4,
x, x .** 5)
x = 0:20
plot(x, x^2, log="y",type="l")
lines(x, x^3, col="blue")
lines(x, x^4, col="green")
lines(x, x^5, col="red")
x = range(0, 20)

for i in [2,3,4,5]:
y.append([j**i for j in x])

for i in [0,1,2,3]:
semilogy(x, y[i])
multivariate charts
matlabrnumpyjulia

plot(1:20, randn(1, 20),
1:20, randn(1, 20))

optional method:
plot(1:20, randn(1, 20))
hold on
plot(1:20, randn(1, 20))
plot(1:20, rnorm(20), type="l")
lines(1:20, rnorm(20), col="red")
colorsUse color letters by themselves for colored lines. Use '.r' for red dots.

'b': blue
'g': green
'r': red
'c': cyan
'm': magenta
'y': yellow
'k': black
'w': white
# Use the col parameter to specify the color of points
# and lines.
#
# The colors() function returns a list of recognized
# names for colors.

plot(rnorm(10), col='red')
plot(rnorm(10), col='#FF0000')
superimposed plots with different y-axis scalesx <- 1:10
y <- rnorm(10)
z <- rnorm(10) * 1000
par(mar = c(5, 4, 4, 4) + 0.3)
plot(x, y, type='l')
par(new=T)
plot(x, z, col='red', type='l', axes=F,
xlab='', ylab='')
axis(side=4, col='red', col.axis='red',
at=pretty(range(z)))
mtext('z', side=4, line=3, col='red')

legend
x = (1:20)
y = x + rnorm(20)
y2 = x - 2 + rnorm(20)

plot(x, y, type="l", col="black")
lines(x, y2, type="l", col="red")
legend('topleft', c('first', 'second'),
lty=c(1,1), lwd=c(2.5, 2.5),
col=c('black', 'red'))

plot(randn(20), randn(20), '.k', randn(20), randn(20), '.r')plot(rnorm(20), rnorm(20))
points(rnorm(20), rnorm(20),
col='red')

stacked area chart
install.packages('ggplot2')
library('ggplot2')

x = rep(0:4, each=3)
y = round(5 * runif(15))
letter = rep(LETTERS[1:3], 5)
df = data.frame(x, y, letter)

p = ggplot(df, aes(x=x, y=y,
group=letter,
fill=letter))
p + geom_area(position='stack')

overlapping area chart
install.packages('ggplot2')
library('ggplot2')

x = rep(0:4, each=3)
y = round(5 * runif(15))
letter = rep(LETTERS[1:3], 5)
df = data.frame(x, y, letter)
alpha = rep(I(2/10), each=15)

p = ggplot(df, aes(x=x, ymin=0, ymax=y,
group=letter, fill=letter,
alpha=alpha))
p + geom_ribbon()

3d scatter plot
install.packages('scatterplot3d')
library('scatterplot3d')

scatterplot3d(rnorm(50), rnorm(50),
rnorm(50), type="h")

bubble chart
install.packages('ggplot2')
library('ggplot2')

df = data.frame(x=rnorm(20),
y=rnorm(20), z=rnorm(20))

p = ggplot(df, aes(x=x, y=y, size=z))
p + geom_point()

scatter plot matrix
x = rnorm(20)
y = rnorm(20)
z = x + 3*y
w = y + 0.1*rnorm(20)
df = data.frame(x, y, z, w)

pairs(df)

contour plot
m = matrix(0, 100, 100)
for (i in 2:100) {
for (j in 2:100) {
m[i,j] = (m[i-1,j] + m[i,j-1])/2 +
runif(1) - 0.5
}
}

filled.contour(1:100, 1:100, m)
________________________________________________________________________________________________________________________________________________________________________________________________________

# General

## version used

The version of software used to check the examples in the reference sheet.

## show version

How to determine the version of an installation.

## implicit prologue

Code which examples in the sheet assume to have already been executed.

r:

The ggplot2 library must be installed and loaded to use the plotting functions qplot and ggplot.

# Grammar and Invocation

## interpreter

How to invoke the interpreter on a script.

## repl

How to launch a command line read-eval-print loop for the language.

r:

R installations come with a GUI REPL.

The shell zsh has a built-in command r which re-runs the last command. Shell built-ins take precedence over external commands, but one can invoke the R REPL with:

# Distributions

A distribution density function f(x) is a non-negative function which, when integrated over its entire domain is equal to one. The distributions described in this sheet have as their domain the real numbers. The support of a distribution is the part of the domain on which the density function is non-zero.

A distribution density function can be used to describe the values one is likely to see when drawing an example from a population. Values in areas where the density function is large are more likely than values in areas where the density function is small. Values where there density function is zero do not occur. Thus it can be useful to plot the density function.

To derive probabilities from a density function one must integrate or use the associated cumulative density function

\begin{align} F(x) = \int_{-\infty}^x f(t) dt \end{align}

which gives the probability of seeing a value less than or equal to x. As probabilities are non-negative and no greater than one, F is a function from (-, ) to [0,1]. The inverse of F is called the inverse cumulative distribution function or the quantile function for the distribution.

For each distribution statistical software will generally provide four functions: the density, the cumulative distribution, the quantile, and a function which returns random numbers in frequencies that match the distribution. If the software does not provide a random number generating function for the distribution, the quantile function can be composed with the built-in random number generator that most languages have as long as it returns uniformly distributed floats from the interval [0, 1].

 density probability density probability mass cumulative density cumulative distribution distribution inverse cumulative density inverse cumulative distribution quantile percentile percent point random variate

Discrete distributions such as the binomial and the poisson do not have density functions in the normal sense. Instead they have probability mass functions which assign probabilities which sum up to one to the integers. In R warnings will be given if non integer values are provided to the mass functions dbinom and dpoiss.

The cumulative distribution function of a discrete distribution can still be defined on the reals. Such a function is constant except at the integers where it may have jump discontinuities.

Most well known distributions are in fact parametrized families of distributions. This table lists some of them with their parameters and properties.

The information entropy of a continuous distribution with density f(x) is defined as:

\begin{align} -\int_\mathbb{R} f(x) \; \log(f(x)) \; dx \end{align}

In Bayesian analysis the distribution with the greatest entropy, subject to the known facts about the distribution, is called the maximum entropy probability distribution. It is considered the best distribution for modeling the current state of knowledge.

## binomial

The probability mass, cumulative distribution, quantile, and random number generating functions for the binomial distribution.

The binomial distribution is a discrete distribution. It models the number of successful trails when n is the number of trials and p is the chance of success for each trial. An example is the number of heads when flipping a coin 100 times. If the coin is fair then p is 0.50.

numpy:

Random numbers in a binomial distribution can also be generated with:

np.random.binomial(n, p)


## poisson

The probability mass, cumulative distribution, quantile, and random number generating functions for the Poisson distribution.

The Poisson distribution is a discrete distribution. It is described by a parameter lam which is the mean value for the distribution. The Poisson distribution is used to model events which happen at a specified average rate and independently of each other. Under these circumstances the time between successive events will be described by an exponential distribution and the events are said to be described by a poisson process.

numpy:

Random numbers in a Poisson distribution can also be generated with:

np.random.poisson(lam, size=1)


## normal

The probability density, cumulative distribution, quantile, and random number generating functions for the normal distribution.

The parameters are the mean μ and the standard deviation σ. The standard normal distribution has μ of 0 and σ of 1.

The normal distribution is the maximum entropy distribution for a given mean and variance. According to the central limit theorem, if {X1, …, Xn} are any independent and identically distributed random variables with mean μ and variance σ2, then Sn := Σ Xi / n converges to a normal distribution with mean μ and variance σ2/n.

numpy:

Random numbers in a normal distribution can also be generated with:

np.random.randn()


## gamma

The probability density, cumulative distribution, quantile, and random number generating functions for the gamma distribution.

The parameter k is called the shape parameter and θ is called the scale parameter. The rate of the distribution is β = 1/θ.

If Xi are n independent random variables with Γ(ki, θ) distribution, then Σ Xi has distribution Γ(Σ ki, θ).

If X has Γ(k, θ) distribution, then αX has Γ(k, αθ) distribution.

## exponential

The probability density, cumulative distribution, quantile, and random number generating functions for the exponential distribution.

## chi-squared

The probability density, cumulative distribution, quantile, and random number generating functions for the chi-squared distribution.

## beta

The probability density, cumulative distribution, quantile, and random number generating functions for the beta distribution.

## uniform

The probability density, cumulative distribution, quantile, and random number generating functions for the uniform distribution.

The uniform distribution is described by the parameters a and b which delimit the interval on which the density function is nonzero.

The uniform distribution the is maximum entropy probability distribution with support [a, b].

Consider the uniform distribution on [0, b]. Suppose that we take k samples from it, and m is the largest of the samples. The minimum variance unbiased estimator for b is

\begin{align} \frac{k+1}{k}m \end{align}

octave, r, numpy:

a and b are optional parameters and default to 0 and 1 respectively.

## Student's t

The probability density, cumulative distribution, quantile, and random number generating functions for Student's t distribution.

## Snedecor's F

The probability density, cumulative distribution, quantile, and random number generating functions for Snedecor's F distribution.

## empirical density function

How to construct a density function from a sample.

# Linear Regression

## simple linear regression

How to get the slope a and intercept b for a line which best approximates the data. How to get the residuals.

If there are more than two data points, then the system is overdetermined and in general there is no solution for the slope and the intercept. Linear regression looks for line that fits the points as best as possible. The least squares solution is the line that minimizes the sum of the square of the distances of the points from the line.

The residuals are the difference between the actual values of y and the calculated values using ax + b. The norm of the residuals can be used as a measure of the goodness of fit.

# Statistical Tests

A selection of statistical tests. For each test the null hypothesis of the test is stated in the left column.

In a null hypothesis test one considers the p-value, which is the chance of getting data which is as or more extreme than the observed data if the null hypothesis is true. The null hypothesis is usually a supposition that the data is drawn from a distribution with certain parameters.

The extremeness of the data is determined by comparing the expected value of a parameter according to the null hypothesis to the estimated value from the data. Usually the parameter is a mean or variance. In a one-tailed test the p-value is the chance the difference is greater than the observed amount; in a two-tailed test the p-value is the chance the absolute value of the difference is greater than the observed amount.

Octave and MATLAB have different names for the statistical test functions. The sheet shows the Octave functions; the corresponding MATLAB functions are:

octavematlab
wilcoxon_testranksum
kruskal_wallis_testkruskalwallis
kolmogorov_smirnov_testkstest
kolmogorov_smirnov_test_2kstest2
t_testttest
t_test_2ttest2

## wilcoxon signed-rank test

matlab

wilcoxon_test() is an Octave function. The MATLAB function is ranksum().

## kolmogorov-smirnov test

Test whether two samples are drawn from the same distribution.

matlab:

kolmogorov_smirnov_test_2() and kolmogorov_smirnov_test() are Octave functions. The corresponding MATLAB functions are kstest2() and kstest().

kolmogorov_smirnov_test() is a one sample test; it tests whether a sample is drawn from one of the standard continuous distributions. A one sample KS test gives a repeatable p-value; generating a sample and using a two sample KS test does not.

x = randn(100, 1)

% null hypothesis is true:
kolmogorov_smirnov_test(x, "norm", 0, 1)

% alternative hypothesis is true:
kolmogorov_smirnov_test(x, "unif", -0.5, 0.5)


r:

## independent two-sample t-test

Test whether two normal variables have same mean.

r:

If the normal variables are known to have the same variance, the variance can be pooled to estimate standard error:

t.test(x, y, var.equal=T)


If the variance cannot be pooled, then Welch's t-test is used. This uses a lower (often non-integral) degrees-of-freedom value, which in turn results in a higher p-value.

## bartlett's test

A test whether variables are drawn from normal distributions with the same variance.

## levene's test

A test whether variables are drawn from distributions with the same variance.

# Time Series

A time series is a sequence of data points collected repeatedly on a uniform time interval.

A time series can be represented by a dictionary which maps timestamps to the type of the data points. A more efficient implementation exploits the fact that the time interval is uniform and stores the data points in an array. To recover the timestamps of the data points, the timestamp of the first data point and the length of the time interval are also stored.

## time series

How to create a time series from an array.

## monthly time series

How to create a time series with one data point per month.

## lookup by time

How to get to a data point in a time series by when the data point was collected.

## lookup by position in series

How to get a data point in a time series by its ordinal position.

# Univariate Charts

## vertical bar chart

A chart in which numerical values are represented by horizontal bars. The bars are aligned at the bottom.

r:

How to produce a bar chart using ggplot2:

cnts = c(7,3,8,5,5)
names = c("a","b","c","d","e")
df = data.frame(names, cnts)
qplot(names, data=df, geom="bar", weight=cnts)


## horizontal bar chart

A bar chart with horizontal bars which are aligned on the left.

## pie chart

A pie chart displays values using the areas of circular sectors or equivalently the lengths of the arcs of those sectors.

A pie chart implies that the values are percentages of a whole.

## stem-and-leaf plot

A stem-and-leaf plot is a concise way of storing a small set of numbers which makes their distribution visually evident.

The original set of numbers can be recovered with some loss of accuracy by appending the number on the left with each of the digits on the right. In the example below the original data set contained -43, -42, -41, -39, -38, -35, …, 35, 44, 46, 50, 58.

> stem(20*rnorm(100))

The decimal point is 1 digit(s) to the right of the |

-4 | 321
-2 | 98544054310
-0 | 8864333111009998776444332222110
0 | 0001122333333466667778899122334555666789
2 | 00023669025
4 | 4608


## histogram

A histogram is a bar chart where each bar represents a range of values that the data points can fall in. The data is tabulated to find out how often data points fall in each of the bins and in the final chart the length of the bars corresponds to the frequency.

A common method for choosing the number of bins using the number of data points is Sturges' formula:

\begin{align} \lceil \log_2{x} + 1 \rceil \end{align}

r:

How to make a histogram with the ggplot2 library:

qplot(rnorm(50), geom="histogram", binwidth=binwidth)
binwidth = (max(x)-min(x))/10
qplot(rnorm(50), geom="histogram", binwidth=binwidth)


## box plot

Also called a box-and-whisker plot.

The box shows the locations of the 1st quartile, median, and 3rd quartile. These are the same as the 25th percentile, 50th percentile, and 75th percentile.

The whiskers are sometimes used to show the maximum and minimum values of the data set. Outliers are sometimes shown explicitly with dots, in which case all remaining data points occur inside the whiskers.

r:

How to create a box plot with ggplot2:

qplot(x="rnorm", y=rnorm(50), geom="boxplot")

qplot(x=c("rnorm", "rexp", "runif"), y=c(rnorm(50), rexp(50), runif(50)), geom="boxplot")


## chart title

How to set the chart title.

r:

The qplot commands supports the main options for setting the title:

qplot(x="rnorm", y=rnorm(50), geom="boxplot", main="boxplot example")


# Bivariate Charts

## stacked bar chart

Two or more data sets with a common set of labels can be charted with a stacked bar chart. This makes the sum of the data sets for each label readily apparent.

## grouped bar chart

Optionally data sets with a common set of labels can be charted with a grouped bar chart which clusters the bars for each label. The grouped bar chart makes it easier to perform comparisons between labels for each data set.

## scatter plot

A scatter plot can be used to determine if two variables are correlated.

r:

How to make a scatter plot with ggplot:

x = rnorm(50)
y = rnorm(50)
p = ggplot(data.frame(x, y), aes(x, y))
p = p + layer(geom="point")
p


## hexagonal binning

A hexagonal binning is the two-dimensional analog of a histogram. The number of data points in each hexagon is tabulated, and then color or grayscale is used to show the frequency.

A hexagonal binning is superior to a scatter-plot when the number of data points is high because most scatter-plot software doesn't indicate when points are occur on top of each other.

## linear regression line

How to plot a line determined by linear regression on top of a scatter plot.

## polygonal line plot

How to connect the dots of a data set with a polygonal line.

## function plot

How to plot a function.

## quantile-quantile plot

Also called a Q-Q plot.

A quantile-quantile plot is a scatter plot created from two data sets. Each point depicts the quantile of the first data set with its x position and the corresponding quantile of the second data set with its y position.

If the data sets are drawn from the same distribution then most of the points should be close to the line y = x. If the data sets are drawn from distributions which have a linear relation then the Q-Q plot should also be close to linear.

If the two data sets have the same number of elements, one can simply sort them and create the scatterplot.

If the number of elements is different, one a set of quantiles (such as percentiles) for each set. The quantile function of MATLAB and R is convenient for this. With Python, one can use scipy.stats.scoreatpercentile.

## axis labels

How to label the x and y axes.

r:

How to label the axes with ggplot2:

x = rnorm(20)
y = x^2

p = ggplot(data.frame(x, y), aes(x, y))
p + layer(geom="point") + xlab('x') + ylab('x squared')


## axis limits

How to manually set the range of values displayed by an axis.

# Multivariate Charts

How to add another line to a plot.

## colors

How to set the color of points and lines.

## superimposed plots with different y-axis scales

How to superimpose two plots with different y-axis scales.

To minimize the risk that the reader will read off an incorrect y-value for a data point, the example uses the same color for the y-axis as it does for the corresponding data set.

## legend

How to put a legend on a chart.

r:

These strings can be used as the first argument to control the legend position:

• "bottomright"
• "bottom"
• "bottomleft"
• "left"
• "topleft"
• "top"
• "topright"
• "right"
• "center"

The named parameter lwd is the line width. It is roughly the width in pixels, though the exact interpretation is device specific.

The named parameter lty specifies the line type. The value can be either an integer or a string:

numberstring
0'blank'
1'solid'
2'dashed'
3'dotted'
4'dotdash'
5'longdash'
6'twodash'

# MATLAB

The basic data type of MATLAB is a matrix of floats. There is no distinction between a scalar and a 1x1 matrix, and functions that work on scalars typically work on matrices as well by performing the scalar function on each entry in the matrix and returning the results in a matrix with the same dimensions. Operators such as the logical operators ('&' '|' '!'), relational operators ('==', '!=', '<', '>'), and arithmetic operators ('+', '-') all work this way. However the multiplication '*' and division '/' operators perform matrix multiplication and matrix division, respectively. The .* and ./ operators are available if entry-wise multiplication or division is desired.

Floats are by default double precision; single precision can be specified with the single constructor. MATLAB has convenient matrix literal notation: commas or spaces can be used to separate row entries, and semicolons or newlines can be used to separate rows.

Arrays and vectors are implemented as single-row (1xn) matrices. As a result an n-element vector must be transposed before it can be multiplied on the right of a mxn matrix.

Numeric literals that lack a decimal point such as 17 and -34 create floats, in contrast to most other programming languages. To create an integer, an integer constructor which specifies the size such as int8 and uint16 must be used. Matrices of integers are supported, but the entries in a given matrix must all have the same numeric type.

Strings are implemented as single-row (1xn) matrices of characters. Matrices cannot contain strings. If a string is put in matrix literal, each character in the string becomes an entry in the resulting matrix. This is consistent with how matrices are treated if they are nested inside another matrix. The following literals all yield the same string or 1xn matrix of characters:

'foo'
[ 'f' 'o' 'o' ]
[ 'foo' ]
[ [ 'f' 'o' 'o' ] ]
`

true and false are functions which return matrices of ones and zeros. The ones and zeros have type logical instead of double, which is created by the literals 1 and 0. Other than having a different class, the 0 and 1 of type logical behave the same as the 0 and 1 of type double.

MATLAB has a tuple type (in MATLAB terminology, a cell array) which can be used to hold multiple strings. It can also hold values with different types.

# R

The primitive data types of R are vectors of floats, vectors of strings, and vectors of booleans. There is no distinction between a scalar and a vector with one entry in it. Functions and operators which accept a scalar argument will typically accept a vector argument, returning a vector of the same size with the scalar operation performed on each the entries of the original vector.

The scalars in a vector must all be of the same type, but R also provides a list data type which can be used as a tuple (entries accessed by index), record (entries accessed by name), or even as a dictionary.

In addition R provides a data frame type which is a list (in R terminology) of vectors all of the same length. Data frames are equivalent to the data sets of other statistical analysis packages.

# NumPy

NumPy is a Python library which provides a data type called array. It differs from the Python list data type in the following ways:

• N-dimensional. Although the list type can be nested to hold higher dimension data, the array can hold higher dimension data in a space efficient manner without using indirection.
• homogeneous. The elements of an array are restricted to be of a specified type. The NumPy library introduces new primitive types not available in vanilla Python. However, the element type of an array can be object which permits storing anything in the array.

In the reference sheet the array section covers the vanilla Python list and the multidimensional array section covers the NumPy array.

List the NumPy primitive types

SciPy, Matplotlib, and Pandas are libraries which depend on Numpy.

# Julia

http://julialang.org/