a side-by-side reference sheet
grammar and invocation | variables and expression | arithmetic and logic | strings | regexes | dates and time | arrays | dictionaries | tables | relational algebra | aggregation | functions | execution control | files | libraries and namespaces | reflection
sql | awk | pig | |
---|---|---|---|
version used | PostgreSQL 9.0 | 20070501 | 0.9 |
show version | > SELECT version(); | $ awk --version | $ pig --version |
grammar and invocation | |||
sql | awk | pig | |
interpreter | $ psql -f foo.sql | $ awk -f foo.awk bar.txt | $ pig -f foo.pig |
repl | $ psql | none | $ pig -x local |
input data format | multiple tables defined by create table statements | single field and record delimited file. By default fields are delimited by whitespace and records by newlines | multiple files. Default loading function is PigStorage and default delimiter is tab |
statement separator | ; | ; or newline | ; and newline when using REPL |
block delimiters | none in SQL; PL/SQL uses keywords to delimit blocks | { } | |
end-of-line comment | -- comment | # comment | -- comment |
multiple line comment | none | none | /* comment another comment */ |
variables and expressions | |||
sql | awk | pig | |
case sensitive? | no | yes | functions and aliases are case sensitive; commands and operators are not |
quoted identifier | CREATE TABLE "select" ( "foo bar" INT ); MySQL: CREATE TABLE `select` ( `foo bar` INT ); | none | none |
null | NULL | "" | null |
null test | foo IS NULL | foo == "" | in filter clause: is null |
coalesce | coalesce(foo, 0) | foo is null ? 0 : foo | |
nullif | nullif(foo, -999) | ||
conditional expression | CASE WHEN x > 0 THEN x ELSE -x END | x > 0 ? x : -x | x > 0 ? x : -x |
arithmetic and logic | |||
sql | awk | pig | |
boolean type | BOOLEAN | values are untyped | none |
true and false | TRUE FALSE | 1 0 | none |
falsehoods | FALSE the predicate of a where clause evaluates as false if it contains a null | 0 "" | none; the first operand of a conditional expression must be a comparison operator expression |
logical operators | AND OR NOT | && || ! | in filter clause: and or not |
logical expression | |||
relational operators | = != > < >= <= | == != > < >= <= | == != > < >= <= |
integer type | smallint 2 bytes int 4 bytes bigint 8 bytes | variables are untyped | int 4 bytes long 8 bytes |
float type | float 4 bytes double 8 bytes | variables are untyped | float 4 bytes double 8 bytes |
fixed type | numeric(precision, scale) | variables are untyped | none |
arithmetic operators addition, subtraction, multiplication, division, remainder | + - * / % | + - * / % | + - * / % |
integer division | 13 / 5 | int(13 / 5) | 13 / 5 |
integer division by zero | error | error | null |
float division | cast(13 as float) / 5 | 13 / 5 | 1.0 * 13 / 5 |
float division by zero | error | error | null |
power | 2 ^ 32 | 2 ^ 32 2 ** 32 | none |
sqrt | sqrt(2) | sqrt(2) | SQRT(2) |
sqrt -1 | error | generates a warning: nan | NaN |
transcendental functions | exp ln sin cos tan asin acos atan atan2 | exp log sin cos none none none none atan2 | EXP LOG SIN COS TAN ASIN ACOS ATAN none |
float truncation | cast(2.7 as int) round(2.7) ceil(2.7) floor(2.7) | int(2.7) | (int)2.7 ROUND(2.7) CEIL(2.7) FLOOR(2.7) |
absolute value | abs(-2.7) | none | ABS(-2.7) |
random number | random() | rand() | RANDOM() |
bit operators | << >> & | # ~ | gawk: lshift(val, cnt) rshift(val, cnt) and(val1, val2) or(val1, val2) xor(val1, val2) compl(val) | none |
strings | |||
sql | awk | pig | |
types | text varchar(n) char(n) | variables are untyped | chararray bytearray |
literal | 'don''t say "no"' PostgreSQL escape literal: E'don\'t say "no"' | "don't say \"no\"" | 'don\'t say "no"' |
length | length('lorem ipsum') | length("lorem ipsum") | SIZE('lorem ipsum') |
escapes | no backslash escapes in SQL standard string literals in PostgreSQL escape literals: \b \f \n \r \t \ooo \xhh \uhhhh \uhhhhhhhh | \\ \" \a \b \f \n \r \t \v \ooo \xhh | \n \t \uhhhh |
concatenation | 'Hello' || ', ' || 'World!' | ("Hello" " " "World!") | CONCAT(CONCAT('Hello', ', '), 'World!') |
split | regexp_split_to_array( 'do re mi', ' ' ) | split("do re mi", a, " ") | STRSPLIT('do re mi', ' ') |
case manipulation | upper('lorem') lower('LOREM') initcap('lorem') | toupper("lorem") tolower("LOREM") none | UPPER('lorem') LOWER('LOREM') UCFIRST('lorem') |
strip | trim(' lorem ') ltrim(' lorem') rtrim('lorem ') | TRIM(' lorem ') | |
index of substring | index starts from 1; returns 0 if not found: strpos('lorem ipsum', 'ipsum') | index starts from 1; returns 0 if not found: index("lorem ipsum", "ipsum") | index starts from 0; returns -1 if not found: INDEXOF('lorem ipsum', 'ipsum', 0) |
extract substring | substr('lorem ipsum', 7, 5) | substr("lorem ipsum", 7, 5) | SUBSTRING('lorem ipsum', 6, 11) |
sprintf | select format('%s %s %s', 'foo', 7, 13.2); | sprintf("%s %d %f", "foo", 7, 13.2) | REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE FORMAT org.apache.pig.piggybank. evaluation.string.FORMAT(); foo = FOREACH foo GENERATE FORMAT('%s %d %f', 'foo', 7, 13.2); |
regular expressions | |||
sql | awk | pig | |
match | select * from pwt where name similar to 'r[a-z]+'; | matching inside pattern: $1 ~ /^r[a-z]+/ { print $0 } matching inside action: { if (match($1, /^r[a-z]+$/)) print $0 } | root_pwf = filter pwf by SIZE(REGEX_EXTRACT(name, '^(root)$',1)) > 0; |
substitute | select regexp_replace('foo bar', 'bar$', 'baz'); | s = "foo bar" sub(/bar$/, "baz", s) | |
extract subgroup | select (regexp_matches('foobar', '(f..)bar'))[1]; | properties = FOREACH urls GENERATE FLATTEN(EXTRACT($0, '^https?://([^/]+)')) as host:chararray; | |
dates and time | |||
sql | awk | pig | |
current date and time | now() | gawk: systime() | REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE DATE_TIME org.apache.pig.piggybank. evaluation.datetime.DATE_TIME(); bar = FOREACH foo GENERATE DATE_TIME('-00:00'); |
datetime to string | to_char(now(), 'YYYY-MM-DD HH24:MI:SS') | gawk: strftime("%Y-%m-%d %H:%M:%S", systime()) | REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE FORMAT_DT org.apache.pig.piggybank. evaluation.datetime.FORMAT_DT(); baz = FOREACH bar GENERATE FORMAT_DT('yyyy-MM-dd HH::mm:ss', $0); |
string to datetime | to_timestamp('2011-09-26 00:00:47', 'YYYY-MM-DD HH24:MI:SS') | gawk: mktime("2011 09 26 00 00 47") | |
arrays | |||
sql | awk | pig | |
literal | PostgreSQL: create temp table foo ( a int[] ); insert into foo values ( '{1,2,3}' ); | none | one_row = LOAD '/tmp/one_line.txt'; foo = FOREACH one_row GENERATE (1,2,3); |
size | PostgreSQL: select array_upper(a, 1) from foo; | none, but split function returns length | bar = FOREACH foo GENERATE SIZE(a); |
lookup | PostgreSQL: select a[1] from foo; | a[0] | bar = FOREACH foo GENERATE a.$0; |
update | |||
iteration | for (i in a) print i, a[i] | ||
membership | |||
dictionaries | |||
sql | awk | pig | |
literal | ['t'#1, 'f'#0] | ||
tables | |||
sql | awk | pig | |
order rows by column | ordering depends on type of column select name from pwt order by name; | ordering is lexical unless -n flag is used $ sort -k1 -t: /etc/passwd | names = foreach pwf generate name; ordered_names = order names by name; |
order rows by multiple columns | select group, name from pwt order by group, name; | ||
order rows in descending order | select name from pwt order by name; | $ sort -k1 -r -t: /etc/passwd | |
limit | select name from pwt order by name limit 10; | first_ten = limit ordered_names 10; | |
offset | select name from pwt order by name limit 10 offset 10; | ||
relational algebra | |||
sql | awk | pig | |
project columns by name | select name, pw from pwt; | none | name_pw = foreach pwf generate name, pw; |
project columns by position | none | { print $1, $2 } | name_pw = foreach pwf generate $0, $1; |
project expression | |||
project all columns | select * from pwt; | # prints input line: { print $0 } | pwf2 = foreach pwf generate *; |
rename columns | select uid as userid, gid as groupid from pwf; | none | usergroups = foreach pwf generate uid as userid, gid as groupid; |
select rows | select * from pwt where name = 'root'; | $1 == "root" { print $0 } | pwf3 = filter pwf by name == 'root'; |
select distinct rows | select distinct gid from pwt; | $ cat > gid.awk BEGIN { FS=":" } !/^#/ { print $4 } $ awk -f gid.awk /etc/passwd > /tmp/x $ sort -u /tmp/x | gids1 = foreach pwf generate gid; gids2 = distinct gids1; |
split rows | split pwf into rootpwf if name == 'root', otherpwf if name != 'root'; | ||
inner join | create temp table gt ( name text, pw text, gid int, members text ); copy gt from '/etc/group' with delimiter ':'; select * from pwt join gt on pwt.gid = gt.gid; | $ awk '!/^#/' /etc/passwd > /tmp/x $ sort -k4,4 -t: < /tmp/x > /tmp/pw $ awk '!/^#/' /etc/group > /tmp/y $ sort -k3,3 -t: < /tmp/y > /tmp/g $ join -t: -14 -23 /tmp/pw /tmp/g | gf = load '/etc/group' using PigStorage(':') as (name:chararray, pw:chararray, gid:int, members:chararray); pwgf = join pwf by gid, gf by gid; |
nulls as join values | input tuples with nulls for join values are omitted from inner joins | no null value; empty strings are joinable values | input tuples with nulls for join values are omitted from inner joins |
self join | select pwt1.name, pwt2.name from pwt as pwt1, pwt as pwt2 where pwt1.gid = pwt2.gid; | $ join -t: -14 -24 /tmp/pw /tmp/pw | pwf2 = foreach pwf generate *; joined_by_gid = join pwf by gid, pwf2 by gid; name_pairs = foreach by_group generate pwf::name, pwf2::name; |
left join | select * from customers c left join orders o on c.id = o.customer_id; | $ join -t: -a1 -11 -22 /tmp/c /tmp/o | j = join customers by id left, orders by customer_id; |
full join | select * from customers c full join orders o on c.id = o.customer_id; | $ join -t: -a1 -a2 -11 -22 /tmp/c /tmp/o | j = join customers by id full, orders by customer_id; |
antijoin | |||
cross join | create table files ( file text ); insert into files values ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'); create table ranks ( rank int ); insert into ranks values (1), (2), (3), (4), (5), (6), (7), (8); create table chessboard ( file text, rank int ); insert into chessboard select * from files, ranks; | specify nonexistent join fields: $ join -12 -22 /tmp/f /tmp/r | files = load '/tmp/files.txt' as (file:chararray); ranks = load '/tmp/ranks.txt' as (rank:int); chessboard = cross files, ranks; |
aggregation | |||
sql | awk | pig | |
group by | select gid, count(*) from pwf group by gid; | BEGIN { FS=":" } !/^#/ { cnt[$4]++ } END { for (i in cnt) print i, cnt[i] } | by_group = group pwf by gid; cnts = foreach by_group generate $0, COUNT($1); |
group by multiple columns | select gid, sh, count(*) from pwf group by gid, sh; | by_group = group pwf by (gid, sh); cnts = foreach by_group generate FLATTEN($0), COUNT($1); | |
aggregation functions | count sum min max avg stddev | none | COUNT SUM MIN MAX AVG none |
functions | |||
pl/sql | awk | pig | |
define function | create or replace function add ( x int, y int ) returns int as $$ begin return x + y; end; $$ language plpgsql; | can be defined at position of pattern-action statement: function add(x, y) { return x+y } | in org/hpglot/Add.java: package org.hpglot; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class Add extends EvalFunc<Integer> { @Override public Integer exec(Tuple input) throws IOException { Integer x, y; if (input == null @@ |
invoke function | select add(x, y) from foo; | { print add($1, $2) } | REGISTER Add.jar DEFINE ADD org.hpglot.Add(); foo = LOAD 'foo.txt' AS (x:int, y:int); bar = FOREACH foo GENERATE ADD(x, y); |
drop function | drop function add(integer, integer); | none | none |
execution control | |||
pl/sql | awk | pig | |
if | if (!0) print "foo"; else print "bar" | none | |
while | i = 0; while (i<5) print i++ | none | |
for | for (i=0; i<5; i++) print i | none | |
files | |||
sql | awk | pig | |
set field delimiter | $ grep -v '^#' /etc/passwd > /tmp/pw create table pwt ( name text, pw text, uid int, gid int, gecos text, home text, sh text ); copy pwt from '/tmp/pw' with delimiter ':'; | BEGIN { FS=":" } | pwf = LOAD '/etc/passwd' USING PigStorage(':') AS (name:chararray, pw:chararray, uid:int, gid:int, gecos:chararray, home:chararray, sh:chararray); |
write table to file | STORE foo INTO '/tmp/foo.tab'; | ||
libraries and namespaces | |||
sql | awk | pig | |
reflection | |||
sql | awk | pig | |
table schema | DESCRIBE foo; | ||
_________________________________________ | _________________________________________ | _________________________________________ |
General
versions used
The versions used for testing code in the reference sheet.
show version
How to get the version.
Grammar and Invocation
interpreter
How to run the interpreter on a script.
repl
How to invoke the REPL.
statement separator
The statement separator.
block delimiters
The delimiters used for blocks.
end-of-line comment
How to create a comment that ends at the next newline.
multiple line comment
How to comment out multiple lines.
Variables and Expressions
case sensitive?
Are identifiers which differ only by case treated as distinct identifiers?
quoted identifier
How to quote an identifier.
Quoting an identifier is a way to include characters which aren't normally permitted in an identifier.
In SQL quoting is also a way to refer to an identifier that would otherwise be interpreted as a reserved word
null
The null literal.
pig:
PigStorage, the default function for loading and persisting relations, represents a null value with an empty string. Null is distinct from an empty string since null == '' evaluates as false. Thus PigStorage cannot load or store an empty string.
null test
How to test whether an expression is null.
sql:
The expression null = null evaluates as null, which is a ternary boolean value distinct from true and false. Expressions built up from arithmetic operators or comparison operators which contain a null evaluate as null. When logical operators are involved, null behaves like the unknown value of Kleene logic.
coalesce
How to use the value of an expression, replacing it with an alternate value if it is null.
nullif
How to use the value of an expression, replacing a specific value with null.
conditional expression
The syntax for a conditional expression.
Arithmetic and Logic
true and false
Literals for true and false.
falsehoods
Values which evaluate as false in a boolean context.
logical operators
The logical operators. Logical operators impose a boolean context on their arguments and return a boolean value.
relational operators
The comparison operators, also known as the relational operators.
integer type
Integer types.
sql:
Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.
awk:
Variables are untyped and implicit conversions are performed between numeric and string types.
The numeric literal for zero, 0, evaluates as false, but the string "0" evaluates as true. Hence we can infer that awk has at least two distinct data types.
float type
Floating point decimal types.
sql:
Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.
fixed type
Fixed precision decimal types.
sql:
Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.
arithmetic operations
The arithmetic operators: addition, subtraction, multiplication, division, modulus, and exponentiation.
integer division
How to compute the quotient of two numbers. The quotient is always an integer.
integer division by zero
What happens when an integer is divided by zero.
pig:
Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.
float division
How to perform floating point division, even if the operands are integers.
float division by zero
The result of dividing a float by zero.
pig:
Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.
power
How to raise a number to a power.
sqrt
How to get the square root of a number.
sqrt -1
The result of taking the square root of negative one.
transcendental functions
The standard transcendental functions of mathematics.
float truncation
How to truncate floats to integers. The functions (1) round towards zero, (2) round to the nearest integer, (3) round towards positive infinity, and (4) round towards negative infinity. How to get the absolute value of a number is also illustrated.
absolute value
The absolute value of a number.
random number
How to create a unit random float.
Strings
types
The available string types.
pig:
A chararray is a string of Unicode characters. Like in Java the characters are UTF-16 encoded.
A bytearray is a string of bytes. Data imported into Pig is of type bytearray unless declared otherwise.
literal
The syntax for string literals.
sql:
MySQL also has double quoted string literals. PostgreSQL and most other database use double quotes for identifiers.
In a MySQL double quoted string double quote characters must be escaped with reduplication but single quote characters do not need to be escaped.
pig:
Single quoted string literals are of type chararray. There is no syntax for a bytearray literal.
length
How to get the length of a string.
escapes
Escape sequences which are available in string literals.
sql:
Here is a portable way to include a newline character in a SQL string:
select 'foo' || chr(10) || 'bar';
MySQL double and single quoted strings support C-style backslash escapes. Backslash escapes are not part of the SQL standard. Their interpretation can be disabled at the session level with
SET sql_mode='NO_BACKSLASH_ESCAPES';
concatenation
How to concatenate strings.
split
How to split a string into an array of substrings.
sql:
How to split a string into multiple rows of data:
=> create temp table foo ( bar text );
CREATE TABLE
=> insert into foo select regexp_split_to_table('do re mi', ' ');
INSERT 0 3
case manipulation
How to uppercase a string; how to lower case a string; how to capitalize the first character.
strip
How to remove whitesapce from the edges of a string.
index of substring
How to get the leftmost index of a substring in a string.
extract substring
How to extract a substring from a string.
sprintf
How to create a string from a format.
Regular Expressions
match
pig:
Pig does not directly support a regex match test. The technique illustrated is to extract a subgroup and see whether the resulting tuple has anything in it. This can be done in the by clause of a filter statement, but not as the first operand of a conditional expression.
substitute
How to perform substitution on a string.
awk:
sub and the global variant gsub return the number of substitutions performed.
extract subgroup
Date and Time
Arrays
literal
The syntax for an array literal.
sql:
The syntax for arrays is specific to PostgreSQL. MySQL does not support arrays.
Defining a column to be an array violates first normal form.
pig:
Pig tuples can be used to store a sequence of data in a field. Pig tuples are heterogeneous; the components do not need to be of the same type.
size
How to get the number of elements in an array.
lookup
How to get the value in an array by index.
update
How to change a value in an array.
iteration
How to iterate through the values of an array.
Dictionaries
literal
pig:
Dictionaries are called maps in Pig. The keys must be character arrays, but the values can be any type.
Tables
order by
How to sort the rows in table using the values in one of the columns.
order by multiple columns
How to sort the rows in a table using the values in multiple columns. If the values in the first column are the same, the values in the seconds column are used as a tie breaker.
limit
offset
Relational Algebra
In a mapping operation the output relation has the same number of rows as the input relation. A mapping operation can be specified with a function which accepts an input record and returns an output record.
In a filtering operation the output relation has a less than or equal number of rows as the input relation. A filtering operation can be specified with a function which accepts an input record and returns a boolean value.
input data format
The data formats the language can operate on.
set field delimiter
For languages which can operate on field and record delimited files, how to set the field delimiter.
sql:
The PostgreSQL copy command requires superuser privilege unless the input source is stdin. Here is an example of how to use the copy command without superuser privilege:
$ ( echo "copy pwt from stdin with delimiter ':';"; cat /tmp/pw ) | psql
The copy command is not part of the SQL standard. MySQL uses the following:
load data infile '/etc/passwd' into table pwt fields terminated by ':';
Both PostgreSQL and MySQL will use tab characters if a field separator is not specified. MySQL permits the record terminator to changed from the default newline, but PostgreSQL does not.
select column by name
How to select fields by name.
select column by position
select all columns
rename columns
filter rows
split rows
An aggregation operation is similar to a filtering operation in that it accepts an input relation and produces an output relation with less than or equal number of rows. An aggregation is defined by two functions: a partitioning function which accepts a record and produces a partition value, and a reduction function which accepts a set of records which share a partition value and produces an output record.
select distinct
How to remove duplicate rows from the output set.
Removing duplicates can be accomplished with an aggregation operation in which the partition value is the entire row and a reduction function which returns the first row in the set of rows sharing the partition value.
inner join
In an inner join, only tuples from the input relations which satisfy a join predicate are used in the output relation.
A special but common case is when the join predicate consists of an equality test or a conjunction of two or more equality tests. Such a join is called an equi-join.
awk:
If awk is available at the command line then chances are good that join is also available.
null treatment in joins
How rows which have a null value for the join column are handled.
Both SQL and Pig do not include such rows in the output relation unless an outer join (i.e. a left, right, or full join) is specified. Even in the case of an outer join the rows with a null join column value are not joined with any rows from the other relations, even if there are also rows in the other relation with null join column values. Instead the columns that derive from the other input relation will have null values in the output relation.
self join
A self join is when a relation is joined with itself.
If a relation contained a list of people and their parents, then a self join could be used to find a persons grandparents.
pig:
An alias be used in a JOIN statement more than once. Thus to join a relation with itself it must first be copied with a FOREACH statement.
left join
How to include rows from the input relation listed on the left (i.e. listed first) which have values in the join column which don't match any rows from the input relation listed on the right (i.e. listed second). The term in short for left outer join.
As an example, a left join between customers and orders would have a row for every order placed and the customer who placed it. In addition it would have rows for customers who haven't placed any orders. Such rows would have null values for the order information.
sql:
Here is a complete example with the schemas and data used in the left join:
create table customers ( id int, name text );
insert into customers values ( 1, 'John' ), ( 2, 'Mary' ), (3, 'Jane');
create table orders ( id int, customer_id int, amount numeric(9, 2));
insert into orders values ( 1, 2, 12.99 );
insert into orders values ( 2, 3, 5.99 );
insert into orders values ( 3, 3, 12.99 );
select * from customers c left join orders o on c.id = o.customer_id;
id | name | id | customer_id | amount
----+------+----+-------------+--------
1 | John | | |
2 | Mary | 1 | 2 | 12.99
3 | Jane | 2 | 3 | 5.99
3 | Jane | 3 | 3 | 12.99
(4 rows)
left outer join is synonymous with left join. The following query is identical to the one above:
select *
from customers c
left outer join orders o
on c.id = o.customer_id;
pig:
For a complete example assume the following data is in /tmp/customers.txt:
1:John
2:Mary
3:Jane
and /tmp/orders.txt:
1:2:12.99
2:3:5.99
3:3:12.99
Here is the Pig session:
customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id left, orders by customer_id;
dump j;
Here is the output:
(1,John,,,)
(2,Mary,1,2,12.99)
(3,Jane,2,3,5.99)
(3,Jane,3,3,12.99)
full join
A full join is a join in which rows with null values for the join condition from both input relations are included in the output relation.
Left joins, right joins, and full joins are collectively called outer joins.
sql:
We illustrate a full join by using the schema and data from the left join example and adding an order with a null customer_id:
insert into orders values ( 4, null, 7.99);
select * from customers c full join orders o on c.id = o.customer_id;
id | name | id | customer_id | amount
----+------+----+-------------+--------
1 | John | | |
2 | Mary | 1 | 2 | 12.99
3 | Jane | 2 | 3 | 5.99
3 | Jane | 3 | 3 | 12.99
| | 4 | | 7.99
(5 rows)
pig:
For a complete example assume the following data is in /tmp/customers.txt:
1:John
2:Mary
3:Jane
and /tmp/orders.txt:
1:2:12.99
2:3:5.99
3:3:12.99
4::7.99
Here is the Pig session:
customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id full, orders by customer_id;
dump j;
Here is the output:
(1,John,,,)
(2,Mary,1,2,12.99)
(3,Jane,2,3,5.99)
(3,Jane,3,3,12.99)
(,,4,,7.99)
cross join
A cross join is a join with no join predicate. It is also called a Cartesian product. If the input relations have N1, N2, …, Nm rows respectively, then the output relation has $$\prod_{i=1}^{m} N_i$$ rows.
Aggregation
group by
sql:
The columns in the select clause of a select statement with a group by clause must be expressions built up of columns listed in the group by clause and aggregation functions. Aggregation functions can contain expressions containing columns not in the group by clause as arguments.
pig:
The output relation of a GROUP BY operation is always a relation with two fields. The first is the partition value, and the second is a bag containing all the tuples in the input relation which have the partition value.
group by multiple columns
How to group by multiple columns. The output relation will have a row for each distinct tuple of column values.
pig:
Tuples must be used to group on multiple fields. Tuple syntax is used in the GROUP BY statement and the first field in the output relation will be tuple. The FLATTEN function can be used to replace the tuple field with multiple fields, one for each component of the tuple.
aggregation functions
The aggregation functions.
sql:
Rows for which the expression given as an argument of an aggregation function is null are excluded from a result. In particular, COUNT(foo) is the number of rows for which the foo column is not null. COUNT(*) always returns the number of rows in the input set, including even rows with columns that are all null.
pig:
The Pig aggregation functions operate on bags. The GROUP BY operator produces a relation of tuples in which the first field is the partition value and the second is a bag of all the tuples which have the partition value. Since $1 references the second component of a tuple, it is often the argument to the aggregation functions.
A join is an operation on m input relations. If the input relations have n1, n2, …, nm columns respectively, then the output relation has $$\sum_{i=1}^{m} n_i$$ columns.
Functions
define function
How to define a function.
sql:
To be able to write PL/pgSQL functions on a PostgreSQL database, someone with superuser privilege must run the following command:
create language plpgsql;
invoke function
How to invoke a function.
drop function
How to remove a function.
sql:
PL/pgSQL permits functions with the same name and different parameter types. Resolution happens at invocation using the types of the arguments.
When dropping the function the parameter types must be specified. There is no statement for dropping multiple functions with a common name.
Execution Control
if
How to execute code conditionally.
while
How to implement a while loop.
for
How to implement a C-style for loop.
Files
Library and Namespaces
Reflection
SQL
PostgreSQL 9.1: The SQL Language
MySQL 5.6 Reference Manual
SQL has been the leading query language for relational databases since the early 1980s. It received its first ISO standardization in 1986.
SQL statements are classified into three types: data manipulation language (DML), data definition language (DDL), and data control language (DCL). DDL defines and alters the database schema. DCL controls the privileges of database users. DML queries and modifies the data in the tables.
sql statement | type |
---|---|
select | DML |
insert | DML |
update | DML |
delete | DML |
create | DDL |
alter | DDL |
drop | DDL |
grant | DCL |
revoke | DCL |
Awk
awk - pattern-directed scanning and processing language
POSIX specification for awk
POSIX specification for join
POSIX specification for sort
Awk has been included on all Unix systems since 7th Edition Unix in 1979. It provides a concise language for performing transformations on files. An entire program can be provided to the awk interpreter as the first argument on the command line. Because awk string literals use double quotes, single quotes are usually used to quote the awk program for the benefit of the shell. Here's an example which prints the default shell used by root:
awk 'BEGIN{FS=":"} $1=="root" {print $7}' /etc/passwd
An awk script is sequence of pattern-action pairs. Awk will iterate through the lines of standard input or the lines of the specified input files, testing each pattern against the line and executing the corresponding action if the pattern matches.
Patterns are usually slash delimited regular expressions, e.g. /lorem/, and logical expressions built up from them using the logical operators &&, ||, and !. If an action is provided without an accompanying pattern, the action is executed once for every line of input. The keywords BEGIN and END are special patterns which cause the following action to be executed once at the start and end of execution, respectively.
Pig
Apache Pig docs
piggybank.jar
piggybank-0.3-amzn.jar
Pig is language for specifying Hadoop map reduce jobs. Pig scripts are shorter than equivalent Java source code, especially if joins are required.
There are products such as Hive which can convert an SQL statement to a map reduce job, but Pig has an advantage over Hive in that it can handle a greater variety of data formats in the input files.
Although Pig is intended to be used with a Hadoop grid, Hadoop is not required to run a Pig job. Running a Pig job locally is a convenient way to test a Pig job before running it on a grid.
In addition to some numeric and string data types, Pig provides three compound data types: bag, tuples, and map. A bag is an array of tuples. It is equivalent to a database table; it is the database type which Pig uses to hold data which it reads in from files.
Pig has a limited type of variable called an alias. The only data type which can be stored in an alias is a bag. When a bag is stored in an alias it is called a relation or an outer bag. A bag can also be stored in the field of a tuple, in which case it is called an inner bag.
pig relational operators:
Pig provides the following 15 operators for manipulating relations:
relational operator | input relations | output relations | output rows |
---|---|---|---|
CROSS | 2+ | 1 | N1 x N2 x … |
DISTINCT | 1 | 1 | N or fewer |
FILTER | 1 | 1 | N or fewer |
FOREACH | 1 | 1 | N |
GROUP | 1+ | 1 | number of groups |
JOIN | 2+ | 1 | bounded by N1 x N2 x … |
LIMIT | 1 | 1 | min(N, limit argument) |
LOAD | 0 | 1 | lines in file |
MAPREDUCE | 0 | 1 | depends on mapreduce job |
ORDER BY | 1 | 1 | N |
SAMPLE | 1 | 1 | N * fraction argument |
SPLIT | 1 | n | can exceed N if split conditions overlap |
STORE | 1 | 0 | 0 |
STREAM | 1 | 1 | depends on external script |
UNION | 2+ | 1 | N1 + N2 + … + Nn |
Most of the above operators create a new relation from existing relations. Exceptions are LOAD and MAPREDUCE which create relations from external files, STORE which writes a relation to a file, and SPLIT which can create more than one relation.
piggybank UDFs:
It is easy to write user defined functions (UDFs) in Java and make them available to pig. piggybank.jar and piggybank-0.3-amzn.jar are two publicly available libraries of UDFs.
If the Piggybank jar is in the home directory when the Pig script is run, the functions can be made available with the following code at the top of the Pig script:
REGISTER /PATH/TO/piggybank.jar;
REGISTER /PATH/TO/piggybank-0.3-amzn.jar;
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();