Relational Data Tools: SQL, Awk, Pig

a side-by-side reference sheet

	sql	awk	pig
version used	PostgreSQL 9.0	20070501	0.9
show version	> SELECT version();	$ awk --version	$ pig --version
grammar and invocation
	sql	awk	pig
interpreter	$ psql -f foo.sql	$ awk -f foo.awk bar.txt	$ pig -f foo.pig
repl	$ psql	none	$ pig -x local
input data format	multiple tables defined by create table statements	single field and record delimited file. By default fields are delimited by whitespace and records by newlines	multiple files. Default loading function is PigStorage and default delimiter is tab
statement separator	;	; or newline	; and newline when using REPL
block delimiters	none in SQL; PL/SQL uses keywords to delimit blocks	{ }
end-of-line comment	-- comment	# comment	-- comment
multiple line comment	none	none	/* comment another comment */
variables and expressions
	sql	awk	pig
case sensitive?	no	yes	functions and aliases are case sensitive; commands and operators are not
quoted identifier	CREATE TABLE "select" ( "foo bar" INT ); MySQL: CREATE TABLE `select` ( `foo bar` INT );	none	none
null	NULL	""	null
null test	foo IS NULL	foo == ""	in filter clause: is null
coalesce	coalesce(foo, 0)		foo is null ? 0 : foo
nullif	nullif(foo, -999)
conditional expression	CASE WHEN x > 0 THEN x ELSE -x END	x > 0 ? x : -x	x > 0 ? x : -x
arithmetic and logic
	sql	awk	pig
boolean type	BOOLEAN	values are untyped	none
true and false	TRUE FALSE	1 0	none
falsehoods	FALSE the predicate of a where clause evaluates as false if it contains a null	0 ""	none; the first operand of a conditional expression must be a comparison operator expression
logical operators	AND OR NOT	&& \|\| !	in filter clause: and or not
logical expression
relational operators	= != > < >= <=	== != > < >= <=	== != > < >= <=
integer type	smallint 2 bytes int 4 bytes bigint 8 bytes	variables are untyped	int 4 bytes long 8 bytes
float type	float 4 bytes double 8 bytes	variables are untyped	float 4 bytes double 8 bytes
fixed type	numeric(precision, scale)	variables are untyped	none
arithmetic operators addition, subtraction, multiplication, division, remainder	+ - * / %	+ - * / %	+ - * / %
integer division	13 / 5	int(13 / 5)	13 / 5
integer division by zero	error	error	null
float division	cast(13 as float) / 5	13 / 5	1.0 * 13 / 5
float division by zero	error	error	null
power	2 ^ 32	2 ^ 32 2 ** 32	none
sqrt	sqrt(2)	sqrt(2)	SQRT(2)
sqrt -1	error	generates a warning: nan	NaN
transcendental functions	exp ln sin cos tan asin acos atan atan2	exp log sin cos none none none none atan2	EXP LOG SIN COS TAN ASIN ACOS ATAN none
float truncation	cast(2.7 as int) round(2.7) ceil(2.7) floor(2.7)	int(2.7)	(int)2.7 ROUND(2.7) CEIL(2.7) FLOOR(2.7)
absolute value	abs(-2.7)	none	ABS(-2.7)
random number	random()	rand()	RANDOM()
bit operators	<< >> & \| # ~	gawk: lshift(val, cnt) rshift(val, cnt) and(val1, val2) or(val1, val2) xor(val1, val2) compl(val)	none
strings
	sql	awk	pig
types	text varchar(n) char(n)	variables are untyped	chararray bytearray
literal	'don''t say "no"' PostgreSQL escape literal: E'don\'t say "no"'	"don't say \"no\""	'don\'t say "no"'
length	length('lorem ipsum')	length("lorem ipsum")	SIZE('lorem ipsum')
escapes	no backslash escapes in SQL standard string literals in PostgreSQL escape literals: \b \f \n \r \t \ooo \xhh \uhhhh \uhhhhhhhh	\\ \" \a \b \f \n \r \t \v \ooo \xhh	\n \t \uhhhh
concatenation	'Hello' \|\| ', ' \|\| 'World!'	("Hello" " " "World!")	CONCAT(CONCAT('Hello', ', '), 'World!')
split	regexp_split_to_array( 'do re mi', ' ' )	split("do re mi", a, " ")	STRSPLIT('do re mi', ' ')
case manipulation	upper('lorem') lower('LOREM') initcap('lorem')	toupper("lorem") tolower("LOREM") none	UPPER('lorem') LOWER('LOREM') UCFIRST('lorem')
strip	trim(' lorem ') ltrim(' lorem') rtrim('lorem ')		TRIM(' lorem ')
index of substring	index starts from 1; returns 0 if not found: strpos('lorem ipsum', 'ipsum')	index starts from 1; returns 0 if not found: index("lorem ipsum", "ipsum")	index starts from 0; returns -1 if not found: INDEXOF('lorem ipsum', 'ipsum', 0)
extract substring	substr('lorem ipsum', 7, 5)	substr("lorem ipsum", 7, 5)	SUBSTRING('lorem ipsum', 6, 11)
sprintf	select format('%s %s %s', 'foo', 7, 13.2);	sprintf("%s %d %f", "foo", 7, 13.2)	REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE FORMAT org.apache.pig.piggybank. evaluation.string.FORMAT(); foo = FOREACH foo GENERATE FORMAT('%s %d %f', 'foo', 7, 13.2);
regular expressions
	sql	awk	pig
match	select * from pwt where name similar to 'r[a-z]+';	matching inside pattern: $1 ~ /^r[a-z]+/ { print $0 } matching inside action: { if (match($1, /^r[a-z]+$/)) print $0 }	root_pwf = filter pwf by SIZE(REGEX_EXTRACT(name, '^(root)$',1)) > 0;
substitute	select regexp_replace('foo bar', 'bar$', 'baz');	s = "foo bar" sub(/bar$/, "baz", s)
extract subgroup	select (regexp_matches('foobar', '(f..)bar'))[1];		properties = FOREACH urls GENERATE FLATTEN(EXTRACT($0, '^https?://([^/]+)')) as host:chararray;
dates and time
	sql	awk	pig
current date and time	now()	gawk: systime()	REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE DATE_TIME org.apache.pig.piggybank. evaluation.datetime.DATE_TIME(); bar = FOREACH foo GENERATE DATE_TIME('-00:00');
datetime to string	to_char(now(), 'YYYY-MM-DD HH24:MI:SS')	gawk: strftime("%Y-%m-%d %H:%M:%S", systime())	REGISTER /PATH/TO/piggybank-0.3-amzn.jar; DEFINE FORMAT_DT org.apache.pig.piggybank. evaluation.datetime.FORMAT_DT(); baz = FOREACH bar GENERATE FORMAT_DT('yyyy-MM-dd HH::mm:ss', $0);
string to datetime	to_timestamp('2011-09-26 00:00:47', 'YYYY-MM-DD HH24:MI:SS')	gawk: mktime("2011 09 26 00 00 47")
arrays
	sql	awk	pig
literal	PostgreSQL: create temp table foo ( a int[] ); insert into foo values ( '{1,2,3}' );	none	one_row = LOAD '/tmp/one_line.txt'; foo = FOREACH one_row GENERATE (1,2,3);
size	PostgreSQL: select array_upper(a, 1) from foo;	none, but split function returns length	bar = FOREACH foo GENERATE SIZE(a);
lookup	PostgreSQL: select a[1] from foo;	a[0]	bar = FOREACH foo GENERATE a.$0;
update
iteration		for (i in a) print i, a[i]
membership
dictionaries
	sql	awk	pig
literal			['t'#1, 'f'#0]
tables
	sql	awk	pig
order rows by column	ordering depends on type of column select name from pwt order by name;	ordering is lexical unless -n flag is used $ sort -k1 -t: /etc/passwd	names = foreach pwf generate name; ordered_names = order names by name;
order rows by multiple columns	select group, name from pwt order by group, name;
order rows in descending order	select name from pwt order by name;	$ sort -k1 -r -t: /etc/passwd
limit	select name from pwt order by name limit 10;		first_ten = limit ordered_names 10;
offset	select name from pwt order by name limit 10 offset 10;
relational algebra
	sql	awk	pig
project columns by name	select name, pw from pwt;	none	name_pw = foreach pwf generate name, pw;
project columns by position	none	{ print $1, $2 }	name_pw = foreach pwf generate $0, $1;
project expression
project all columns	select * from pwt;	# prints input line: { print $0 }	pwf2 = foreach pwf generate *;
rename columns	select uid as userid, gid as groupid from pwf;	none	usergroups = foreach pwf generate uid as userid, gid as groupid;
select rows	select * from pwt where name = 'root';	$1 == "root" { print $0 }	pwf3 = filter pwf by name == 'root';
select distinct rows	select distinct gid from pwt;	$ cat > gid.awk BEGIN { FS=":" } !/^#/ { print $4 } $ awk -f gid.awk /etc/passwd > /tmp/x $ sort -u /tmp/x	gids1 = foreach pwf generate gid; gids2 = distinct gids1;
split rows			split pwf into rootpwf if name == 'root', otherpwf if name != 'root';
inner join	create temp table gt ( name text, pw text, gid int, members text ); copy gt from '/etc/group' with delimiter ':'; select * from pwt join gt on pwt.gid = gt.gid;	$ awk '!/^#/' /etc/passwd > /tmp/x $ sort -k4,4 -t: < /tmp/x > /tmp/pw $ awk '!/^#/' /etc/group > /tmp/y $ sort -k3,3 -t: < /tmp/y > /tmp/g $ join -t: -14 -23 /tmp/pw /tmp/g	gf = load '/etc/group' using PigStorage(':') as (name:chararray, pw:chararray, gid:int, members:chararray); pwgf = join pwf by gid, gf by gid;
nulls as join values	input tuples with nulls for join values are omitted from inner joins	no null value; empty strings are joinable values	input tuples with nulls for join values are omitted from inner joins
self join	select pwt1.name, pwt2.name from pwt as pwt1, pwt as pwt2 where pwt1.gid = pwt2.gid;	$ join -t: -14 -24 /tmp/pw /tmp/pw	pwf2 = foreach pwf generate *; joined_by_gid = join pwf by gid, pwf2 by gid; name_pairs = foreach by_group generate pwf::name, pwf2::name;
left join	select * from customers c left join orders o on c.id = o.customer_id;	$ join -t: -a1 -11 -22 /tmp/c /tmp/o	j = join customers by id left, orders by customer_id;
full join	select * from customers c full join orders o on c.id = o.customer_id;	$ join -t: -a1 -a2 -11 -22 /tmp/c /tmp/o	j = join customers by id full, orders by customer_id;
antijoin
cross join	create table files ( file text ); insert into files values ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'); create table ranks ( rank int ); insert into ranks values (1), (2), (3), (4), (5), (6), (7), (8); create table chessboard ( file text, rank int ); insert into chessboard select * from files, ranks;	specify nonexistent join fields: $ join -12 -22 /tmp/f /tmp/r	files = load '/tmp/files.txt' as (file:chararray); ranks = load '/tmp/ranks.txt' as (rank:int); chessboard = cross files, ranks;
aggregation
	sql	awk	pig
group by	select gid, count(*) from pwf group by gid;	BEGIN { FS=":" } !/^#/ { cnt[$4]++ } END { for (i in cnt) print i, cnt[i] }	by_group = group pwf by gid; cnts = foreach by_group generate $0, COUNT($1);
group by multiple columns	select gid, sh, count(*) from pwf group by gid, sh;		by_group = group pwf by (gid, sh); cnts = foreach by_group generate FLATTEN($0), COUNT($1);
aggregation functions	count sum min max avg stddev	none	COUNT SUM MIN MAX AVG none
functions
	pl/sql	awk	pig
define function	create or replace function add ( x int, y int ) returns int as $$ begin return x + y; end; $$ language plpgsql;	can be defined at position of pattern-action statement: function add(x, y) { return x+y }	in org/hpglot/Add.java: package org.hpglot; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class Add extends EvalFunc<Integer> { @Override public Integer exec(Tuple input) throws IOException { Integer x, y; if (input == null @@
invoke function	select add(x, y) from foo;	{ print add($1, $2) }	REGISTER Add.jar DEFINE ADD org.hpglot.Add(); foo = LOAD 'foo.txt' AS (x:int, y:int); bar = FOREACH foo GENERATE ADD(x, y);
drop function	drop function add(integer, integer);	none	none
execution control
	pl/sql	awk	pig
if		if (!0) print "foo"; else print "bar"	none
while		i = 0; while (i<5) print i++	none
for		for (i=0; i<5; i++) print i	none
files
	sql	awk	pig
set field delimiter	$ grep -v '^#' /etc/passwd > /tmp/pw create table pwt ( name text, pw text, uid int, gid int, gecos text, home text, sh text ); copy pwt from '/tmp/pw' with delimiter ':';	BEGIN { FS=":" }	pwf = LOAD '/etc/passwd' USING PigStorage(':') AS (name:chararray, pw:chararray, uid:int, gid:int, gecos:chararray, home:chararray, sh:chararray);
write table to file			STORE foo INTO '/tmp/foo.tab';
libraries and namespaces
	sql	awk	pig
reflection
	sql	awk	pig
table schema			DESCRIBE foo;
	_________________________________________	_________________________________________	_________________________________________

General

versions used

The versions used for testing code in the reference sheet.

Grammar and Invocation

end-of-line comment

How to create a comment that ends at the next newline.

Variables and Expressions

case sensitive?

Are identifiers which differ only by case treated as distinct identifiers?

quoted identifier

How to quote an identifier.

Quoting an identifier is a way to include characters which aren't normally permitted in an identifier.

In SQL quoting is also a way to refer to an identifier that would otherwise be interpreted as a reserved word

PigStorage, the default function for loading and persisting relations, represents a null value with an empty string. Null is distinct from an empty string since null == '' evaluates as false. Thus PigStorage cannot load or store an empty string.

null test

How to test whether an expression is null.

sql:

The expression null = null evaluates as null, which is a ternary boolean value distinct from true and false. Expressions built up from arithmetic operators or comparison operators which contain a null evaluate as null. When logical operators are involved, null behaves like the unknown value of Kleene logic.

coalesce

How to use the value of an expression, replacing it with an alternate value if it is null.

nullif

How to use the value of an expression, replacing a specific value with null.

conditional expression

The syntax for a conditional expression.

Arithmetic and Logic

true and false

Literals for true and false.

falsehoods

Values which evaluate as false in a boolean context.

logical operators

The logical operators. Logical operators impose a boolean context on their arguments and return a boolean value.

relational operators

The comparison operators, also known as the relational operators.

integer type

Integer types.

sql:

Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.

awk:

Variables are untyped and implicit conversions are performed between numeric and string types.

The numeric literal for zero, 0, evaluates as false, but the string "0" evaluates as true. Hence we can infer that awk has at least two distinct data types.

float type

Floating point decimal types.

sql:

Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.

fixed type

Fixed precision decimal types.

sql:

Datatypes are database specific, but the mentioned types are provided by both PostgreSQL and MySQL.

arithmetic operations

The arithmetic operators: addition, subtraction, multiplication, division, modulus, and exponentiation.

integer division

How to compute the quotient of two numbers. The quotient is always an integer.

integer division by zero

What happens when an integer is divided by zero.

pig:

Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.

float division

How to perform floating point division, even if the operands are integers.

float division by zero

The result of dividing a float by zero.

pig:

Division by zero evaluates to null. Recall that PigStorage stores nulls in files as empty strings.

power

How to raise a number to a power.

sqrt

How to get the square root of a number.

sqrt -1

The result of taking the square root of negative one.

transcendental functions

The standard transcendental functions of mathematics.

float truncation

How to truncate floats to integers. The functions (1) round towards zero, (2) round to the nearest integer, (3) round towards positive infinity, and (4) round towards negative infinity. How to get the absolute value of a number is also illustrated.

absolute value

The absolute value of a number.

random number

How to create a unit random float.

Strings

types

The available string types.

pig:

A chararray is a string of Unicode characters. Like in Java the characters are UTF-16 encoded.

A bytearray is a string of bytes. Data imported into Pig is of type bytearray unless declared otherwise.

literal

The syntax for string literals.

sql:

MySQL also has double quoted string literals. PostgreSQL and most other database use double quotes for identifiers.

In a MySQL double quoted string double quote characters must be escaped with reduplication but single quote characters do not need to be escaped.

pig:

Single quoted string literals are of type chararray. There is no syntax for a bytearray literal.

length

How to get the length of a string.

escapes

Escape sequences which are available in string literals.

sql:

Here is a portable way to include a newline character in a SQL string:

select 'foo' || chr(10) || 'bar';

MySQL double and single quoted strings support C-style backslash escapes. Backslash escapes are not part of the SQL standard. Their interpretation can be disabled at the session level with

SET sql_mode='NO_BACKSLASH_ESCAPES';

concatenation

How to concatenate strings.

split

How to split a string into an array of substrings.

sql:

How to split a string into multiple rows of data:

=> create temp table foo ( bar text );
CREATE TABLE

=> insert into foo select regexp_split_to_table('do re mi', ' ');
INSERT 0 3

case manipulation

How to uppercase a string; how to lower case a string; how to capitalize the first character.

strip

How to remove whitesapce from the edges of a string.

index of substring

How to get the leftmost index of a substring in a string.

extract substring

How to extract a substring from a string.

sprintf

How to create a string from a format.

Regular Expressions

match

pig:

Pig does not directly support a regex match test. The technique illustrated is to extract a subgroup and see whether the resulting tuple has anything in it. This can be done in the by clause of a filter statement, but not as the first operand of a conditional expression.

substitute

How to perform substitution on a string.

awk:

sub and the global variant gsub return the number of substitutions performed.

extract subgroup

Date and Time

Arrays

literal

The syntax for an array literal.

sql:

The syntax for arrays is specific to PostgreSQL. MySQL does not support arrays.

Defining a column to be an array violates first normal form.

pig:

Pig tuples can be used to store a sequence of data in a field. Pig tuples are heterogeneous; the components do not need to be of the same type.

size

How to get the number of elements in an array.

lookup

How to get the value in an array by index.

update

How to change a value in an array.

iteration

How to iterate through the values of an array.

Dictionaries

literal

pig:

Dictionaries are called maps in Pig. The keys must be character arrays, but the values can be any type.

Tables

order by

How to sort the rows in table using the values in one of the columns.

order by multiple columns

How to sort the rows in a table using the values in multiple columns. If the values in the first column are the same, the values in the seconds column are used as a tie breaker.

limit

offset

Relational Algebra

In a mapping operation the output relation has the same number of rows as the input relation. A mapping operation can be specified with a function which accepts an input record and returns an output record.

In a filtering operation the output relation has a less than or equal number of rows as the input relation. A filtering operation can be specified with a function which accepts an input record and returns a boolean value.

input data format

The data formats the language can operate on.

set field delimiter

For languages which can operate on field and record delimited files, how to set the field delimiter.

sql:

The PostgreSQL copy command requires superuser privilege unless the input source is stdin. Here is an example of how to use the copy command without superuser privilege:

$ ( echo "copy pwt from stdin with delimiter ':';"; cat /tmp/pw ) | psql

The copy command is not part of the SQL standard. MySQL uses the following:

load data infile '/etc/passwd' into table pwt fields terminated by ':';

Both PostgreSQL and MySQL will use tab characters if a field separator is not specified. MySQL permits the record terminator to changed from the default newline, but PostgreSQL does not.

select column by name

How to select fields by name.

select column by position

select all columns

rename columns

filter rows

split rows

An aggregation operation is similar to a filtering operation in that it accepts an input relation and produces an output relation with less than or equal number of rows. An aggregation is defined by two functions: a partitioning function which accepts a record and produces a partition value, and a reduction function which accepts a set of records which share a partition value and produces an output record.

select distinct

How to remove duplicate rows from the output set.

Removing duplicates can be accomplished with an aggregation operation in which the partition value is the entire row and a reduction function which returns the first row in the set of rows sharing the partition value.

inner join

In an inner join, only tuples from the input relations which satisfy a join predicate are used in the output relation.

A special but common case is when the join predicate consists of an equality test or a conjunction of two or more equality tests. Such a join is called an equi-join.

awk:

If awk is available at the command line then chances are good that join is also available.

null treatment in joins

How rows which have a null value for the join column are handled.

Both SQL and Pig do not include such rows in the output relation unless an outer join (i.e. a left, right, or full join) is specified. Even in the case of an outer join the rows with a null join column value are not joined with any rows from the other relations, even if there are also rows in the other relation with null join column values. Instead the columns that derive from the other input relation will have null values in the output relation.

self join

A self join is when a relation is joined with itself.

If a relation contained a list of people and their parents, then a self join could be used to find a persons grandparents.

pig:

An alias be used in a JOIN statement more than once. Thus to join a relation with itself it must first be copied with a FOREACH statement.

left join

How to include rows from the input relation listed on the left (i.e. listed first) which have values in the join column which don't match any rows from the input relation listed on the right (i.e. listed second). The term in short for left outer join.

As an example, a left join between customers and orders would have a row for every order placed and the customer who placed it. In addition it would have rows for customers who haven't placed any orders. Such rows would have null values for the order information.

sql:

Here is a complete example with the schemas and data used in the left join:

create table customers ( id int, name text );

insert into customers values ( 1, 'John' ), ( 2, 'Mary' ), (3, 'Jane');

create table orders ( id int, customer_id int, amount numeric(9, 2));

insert into orders values ( 1, 2, 12.99 );
insert into orders values ( 2, 3, 5.99 );
insert into orders values ( 3, 3, 12.99 );

select * from customers c left join orders o on c.id = o.customer_id;
 id | name | id | customer_id | amount 
----+------+----+-------------+--------
  1 | John |    |             |       
  2 | Mary |  1 |           2 |  12.99
  3 | Jane |  2 |           3 |   5.99
  3 | Jane |  3 |           3 |  12.99
(4 rows)

left outer join is synonymous with left join. The following query is identical to the one above:

select *
from customers c
left outer join orders o
  on c.id = o.customer_id;

pig:

For a complete example assume the following data is in /tmp/customers.txt:

1:John
2:Mary
3:Jane

and /tmp/orders.txt:

1:2:12.99
2:3:5.99
3:3:12.99

Here is the Pig session:

customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id left, orders by customer_id;
dump j;

Here is the output:

(1,John,,,)
(2,Mary,1,2,12.99)
(3,Jane,2,3,5.99)
(3,Jane,3,3,12.99)

full join

A full join is a join in which rows with null values for the join condition from both input relations are included in the output relation.

Left joins, right joins, and full joins are collectively called outer joins.

sql:

We illustrate a full join by using the schema and data from the left join example and adding an order with a null customer_id:

insert into orders values ( 4, null, 7.99);

select * from customers c full join orders o on c.id = o.customer_id;
 id | name | id | customer_id | amount 
----+------+----+-------------+--------
  1 | John |    |             |       
  2 | Mary |  1 |           2 |  12.99
  3 | Jane |  2 |           3 |   5.99
  3 | Jane |  3 |           3 |  12.99
    |      |  4 |             |   7.99
(5 rows)

pig:

For a complete example assume the following data is in /tmp/customers.txt:

1:John
2:Mary
3:Jane

and /tmp/orders.txt:

1:2:12.99
2:3:5.99
3:3:12.99
4::7.99

Here is the Pig session:

customers = LOAD '/tmp/customers.txt' USING PigStorage(':') AS (id:int, name:chararray);
orders = LOAD '/tmp/orders.txt' USING PigStorage(':') AS (id:int, customer_id:int, amount:float);
j = join customers by id full, orders by customer_id;
dump j;

Here is the output:

(1,John,,,)
(2,Mary,1,2,12.99)
(3,Jane,2,3,5.99)
(3,Jane,3,3,12.99)
(,,4,,7.99)

cross join

A cross join is a join with no join predicate. It is also called a Cartesian product. If the input relations have N₁, N₂, …, N_m rows respectively, then the output relation has $\prod_{i=1}^{m} N_i$ rows.

Aggregation

group by

sql:

The columns in the select clause of a select statement with a group by clause must be expressions built up of columns listed in the group by clause and aggregation functions. Aggregation functions can contain expressions containing columns not in the group by clause as arguments.

pig:

The output relation of a GROUP BY operation is always a relation with two fields. The first is the partition value, and the second is a bag containing all the tuples in the input relation which have the partition value.

group by multiple columns

How to group by multiple columns. The output relation will have a row for each distinct tuple of column values.

pig:

Tuples must be used to group on multiple fields. Tuple syntax is used in the GROUP BY statement and the first field in the output relation will be tuple. The FLATTEN function can be used to replace the tuple field with multiple fields, one for each component of the tuple.

aggregation functions

The aggregation functions.

sql:

Rows for which the expression given as an argument of an aggregation function is null are excluded from a result. In particular, COUNT(foo) is the number of rows for which the foo column is not null. COUNT(*) always returns the number of rows in the input set, including even rows with columns that are all null.

pig:

The Pig aggregation functions operate on bags. The GROUP BY operator produces a relation of tuples in which the first field is the partition value and the second is a bag of all the tuples which have the partition value. Since $1 references the second component of a tuple, it is often the argument to the aggregation functions.

A join is an operation on m input relations. If the input relations have n₁, n₂, …, n_m columns respectively, then the output relation has $\sum_{i=1}^{m} n_i$ columns.

Functions

define function

How to define a function.

sql:

To be able to write PL/pgSQL functions on a PostgreSQL database, someone with superuser privilege must run the following command:

create language plpgsql;

invoke function

How to invoke a function.

drop function

How to remove a function.

sql:

PL/pgSQL permits functions with the same name and different parameter types. Resolution happens at invocation using the types of the arguments.

When dropping the function the parameter types must be specified. There is no statement for dropping multiple functions with a common name.

Execution Control

if

How to execute code conditionally.

while

How to implement a while loop.

for

How to implement a C-style for loop.

Files

Library and Namespaces

Reflection

SQL

PostgreSQL 9.1: The SQL Language
MySQL 5.6 Reference Manual

SQL has been the leading query language for relational databases since the early 1980s. It received its first ISO standardization in 1986.

SQL statements are classified into three types: data manipulation language (DML), data definition language (DDL), and data control language (DCL). DDL defines and alters the database schema. DCL controls the privileges of database users. DML queries and modifies the data in the tables.

sql statement	type
select	DML
insert	DML
update	DML
delete	DML
create	DDL
alter	DDL
drop	DDL
grant	DCL
revoke	DCL

Awk

awk - pattern-directed scanning and processing language
POSIX specification for awk
POSIX specification for join
POSIX specification for sort

Awk has been included on all Unix systems since 7th Edition Unix in 1979. It provides a concise language for performing transformations on files. An entire program can be provided to the awk interpreter as the first argument on the command line. Because awk string literals use double quotes, single quotes are usually used to quote the awk program for the benefit of the shell. Here's an example which prints the default shell used by root:

awk 'BEGIN{FS=":"} $1=="root" {print $7}' /etc/passwd

An awk script is sequence of pattern-action pairs. Awk will iterate through the lines of standard input or the lines of the specified input files, testing each pattern against the line and executing the corresponding action if the pattern matches.

Patterns are usually slash delimited regular expressions, e.g. /lorem/, and logical expressions built up from them using the logical operators &&, ||, and !. If an action is provided without an accompanying pattern, the action is executed once for every line of input. The keywords BEGIN and END are special patterns which cause the following action to be executed once at the start and end of execution, respectively.

Pig

Apache Pig docs
piggybank.jar
piggybank-0.3-amzn.jar

Pig is language for specifying Hadoop map reduce jobs. Pig scripts are shorter than equivalent Java source code, especially if joins are required.

There are products such as Hive which can convert an SQL statement to a map reduce job, but Pig has an advantage over Hive in that it can handle a greater variety of data formats in the input files.

Although Pig is intended to be used with a Hadoop grid, Hadoop is not required to run a Pig job. Running a Pig job locally is a convenient way to test a Pig job before running it on a grid.

In addition to some numeric and string data types, Pig provides three compound data types: bag, tuples, and map. A bag is an array of tuples. It is equivalent to a database table; it is the database type which Pig uses to hold data which it reads in from files.

Pig has a limited type of variable called an alias. The only data type which can be stored in an alias is a bag. When a bag is stored in an alias it is called a relation or an outer bag. A bag can also be stored in the field of a tuple, in which case it is called an inner bag.

pig relational operators:

Pig provides the following 15 operators for manipulating relations:

relational operator	input relations	output relations	output rows
CROSS	2+	1	N₁ x N₂ x …
DISTINCT	1	1	N or fewer
FILTER	1	1	N or fewer
FOREACH	1	1	N
GROUP	1+	1	number of groups
JOIN	2+	1	bounded by N₁ x N₂ x …
LIMIT	1	1	min(N, limit argument)
LOAD	0	1	lines in file
MAPREDUCE	0	1	depends on mapreduce job
ORDER BY	1	1	N
SAMPLE	1	1	N * fraction argument
SPLIT	1	n	can exceed N if split conditions overlap
STORE	1	0	0
STREAM	1	1	depends on external script
UNION	2+	1	N₁ + N₂ + … + N_n

Most of the above operators create a new relation from existing relations. Exceptions are LOAD and MAPREDUCE which create relations from external files, STORE which writes a relation to a file, and SPLIT which can create more than one relation.

piggybank UDFs:

It is easy to write user defined functions (UDFs) in Java and make them available to pig. piggybank.jar and piggybank-0.3-amzn.jar are two publicly available libraries of UDFs.

If the Piggybank jar is in the home directory when the Pig script is run, the functions can be made available with the following code at the top of the Pig script:

REGISTER /PATH/TO/piggybank.jar;
REGISTER /PATH/TO/piggybank-0.3-amzn.jar;

DEFINE DATE_TIME    org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();