- · Basic Operators
- · Relational Operators
- · Functions
- · Load/Store Functions
- · Math Functions
- · String Functions
- · Tuple, Bag, Map Functions
- · UDFs
- · Data Types
- · Complex Types
Basic Operators
Operator
|
Description
|
Example
|
Arithmetic Operators
|
+, -, *, /, %, ?:
|
X = FOREACH A GENERATE f1, f2,
f1%f2;
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B)); |
Boolean Operators
|
and, or, not
|
X = FILTER A BY (f1==8) OR (NOT
(f2+f3 > f1));
|
Cast Operators
|
Casting from one datatype to
another
|
B = FOREACH A GENERATE (int)$0 +
1;
B = FOREACH A GENERATE $0 + 1, $1 + 1.0 |
Comparison Operators
|
==, !=, >, <, >=, <=,
matches
|
X = FILTER A BY (f1 == 8);
X = FILTER A BY (f2 == ‘apache’); X = FILTER A BY (f1 matches ‘.*apache.*’); |
Construction Operators
|
Used to construct tuple (), bag {}
and map []
|
B = foreach A generate (name,
age);
B = foreach A generate {(name, age)}, {name, age}; B = foreach A generate [name, gpa]; |
Dereference Operators
|
dereference tuples (tuple.id or
tuple.(id,…)), bags (bag.id or bag.(id,…)) and maps (map#’key’)
|
X = FOREACH A GENERATE f2.t1,f2.t3
(dereferencing is used to retrieve two fields from tuple f2)
|
Disambiguate Operator
|
( :: ) used to identify field
names after JOIN, COGROUP, CROSS, or FLATTEN operators
|
A = load ‘data1′ as (x, y);
B = load ‘data2′ as (x, y, z); C = join A by x, B by x; D = foreach C generate A::y; |
Flatten Operator
|
Flatten un-nests tuples as well as
bags
|
consider a relation that has a
tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will
cause that tuple to become (a, b, c).
|
Null Operator
|
is null, is not null
|
X = FILTER A BY f1 is not null;
|
Sign Operators
|
+ -> has no effect, – ->
changes the sign of a positive/negative number
|
A = LOAD ‘data’ as (x, y, z);
B = FOREACH A GENERATE -x, y; |
Operator
|
Description
|
Example
|
COGROUP/GROUP
|
Groups the data in one or more
relations. The COGROUP operator groups together tuples that have the same
group key (key field)
|
A = load ‘student’ AS
(name:chararray,age:int,gpa:float);
B = GROUP A BY age; |
CROSS
|
Computes the cross product of two
or more relations
|
X = CROSS A,B A = (1, 2, 3) B =
(2, 4)
DUMP X; (4, 2, 1) (8, 9) (1,2,3,2,4) (1, 3) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3) |
DEFINE
|
Assigns an alias to a UDF or
streaming command.
|
DEFINE CMD `perl PigStreaming.pl –
nameMap` input(stdin using PigStreaming(‘,’)) output(stdout using
PigStreaming(‘,’));
A = LOAD ‘file’; B = STREAM B THROUGH CMD; |
DISTINCT
|
Removes duplicate tuples in a
relation.
|
X = DISTINCT A; A = (8,3,4)
DUMP X; (1,2,3) (1,2,3) (4,3,3) (4,3,3) (4,3,3) (8,3,4) (1,2,3) |
FILTER
|
Selects tuples from a relation
based on some condition.
|
X = FILTER A BY f3 == 3; A =
(1,2,3)
DUMP X; (4,5,6) (1,2,3) (7,8,9) (4,3,3) (4,3,3) (8,4,3) (8,4,3) |
FOREACH
|
Generates transformation of data
for each row as specified
|
X = FOREACH A GENERATE a1, a2; A =
(1,2,3)
DUMP X; (4,2,5) (1,2) (8,3,6) (4,2) (8,3) |
IMPORT
|
Import macros defined in a
separate file.
|
/* myscript.pig */
IMPORT ‘my_macro.pig’; |
JOIN
|
Performs an inner join of two or
more relations based on common field values.
|
X = JOIN A BY a1, B BY b1;
DUMP X (1,2,1,3) A = (1,2) B = (1,3) (1,2,1,2) (4,5) (1,2) (4,5,4,7) (4,7) |
LOAD
|
Loads data from the file system.
|
A = LOAD ‘myfile.txt’;
LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); |
MAPREDUCE
|
Executes native MapReduce jobs
inside a Pig script.
|
A = LOAD ‘WordcountInput.txt’;
B = MAPREDUCE ‘wordcount.jar’ STORE A INTO ‘inputDir’ LOAD ‘outputDir’ AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |
ORDERBY
|
Sorts a relation based on one or
more fields.
|
A = LOAD ‘mydata’ AS (x: int, y:
map[]);
B = ORDER A BY x; |
SAMPLE
|
Partitions a relation into two or
more relations, selects a random data sample with the stated sample size.
|
Relation X will contain 1% of the
data in relation A.
A = LOAD ‘data’ AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01; |
SPLIT
|
Partitions a relation into two or
more relations based on some expression.
|
SPLIT input_var INTO output_var IF
(field1 is not null), ignored_var IF (field1 is null);
|
STORE
|
Stores or saves results to the
file system.
|
STORE A INTO ‘myoutput’ USING
PigStorage (‘*’);
1*2*3 4*2*1 |
STREAM
|
Sends data to an external script
or program
|
A = LOAD ‘data’;
B = STREAM A THROUGH `stream.pl -n 5`; |
UNION
|
Computes the union of two or more
relations. (Does not preserve the order of tuples)
|
X = UNION A, B; A = (1,2,3) B =
(2,4)
DUMP X; (4,2,1) (8,9) (1,2,3) (1,3) (4,2,1) (2,4) (8,9) (1,3) |
Function
|
Syntax
|
Description
|
AVG
|
AVG(expression
|
Computes the average of the
numeric values in a single-column bag.
|
CONCAT
|
CONCAT (expression, expression)
|
Concatenates two expressions of
identical type.
|
COUNT
|
COUNT(expression)
|
Computes the number of elements in
a bag, it ignores null.
|
COUNT_STAR
|
COUNT_STAR(expression)
|
Computes the number of elements in
a bag, it includes null.
|
DIFF
|
DIFF (expression, expression)
|
Compares two fields in a tuple,
any tuples that are in one bag but not the other are returned in a bag.
|
DIFF
|
DIFF (expression, expression)
|
Compares two fields in a tuple,
any tuples that are in one bag but not the other are returned in a bag.
|
IsEmpty
|
IsEmpty(expression)
|
Checks if a bag or map is empty.
|
MAX
|
MAX(expression)
|
Computes the maximum of the
numeric values or chararrays in a single-column bag
|
MIN
|
MIN(expression)
|
Computes the minimum of the
numeric values or chararrays in a single-column bag.
|
SIZE
|
SIZE(expression)
|
Computes the number of elements
based on any Pig data type. SIZE includes NULL values in the size computation
|
SUM
|
SUM(expression)
|
Computes the sum of the numeric
values in a single-column bag.
|
TOKENIZE
|
TOKENIZE(expression [,
'field_delimiter'])
|
Splits a string and outputs a bag
of words.
|
FUnction
|
Syntax
|
Description
|
Handling Compression
|
A = load ‘myinput.gz’;
store A into ‘myoutput.gz’; |
PigStorage and TextLoader support
gzip and bzip compression for both read (load) and write (store). BinStorage
does not support compression.
|
BinStorage
|
A = LOAD ‘data’ USING
BinStorage();
|
Loads and stores data in
machine-readable format.
|
JsonLoader, JsonStorage
|
A = load ‘a.json’ using
JsonLoader();
|
Load or store JSON data.
|
PigDump
|
STORE X INTO ‘output’ USING PigDump();
|
Stores data in UTF-8 format.
|
PigStorage
|
A = LOAD ‘student’ USING
PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
|
Loads and stores data as
structured text files.
|
TextLoader
|
A = LOAD ‘data’ USING
TextLoader();
|
Loads unstructured data in UTF-8
format.
|
Operator
|
Description
|
Example
|
ABS
|
ABS(expression)
|
Returns the absolute value of an
expression. If the result is not negative (x ≥ 0), the result is returned. If
the result is negative (x < 0), the negation of the result is returned.
|
ACOS
|
ACOS(expression)
|
Returns the arc cosine of an
expression.
|
ASIN
|
ASIN(expression)
|
Returns the arc sine of an
expression.
|
ATAN
|
ATAN(expression)
|
Returns the arc tangent of an
expression.
|
CBRT
|
CBRT(expression)
|
Returns the cube root of an expression.
|
CEIL
|
CEIL(expression)
|
Returns the value of an expression
rounded up to the nearest integer. This function never decreases the result
value.
|
COS
|
COS(expression)
|
Returns the trigonometric cosine
of an expression.
|
COSH
|
COSH(expression)
|
Returns the hyperbolic cosine of
an expression.
|
EXP
|
EXP(expression)
|
Returns Euler’s number e raised to
the power of x.
|
FLOOR
|
FLOOR(expression)
|
Returns the value of an expression
rounded down to the nearest integer. This function never increases the result
value.
|
LOG
|
LOG(expression)
|
Returns the natural logarithm
(base e) of an expression.
|
LOG10
|
LOG10(expression)
|
Returns the base 10 logarithm of
an expression.
|
RANDOM
|
RANDOM( )
|
Returns a pseudo random number
(type double) greater than or equal to 0.0 and less than 1.0.
|
ROUND
|
ROUND(expression)
|
Returns the value of an expression
rounded to an integer (if the result type is float) or rounded to a long (if
the result type is double).
|
SIN
|
SIN(expression)
|
Returns the sine of an expression.
|
SINH
|
SINH(expression)
|
Returns the hyperbolic sine of an
expression.
|
SQRT
|
SQRT(expression)
|
Returns the positive square root
of an expression.
|
TAN
|
TAN(expression)
|
Returns the trignometric tangent
of an angle.
|
TANH
|
TANH(expression)
|
Returns the hyperbolic tangent of
an expression.
|
Operator
|
Description
|
Example
|
INDEXOF
|
INDEXOF(string, ‘character’,
startIndex)
|
Returns the index of the first
occurrence of a character in a string, searching forward from a start index.
|
LAST_INDEX
|
LAST_INDEX_OF(expression)
|
Returns the index of the last
occurrence of a character in a string, searching backward from a start index.
|
LCFIRST
|
LCFIRST(expression)
|
Converts the first character in a
string to lower case.
|
LOWER
|
LOWER(expression)
|
Converts all characters in a
string to lower case.
|
REGEX_EXTRACT
|
REGEX_EXTRACT (string, regex,
index)
|
Performs regular expression
matching and extracts the matched group defined by an index parameter. The
function uses Java regular expression form.
|
REGEX_EXTRACT_ALL
|
REGEX_EXTRACT (string, regex)
|
Performs regular expression
matching and extracts all matched groups. The function uses Java regular
expression form.
|
REPLACE
|
REPLACE(string, ‘oldChar’,
‘newChar’);
|
Replaces existing characters in a
string with new characters.
|
STRSPLIT
|
STRSPLIT(string, regex, limit)
|
Splits a string around matches of
a given regular expression.
|
SUBSTRING
|
SUBSTRING(string, startIndex,
stopIndex)
|
Returns a substring from a given
string.
|
TRIM
|
TRIM(expression)
|
Returns a copy of a string with
leading and trailing white space removed.
|
UCFIRST
|
UCFIRST(expression)
|
Returns a string with the first
character converted to upper case.
|
UPPER
|
UPPER(expression)
|
Returns a string converted to
upper case.
|
Operator
|
Description
|
Example
|
TOTUPLE
|
TOTUPLE(expression [, expression
...])
|
Converts one or more expressions
to type tuple.
|
TOBAG
|
TOBAG(expression [, expression
...])
|
Converts one or more expressions
to individual tuples which are then placed in a bag.
|
TOMAP
|
TOMAP(key-expression, value-expression
[, key-expression, value-expression ...])
|
Converts key/value expression
pairs into a map. Needs an even number of expressions as parameters. The
elements must comply with map type rules.
|
TOP
|
TOP(topN,column,relation)
|
Returns the top-n tuples from a
bag of tuples.
|
Pig provides extensive support for user defined functions (UDFs)
as a way to specify custom processing. Pig UDFs can currently be implemented in
three languages: Java, Python, JavaScript and Ruby.
Registering UDFs
Registering Java UDFs:
---register_java_udf.pig
register
'your_path_to_piggybank/piggybank.jar';
divs = load 'NYSE_dividends' as
(exchange:chararray, symbol:chararray,
date:chararray,
dividends:float);
Registering Python UDFs (The Python script must be in your
current directory):
--register_python_udf.pig
register
'production.py' using jython as bballudfs;
players = load 'baseball' as (name:chararray,
team:chararray,
pos:bag{t:(p:chararray)},
bat:map[]);
Writing UDFs
Java UDFs:
package myudfs;
import
java.io.IOException;
import
org.apache.pig.EvalFunc;
import
org.apache.pig.data.Tuple;
public class UPPER
extends EvalFunc
{
public String exec(Tuple input) throws
IOException {
if (input == null || input.size() == 0)
return null;
try{
String str =
(String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new
IOException("Caught exception processing input row ", e);
}
}
}
Python UDFs
#Square - Square of a
number of any data type
@outputSchemaFunction("squareSchema")
-- Defines a script delegate function that defines schema for this function
depending upon the input type.
def square(num):
return ((num)*(num))
@schemaFunction("squareSchema")
--Defines delegate function and is not registered to Pig.
def squareSchema(input):
return input
#Percent- Percentage
@outputSchema("percent:double")
--Defines schema for a script UDF in a format that Pig understands and is able
to parse
def percent(num, total):
return num * 100 / total
Simple
Types
Operator
|
Description
|
Example
|
int
|
Signed 32-bit integer
|
10
|
long
|
Signed 64-bit integer
|
Data: 10L or 10l
Display: 10L |
float
|
32-bit floating point
|
Data: 10.5F or 10.5f or 10.5e2f or
10.5E2F
Display: 10.5F or 1050.0F |
double
|
64-bit floating point
|
Data: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0 |
chararray
|
Character array (string) in
Unicode UTF-8 format
|
hello world
|
bytearray
|
Byte array (blob)
|
|
boolean
|
boolean
|
true/false (case insensitive)
|
Operator
|
Description
|
Example
|
tuple
|
An ordered set of fields.
|
(19,2)
|
bag
|
An collection of tuples.
|
{(19,2), (18,1)}
|
map
|
A set of key value pairs.
|
[name#John,phone#5551212]Pig Functunal Commands
|
No comments:
Post a Comment