Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 16 additions & 10 deletions ch03/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,15 @@ cd gmail

## Download Apache Pig ##
```
wget http://www.trieuvan.com/apache/pig/pig-0.10.1/pig-0.10.1.tar.gz
tar -xvzf pig-0.10.1.tar.gz
cd pig-0.10.1
ant
wget http://mirrors.ibiblio.org/apache/pig/pig-0.12.0/pig-0.12.0.tar.gz
tar -xvzf pig-*.tar.gz
cd pig-0.12.0
```

## Compile Pig for Hadoop 2.0.x ##

```
ant clean jar-withouthadoop -Dhadoopversion=23
```

Now you can run 'bin/pig'!
Expand Down Expand Up @@ -89,11 +94,12 @@ bin/mongo agile_data

## Install MongoDB's Java Driver ##

The MongoDB Java driver is available at https://github.com/mongodb/mongo-java-driver/downloads Download it, and place it at the base of your MongoDB install directory.
The MongoDB Java driver is available at https://github.com/mongodb/mongo-java-driver/downloads or you can download recent snapshots like the one below. Register the path in pig/mongo.pig

```
cd <my_mongodb_install_path>
wget https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.10.1.jar
wget
https://oss.sonatype.org/content/repositories/snapshots/org/mongodb/mongo-java-driver/2.12.0-SNAPSHOT/mongo-java-driver-2.12.0-20140213.053134-54.jar
```

## Install mongo-hadoop ##
Expand All @@ -120,15 +126,15 @@ find .|grep jar
Fix the paths in 'ch3/pig/mongo.pig' to point at your install paths and run it, to store the email sent counts to MongoDB.

```
REGISTER </my_mongo_install_path>/mongo-2.10.1.jar
REGISTER </my_mongo_install_path>/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER </my_mongo_install_path>/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
REGISTER $HOME/mongo-java-driver*.jar
REGISTER $HOME/core/target/mongo-hadoop-core-*.jar
REGISTER $HOME/pig/target/mongo-hadoop-pig-*.jar

set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false

sent_counts = LOAD '/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:long);
STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoStorage();
STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoInsertStorage('','');
```

## Connect to MongoDB from Python ##
Expand Down
17 changes: 8 additions & 9 deletions ch03/pig/avro_to_mongo.pig
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,19 @@
%default HOME `echo \$HOME/Software/`

/* Load Avro jars and define shortcut */
REGISTER $HOME/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER $HOME/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER $HOME/pig/build/ivy/lib/Pig/avro-*.jar
REGISTER /$HOME/pig/build/ivy/lib/Pig/json-simple-*.jar
DEFINE AvroStorage org.apache.pig.builtin.AvroStorage();

/* MongoDB libraries and configuration */
REGISTER $HOME/mongo-hadoop/mongo-2.10.1.jar
REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
REGISTER $HOME/mongo-java-driver*.jar
REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar

/* Set speculative execution off so we don't have the chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

avros = load '$avros' using AvroStorage(); /* For example, 'enron.avro' */
store avros into '$mongourl' using MongoStorage(); /* For example, 'mongodb://localhost/enron.emails' */
avros = load '/tmp/sent_counts.txt' using AvroStorage(); /* For example, 'enron.avro' */
store avros into 'mongodb://localhost/agile_date.sent_counts' using MongoInsertStorage(); /* For example, 'mongodb://localhost/enron.emails' */
9 changes: 4 additions & 5 deletions ch03/pig/mongo.pig
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
/* Set Home Directory - where we install software */
%default HOME `echo \$HOME/Software/`

REGISTER $HOME/mongo-hadoop/mongo-2.10.1.jar
REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
REGISTER $HOME/mongo-java-driver*.jar
REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core_2.2.0-1.2.0.jar
REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig_2.2.0-1.2.0.jar

set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false

sent_counts = LOAD '/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:long);
STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoStorage();
STORE sent_counts INTO 'mongodb://127.0.0.1:27017/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoInsertStorage('','');
11 changes: 5 additions & 6 deletions ch03/pig/sent_counts.pig
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
/* Set Home Directory - where we install software */
%default HOME `echo \$HOME/Software/`

REGISTER $HOME/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER $HOME/pig/contrib/piggybank/java/piggybank.jar

DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
REGISTER $HOME/pig/pig-*.jar
REGISTER $HOME/pig/build/ivy/lib/Pig/avro-*.jar
REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-*.jar
DEFINE AvroStorage org.apache.pig.builtin.AvroStorage();

rmf /tmp/sent_counts.txt

/* Load the emails in avro format (edit the path to match where you saved them) using the AvroStorage UDF from Piggybank */
messages = LOAD '/me/Data/test_mbox' USING AvroStorage();
messages = LOAD '/Data/test_mbox' USING AvroStorage();

/* Filter nulls, they won't help */
messages = FILTER messages BY (from IS NOT NULL) AND (tos IS NOT NULL);
Expand Down