AWSEMRとは、SparkやらHiveやらそれら一式を簡単に使える様にしてくれている仕組みです。
ぽちぽちっとEMRでサーバを作成。
この間10分程度
SparkRでサンプルデータを解析してみます
こちらの内容をアレンジしてみました
http://engineer.recruit-lifestyle.co.jp/techblog/2015-08-19-sparkr/
データ取得
http://stat-computing.org/dataexpo/2009/the-data.html
こちらから2001、2、3のデータをダウンロード
| 1 | $ wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2 | 
unzip
| 1 | $ bunzip2 2001.csv.bz2 | 
s3にアップロード
| 1 | $ aws s3 cp 2001.csv s3://samplebucket/airline/ | 
同様に2002,2003も繰り返す
Hive
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | $ hive hive> add jar /usr/lib/hive/lib/hive-contrib.jar; Added [/usr/lib/hive/lib/hive-contrib.jar] to class path Added resources: [/usr/lib/hive/lib/hive-contrib.jar] hive> create table airline(  > Year STRING,  > Month STRING,  > DayofMonth STRING,  > DayOfWeek STRING,  > DepTime STRING,  > CRSDepTime STRING,  > ArrTime STRING,  > CRSArrTime STRING,  > UniqueCarrier STRING,  > FlightNum STRING,  > TailNum STRING,  > ActualElapsedTime STRING,  > CRSElapsedTime STRING,  > AirTime STRING,  > ArrDelay STRING,  > DepDelay STRING,  > Origin STRING,  > Dest STRING,  > Distance STRING,  > TaxiIn STRING,  > TaxiOut STRING,  > Cancelled STRING,  > CancellationCode STRING,  > Diverted STRING,  > CarrierDelay STRING,  > WeatherDelay STRING,  > NASDelay STRING,  > SecurityDelay STRING,  > LateAircraftDelay STRING  > )  > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'  > LOCATION 's3://samplebucket/airline/' tblproperties ("skip.header.line.count"="1"); hive> select * from airline limit 1; OK 2001 1 17 3 1806 1810 1931 1934 US 375 N700��	85	84	60	-3	-4	BWI	CLT	361	5	20	0	NA	0	NA	NA	NA	NA	NA | 
SparkR
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | $ sparkR > install.packages("magrittr") > library(magrittr) > hiveContext <- sparkRHive.init(sc) > airline<-sql(hiveContext,"select * from airline") > class(airline) [1] "DataFrame" attr(,"package") [1] "SparkR" > airline %>% + filter(airline$Origin == "JFK") %>% + group_by(airline$Dest) %>% + agg(count=n(airline$Dest)) %>% + head  Dest count  1 IAH 1214 2 STL 2922 3 SNA 805 4 MSP 1580 5 STT 1085 6 SAN 2723 | 
こんな感じで簡単にできました