AWSEMRとは、SparkやらHiveやらそれら一式を簡単に使える様にしてくれている仕組みです。
ぽちぽちっとEMRでサーバを作成。
この間10分程度
SparkRでサンプルデータを解析してみます
こちらの内容をアレンジしてみました
http://engineer.recruit-lifestyle.co.jp/techblog/2015-08-19-sparkr/
データ取得
http://stat-computing.org/dataexpo/2009/the-data.html
こちらから2001、2、3のデータをダウンロード
$ wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2
unzip
$ bunzip2 2001.csv.bz2
s3にアップロード
$ aws s3 cp 2001.csv s3://samplebucket/airline/
同様に2002,2003も繰り返す
Hive
$ hive hive> add jar /usr/lib/hive/lib/hive-contrib.jar; Added [/usr/lib/hive/lib/hive-contrib.jar] to class path Added resources: [/usr/lib/hive/lib/hive-contrib.jar] hive> create table airline( > Year STRING, > Month STRING, > DayofMonth STRING, > DayOfWeek STRING, > DepTime STRING, > CRSDepTime STRING, > ArrTime STRING, > CRSArrTime STRING, > UniqueCarrier STRING, > FlightNum STRING, > TailNum STRING, > ActualElapsedTime STRING, > CRSElapsedTime STRING, > AirTime STRING, > ArrDelay STRING, > DepDelay STRING, > Origin STRING, > Dest STRING, > Distance STRING, > TaxiIn STRING, > TaxiOut STRING, > Cancelled STRING, > CancellationCode STRING, > Diverted STRING, > CarrierDelay STRING, > WeatherDelay STRING, > NASDelay STRING, > SecurityDelay STRING, > LateAircraftDelay STRING > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' > LOCATION 's3://samplebucket/airline/' tblproperties ("skip.header.line.count"="1"); hive> select * from airline limit 1; OK 2001 1 17 3 1806 1810 1931 1934 US 375 N700�� 85 84 60 -3 -4 BWI CLT 361 5 20 0 NA 0 NA NA NA NA NA
SparkR
$ sparkR > install.packages("magrittr") > library(magrittr) > hiveContext <- sparkRHive.init(sc) > airline<-sql(hiveContext,"select * from airline") > class(airline) [1] "DataFrame" attr(,"package") [1] "SparkR" > airline %>% + filter(airline$Origin == "JFK") %>% + group_by(airline$Dest) %>% + agg(count=n(airline$Dest)) %>% + head Dest count 1 IAH 1214 2 STL 2922 3 SNA 805 4 MSP 1580 5 STT 1085 6 SAN 2723
こんな感じで簡単にできました