SparkからHiveが使いづらいというか使えない?のでSparkSQLを使ってみました。
そこそこ試行錯誤する必要があったのでメモです。
データファイル
銘柄コード,日付,始値,高値,安値,終値,出来高
のフォーマットのファイルを用意しておきます。こんな感じ。
1301,2004-04-01,198,198,195,196,651000 1301,2004-04-02,194,196,194,196,490000 1301,2004-04-05,196,200,195,197,1478000 1301,2004-04-06,202,208,200,207,4324000
これをS3へアップしておきます
build.sbt
こんな感じで記述します。build assemblyでエラーが出るのでこんな記述にしています。
name := "spark_sample" version := "1.0-SNAPSHOT" scalaVersion := "2.11.7" // additional libraries libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.5.2" % "provided", "org.apache.spark" %% "spark-sql" % "1.5.2", "org.apache.spark" %% "spark-hive" % "1.5.2", "org.apache.spark" %% "spark-streaming" % "1.5.2", "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2", "org.apache.spark" %% "spark-streaming-flume" % "1.5.2", "org.apache.spark" %% "spark-mllib" % "1.5.2", "org.apache.commons" % "commons-lang3" % "3.0", "org.eclipse.jetty" % "jetty-client" % "8.1.14.v20131031", "com.typesafe.play" %% "play-json" % "2.3.10", "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.4", "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.3", "org.elasticsearch" % "elasticsearch-hadoop-mr" % "2.0.0.RC1", "net.sf.opencsv" % "opencsv" % "2.0", "com.twitter.elephantbird" % "elephant-bird" % "4.5", "com.twitter.elephantbird" % "elephant-bird-core" % "4.5", "com.hadoop.gplcompression" % "hadoop-lzo" % "0.4.17", "mysql" % "mysql-connector-java" % "5.1.31", "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M3", "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.5.0-M3", "com.github.scopt" %% "scopt" % "3.2.0", "org.scalatest" %% "scalatest" % "2.2.1" % "test", "com.holdenkarau" %% "spark-testing-base" % "1.5.1_0.2.1", "org.apache.hive" % "hive-jdbc" % "1.2.1" ) resolvers ++= Seq( "JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/", "Spray Repository" at "http://repo.spray.cc/", "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/", "Akka Repository" at "http://repo.akka.io/releases/", "Twitter4J Repository" at "http://twitter4j.org/maven2/", "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Twitter Maven Repo" at "http://maven.twttr.com/", "scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/", "Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/", "Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven", Resolver.sonatypeRepo("public") ) mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => { case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard case m if m.startsWith("META-INF") => MergeStrategy.discard case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first case PathList("org", "apache", xs @ _*) => MergeStrategy.first case PathList("org", "jboss", xs @ _*) => MergeStrategy.first case "about.html" => MergeStrategy.rename case "reference.conf" => MergeStrategy.concat case _ => MergeStrategy.first } }
ちなみにproject/assembly.sbtはこれ
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
SqlSample.scala
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-15-to-16
この辺りを参考に
package sample import org.apache.spark.mllib.util.MLUtils import org.apache.spark._ import org.apache.spark.api.java._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ object SqlSample { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkSQL").setMaster("yarn-cluster") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Import Row. import org.apache.spark.sql.Row; // Import Spark SQL data types import org.apache.spark.sql.types.{StructType,StructField,StringType}; val histRDD = sc.textFile(args(0)).map(_.split(",")). map(p => Row(p(0), p(1),p(2),p(3),p(4),p(5),p(6))) val schemaString = "code date open high low close volume" val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) // Apply the schema to the RDD. val histDataFrame = sqlContext.createDataFrame(histRDD, schema) // Register the DataFrames as a table. histDataFrame.registerTempTable("priceHist") // SQL statements can be run by using the sql methods provided by sqlContext. val results = sqlContext.sql("SELECT code,date,open FROM priceHist where code='6758'") val ary=results.map(_.getValuesMap[Any](List("code", "date","open"))).collect() val outputLocation = args(1) // s3n://bucket/ val data=sc.makeRDD(ary) data.saveAsTextFile(outputLocation) sc.stop() } }
build
$ sbt package
これで作成したJarを同じくS3へアップします
EMR
今までと同様にEMRを作成し、AddStepでSparkApplicationを追加します。Jarは先ほどアップしたものを指定します
Spark-submit options
--class sample.SqlSample
Arguments
s3n://bucket/output
ここには出力ファイルが入ります
じっこすればOutputにMapで表現されたデータが保存されます