utsubo – ページ 11

docker buildのエラー

投稿者: utsubo 投稿日: 2016-01-12 in docker、linux

dockerで今までうまくいっていたBuildが突然落ちるようになったりすることがあります

$ docker build .
..
Err http://archive.ubuntu.com/ubuntu/ trusty-security/main libnss3-nssdb all 2:3.19.2.1-0ubuntu0.14.04.1
404	Not Found [IP: 91.189.88.149 80]
Err http://archive.ubuntu.com/ubuntu/ trusty-security/main libnss3 amd64 2:3.19.2.1-0ubuntu0.14.04.1
404	Not Found [IP: 91.189.88.149 80]
Fetched 108 MB in 3min 14s (555 kB/s)
Unable to correct missing packages.
[91mE: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/n/nss/libnss3-nssdb_3.19.2.1-0ubuntu0.14.04.1_all.deb	404	Not Found [IP: 91.189.88.149 80]
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/n/nss/libnss3_3.19.2.1-0ubuntu0.14.04.1_amd64.deb	404	Not Found [IP: 91.189.88.149 80]

$ docker build .

Err http://archive.ubuntu.com/ubuntu/ trusty-security/main libnss3-nssdb all 2:3.19.2.1-0ubuntu0.14.04.1

404 Not Found [IP: 91.189.88.149 80]

Err http://archive.ubuntu.com/ubuntu/ trusty-security/main libnss3 amd64 2:3.19.2.1-0ubuntu0.14.04.1

404 Not Found [IP: 91.189.88.149 80]

Fetched 108 MB in 3min 14s (555 kB/s)

Unable to correct missing packages.

[91mE: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/n/nss/libnss3-nssdb_3.19.2.1-0ubuntu0.14.04.1_all.deb 404 Not Found [IP: 91.189.88.149 80]

E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/n/nss/libnss3_3.19.2.1-0ubuntu0.14.04.1_amd64.deb 404 Not Found [IP: 91.189.88.149 80]

こういう時には一度キャッシュをクリーンすれば治ります

$ docker --no-cache build .

1	$ docker --no-cache build .

EMRでHadoopのJavaサンプル

投稿者: utsubo 投稿日: 2016-01-08 in AWS

こんな感じでディレクトリ作成します

├── bin
├── pom.xml
└── src
		└── main
				└── java
						└── emrhadoop
								├── WordCountMain.java
								├── WordCountMapper.java
								└── WordCountReducer.java

├── bin

├── pom.xml

└── src

└── main

└── java

└── emrhadoop

├── WordCountMain.java

├── WordCountMapper.java

└── WordCountReducer.java

pom.xmlを作成します

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>jp.qri.emr</groupId>
	<artifactId>emrhive</artifactId>
	<version>1.0-SNAPSHOT</version>
	<packaging>jar</packaging>
	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
	</properties>
	<repositories>
		<repository>
			<id>cloudera</id>
			<url>https://repository.cloudera.com/content/repositories/releases/</url>
		</repository>
	</repositories>
	<dependencies>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.12</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-core</artifactId>
			<version>1.2.1</version>
		</dependency>
	</dependencies>
	<build>
		<plugins>
						<plugin>
								<groupId>org.apache.maven.plugins</groupId>
								<artifactId>maven-dependency-plugin</artifactId>
								<configuration>
										<outputDirectory>
												${project.build.directory}
										</outputDirectory>
								</configuration>
						</plugin>
						<plugin>
								<artifactId>maven-assembly-plugin</artifactId>
								<configuration>
										<descriptorRefs>
												<descriptorRef>jar-with-dependencies</descriptorRef>
										</descriptorRefs>
								</configuration>
						</plugin>
						<plugin>
								<groupId>org.apache.maven.plugins</groupId>
								<artifactId>maven-shade-plugin</artifactId>
								<version>2.4.2</version>
								<configuration>
										<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
												<resource>reference.conf</resource>
										</transformer>
								</configuration>
								<executions>
										<execution>
												<phase>package</phase>
												<goals>
														<goal>shade</goal>
												</goals>
										</execution>
								</executions>
						</plugin>
				</plugins>
	</build>
</project>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<artifactId>emrhive</artifactId>

<version>1.0-SNAPSHOT</version>

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

</properties>

<id>cloudera</id>

<url>https://repository.cloudera.com/content/repositories/releases/</url>

</repository>

</repositories>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

</dependencies>

<build>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-dependency-plugin</artifactId>

${project.build.directory}

</outputDirectory>

</configuration>

</plugin>

<artifactId>maven-assembly-plugin</artifactId>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

</configuration>

</plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-shade-plugin</artifactId>

<transformer

implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">

<resource>reference.conf</resource>

</transformer>

</configuration>

<phase>package</phase>

<goals>

<goal>shade</goal>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

</project>

eclipseで読み込めるようにします

mvn eclipse:eclipse

1	mvn eclipse:eclipse

Javaファイルはこんな感じ

WordCountMain.java

package emrhadoop;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCountMain {

		/**
		 * Jobを設定して実行する
		 * 
		 * @param args
		 * @throws Exception
		 */
		public static void main(String[] args) throws Exception {

				System.out.println("Masterノード start");

				// スレーブノードで実行するJobを設定する
				Job job = Job.getInstance();
				job.setJarByClass(WordCountMain.class);
				job.setJobName("wordcount");

				// Reducerへの出力キー、バリューの型を指定する
				job.setOutputKeyClass(Text.class);
				job.setOutputValueClass(IntWritable.class);

				// Mapper、Reducerのクラスを指定する
				job.setMapperClass(WordCountMapper.class);
				job.setReducerClass(WordCountReducer.class);
				// もしReducerが必要なければ、このように指定する job.setNumReduceTasks(0);

				// データを読み込み、Mapperへ渡すデータ・フォーマットを指定する
				job.setInputFormatClass(TextInputFormat.class);
				// Reducerからデータを受け取り、出力を行う際のデータ・フォーマットを指定する
				job.setOutputFormatClass(TextOutputFormat.class);

				// 引数取得
				// arg[0] は、CLIから実行した場合はメインコントローラークラス名が設定される場合もあるようだったので注意。
				String inputPath = args[0];
				System.out.println("arg 0 : " + inputPath);
				String outputPath = args[1];
				System.out.println("arg 1 : " + outputPath);

				// 入力ファイル・出力ファイルのパスを設定
				FileInputFormat.setInputPaths(job, new Path(inputPath));
				FileOutputFormat.setOutputPath(job, new Path(outputPath));

				// Job実行
				boolean result = job.waitForCompletion(true);
				System.out.println("result : " + result);

				System.out.println("Masterノード end");
		}
}

package emrhadoop;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCountMain {

/**

* Jobを設定して実行する

* @param args

* @throws Exception

public static void main(String[] args) throws Exception {

System.out.println("Masterノード start");

// スレーブノードで実行するJobを設定する

Job job = Job.getInstance();

job.setJarByClass(WordCountMain.class);

job.setJobName("wordcount");

// Reducerへの出力キー、バリューの型を指定する

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

// Mapper、Reducerのクラスを指定する

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

// もしReducerが必要なければ、このように指定する job.setNumReduceTasks(0);

// データを読み込み、Mapperへ渡すデータ・フォーマットを指定する

job.setInputFormatClass(TextInputFormat.class);

// Reducerからデータを受け取り、出力を行う際のデータ・フォーマットを指定する

job.setOutputFormatClass(TextOutputFormat.class);

// 引数取得

// arg[0] は、CLIから実行した場合はメインコントローラークラス名が設定される場合もあるようだったので注意。

String inputPath = args[0];

System.out.println("arg 0 : " + inputPath);

String outputPath = args[1];

System.out.println("arg 1 : " + outputPath);

// 入力ファイル・出力ファイルのパスを設定

FileInputFormat.setInputPaths(job, new Path(inputPath));

FileOutputFormat.setOutputPath(job, new Path(outputPath));

// Job実行

boolean result = job.waitForCompletion(true);

System.out.println("result : " + result);

System.out.println("Masterノード end");

}

WordCountMapper.java

package emrhadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * Mapper
 * 
 * 継承の際のジェネリクス指定によって、mapメソッドの型を指定出来る
 * Mapper<入力キーの型, 入力値の型, 出力キーの型, 出力値の型>
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		/**
		 * 初期化処理
		 */
		@Override
		public void setup(Context context) throws IOException, InterruptedException {
				System.out.println("Mapper setup");
		}

		@Override
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

				// 入力値を取り出す（1行データ)
				String line = value.toString();

				// 単語に分解する
				StringTokenizer tokenizer = new StringTokenizer(line);

				IntWritable one = new IntWritable(1);
				Text word = new Text();

				// 単語ごとに繰り返し
				while (tokenizer.hasMoreTokens()) {
						word.set(tokenizer.nextToken());

						// 1単語ごとにReducerへ値を渡す。(単語, 集計数)。ここでは単純に1単語につき1を渡しているだけだが、Mapper側で一度集計してからReducerに渡してもいい。
						context.write(word, one);
				}
		}

		/**
		 * 終了処理
		 */
		@Override
		public void cleanup(Context context) throws IOException,InterruptedException {
				System.out.println("Mapper cleanup");
		}
}

package emrhadoop;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

/**

* Mapper

* 継承の際のジェネリクス指定によって、mapメソッドの型を指定出来る

* Mapper<入力キーの型, 入力値の型, 出力キーの型, 出力値の型>

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

/**

* 初期化処理

@Override

public void setup(Context context) throws IOException, InterruptedException {

System.out.println("Mapper setup");

}

@Override

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

// 入力値を取り出す（1行データ)

String line = value.toString();

// 単語に分解する

StringTokenizer tokenizer = new StringTokenizer(line);

IntWritable one = new IntWritable(1);

Text word = new Text();

// 単語ごとに繰り返し

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

// 1単語ごとにReducerへ値を渡す。(単語, 集計数)。ここでは単純に1単語につき1を渡しているだけだが、Mapper側で一度集計してからReducerに渡してもいい。

context.write(word, one);

}

/**

* 終了処理

@Override

public void cleanup(Context context) throws IOException,InterruptedException {

System.out.println("Mapper cleanup");

}

WordCountReducer.java

package emrhadoop;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * Reducer
 * 
 * 継承の際のジェネリクス指定によって、reduceメソッドの型を指定出来る
 * Reducer<入力キーの型, 入力値の型, 出力キーの型, 出力値の型>
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		@Override
		public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

				// Mapperから渡された値を集計
				int sum = 0;
				for (IntWritable value : values) {
						sum += value.get();
				}

				// 書き込み
				context.write(key, new IntWritable(sum));
		}

}

package emrhadoop;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

/**

* Reducer

* 継承の際のジェネリクス指定によって、reduceメソッドの型を指定出来る

* Reducer<入力キーの型, 入力値の型, 出力キーの型, 出力値の型>

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

// Mapperから渡された値を集計

int sum = 0;

for (IntWritable value : values) {

sum += value.get();

}

// 書き込み

context.write(key, new IntWritable(sum));

}

Jar作成

mvn package

1	mvn package

AWS Console

まずEMRを作成します
Cleate Clusterから Go to advanced optionsへ
Hardware ConfigurationでEC2 instance typeを必要に応じ変更。m1.mediumが最安かな？
VPCに対応したのでVPC内に作成したい場合にはここで選択
その他、キーとかSecurityGroupなどを適宜設定します
　その後、作成したJarをS3へコピーしておきます
WordCount用のファイルをS3へコピーします

aws s3 cp input.txt s3://bucket/input/

1	aws s3 cp input.txt s3://bucket/input/

Stepsから起動します

StepTypeはCustomJARを選択
JAR locationに先ほどコピーしたJarファイルのS3のLocationを入力
Argumentsに

emrhadoop.WordCountMain s3n://bucket/input/input.txt s3n://bucket/output

1	emrhadoop.WordCountMain s3n://bucket/input/input.txt s3n://bucket/output

ちなみにouputディレクトリをあらかじめ作成しておくとエラーになります

AmazonEMRでSparkを動かす

投稿者: utsubo 投稿日: 2015-12-22 in AWS

結構はまってしまったのでメモ

VPC内に作成したEMRでSparkを動かしました。サンプルはいろいろなところにそこそこあるのですが、どうもきちっと動くものがなく結構苦労してしまいました。

EMR

まずはVPC内にEMRを作成します。EMRのコンソール画面を開き、CreateClusterを押します。

VPC内に作成するためには上の方にあるGo to Advanced optionから進む必要があります。

VPCとサブネットを設定し、パーミッションなどを設定しクラスターを作成します。

計算プログラム

こんな感じでディレクトリ作成します

なおscala 2.10,jdk1.8を使用しています

s3にあらかじめbucketという名のbucketとその下にoutputディレクトリを作成し、EMRから触れるようにパーミッションを設定しておきます

├── build.sbt
├── project
│　└── assembly.sbt
├── src
│　├── main
│　 │　├── java
│　 │　 ├── resources
│　 │　└── scala
│　 │　		 └── sample
│　│　				└── SparkPi.scala
│　└── test
│　		├── resources
│　		└── scala
└── target

├── build.sbt

├── project

│　└── assembly.sbt

├── src

│　├── main

│　 │　├── java

│　 │　 ├── resources

│　 │　└── scala

│　 │　 └── sample

│　│　 └── SparkPi.scala

│　└── test

│　 ├── resources

│　 └── scala

└── target

build.sbt

name := "emrscala"

version := "0.0.1"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq(
	("org.apache.spark" %% "spark-sql" % "1.3.1").
		exclude("org.mortbay.jetty", "servlet-api").
		exclude("com.google.guava","guava").
		exclude("org.apache.hadoop","hadoop-yarn-api").
		exclude("commons-beanutils", "commons-beanutils-core").
		exclude("commons-beanutils", "commons-beanutils").
		exclude("commons-collections", "commons-collections").
		exclude("commons-logging", "commons-logging").
		exclude("org.spark-project.spark", "unused"). 
		exclude("com.twitter", "parquet-encoding").
		exclude("com.twitter", "parquet-column").
		exclude("com.twitter", "parquet-hadoop-bundle").
		exclude("org.datanucleus", "datanucleus-api-jdo").
		exclude("org.datanucleus", "datanucleus-core").
		exclude("org.datanucleus", "datanucleus-rdbms").
		exclude("com.esotericsoftware.minlog", "minlog"),
	("org.apache.spark" %% "spark-mllib" % "1.3.1").
		exclude("org.mortbay.jetty", "servlet-api").
		exclude("com.google.guava","guava").
		exclude("org.apache.hadoop","hadoop-yarn-api").
		exclude("commons-beanutils", "commons-beanutils-core").
		exclude("commons-beanutils", "commons-beanutils").
		exclude("commons-collections", "commons-collections").
		exclude("commons-logging", "commons-logging").
		exclude("org.spark-project.spark", "unused"). 
		exclude("com.twitter", "parquet-encoding").
		exclude("com.twitter", "parquet-column").
		exclude("com.twitter", "parquet-hadoop-bundle").
		exclude("org.datanucleus", "datanucleus-api-jdo").
		exclude("org.datanucleus", "datanucleus-core").
		exclude("org.datanucleus", "datanucleus-rdbms").
		exclude("com.esotericsoftware.minlog", "minlog"),
	("org.apache.spark" %% "spark-hive" % "1.3.1").
		exclude("org.mortbay.jetty", "servlet-api").
		exclude("com.google.guava","guava").
		exclude("org.apache.hadoop","hadoop-yarn-api").
		exclude("commons-beanutils", "commons-beanutils-core").
		exclude("commons-beanutils", "commons-beanutils").
		exclude("commons-collections", "commons-collections").
		exclude("commons-logging", "commons-logging").
		exclude("org.spark-project.spark", "unused"). 
		exclude("com.twitter", "parquet-encoding").
		exclude("com.twitter", "parquet-column").
		exclude("com.twitter", "parquet-hadoop-bundle").
		exclude("org.datanucleus", "datanucleus-api-jdo").
		exclude("org.datanucleus", "datanucleus-core").
		exclude("org.datanucleus", "datanucleus-rdbms").
		exclude("com.esotericsoftware.minlog", "minlog"),
	("org.apache.spark" %% "spark-sql" % "1.3.1").
		exclude("org.mortbay.jetty", "servlet-api").
		exclude("com.google.guava","guava").
		exclude("org.apache.hadoop","hadoop-yarn-api").
		exclude("commons-beanutils", "commons-beanutils-core").
		exclude("commons-beanutils", "commons-beanutils").
		exclude("commons-collections", "commons-collections").
		exclude("commons-logging", "commons-logging").
		exclude("org.spark-project.spark", "unused"). 
		exclude("com.twitter", "parquet-encoding").
		exclude("com.twitter", "parquet-column").
		exclude("com.twitter", "parquet-hadoop-bundle").
		exclude("org.datanucleus", "datanucleus-api-jdo").
		exclude("org.datanucleus", "datanucleus-core").
		exclude("org.datanucleus", "datanucleus-rdbms").
		exclude("com.esotericsoftware.minlog", "minlog"),
	("org.apache.spark" %% "spark-core" % "1.3.1").
		exclude("org.mortbay.jetty", "servlet-api").
		exclude("com.google.guava","guava").
		exclude("org.apache.hadoop","hadoop-yarn-api").
		exclude("commons-beanutils", "commons-beanutils-core").
		exclude("commons-beanutils", "commons-beanutils").
		exclude("commons-collections", "commons-collections").
		exclude("commons-logging", "commons-logging").
		exclude("org.spark-project.spark", "unused"). 
		exclude("com.twitter", "parquet-encoding").
		exclude("com.twitter", "parquet-column").
		exclude("com.twitter", "parquet-hadoop-bundle").
		exclude("org.datanucleus", "datanucleus-api-jdo").
		exclude("org.datanucleus", "datanucleus-core").
		exclude("org.datanucleus", "datanucleus-rdbms").
		exclude("com.esotericsoftware.minlog", "minlog")
)

name := "emrscala"

version := "0.0.1"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq(

("org.apache.spark" %% "spark-sql" % "1.3.1").

exclude("org.mortbay.jetty", "servlet-api").

exclude("com.google.guava","guava").

exclude("org.apache.hadoop","hadoop-yarn-api").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-beanutils", "commons-beanutils").

exclude("commons-collections", "commons-collections").

exclude("commons-logging", "commons-logging").

exclude("org.spark-project.spark", "unused").

exclude("com.twitter", "parquet-encoding").

exclude("com.twitter", "parquet-column").

exclude("com.twitter", "parquet-hadoop-bundle").

exclude("org.datanucleus", "datanucleus-api-jdo").

exclude("org.datanucleus", "datanucleus-core").

exclude("org.datanucleus", "datanucleus-rdbms").

exclude("com.esotericsoftware.minlog", "minlog"),

("org.apache.spark" %% "spark-mllib" % "1.3.1").

exclude("org.mortbay.jetty", "servlet-api").

exclude("com.google.guava","guava").

exclude("org.apache.hadoop","hadoop-yarn-api").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-beanutils", "commons-beanutils").

exclude("commons-collections", "commons-collections").

exclude("commons-logging", "commons-logging").

exclude("org.spark-project.spark", "unused").

exclude("com.twitter", "parquet-encoding").

exclude("com.twitter", "parquet-column").

exclude("com.twitter", "parquet-hadoop-bundle").

exclude("org.datanucleus", "datanucleus-api-jdo").

exclude("org.datanucleus", "datanucleus-core").

exclude("org.datanucleus", "datanucleus-rdbms").

exclude("com.esotericsoftware.minlog", "minlog"),

("org.apache.spark" %% "spark-hive" % "1.3.1").

exclude("org.mortbay.jetty", "servlet-api").

exclude("com.google.guava","guava").

exclude("org.apache.hadoop","hadoop-yarn-api").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-beanutils", "commons-beanutils").

exclude("commons-collections", "commons-collections").

exclude("commons-logging", "commons-logging").

exclude("org.spark-project.spark", "unused").

exclude("com.twitter", "parquet-encoding").

exclude("com.twitter", "parquet-column").

exclude("com.twitter", "parquet-hadoop-bundle").

exclude("org.datanucleus", "datanucleus-api-jdo").

exclude("org.datanucleus", "datanucleus-core").

exclude("org.datanucleus", "datanucleus-rdbms").

exclude("com.esotericsoftware.minlog", "minlog"),

("org.apache.spark" %% "spark-sql" % "1.3.1").

exclude("org.mortbay.jetty", "servlet-api").

exclude("com.google.guava","guava").

exclude("org.apache.hadoop","hadoop-yarn-api").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-beanutils", "commons-beanutils").

exclude("commons-collections", "commons-collections").

exclude("commons-logging", "commons-logging").

exclude("org.spark-project.spark", "unused").

exclude("com.twitter", "parquet-encoding").

exclude("com.twitter", "parquet-column").

exclude("com.twitter", "parquet-hadoop-bundle").

exclude("org.datanucleus", "datanucleus-api-jdo").

exclude("org.datanucleus", "datanucleus-core").

exclude("org.datanucleus", "datanucleus-rdbms").

exclude("com.esotericsoftware.minlog", "minlog"),

("org.apache.spark" %% "spark-core" % "1.3.1").

exclude("org.mortbay.jetty", "servlet-api").

exclude("com.google.guava","guava").

exclude("org.apache.hadoop","hadoop-yarn-api").

exclude("commons-beanutils", "commons-beanutils-core").

exclude("commons-beanutils", "commons-beanutils").

exclude("commons-collections", "commons-collections").

exclude("commons-logging", "commons-logging").

exclude("org.spark-project.spark", "unused").

exclude("com.twitter", "parquet-encoding").

exclude("com.twitter", "parquet-column").

exclude("com.twitter", "parquet-hadoop-bundle").

exclude("org.datanucleus", "datanucleus-api-jdo").

exclude("org.datanucleus", "datanucleus-core").

exclude("org.datanucleus", "datanucleus-rdbms").

exclude("com.esotericsoftware.minlog", "minlog")

)

assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")

SparkPi.scala

package sample
import scala.math.random

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark._

object SparkPi {
	def main(args: Array[String]) {
		val conf = new SparkConf().setAppName("Spark Pi").setMaster("local[2]")
		val spark = new SparkContext(conf)
		val slices=2
		val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
		val count = spark.parallelize(1 until n, slices).map { i =>
			val x = random * 2 - 1
			val y = random * 2 - 1
			if (x*x + y*y < 1) 1 else 0
		}.reduce(_ + _)
		println("Pi is roughly " + 4.0 * count / n)
		
		val outputLocation = "s3n://bucket/output"
		val pi=4.0 * count / n
		val data=spark.makeRDD(Seq(pi))
		data.saveAsTextFile(outputLocation + "/pi")
	spark.stop()
	}
}

package sample

import scala.math.random

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark._

object SparkPi {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("Spark Pi").setMaster("local[2]")

val spark = new SparkContext(conf)

val slices=2

val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow

val count = spark.parallelize(1 until n, slices).map { i =>

val x = random * 2 - 1

val y = random * 2 - 1

if (x*x + y*y < 1) 1 else 0

}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count / n)

val outputLocation = "s3n://bucket/output"

val pi=4.0 * count / n

val data=spark.makeRDD(Seq(pi))

data.saveAsTextFile(outputLocation + "/pi")

spark.stop()

}

make

sbt assembly

1	sbt assembly

できたjarファイルをS3にコピーします

実行

EMRのadd StepからCustom JARを選択

JARLocationに先ほどアップしたJarを選択

Argumentに

--verbose sample.SparkPi

1	--verbose sample.SparkPi

こんな感じで実行

しばらくたつと s3://bucket/output/piいかに結果が格納されています。

AWS EMR でSparkRを使って見る

投稿者: utsubo 投稿日: 2015-12-03 in AWS、R

AWSEMRとは、SparkやらHiveやらそれら一式を簡単に使える様にしてくれている仕組みです。

ぽちぽちっとEMRでサーバを作成。

この間１０分程度

SparkRでサンプルデータを解析してみます

こちらの内容をアレンジしてみました

http://engineer.recruit-lifestyle.co.jp/techblog/2015-08-19-sparkr/

データ取得

http://stat-computing.org/dataexpo/2009/the-data.html

こちらから２００１、２、３のデータをダウンロード

$ wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2

1	$ wget http://stat-computing.org/dataexpo/2009/2001.csv.bz2

unzip

$ bunzip2 2001.csv.bz2

1	$ bunzip2 2001.csv.bz2

s3にアップロード

$ aws s3 cp 2001.csv s3://samplebucket/airline/

1	$ aws s3 cp 2001.csv s3://samplebucket/airline/

同様に2002,2003も繰り返す

Hive

$ hive
hive> add jar /usr/lib/hive/lib/hive-contrib.jar;
Added [/usr/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib.jar]
hive> create table airline(
		> Year STRING,
		> Month STRING,
		> DayofMonth STRING,
		> DayOfWeek STRING,
		> DepTime STRING,
		> CRSDepTime STRING,
		> ArrTime STRING,
		> CRSArrTime STRING,
		> UniqueCarrier STRING,
		> FlightNum STRING,
		> TailNum STRING,
		> ActualElapsedTime STRING,
		> CRSElapsedTime STRING,
		> AirTime STRING,
		> ArrDelay STRING,
		> DepDelay STRING,
		> Origin STRING,
		> Dest STRING,
		> Distance STRING,
		> TaxiIn STRING,
		> TaxiOut STRING,
		> Cancelled STRING,
		> CancellationCode STRING,
		> Diverted STRING,
		> CarrierDelay STRING,
		> WeatherDelay STRING,
		> NASDelay STRING,
		> SecurityDelay STRING,
		> LateAircraftDelay STRING
		> )
		> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
		> LOCATION 's3://samplebucket/airline/' tblproperties ("skip.header.line.count"="1");
hive> select * from airline limit 1;
OK
2001	1	17	3	1806	1810	1931	1934	US	375	N700&#65533;&#65533;	85	84	60	-3	-4	BWI	CLT	361	5	20	0	NA	0	NA	NA	NA	NA	NA

$ hive

hive> add jar /usr/lib/hive/lib/hive-contrib.jar;

Added [/usr/lib/hive/lib/hive-contrib.jar] to class path

Added resources: [/usr/lib/hive/lib/hive-contrib.jar]

hive> create table airline(

> Year STRING,

> Month STRING,

> DayofMonth STRING,

> DayOfWeek STRING,

> DepTime STRING,

> CRSDepTime STRING,

> ArrTime STRING,

> CRSArrTime STRING,

> UniqueCarrier STRING,

> FlightNum STRING,

> TailNum STRING,

> ActualElapsedTime STRING,

> CRSElapsedTime STRING,

> AirTime STRING,

> ArrDelay STRING,

> DepDelay STRING,

> Origin STRING,

> Dest STRING,

> Distance STRING,

> TaxiIn STRING,

> TaxiOut STRING,

> Cancelled STRING,

> CancellationCode STRING,

> Diverted STRING,

> CarrierDelay STRING,

> WeatherDelay STRING,

> NASDelay STRING,

> SecurityDelay STRING,

> LateAircraftDelay STRING

> )

> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'

> LOCATION 's3://samplebucket/airline/' tblproperties ("skip.header.line.count"="1");

hive> select * from airline limit 1;

2001 1 17 3 1806 1810 1931 1934 US 375 N700�� 85 84 60 -3 -4 BWI CLT 361 5 20 0 NA 0 NA NA NA NA NA

SparkR

$ sparkR
> install.packages("magrittr")
> library(magrittr)
> hiveContext <- sparkRHive.init(sc)
> airline<-sql(hiveContext,"select * from airline")
> class(airline)
[1] "DataFrame"
attr(,"package")
[1] "SparkR"
> airline %>%
+	 filter(airline$Origin == "JFK") %>%
+	 group_by(airline$Dest) %>%
+	 agg(count=n(airline$Dest)) %>%
+	 head
	Dest count																																		
1	IAH	1214
2	STL	2922
3	SNA	 805
4	MSP	1580
5	STT	1085
6	SAN	2723

$ sparkR

> install.packages("magrittr")

> library(magrittr)

> hiveContext <- sparkRHive.init(sc)

> airline<-sql(hiveContext,"select * from airline")

> class(airline)

[1] "DataFrame"

attr(,"package")

[1] "SparkR"

> airline %>%

+ filter(airline$Origin == "JFK") %>%

+ group_by(airline$Dest) %>%

+ agg(count=n(airline$Dest)) %>%

+ head

Dest count

1 IAH 1214

2 STL 2922

3 SNA 805

4 MSP 1580

5 STT 1085

6 SAN 2723

こんな感じで簡単にできました

RでDeeplearning

投稿者: utsubo 投稿日: 2015-10-30 in R

最近、Deeplearning、いいキーワードになっていますね。

これで解析さえすればバンバン売れる！なんて事は無いと思いますが、

Rで使い方を調べてみました。

h2oパッケージというものをRから使うことになります。

環境

R version 3.2.2
MacOS 10.11.1
jdk 1.8.0_40

h2oインストール

こちらを参考にインストール

http://d.hatena.ne.jp/dichika/20140503/p1

install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))

1	install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))

ubuntuなどでこのようなエラーが出た時には以下のように足りないものをインストールします

ERROR: configuration failed for package 'RCurl'

1	ERROR: configuration failed for package 'RCurl'

$ sudo apt-get install libcurl4-openssl-dev

1	$ sudo apt-get install libcurl4-openssl-dev

install.packages("RCurl")

install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))

install.packages("RCurl")

install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))

Deeplearning

こちらのコードにh2oのDeeplearningを追加します

http://yut.hatenablog.com/entry/20120827/1346024147

library( kernlab )
data(spam)
rowdata<-nrow(spam)
random_ids<-sample(rowdata,rowdata*0.5)
spam_training<-spam[random_ids,]
spam_predicting<-spam[-random_ids,]

#svm
library( kernlab )
spam_svm<-ksvm(type ~., data=spam_training )
spam_predict<-predict(spam_svm,spam_predicting[,-58])
table(spam_predict, spam_predicting[,58])

# nnet
library( nnet )
spam_nn<-nnet(type ~., data=spam_training,size = 2, rang = .1, decay = 5e-4, maxit = 200 )
spam_predict<-predict(spam_nn,spam_predicting[,-58],type="class")
table(spam_predict, spam_predicting[,58])

# naivebayes
library( e1071 )
spam_nn<-naiveBayes(type ~., data=spam_training)
spam_predict<-predict(spam_nn,spam_predicting[,-58],type="class")
table(spam_predict, spam_predicting[,58])


# deeplearning
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
spam_h2o<-h2o.deeplearning(x=1:57,y=58,training_frame=as.h2o(spam_training))
spam_predict<-h2o.predict(spam_h2o,as.h2o(spam_predicting[,-58]))
table(as.data.frame(spam_predict)[,1],spam_predicting[,58])
h2o.shutdown(localH2O)

library( kernlab )

data(spam)

rowdata<-nrow(spam)

random_ids<-sample(rowdata,rowdata*0.5)

spam_training<-spam[random_ids,]

spam_predicting<-spam[-random_ids,]

#svm

library( kernlab )

spam_svm<-ksvm(type ~., data=spam_training )

spam_predict<-predict(spam_svm,spam_predicting[,-58])

table(spam_predict, spam_predicting[,58])

# nnet

library( nnet )

spam_nn<-nnet(type ~., data=spam_training,size = 2, rang = .1, decay = 5e-4, maxit = 200 )

spam_predict<-predict(spam_nn,spam_predicting[,-58],type="class")

table(spam_predict, spam_predicting[,58])

# naivebayes

library( e1071 )

spam_nn<-naiveBayes(type ~., data=spam_training)

spam_predict<-predict(spam_nn,spam_predicting[,-58],type="class")

table(spam_predict, spam_predicting[,58])

# deeplearning

library(h2o)

localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)

spam_h2o<-h2o.deeplearning(x=1:57,y=58,training_frame=as.h2o(spam_training))

spam_predict<-h2o.predict(spam_h2o,as.h2o(spam_predicting[,-58]))

table(as.data.frame(spam_predict)[,1],spam_predicting[,58])

h2o.shutdown(localH2O)

結果

SVM

spam_predict nonspam spam
		 nonspam		1338	117
		 spam				 62	784

spam_predict nonspam spam

nonspam 1338 117

spam 62 784

(1338+784)/(1338+117+62+784)=0.9222077

nnet

spam_predict nonspam spam
		 nonspam		1313	 86
		 spam				 87	815

spam_predict nonspam spam

nonspam 1313 86

spam 87 815

(1313+815)/(1313+86+87+815)=0.9248153

naivebayes

spam_predict nonspam spam
		 nonspam		 752	 51
		 spam				648	850

spam_predict nonspam spam

nonspam 752 51

spam 648 850

(752+850)/(752+51+648+850)=0.696219

h2o

					nonspam spam
	nonspam		1321	 83
	spam				 74	823

nonspam spam

nonspam 1321 83

spam 74 823

(1321+823)/(1321+83+74+823)=0.9317688

Deeplearningが一番正解率高いですね

Rubyでクラス名からインスタンスを作成

投稿者: utsubo 投稿日: 2015-09-25 in ruby

環境: ruby 2.2.2

class Test
	attr_accessor:a
	def initialize(a)
		@a=a
	end
	def to_str
		return "class:"+@a.to_s
	end
end

p Test.name	 # クラス名を取得
# "Test"

cls=(eval Test.name).new("kk")	# クラス名からインスタンスを作成
p cls.a
# "kk"

p cls.to_str
# "class:kk"

class Test

attr_accessor:a

def initialize(a)

@a=a

end

def to_str

return "class:"+@a.to_s

end

p Test.name # クラス名を取得

# "Test"

cls=(eval Test.name).new("kk") # クラス名からインスタンスを作成

p cls.a

# "kk"

p cls.to_str

# "class:kk"

参考

http://d.hatena.ne.jp/stakizawa/20070505/t1

http://yiaowang.web.fc2.com/programing/ruby_tips/etc_01.html

AzureでSQLサーバにlinuxから接続する

投稿者: utsubo 投稿日: 2015-07-03 in azure 1件のコメント

最近Azureにはまっています。AzureはMSDNのサブスクリプションを持っていると無料で使える枠があるので非常に便利です。ちょっと試したいときとかサクッとサーバ作って試せますから。

Azureでは公式にはデータベースのサービスはSQLServerになっているようです。MySQL等もサードパーティのサービスで使えるみたいですが、やはりAzureをせっかく使うのならばSQLServerを使ってみることにします。

SQLサーバには特に不満はないのですが、Linuxからの使い勝手が悪かったりします。今回はunixodbcとfreetdsを使ってコマンドラインから接続を試みます

環境

OS:Ubuntu14.4

手順

unixodbc,freetds,tdsodbcのインストール

sudo apt-get install freetds-common
sudo apt-get install freetds-bin
sudo apt-get install unixodbc
sudo apt-get install tdsodbc

sudo apt-get install freetds-common

sudo apt-get install freetds-bin

sudo apt-get install unixodbc

sudo apt-get install tdsodbc

/etc/odbcinst.ini

[SQLServer]
	Servername = SqlServer
	Driver = FreeTDS
	database = master

[SQLServer]

Servername = SqlServer

Driver = FreeTDS

database = master

/etc/odbc.ini

odbcinst.iniのDriverとodbc.iniの[FreeTDS]の部分は名前を合わせます

[FreeTDS]
Description	= TDS driver (Sybase/MS SQL)
Driver		= /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup		= /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
FileUsage = 1
CPTimeout = 5
CPReuse = 5

[FreeTDS]

Description = TDS driver (Sybase/MS SQL)

Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so

Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so

FileUsage = 1

CPTimeout = 5

CPReuse = 5

/etc/freetds/freetds.conf

odbcinst.iniのServernameと[SqlServer]の部分の名前を合わせます

hostにはazureのSQLサーバの接続先ホスト名を記入します

[SqlServer]
	host = xxxxxxxxx.database.windows.net
	port = 1433
	tds version = 8.0

[SqlServer]

host = xxxxxxxxx.database.windows.net

port = 1433

tds version = 8.0

環境変数

$ cat .bash_profile 
export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc
export FREETDSCONF=/etc/freetds/freetds.conf

$ cat .bash_profile

export ODBCINI=/etc/odbc.ini

export ODBCSYSINI=/etc

export FREETDSCONF=/etc/freetds/freetds.conf

接続

sqlserverusernameとsqlserverpasswordはSQLサーバのユーザ名とパスワードをいれます

ここがポイントなのですが -U の後のユーザ名に@xxxxxxxxxとSQLサーバの接続先ホストのホスト名を入れてやる必要があります

$ tsql -S SqlServer -U sqlserverusername@xxxxxxxxx -P sqlserverpassword -D master

1	$ tsql -S SqlServer -U sqlserverusername@xxxxxxxxx -P sqlserverpassword -D master

以下のサイトを参考にしました

http://jyukutyo.hatenablog.com/entry/20111024/1319473427

http://makeitsmartjp.com/2013/02/centos-sqlserver.html

JD-Eclipseのインストールではまった

投稿者: utsubo 投稿日: 2015-06-18 in java

ちょっと以前まではEclipseの逆コンパイラ、JD-EclipseのインストールにはUpdateサイトから行っていたのですが、Lunaでインストールした際にはまったので以下にメモ。

OS MacOSX 10.10
Eclipse 4.4 luna
Java 8

http://totech.hateblo.jp/entry/2015/02/19/145004

こちらに書いてあるようにhttp://jd.benow.ca/jd-eclipse/updateのupdateサイトをHelpのInstallNewSoftwareからインストールするとJD-Eclipseの0.1.5というバージョンがインストールされます

これを使用し、デコンパイルしようとしてもうまくデコンパイルされません。最初はFileassociationがおかしいのか？と思い設定を見直したのですがClassFileEditorのまま、特におかしなところはありません。

ずいぶん悩んだのですが、JD-Eclipseのサイトに書いてある手順がファイルからインストールしろとのことなのでその通りにします

https://github.com/java-decompiler/jd-eclipse

こちらからjd-eclipse-site-1.0.0-RC2.zipをダウンロード

https://github.com/java-decompiler/jd-eclipse/releases

EclipseのヘルプーInstallNewSoftwareのAddからArchiveを選択、先ほどのZIPファイルを選択しインストール。

インストールすると1.0.0のバージョンがインストールされます。

FileAssociationもJD Class FileViewerとなります。

これでclassファイルを選択するとめでたく逆コンパイルされます。

古いバージョンのJD-EclipseはJava8には対応していないのでしょうか。。

Monoでデバッグ

投稿者: utsubo 投稿日: 2015-06-02 in mono

備忘録

デバックログ出力

http://www.mono-project.com/docs/advanced/pinvoke/dllnotfoundexception/

$ MONO_LOG_LEVEL=debug mono GdiExample.exe

1	$ MONO_LOG_LEVEL=debug mono GdiExample.exe

Rでチャートを書いてみる(9)

投稿者: utsubo 投稿日: 2015-05-13 in R

Rで作成したチャートをファイル保存する際にちょっとハマってしまったのでメモ

チャートを作成する際に、銘柄コードでグルグル回して作成したい場合があります。その際に、ロウソク足だけ、とかなら大丈夫なのですが、その上に重ね合わせたりする場合にファイル作成時にはうまくいかないことがあります。

ロウソク足

これはＯＫです

png("file.png")
candleChart(ohlc,theme="white")
dev.off()

png("file.png")

candleChart(ohlc,theme="white")

dev.off()

ロウソク足＋α

これだとpointsが描かれない

png("file.png")
candleChart(ohlc,theme="white")
addTA(points,on=1,col="red",type="b")
dev.off()

png("file.png")

candleChart(ohlc,theme="white")

addTA(points,on=1,col="red",type="b")

dev.off()

こうすればうまくファイルに出力されます

candleChart(ohlc,theme="white")
plot(addTA(points,on=1,col="red",type="b")
dev.copy(png,"file.png")
dev.off()

candleChart(ohlc,theme="white")

plot(addTA(points,on=1,col="red",type="b")

dev.copy(png,"file.png")

dev.off()

この辺りを参考にしました

http://stackoverflow.com/questions/18556548/is-it-possible-to-build-a-quantmod-chart-incrementally-and-export-the-final-resu

http://stackoverflow.com/questions/18342703/r-appears-to-fail-to-execute-a-line-in-a-function/18342756#18342756