如何使用开源软件为 Hadoop 构建大数据管道

资源描述

*,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,How to Build Big Data Pipelines for Hadoop,Dr. Mark Pollack,“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze,A subjective and moving target,Big data in many sectors today range from 10s of TB to multiple PB,Big Data,2,Enterprise Data Trends,3,Value from Data Exceeds Hardware & Software costs,Value in connecting data sets,Grouping e-commerce users by user agent,Orbitz shows more expensive hotels to Mac users,See, Data Access Landscape - The Value of Data,4,Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3c,Spring has always provided excellent data access support,Transaction Management,Portable data access exception hierarchy,JDBC JdbcTemplate,ORM - Hibernate, JPA, JDO, Ibatis support,Cache support (Spring 3.1),Spring Data project started in 2010,Goal is to “refresh” Springs Data Access support,In light of new data access landscape,Spring and Data Access,5,Spring Data Mission Statement,6,“,Provide a familiar and consistent Spring-based programming model for,Big Data, NoSQL, and relational stores while retaining store-specific features and capabilities.,Relational,JPA,JDBC Extensions,NoSQL,Redis,HBase,Mongo,Neo4j,Lucene,Gemfire,Big Data,Hadoop,HDFS and M/R,Hive,Pig,Cascading,Splunk,Access,Repositories,QueryDSL,REST,Spring Data Supported Technologies,7,A View of a Big Data System,8,Integration Apps,Stream Processing,Unstructured Data Store,Interactive,Processing,(Structured DB),Batch,Analysis,Analytical Apps,Real Time,Analytics,Data Streams,(Log Files, Sensors, Mobile),Ingestion,Engine,Distribution,Engine,Monitoring / Deployment,SaaS,Social,Where Spring Projects can be used,to provide a solution,Real world big data solutions require workflow across systems,Share core components of a classic integration workflow,Big data solutions need to integrate with existing data and apps,Event-driven processing,Batch workflows,Big Data Problems are Integration Problems,9,Spring Integration,for building and configuring message-based integration flowsusing input & output adapters, channels, and processors,Spring Batch,for building and operating batch workflows and manipulating data in files and ETLBasis for JSR 352 in EE7.,Spring projects offer substantial integration functionality,10,Spring Data,for manipulating data in relational DBs as well as a variety of NoSQL databases and data grids,(inside Gemfire 7.0),Spring for Apache Hadoop,for orchestrating Hadoop and non-Hadoop workflowsin conjunction with Batch and Integration processing,(inside GPHD 1.2),Spring projects offer substantial integration functionality,11,Integration is an essential part of Big Data,12,Some Existing Big Data Integration tools,13,Hadoop as a Big Data Platform,14,Hadoop has a poor out of the box programming model,Applications are generally a collection of scripts calling command line apps,Spring simplifies developing Hadoop applications,By providing a familiar and consistent programming and configuration mode,Across a wide range of use cases,HDFS usage,Data Analysis (MR/Pig/Hive/Cascading),Workflow,Event Streams,Integration,Allowing to start small and grow,Spring for Hadoop - Goals,15,Relationship with other Spring projects,16,Spring Hadoop Core Functionality,17,Declarative configuration,Create, configure, and parameterize Hadoop connectivity and all job types,Environment profiles easily move from dev to qa to prod,Developer productivity,Create well-formed applications, not spaghetti script applications,Simplify HDFS and FsShell API with support for JVM scripting,Runner classes for MR/Pig/Hive/Cascading for small workflows,Helper “Template” classes for Pig/Hive/HBase,Capabilities: Spring + Hadoop,18,Core Map Reduce idea,19,Standard Hadoop APIs,Counting Words Configuring M/R,20,Configuration conf = new Configuration();,Job job = new Job(conf, wordcount);,job.setOutputKeyClass(Text.class);,job.setOutputValueClass(IntWritable.class);,job.setMapperClass(Map.class);,job.setReducerClass(Reduce.class);,job.setInputFormatClass(TextInputFormat.class);,job.setOutputFormatClass(TextOutputFormat.class);,FileInputFormat.addInputPath(job, new Path(args0);,FileOutputFormat.setOutputPath(job, new Path(args1);,job.waitForCompletion(true);,Standard Hadoop API - Mapper,Counting Words M/R Code,21,public class TokenizerMapper extends Mapper ,private final static IntWritable one = new IntWritable(1); Text word = new Text();,public void map(Object key, Text value, Context context),throws IOException, InterruptedException ,StringTokenizer itr = new StringTokenizer(value.toString();,while (itr.hasMoreTokens() ,word.set(itr.nextToken();context.write(word, one);, ,Standard Hadoop API - Reducer,Counting Words M/R Code,22,public class IntSumReducer extends Reducer ,private IntWritable result = new IntWritable();,public void reduce(Text key, Iterable values, Context context),throws IOException, InterruptedException ,int sum = 0; for (IntWritable val : values) sum += val.get(); ,result.set(sum); context.write(key, result);,Standard Hadoop,SDHP (Spring Hadoop),Running Hadoop Example Jars,23,bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output,Standard Hadoop,SHDP,Running Hadoop Tools,24,bin/hadoop jar conf myhadoop-site.xml D ignoreCase=true,wordcount.jar org.myorg.WordCount /wc/input /wc/output,ignoreCase=true,Configuring Hadoop,25,fs.default.name=$hd.fs,input.path=,/wc/input/,output.path=,/wc/word/,hd.fs=,hdfs:/localhost:9000,applicationContext.xml,hadoop-dev.properties,Access all “bin/hadoop fs” commands through FsShell,mkdir, chmod, test,HDFS and Hadoop Shell as APIs,26,class MyScript ,Autowired FsShell fsh;,PostConstruct void init() ,String outputDir = /data/output;,if (fsShell.test(outputDir) ,fsShell.rmr(outputDir);,FsShell is designed to support JVM scripting languages,HDFS and FsShell as APIs,27,/ use the shell (made available under variable,fsh),if (!fsh.test(inputDir) ,fsh.mkdir(inputDir);,fsh.copyFromLocal(sourceFile, inputDir);,fsh.chmod(700, inputDir),if (fsh.test(outputDir) ,fsh.rmr(outputDir),copy-files.groovy,HDFS and FsShell as APIs,/ use the shell (made available under variable,fsh),if (!fsh.test(inputDir) ,fsh.mkdir(inputDir);,fsh.copyFromLocal(sourceFile, inputDir);,fsh.chmod(700, inputDir),if (fsh.test(outputDir) ,fsh.rmr(outputDir),appCtx.xml,Externalize Script,HDFS and FsShell as APIs,29,appCtx.xml,30,$ demo,input.path=,/wc/input/,output.path=,/wc/word/,hd.fs=,hdfs:/localhost:9000,Streaming Jobs and Environment Configuration,bin/hadoop jar hadoop-streaming.jar ,input /wc/input output /wc/output ,-mapper /bin/cat reducer /bin/wc ,-files stopwords.txt,env=dev java jar SpringLauncher.jar applicationContext.xml,hadoop-dev.properties,Streaming Jobs and Environment Configuration,bin/hadoop jar hadoop-streaming.jar ,input /wc/input output /wc/output ,-mapper /bin/cat reducer /bin/wc ,-files stopwords.txt,env=qa java jar SpringLauncher.jar applicationContext.xml,input.path=,/gutenberg/input/,output.path=,/gutenberg/word/,hd.fs=,hdfs:/darwin:9000,hadoop-qa.properties,Use Dependency Injection to obtain reference to Hadoop Job,Perform additional runtime configuration and submit,Word Count Injecting Jobs,33,public,class,WordService ,Inject,private,Job,mapReduceJob,;,public,void,processWords() ,mapReduceJob,.submit();,Pig,34,An alternative to writing MapReduce applications,Improve productivity,Pig applications are written in the Pig Latin Language,Pig Latin is a high level data processing language,In the spirit of sed and ask, not SQL,Pig Latin describes a sequence of steps,Each step performs a transformation on item of data in a collection,Extensible with User defined functions (UDFs),A PigServer is responsible for translating PigLatin to MR,What is Pig?,35,Counting Words PigLatin Script,36,input_lines = LOAD /tmp/books AS (line:chararray);,- Extract words from each line and put them into a pig bag,words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line) AS word;,- filter out any words that are just white spaces filtered_,filtered_words = FILTER words BY word MATCHES w+;,- create a group for each word,word_groups = GROUP filtered_words BY word;,- count the entries in each group,word_count = FOREACH word_groups GENERATE COUNT(filtered_words),AS count, group AS word;,ordered_word_count = ORDER word_count BY count DESC;,STORE ordered_word_count INTO /tmp/number-of-words;,Standard Pig,Spring Hadoop,Creates a,PigServer,Optional execution of scripts on application startup,Using Pig,37,pig x mapreduce wordcount.pig,pig wordcount.pig P pig.properties p pig.exec.nocombiner=true,pig.exec.nocombiner=true,ignoreCase,=TRUE,Execute a small Pig workflow (HDFS, PigLatin, HDFS),Springs PigRunner,38,inputDir=$inputDir,outputDir=$outputDir,PigRunner implements Callable,Use Springs Scheduling support,Schedule a Pig job,39,Scheduled(cron= “0 0 12 * * ?”),public,void,process() ,pigRunner,.call();,Simplifies the programmatic use of Pig,Common tasks are one-liners,PigTemplate,40,fs.default.name=$hd.fs,mapred.job.tracker=$mapred.job.tracker,PigTemplate - Programmatic Use,41,public,class,PigPasswordRepository,implements,PasswordRepository ,private,PigTemplate,pigTemplate,;,private,String,pigScript,=,classpath:password-analysis.pig,;,public,void,processPasswordFile(String inputFile) ,String outputDir =,baseOutputDir,+ File.,separator,+,counter,.incrementAndGet();,Properties scriptParameters =,new,Properties();,scriptParameters.put(,inputDir, inputFile);,scriptParameters.put(,outputDir, outputDir);,pigTemplate,.executeScript(,pigScript, scriptParameters);,/.,Hive,42,An alternative to writing MapReduce applications,Improve productivity,Hive applications are written using HiveQL,HiveQL is in the spirit of SQL,A HiveServer is responsible for translating HiveQL to MR,Access via JDBC, ODBC, or Thrift RPC,What is Hive?,43,Counting Words - HiveQL,44,- import the file as lines,CREATE EXTERNAL TABLE lines(line string)LOAD DATA INPATH books OVERWRITE INTO TABLE lines;,- create a virtual view that splits the lines,SELECT word, count(*) FROM lines,LATERAL VIEW explode(split(text, ) lTable as word,GROUP BY word;,Command-line,JDBC based,Using Hive,45,$HIVE_HOME/bin/hive f wordcount.sql d ignoreCase=TRUE h hive-server.host,Class.forName(,“org.apache.hadoop.hive.jdbc.HiveDriver”,);,Connection con = DriverManager.getConnection(,“jdbc:hive:/server:port/default”,”,“”,),try ,Statement stmt = con.createStatement();,ResultSet res = stmt.executeQuery(“”),.,while (res.next() , catch (SQLException ex) , finally ,try con.close(); catch (Exception ex) ,Access Hive using JDBC Client and use,JdbcTemplate,Using Hive with Spring Hadoop,46,Reuse existing knowledge of Springs Rich,ResultSet,to POJO Mapping Features,Using Hive with Spring Hadoop,47,public,long,count() ,return,jdbcTemplate,.queryForLong(,select count(*) from ,+,tableName,);,List result = jdbcTemplate.query(,“select * from passwords,new ResultSetExtractor() ,public,String extractData(ResultSet rs),throws,SQLException ,/ extract data from result set,);,HiveClient is not thread-safe, throws checked exceptions,Standard Hive Thrift API,48,public,long,count() ,HiveClient hiveClient = createHiveClient();,try,hiveClient.execute(,select count(*) from ,+,tableName,);,return,Long.,parseLong(hiveClient.fetchOne();,/ checked exceptions,catch,(HiveServerException ex) ,throw,translateExcpetion(ex);,catch,(org.apache.thrift.TException tex) ,throw,translateExcpetion(tex);,finally,try, hiveClient.shutdown(); ,catch,(org.apache.thrift.TException tex) ,logger,.debug(,Unexpected exception on shutting down HiveClient, tex);,protected,HiveClient createHiveClient() ,TSocket transport =,new,TSocket(,host,port,timeout,);,HiveClient hive =,new,HiveClient(,new,TBinaryProtocol(transport);,try, transport.open();,catch,(TTransportException e) ,throw,translateExcpetion(e); ,return,hive;,Spring Hadoop Batch & Integration,49,Reuse same Batch infrastructure and knowledge to manage Hadoop workflows,Step can be any Hadoop job type or HDFS script,Hadoop Workflows managed by Spring Batch,50,Spring Batch for File/DB/NoSQL driven applications,Collect: Process local files,Transform: Scripting or Java code to transform and enrich,RT Analysis: N/A,Ingest: (batch/aggregate) write to HDFS or split/filtering,Batch Analysis: Orchestrate Hadoop steps in a workflow,Distribute: Copy data out of HDFS to structured storage,JMX enabled along with REST interface for job control,Capabilities: Spring + Hadoop + Batch,51,Collect,Transform,RT Analysis,Ingest,Batch Analysis,Distribute,Use,Spring Batch Configuration for Hadoop,52,Reuse previous Hadoop job definitions,Spring Batch Configuration for Hadoop,53,Spring Integration for Event driven applications,Collect: Single node or distributed data collection (tcp/JMS/Rabbit),Transform: Scripting or Java code to transform and enrich,RT Analysis: Connectivity to multiple analysis techniques,Ingest: Write to HDFS, Split/Filter data stream to other stores,JMX enabled + control bus for starting/stopping individual components,Capabilities: Spring + Hadoop + SI,54,Collect,Transform,RT Analysis,Ingest,Batch Analysis,Distribute,Use,Poll a local directory for files, files are rolled over every 10 min,Copy files to staging area and then to HDFS,Use an aggregator to wait to “process all files available every hour” to launch MR job,Ingesting Copying Local Log Files into HDFS,55,Use syslog adapter,Transformer categorized messages,Route to specific channels based on category,One route leads to HDFS write and filtered data stored in Redis,Ingesting Syslog into HDFS,56,Syslog collection across multiple machines,Use TCP Adapters to forward events,Or other middleware,Ingesting Multi-node Syslog into HDFS,57,Use Spring Batch,JdbcItemReader,FileItemWriter,Ingesting JDBC to HDFS,58,Use FsShell,Include as step in Batch workflow,Spring Batch and fire events when jobs end SI can poll HDFS,Exporting HDFS to local Files,59,59,/ use the shell (made available under variable,fsh),fsh.copyToLocal(sourceDir, outputDir);,Use Spring Batch,MutliFileItemReader,JdbcItemWriter,Exporting HDFS to JDBC,60,Use Spring Batch,MutliFileItemReader,MongoItemWriter,Exporting HDFS to Mongo,61,/,

展开阅读全文

如何使用开源软件为 Hadoop 构建大数据管道

最新文档