发布于 2016-07-28 07:30:30 | 210 次阅读 | 评论: 0 | 来源: 网友投递

这里有新鲜出炉的精品教程,程序狗速度看过来!

Apache Spark

Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的map reduce的算法。


Apache Spark 2.0.0 发布了,

该版本主要更新APIs,支持SQL 2003,支持R UDF ,增强其性能。300个开发者贡献了2500补丁程序。

Apache Spark 2.0.0 APIs更新记录如下:

  • Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.

  • SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.

  • A new, streamlined configuration API for SparkSession

  • Simpler, more performant accumulator API

  • A new, improved Aggregator API for typed aggregation in Datasets

Apache Spark 2.0.0 SQL更新记录如下:

  • A native SQL parser that supports both ANSI-SQL as well as Hive QL

  • Native DDL command implementations

  • Subquery support, including

    • Uncorrelated Scalar Subqueries

    • Correlated Scalar Subqueries

    • NOT IN predicate Subqueries (in WHERE/HAVING clauses)

    • IN predicate subqueries (in WHERE/HAVING clauses)

    • (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)

  • View canonicalization support

一些新特性:

  • Native CSV data source, based on Databricks’ spark-csv module

  • Off-heap memory management for both caching and runtime execution

  • Hive style bucketing support

  • Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.

性能增强:

  • Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.

  • Improved Parquet scan throughput through vectorization

  • Improved ORC performance

  • Many improvements in the Catalyst query optimizer for common workloads

  • Improved window function performance via native implementations for all window functions

  • Automatic file coalescing for native data sources

更多发布信息,可查看发布说明

下载地址:http://spark.apache.org/downloads.html



历史版本 :
Apache Spark 2.2.0 正式发布,提高可用性和稳定性
Spark 2.0 时代全面到来 —— 2.0.1 版本发布
Apache Spark 2.0.0 发布,APIs 更新
Apache Spark 1.6.2 发布,集群计算环境
Spark 2.0 预览:更简单,更快,更智能
Spark 2.7.6 发布,开源集群计算环境
Apache spark 1.6.1 发布,集群计算环境
Apache Spark 2.0 最快今年4月亮相
Apache Spark 1.6 正式发布,性能大幅度提升
Apache Spark 1.6 预览版:更简便的搜索
Apache Spark 1.5.2 发布,开源集群计算环境
Apache Spark 1.5.1 发布,开源集群计算环境
最新网友评论  共有(0)条评论 发布评论 返回顶部

Copyright © 2007-2017 PHPERZ.COM All Rights Reserved   冀ICP备14009818号  版权声明  广告服务