1、dataframe返回的不是对象。
2、dataframe查出来的数据返回的是一个dataframe数据集。
3、dataframe只有遇见action的算子才能执行
4、sparksql查出来的数据返回的是一个dataframe数据集。
原始数据
1
2
|
scala> val parquetdf = sqlcontext.read.parquet( "hdfs://hadoop14:9000/yuhui/parquet/part-r-00004.gz.parquet" ) df: org.apache.spark.sql.dataframe = [timestamp: string, appkey: string, app_version: string, channel: string, lang: string, os_type: string, os_version: string, display: string, device_type: string, mac: string, network: string, nettype: string, suuid: string, register_days: int , country: string, area: string, province: string, city: string, event: string, use_interval_cat: string, use_duration_cat: string, use_interval: bigint, use_duration: bigint, os_upgrade_from: string, app_upgrade_from: string, page_name: string, event_name: string, error_type: string] |
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
package dataframe import org.apache.spark.sql.sqlcontext import org.apache.spark.{sparkconf, sparkcontext} / * * * created by yuhui on 2016 / 6 / 14. * / object dataframetest { def main(args: array[string]) { dataframeinto() } def dataframeinto() { val conf = new sparkconf() val sc = new sparkcontext(conf) val sqlcontext = new sqlcontext(sc) val df = sqlcontext.read.parquet( "hdfs://hadoop14:9000/yuhui/parquet" ) / / df. map (line = > printinfo(line.getstring( 0 ))) / / df.foreach(line = > printinfo(line.getstring( 0 ) + " , " + line.getstring( 14 ) + " , " + line.getstring( 15 ))) / / df.select( "timestamp" , "country" , "area" ).foreach(line = >printinfo(line.tostring)) df.registertemptable( "infotable" ) sqlcontext.sql( "select timestamp , country , area from infotable" ).foreach(line = >printinfo(line.tostring)) } def printinfo(msg: string) {println( "printinfo函数-->" + msg) } } |
代码解析
1、df.map(line => printinfo(line.getstring(0)))
这段代码不行执行printinfo()函数,因为只有map算子,没有action算子。
2、df.foreach(line => printinfo(line.getstring(0)+" , "+line.getstring(14)+" , "+line.getstring(15)))
通过spark的action算子接收数据进行操作,执行结果如下:
3、df.select("timestamp","country","area").foreach(line=>printinfo(line.tostring))
通过dataframe的api进行操作,再通过spark的action算子打印出来,执行结果如下:
4、sqlcontext.sql("select timestamp , country , area from infotable").foreach(line=>printinfo(line.tostring))
执行结果如下:
以上这篇浅谈dataframe和sparksql取值误区就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持服务器之家。
原文链接:https://blog.csdn.net/silentwolfyh/article/details/51669839