spot7.org logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML Categories
  Home » HADOOP » Page 1
how to select column which is not in group by sentence in hive?
try this, select id1,id2,id3 from ( select id1,id2,id3,ROW_NUMBER() over (Partition BY id1,id2) as rnum from test ) t where rnum=1; output 1 1 1 1 2 1 2 2 2

Categories : Hadoop

Hadoop cluster Http port for name node is not working
Probably a network error than hadoop specific. Can you try this http://www.cyberciti.biz/tips/no-route-to-host-error-and-solution.html

Categories : Hadoop

Hadoop - Defining and working on data with no delimeter , nospace/space between some columns
Can you try this?. input.txt 0000856214AB25 256 T PL1423AS 2563458547CD12 748 S AK2523YU Hive table creation with regex: hive> CREATE TABLE test_regex( >f1 STRING,f2 STRING, >f3 STRING,f4 STRING, >f5 STRING,f6 STRING, >f7 STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' >WITH SERDEPROPERTIES ("input.regex" = >"([0

Categories : Hadoop

How to select ${mapred.local.dir}?
1. Whether LocalDirAllocator.java is used to manage ${mapred.local.dir} directories? Yes, the tasktracker uses LocalDirAllocator to manage the local directories/ disks inorder to store intermmediate data.(The by which it allocate space is given in the explanation) 2.The method getLocalPathForWrite() of LocalDirAllocator.java is used to select a ${mapred.local.dir} directory? There are 3 overl

Categories : Hadoop

java.net.URISyntaxException when starting HIVE
add property in hive-site.xml <configuration> <property> <name>hive.metastore.schema.verification</name> <value>false</value> <description>Will remove your error occurring because of metastore_db in shark</description> </property> </configuration> add java and hadoop path in hive-env.sh according to your system. # Set HADOOP_HOM

Categories : Hadoop

Google cloud click to deploy hadoop
The three big uses of persistent disks (PDs) are: Logs, both daemon and job (or container in YARN) These can get quite large with debug logging turned on and can result in many writes per second MapReduce shuffle These can be large, but benefit more from higher IOPS and throughput HDFS (image and data) Due to the layout of directories, persistent disks will also be used for other items lik

Categories : Hadoop

general form of MapReduce format
Let's walk through the series of transformations to your data. We start with the raw data: AAA BBB CCC DDD EEE AAA GGG CCC BBB Suppose we're using TextInputFormat as the InputFormat. Then, the input to the mapper will be key, value pairs that look something like this: 0 AAA BBB CCC 13 DDD EEE AAA 26 GGG CCC BBB Here, the file is broken up into lines. The key is the position in the file, and

Categories : Hadoop

Hadoop namenode needs to be formatted after every computer start
Looks like you are not overriding the hdfs configurations dfs.name.dir , dfs.data.dir, by default it points to /tmp directory which will be cleared when your machine restarts. You have to change this from /tmp to another location in your home directory by overriding these values in your hdfs-site.xml file located in your HADOOP configuration directory. Do the following steps Create a directory i

Categories : Hadoop

How to solve tHDFS component issue in talend open studio for Big data
The issue was solved in Oracle VM Virtual Box by following these steps: Go to File-- >Preferencs --- >Network, Now select the Host-Only Networks and add the VirtualBox Host-Only Ethernet Adapter and click OK. (Details.) Now right click on HartonWorks Sandbox and then go to Settings-- >Network, now selected the Host-Only Adapter for the filed of Attached to and Click Ok. (Details.) Now start the

Categories : Hadoop

Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method
Can you try with following code ? import org.apache.hadoop.io._ import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat val nums = sc.makeRDD(1 to 3).map(x => (new IntWritable(x), new Text("a" * x))) nums.saveAsNewAPIHadoopFile[TextOutputFormat[IntWritable, Text]]("/data/newAPIHadoopFile") Following code also worked for me. val x = sc.parallelize(List("THIS","ISCOOL")).map(x => (N

Categories : Hadoop

Filebrowser in HUE not working, uploads fail for user
Seems like it is using 'hadoop' instead of 'hue'. Which Hue distribution are you using? You should not need to modify hue.ini file by default. How to configure Hue for HDFS: WebHDFS: Add to core-site.xml: <!-- Hue WebHDFS proxy user setting --> <property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value> </property> <property> <na

Categories : Hadoop

Hbase vs Cassandra: Which is better for a timeseries data storage?
Chocolate or Vanilla ice cream - which is better? I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

Categories : Hadoop

Apache Flume: cannot commit transaction. Heap space limit reached
You have a few nobs available to turn to get this working appropriately: Increase byteCapacity: a1.channels.ch1.byteCapacity = 6912212. Increase memory as suggested in the above comment (JAVA_OPTS="-Xms512m -Xmx1024m -Dcom.sun.management.jmxremote") is probably the best option. The reason being that the default byteCapacity is 80% of the processes max memory, which is already consuming a lot pro

Categories : Hadoop

Which is better in term of speed, Shark or spark
Spark is a framework for distributed data processing, you can write your code in Scala, Java and Python. Shark was renamed to SparkSQL and it is some kind of SQL engine on top of Spark - you write SQL queries and they are executed using Spark framework. Here's Spark programming guide: https://spark.apache.org/docs/latest/programming-guide.html Here's Spark SQL guide: https://spark.apache.org/docs

Categories : Hadoop

Where does Hive store data on the file system?
Actually hive is work on top of the hadoop mean using the hdfs for storage or can be use another file system. if your hive use file system as hdfs then go to terminal where you hadoop installed. hadoop dfs -ls /user/hive/warehouse

Categories : Hadoop

Hive load data action using oozie
You must define a hive-default.xml file to execute the hive scripts in oozie and that file has be mentioned in the workflow.xml as <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> <property> <name>o

Categories : Hadoop

Register Hbase table in Hive
sudo cp /usr/lib/hive/lib/hive-common-0.7.0-cdh3u0.jar /usr/lib/hadoop/lib/ sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/ 2)CLOSE HBASE AND HADOOP USING FOLLOWING COMMOND /usr/lib/hadoop/bin/stop-all.sh /usr/lib/hbase/bin/stop-hbase.sh 3) RESTART HBASE AND HADOOP USING COMMOND /usr/lib/hadoop/bin/start-all.sh /usr/lib/hadoop/bin/start-hbase.sh

Categories : Hadoop

How to select policy of block placement in the DataNode?
The right directory is chosen on round robin manner when the block arrives to the datanode. You can alter this behavior by changing dfs.datanode.fsdataset.volume.choosing.policy to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy, then the right directory would be chosen based on the space available in them (refer to configurations here: https://hadoop.apache.org

Categories : Hadoop

hadoop file system change directory command
Think about it like this: Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd Let's say we have following structure in hdfs: d1/ d2/ f1 d3/ f2 d4/ f3 You could do cd in your Linux file system from moving from one to the other but do you think changi

Categories : Hadoop

Datastax hadoop nodes basics
In Datastax Enterprise you run Hadoop on nodes that are also running Cassandra. The most common deployment is to make two datacenters (logical groupings of nodes.) One Datacenter is devoted to analytics and contains your machines which run Hadoop and C* at the same time, the other datacenter is C* only and servers the OLTP function of your cluster. The C* processes on the Analytics nodes are conne

Categories : Hadoop

Hive Table Data with MapReduce
Its seems that your Hadoop installation contain multiple bindings for slf4j, removing one of the binding might solve the problem. Add following exclusion in the dependencies that caused conflict. <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions>

Categories : Hadoop

DSE with Hadoop: Error in getting started
I managed to find the solution, after a lot of struggle. I had been guessing all this time, that the problem would be one mis-step somewhere that was not very obvious, at least to me, and that's how it turned out. So for the benefit of anybody else who may face the same problem, what the problem was and what worked is as follows. DSE documentation specifies that for DSE with integrated Hadoop yo

Categories : Hadoop

Pig 0.13 error only in mapreduce mode
Found the solution. The problem was that I was using maven to build the project and I was building the jar with dependencies. This caused the dependencies that has class names with the same class path to override each other (like FileSystem.java for hadoop-hdfs and hadoop-common and the solution was just to build the jar without including dependencies.

Categories : Hadoop

Spark job seems not to parallelize well
Your initial partitions are based on the set of folders in your root (sc.parallelize(pathsStr)). There are two steps in your flow that can significantly unbalance your partitions: 1) reading the list of files within each folder, if some folders have many more files than other folders; 2) reading the TSV lines from each file, if some files have many more lines than others. If your files are roughl

Categories : Hadoop

Questions using Cloudera Quickstart 5.2.0 and Oozie
Use this command since they are in a database. hue dumpdata beeswax.savedquery > your_file.json. You can then reimport them. But you will lose the ids of the queries ('pk') and users ('owner') I think. You might be interested by this documentation if you are using HUE: http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/Hue-2-User-Guide/hue27.html Have a nice day :)

Categories : Hadoop

Cluster configuration and hdfs
The /user/prema is a folder within HDFS. The folder /home/hadoop-user/hdfs/data is a folder within the regular filesystem. The regular filesystem folder is the place where HDFS stores its data. So when you read data from HDFS, it actually goes to the physical regular filesystem folder to read the data. You should never need to touch this data as its format is not very user-friendly - the HDFS tak

Categories : Hadoop

Spark iterate HDFS directory
You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true) And with Spark... FileSystem.get(sc.hadoopConfiguration()).listFiles(..., true)

Categories : Hadoop

Hadoop hdfs unable to locate file
Almost all hadoop dfs utilies follows unix style. Syntax of hadoop dfs -put is hadoop dfs -put <source_file> <destination>. Here destination can be a directory or a file. In your case /user directory exists but the directory prema doesn't exist, So when you copy files from local to hdfs prema will be used for the name of the file. googlebooks-eng-all-1gram-20120701-0 and /user/prema

Categories : Hadoop

Hive table reading from GZIP contains meta information like file name in the first row
That depends on what version of Hive you are using. For Hive version 13 and above: There is a table property tblproperties ("skip.header.line.count"="1") which you can use while creating the table. So it will skip that no of lines. For Hive Version 12 and below: You need to remove the line/header manually or by using some shell/python script. Hope it helps...!!!

Categories : Hadoop

Streaming Kmeans Mahout one file output
Hadoop manages data chunking (ie : splitting a file into multiple ones). This means that from your perspective (ie, from HDFS), there is one file. Howver, for the datanodes file system, there are many.

Categories : Hadoop

How to browse the filesystem of hadoop-2.5.0-cdh5.2.0 without download?
If you use localhost:50070/dfshealth.html to browse HDFS File System, you cannot view text files. Use localhost:50070/dfshealth.jsp to get older view of File System and can view files.

Categories : Hadoop

How to load the data without text qualifiers dynamically from a file using PIG/HIVE/HBASE?
Can you try like this? input.txt "123","456","789" "abc","def","ghi" PigScript: A = LOAD 'input.txt' AS line; B = FOREACH A GENERATE REPLACE(line,'\"','') AS line1; C = FOREACH B GENERATE FLATTEN(STRSPLIT(line1,'\,',3)); D = FOREACH C GENERATE $0,$1,$2; DUMP D; Output: (123,456,789) (abc,def,ghi) In your case you can change the above 3rd line to STRSPLIT(line1,'\,',150), where 150 is the

Categories : Hadoop

How to run hadoop cluster balancer from gateway machine?
Best way to check if you cluster is balanced is to visit namenode web UI or goto hadoop dfsadmin -report for latest stats. Dont go with the time it has taken or log on console. Also it not best practice to run balancer on namenode and it should be run from a client node.

Categories : Hadoop

Apache spark - dealing with auto-updating inputs
Spark alone cannot recognize if a file has been updated. It does its job when reading for a first time the file and that's all. By default, Spark won't know that a file has been updated and won't know which parts of the file are updates. You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...

Categories : Hadoop

running multiple datanodes on same machine
To start multiple data nodes on a single node first download / build hadoop binary. 1) Download hadoop binary or build hadoop binary from hadoop source. 2) Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location) 3) Add following script to the $HADOOP_HOME/bin directory and chmod it to 744. 4) Format HDFS – bin/hado

Categories : Hadoop

How to access flume event header attributes?
Found out you can add filename or absolute file path using the following flumeagent.sources.src1.fileHeader = true flumeagent.sources.src1.fileHeaderKey = file flumeagent.sources.src1.basenameHeader = true flumeagent.sources.src1.basenameHeaderKey = basename Note: the above are added in sources but it is used in sinks flumeagent.sinks.sinkname.hdfs.path = /user/name/flumedir/%y-%m-%d/%{file} o

Categories : Hadoop

How to write pig script for calculating node degree and count
Can you try this? input.txt 1 2 1 3 1 4 2 1 2 5 3 1 4 8 PigScript: A = LOAD 'input.txt' USING PigStorage() AS(id:int,friends:int); B = GROUP A BY id; C = FOREACH B GENERATE FLATTEN(COUNT(A.friends)) AS cnt; D = GROUP C BY cnt; E = FOREACH D GENERATE COUNT(C.cnt),group; F = ORDER E BY group DESC; DUMP F; Output: (1,3) (1,2) (2,1)

Categories : Hadoop

Hadoop warning and error while copying to HDFS on Amazon Aws EC2
One of the best step by step guide to create multinode cluster in Amazon EC2 is Here It explains each and every step. You are already done with first part seems, Go through the second part which will help you.. Hope it helps you..

Categories : Hadoop


Recently Add
Filtering AVRO Data from 2 datasets
how to select column which is not in group by sentence in hive?
Hadoop cluster Http port for name node is not working
Hadoop - Defining and working on data with no delimeter , nospace/space between some columns
How to select ${mapred.local.dir}?
java.net.URISyntaxException when starting HIVE
Google cloud click to deploy hadoop
general form of MapReduce format
Hadoop namenode needs to be formatted after every computer start
How to solve tHDFS component issue in talend open studio for Big data
Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method
Filebrowser in HUE not working, uploads fail for user
Hbase vs Cassandra: Which is better for a timeseries data storage?
Apache Flume: cannot commit transaction. Heap space limit reached
Which is better in term of speed, Shark or spark
Where does Hive store data on the file system?
Hive load data action using oozie
Register Hbase table in Hive
How to select policy of block placement in the DataNode?
hadoop file system change directory command
Datastax hadoop nodes basics
Hive Table Data with MapReduce
DSE with Hadoop: Error in getting started
Pig 0.13 error only in mapreduce mode
Spark job seems not to parallelize well
Questions using Cloudera Quickstart 5.2.0 and Oozie
Cluster configuration and hdfs
Spark iterate HDFS directory
Hadoop hdfs unable to locate file
Hive table reading from GZIP contains meta information like file name in the first row
© Copyright 2017 spot7.org Publishing Limited. All rights reserved.