Data not correctly read from hadoop using Filesystem API - api

I am trying to read a file from hadoop using filesystem API, I am able to connect hadoop and read the file , however file read contains garbled characters.
Below is the code:
public class HdfsToInfaWriter{
public static void main(String[] args)
{
//FileUtil futil;
String hdfsuri=args[0];
//String src=args[1];
String localuri=args[1];
String hdusername=args[2];
byte[] buffer=new byte[30];
char c;
Configuration conf=new Configuration();
conf.addResource(new Path("file:///etc/hadoop/conf/core-site.xml"));
conf.addResource(new Path("file:///etc/hadoop/conf/hdfs-site.xml"));
conf.set("hadoop.security.authentication", "kerberos");
conf.set("fs.defaultFS",hdfsuri);
conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());
//futil.copy(srcFS, src, dst, deleteSource, conf)
try {
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("**************",
"********************");
}catch(IOException e){
e.printStackTrace();
}
System.setProperty("HADOOP_USER_NAME",hdusername);
System.setProperty("hadoop.home.dir","/");
FSDataInputStream in1 = null;
try{
FileSystem fs = FileSystem.get(URI.create(hdfsuri),conf);
Path hdfsreadpath=new Path(hdfsuri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
System.out.println("the class for codec is " +factory.getCodec(hdfsreadpath));
File src1=new File(localuri);
System.out.println("before copy");
FileUtil.copy(fs, hdfsreadpath, src1, false, conf);
}}}
When i use hdfs command hdfs dfs -cat /bigdatahdfs/datamart/trial.txt, the data in file is a simple text file.
But when I use the command cat /home/trial1.txt and copy file to local system, the output is as below:
▒▒▒1K▒;▒▒
=▒<▒▒▒&▒▒▒
NOTE:- i have tried using IOUtils API also, output is the same.

Related

StreamingFileSink doesn't work sometimes when trying to write to S3

I am trying to write to a S3 sink.
private static StreamingFileSink<String> createS3SinkFromStaticConfig(
final Map<String, Properties> applicationProperties
) {
Properties sinkProperties = applicationProperties.get(SINK_PROPERTIES);
String s3SinkPath = sinkProperties.getProperty(SINK_S3_PATH_KEY);
return StreamingFileSink
.forRowFormat(
new Path(s3SinkPath),
new SimpleStringEncoder<String>(StandardCharsets.UTF_8.toString())
)
.build();
}
The following code works and I can see the results in S3
input.map(value -> { // Parse the JSON
JsonNode jsonNode = jsonParser.readValue(value, JsonNode.class);
return new Tuple2<>(jsonNode.get("ticker").asText(), jsonNode.get("price").asDouble());
}).returns(Types.TUPLE(Types.STRING, Types.DOUBLE))
.keyBy(0) // Logically partition the stream per stock symbol
.timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition
.min(1) // Calculate minimum price per stock over the window
.setParallelism(3) // Set parallelism for the min operator
.map(value -> value.f0 + ": ----- " + value.f1.toString() + "\n")
.addSink(createS3SinkFromStaticConfig(applicationProperties));
But the following doesn't write anything to S3.
KeyedStream<EnrichedMetric, EnrichedMetricKey> input = env.addSource(new EnrichedMetricSource())
.assignTimestampsAndWatermarks(
WatermarkStrategy.<EnrichedMetric>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
).keyBy(new EnrichedMetricKeySelector());
DataStream<String> statsStream = input
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new PValueStatisticsWindowFunction());
statsStream.addSink(createS3SinkFromStaticConfig(applicationProperties));
PValueStatisticsWindowFunction is a ProcessWindowFunction as below.
#Override
public void process(EnrichedMetricKey enrichedMetricKey,
Context context,
Iterable<EnrichedMetric> in,
Collector<String> out) throws Exception {
int count = 0;
for (EnrichedMetric m : in) {
count++;
}
out.collect("Count: " + count);
}
When I run the Flink app locally, statsStream.print() prints the results to log/flink-*-taskexecutor-*.out.
In the cluster, I can see checkpoint is enabled and the various checkpoints history from the Flink dashboard. I also made sure the S3 path is in the format s3a://<bucket>
Not sure what I am missing here.

How to retain original Last Modified/Created date of the file on FTP when we try to download them using FTPClient Class and Selenium [duplicate]

I am using org.apache.commons.net.ftp.FTPClient for retrieving files from a ftp server. It is crucial that I preserve the last modified timestamp on the file when its saved on my machine. Do anyone have a suggestion for how to solve this?
This is how I solved it:
public boolean retrieveFile(String path, String filename, long lastModified) throws IOException {
File localFile = new File(path + "/" + filename);
OutputStream outputStream = new FileOutputStream(localFile);
boolean success = client.retrieveFile(filename, outputStream);
outputStream.close();
localFile.setLastModified(lastModified);
return success;
}
I wish the Apache-team would implement this feature.
This is how you can use it:
List<FTPFile> ftpFiles = Arrays.asList(client.listFiles());
for(FTPFile file : ftpFiles) {
retrieveFile("/tmp", file.getName(), file.getTimestamp().getTime());
}
You can modify the timestamp after downloading the file.
The timestamp can be retrieved through the LIST command, or the (non standard) MDTM command.
You can see here how to do modify the time stamp: that: http://www.mkyong.com/java/how-to-change-the-file-last-modified-date-in-java/
When download list of files, like all files returned by by FTPClient.mlistDir or FTPClient.listFiles, use the timestamp returned with the listing to update timestemp of local downloaded files:
String remotePath = "/remote/path";
String localPath = "C:\\local\\path";
FTPFile[] remoteFiles = ftpClient.mlistDir(remotePath);
for (FTPFile remoteFile : remoteFiles) {
File localFile = new File(localPath + "\\" + remoteFile.getName());
OutputStream outputStream = new BufferedOutputStream(new FileOutputStream(localFile));
if (ftpClient.retrieveFile(remotePath + "/" + remoteFile.getName(), outputStream))
{
System.out.println("File " + remoteFile.getName() + " downloaded successfully.");
}
outputStream.close();
localFile.setLastModified(remoteFile.getTimestamp().getTimeInMillis());
}
When downloading a single specific file only, use FTPClient.mdtmFile to retrieve the remote file timestamp and update timestamp of the downloaded local file accordingly:
File localFile = new File("C:\\local\\path\\file.zip");
FTPFile remoteFile = ftpClient.mdtmFile("/remote/path/file.zip");
if (remoteFile != null)
{
OutputStream outputStream = new BufferedOutputStream(new FileOutputStream(localFile));
if (ftpClient.retrieveFile(remoteFile.getName(), outputStream))
{
System.out.println("File downloaded successfully.");
}
outputStream.close();
localFile.setLastModified(remoteFile.getTimestamp().getTimeInMillis());
}

Why NoPointerExcepeion when decompression by apache compress?

click and see The NoPointerExcepeion
I generate tar.gz files and send 2 others 4 decompress, but their progrem has error above(their progrem was not created by me), only one file has that error.
But when using command 'tar -xzvf ***' on my computer and their computer, no problem occured...
So I want 2 know what was wrong in my progrem below:
public static void archive(ArrayList<File> files, File destFile) throws Exception {
TarArchiveOutputStream taos = new TarArchiveOutputStream(new FileOutputStream(destFile));
taos.setLongFileMode(TarArchiveOutputStream.LONGFILE_POSIX);
for (File file : files) {
//LOG.info("file Name: "+file.getName());
archiveFile(file, taos, "");
}
}
private static void archiveFile(File file, TarArchiveOutputStream taos, String dir) throws Exception {
TarArchiveEntry entry = new TarArchiveEntry(dir + file.getName());
entry.setSize(file.length());
taos.putArchiveEntry(entry);
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int count;
byte data[] = new byte[BUFFER];
while ((count = bis.read(data, 0, BUFFER)) != -1) {
taos.write(data, 0, count);
}
bis.close();
taos.closeArchiveEntry();
}
The stack trace looks like a bug in Apache Commons Compress https://issues.apache.org/jira/browse/COMPRESS-223 that has been fixed with version 1.7 (released almost three years ago).

Can I write streams or bytes to an Apache Commons Compress Tarfile?

The Apache Commons compress library seems focused around writing a TarArchiveEntry to TarArchiveOutputStream. But it looks like the only way to create a TarArchiveEntry is with a File object.
I don't have files to write to the Tar, I have byte[]s in memory or preferably streams. And I don't want to write a bunch of temp files to disk just so that I can build a tar.
Is there any way I can do something like:
TarEntry entry = new TarEntry(int size, String filename);
entry.write(byte[] fileContents);
TarArchiveOutputStream tarOut = new TarArchiveOutputStream();
tarOut.write(entry);
tarOut.flush();
tarOut.close();
Or, even better....
InputStream nioTarContentsInputStream = .....
TarEntry entry = new TarEntry(int size, String filename);
entry.write(nioTarContentsInputStream);
TarArchiveOutputStream tarOut = new TarArchiveOutputStream();
tarOut.write(entry);
tarOut.flush();
tarOut.close();
Use the following code:
byte[] test1Content = new byte[] { /* Some data */ };
TarArchiveEntry entry1 = new TarArchiveEntry("test1.txt");
entry1.setSize(test1Content.length);
TarArchiveOutputStream out = new TarArchiveOutputStream(new FileOutputStream("out.tar"));
out.putArchiveEntry(entry1);
out.write(test1Content);
out.closeArchiveEntry();
out.close();
This builds the desired tar file with a single file in it, with the contents from the byte[].

edit any file which is wrapped in the jar file

I want to implement Following stuff with my java code in eclipse.
i need to edit the .dict file which is in directory of jar file.
my directory structure is like
C:\Users\bhavik.kama\Desktop\Sphinx\sphinx4-1.0beta6-bin\sphinx4-1.0beta6\modified_jar_dict\*WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar*\dict\**cmudict04.dict**
Text with bold character is my text file name which i want to edit
and text with italic foramt is my .jar file
now how can i edit this cmudict04.dict file which is reside in WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar\dict\ directory on runtime with java application.
and i want the jar file with the updated file i have edited.
please can u provide me any help?
thnank you in advance.
I would recommend to use java.util.zip.Using these classes you can read and write the files inside the archive .But modifying the contents is not guaranteed because it may be cached.
Sample tutorial
http://www.javaworld.com/community/node/8362
You can't edit files that are contained in a Jar file and have it saved in the Jar file ... Without, extracting the file first, updating it and creating a new Jar by copying the contents of the old one over to the new one, deleting the old one and renaming the new one in its place...
My suggestion is find a better solution
I had succeded to edit jar file and wrap it back as it is...with the following code
public void run() throws IOException
{
Manifest manifest = new Manifest();
manifest.getMainAttributes().put(Attributes.Name.MANIFEST_VERSION, "1.0");
// JarOutputStream target = new JarOutputStream(new FileOutputStream("E:\\hiren1\\WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar"), manifest);
// add(new File("E:\\hiren1\\WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/"), target);
JarOutputStream target = new JarOutputStream(new FileOutputStream("C:\\Users\\bhavik.kama\\Desktop\\Sphinx\\sphinx4-1.0beta6-bin\\sphinx4-1.0beta6\\modified_jar_dict\\WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar"), manifest);
add(new File("C:\\Users\\bhavik.kama\\Desktop\\Sphinx\\sphinx4-1.0beta6-bin\\sphinx4-1.0beta6\\modified_jar_dict\\WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/"), target);
target.close();
}
private void add(File source, JarOutputStream target) throws IOException
{
BufferedInputStream in = null;
try
{
if (source.isDirectory())
{
//String name = source.getPath().replace("\\", "/");
if(isFirst)
{
firstDir = source.getParent() + "\\";
isFirst = false;
}
String name = source.getPath();
name = name.replace(firstDir,"");
if (!name.isEmpty())
{
if (!name.endsWith("/"))
name += "/";
JarEntry entry = new JarEntry(name);
entry.setTime(source.lastModified());
target.putNextEntry(entry);
target.closeEntry();
}
for (File nestedFile: source.listFiles())
add(nestedFile, target);
return;
}
String name = source.getPath();
name = name.replace(firstDir,"").replace("\\", "/");
//JarEntry entry = new JarEntry(source.getPath().replace("\\", "/"));
JarEntry entry = new JarEntry(name);
//JarEntry entry = new JarEntry(source.getName());
entry.setTime(source.lastModified());
target.putNextEntry(entry);
in = new BufferedInputStream(new FileInputStream(source));
byte[] buffer = new byte[1024];
while (true)
{
int count = in.read(buffer);
if (count == -1)
break;
target.write(buffer, 0, count);
}
target.closeEntry();
}
finally
{
if (in != null)
in.close();
}
}