I want to know the filename of the file I am querying from using Apache Drill. I know, we can query INPUT__FILE__NAME in Hive but could not find any similar thing in Apache Drill.
This is a new feature that was merged recently. I think it should be available in the 1.7 release.
https://issues.apache.org/jira/browse/DRILL-3474
Related
and when shall I use it? How is it configured can anyone please tell me in detail?
The data-config.xml file is an example configuration file for how to use the DataImportHandler in Solr. It's one way of getting data into Solr, allowing one of the servers to connect through JDBC (or through a few other plugins) to a database server or a set of files and import them into Solr.
DIH has a few issues (for example the non-distributed way it works), so it's usually suggested to write the indexing code yourself (and POST it to Solr from a suitable client, such as SolrJ, Solarium, SolrClient, MySolr, etc.)
It has been mentioned that the DIH functionality really should be moved into a separate application, but that hasn't happened yet as far as I know.
I have 5000 files in a folder and on daily basis new file keep loaded in same file. I need to get the latest file only on daily basis among all the files.
Will it be possible to achieve the scenario in Mule out of box.
Tried keeping file component inside Poll component( To make use of waterMark) but not working.
Is there any way we can achieve this. If not please suggest the best way ( Any possible links).
Mule Studio: 5.3, RunTime 3.7.2.
Thanks in advance
Short answer: Not really any extremely quick out of the box solution. But there are other ways. Im not saying this is the right or only way of solving it, but I've earlier implemented a similar scenario in this way:
A Normal File inbound with a database table as file-log. Each time a new file is processed, a component checks if its name appears in the table. By choice or filter I only continue if it isn't in there already - and after processing I add the filename to the table.
This is a quite "heavy" solution though. A simpler access would be to use an idempotent filter with a object store. For example a Redis server: https://github.com/mulesoft/redis-connector/blob/master/src/test/resources/redis-objectstore-tests-config.xml
It is actually very simple if your incoming file contains timestamp........you can configure the file inbound connector by setting file:filename-regex-filter pattern="myfilename_#[function:timestamp].csv". I hope this helps
May be you can use a quartz scheduler( mention the time in cron expression), followed by a groovy script in which you can start the file connector . Keep the file connector in another flow.
I am using Java API quick start program to upload CSV files into bigquery tables. I uploaded more than thousand file but in one of the table 1 file's rows are duplicated in bigquery. I went through the logs but found only 1 entry of its upload. Also went through many similar questions where #jordan mentioned that the bug is fixed. Can there be any reason of this behavior? I am yet to try a solution mentioned of setting a job id. But just could not get any reason from my side of the duplicate entries...
Since BigQuery is append only by design, you need to accept there will be some duplicates in your system, and be able to write your queries in such way that selects the most recent version of it.
In ApacheDS, I'm trying to load the sample LDAP data provided in ldif format and with instructions here.
I followed the same pattern I used to load the context record by specifying an ldif file in server.xml for the partition (directions, albiet lousy ones, located here).
So I have...
<apacheDS id="apacheDS">
<ldapServer>#ldapServer</ldapServer>
<ldifDirectory>sevenSeasRoot.ldif</ldifDirectory>
<ldifDirectory>apache_ds_tutorial.ldif</ldifDirectory>
</apacheDS>
The sevenSeasRoot.ldif file seems to have loaded properly, because I can see an entry for it in LdapBrowser. But there are no records under it.
What am I doing wrong? How do I configure server.xml to load the child records for sevenSeas?
Just a quick note, the config says ldif*Directory* but you are passing a file.
OTOH, I guess you are using 1.5.7 this is very old, it would be better to try the latest version for better support.
I have a Vector < String >. Now i want to store those Strings in database. But i have one "but"! The program user mustn't install anything else except J2RE, he will just copy program to his computer and run. Does java has such kind of database?
P.S. Previously i think about object serialization or just simple text\xml file but according to the task it must be database... So user just copy my program and run, without installing any additional software, except J2RE.
I think HSQLDB is the right choice for your problem. You just need the HSQLDB JAR in your classpath and then use the file-based database configuration
You can embed Apache Derby in your application. This will run on a JRE installation.
The only thing I know of is JavaDB, but I don't know if that's included in J2RE.
For more info on JavaDB see JavaDB
Edit
After reading on the JavaDB site it seems it's only included in the JDK, which I assume would be not sufficient for you.
Why the requirement for a database? You have 1 Vector - is there any other data that is linked to each String? If you just have to search for Strings in the Vector, you can do that without a database. Ordering the list, searching for substring matches, etc can all be done using Java string functions. Even if the list contians 100,000 thousand Strings, it should still be fast.
JavaDB and Derby are very closely related. JavaDB is the Sun distribution of Derby. You can get Derby directly from the Apache web site (http://db.apache.org/derby) and embed it directly into your application, and both JavaDB and Derby require only the JRE in order to run.