Inputs of Mapreduce

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Inputs of Mapreduce

Khaled BEN BAHRI
Hello to all

I'm novice in working with mapreduce and i'm developping a mapreduce  
function that take xml documents as inputs.

How can i make input files and precise it to the map function

Thanks for help

Best regards
Khaled

Reply | Threaded
Open this post in threaded view
|

Re: Inputs of Mapreduce

edward choi
Khaled,

Hadoop mapreduce innately takes in file line by line.
XML files are not comprised of single lines.
So you will have to pack a single xml document into a single line.
Or you can make your own input format, which you need to refer to a guide
book.

2010/7/13 Khaled BEN BAHRI <[hidden email]>

> Hello to all
>
> I'm novice in working with mapreduce and i'm developping a mapreduce
> function that take xml documents as inputs.
>
> How can i make input files and precise it to the map function
>
> Thanks for help
>
> Best regards
> Khaled
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Inputs of Mapreduce

shujamughal
Hi Khaled,
XML files can be processed using hadoop streaming. check out the following
link.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

Regards
Shuja

On Tue, Jul 13, 2010 at 2:24 PM, edward choi <[hidden email]> wrote:

> Khaled,
>
> Hadoop mapreduce innately takes in file line by line.
> XML files are not comprised of single lines.
> So you will have to pack a single xml document into a single line.
> Or you can make your own input format, which you need to refer to a guide
> book.
>
> 2010/7/13 Khaled BEN BAHRI <[hidden email]>
>
> > Hello to all
> >
> > I'm novice in working with mapreduce and i'm developping a mapreduce
> > function that take xml documents as inputs.
> >
> > How can i make input files and precise it to the map function
> >
> > Thanks for help
> >
> > Best regards
> > Khaled
> >
> >
>



--
Regards
Shuja-ur-Rehman Baig
_________________________________
MS CS - School of Science and Engineering
Lahore University of Management Sciences (LUMS)
Sector U, DHA, Lahore, 54792, Pakistan
Cell: +92 3214207445
Reply | Threaded
Open this post in threaded view
|

Re: Inputs of Mapreduce

Paul Ingles
We tried using the hadoop streaming xml format a while ago and it didn't quite go as expected. I don't remember why, but, it gave some weird results- missing some records off, getting to 98% complete and then stopping etc.

The Mahout project also has an XmlInputFormat [1] that we ended up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc.

Hope that helps,
Paul

1. http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On 13 Jul 2010, at 12:26, Shuja Rehman wrote:

> Hi Khaled,
> XML files can be processed using hadoop streaming. check out the following
> link.
>
> http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
>
> Regards
> Shuja
>
> On Tue, Jul 13, 2010 at 2:24 PM, edward choi <[hidden email]> wrote:
>
>> Khaled,
>>
>> Hadoop mapreduce innately takes in file line by line.
>> XML files are not comprised of single lines.
>> So you will have to pack a single xml document into a single line.
>> Or you can make your own input format, which you need to refer to a guide
>> book.
>>
>> 2010/7/13 Khaled BEN BAHRI <[hidden email]>
>>
>>> Hello to all
>>>
>>> I'm novice in working with mapreduce and i'm developping a mapreduce
>>> function that take xml documents as inputs.
>>>
>>> How can i make input files and precise it to the map function
>>>
>>> Thanks for help
>>>
>>> Best regards
>>> Khaled
>>>
>>>
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> _________________________________
> MS CS - School of Science and Engineering
> Lahore University of Management Sciences (LUMS)
> Sector U, DHA, Lahore, 54792, Pakistan
> Cell: +92 3214207445