Help designing application architecture

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help designing application architecture

venito camelas
I'm pretty new to this and I have a use case I'm not sure how to implement, I'll try to explain it and I'd appreciate if anyone could point me in the right direction.

The case has these requirements:
 1 - Any user shoud be able to define the format of the information they want to store (channel). For example, user X defines a channel named "coordinate": 
coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}
  Every channel has some time value, it can be an instant (like above) or a period of time ("start" : "Timestamp", "end" : "Timestamp")

 2 - Given the previous example, the user should be able to ask the following questions:
2.1 When was the last time I went near {X : x, Y : y}?  --> Process the information in order to get the "near" places and return the newest one.
2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time



For 1) I was thinking of using some Document oriented storage because of the channels lack of structure, not sure that's the only thing to consider though.

For 2.1) I'd use some MR job

For 2.2) I think it would be better to have the information in the document storage and make the queries there.

Is it a good approach to have the information stored both in the hdfs and the document oriented storage (for processing and querying respectively)?

As I mentioned in the beginning, I'm really new to this and I'm just trying to learn..so sorry if my doubts are silly.

Any suggestion or any good reference related to this will be much appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Help designing application architecture

Ted Yu-3
For 1) you don't have to introduce external storage.

You can define case classes for the known formats.

FYI

On Thu, Jul 7, 2016 at 4:40 PM, venito camelas <[hidden email]> wrote:
I'm pretty new to this and I have a use case I'm not sure how to implement, I'll try to explain it and I'd appreciate if anyone could point me in the right direction.

The case has these requirements:
 1 - Any user shoud be able to define the format of the information they want to store (channel). For example, user X defines a channel named "coordinate": 
coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}
  Every channel has some time value, it can be an instant (like above) or a period of time ("start" : "Timestamp", "end" : "Timestamp")

 2 - Given the previous example, the user should be able to ask the following questions:
2.1 When was the last time I went near {X : x, Y : y}?  --> Process the information in order to get the "near" places and return the newest one.
2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time



For 1) I was thinking of using some Document oriented storage because of the channels lack of structure, not sure that's the only thing to consider though.

For 2.1) I'd use some MR job

For 2.2) I think it would be better to have the information in the document storage and make the queries there.

Is it a good approach to have the information stored both in the hdfs and the document oriented storage (for processing and querying respectively)?

As I mentioned in the beginning, I'm really new to this and I'm just trying to learn..so sorry if my doubts are silly.

Any suggestion or any good reference related to this will be much appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Help designing application architecture

venito camelas
Sorry but I did not understand. 
For what I see case classes are scala, I'm using java (I could consider learn and change to scala because I have not started yet and its for learning purposes only)

What do you mean with known formats? When the user creates a channel he only has some basic types (string, long, timestamp, etc) and some channels previously created (by him) to choose from. Example:

The user first creates 2 simple channels (Coordinate and Temperature):
Coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}

Temperature{
"value" : "Float",
"measurement_unit" : "String",
"instant" : "Timestamp"
}

Then, the user creates a new channel using the 2 previously created:
Measurement{
"coord" : "Coordinate",
"temp" : "Temperature",
"instant" : "Timestamp"
}


Now, when the data comes I validate its format against the defined channel's format, if it does't match I throw an error. Example:

{
"coord" : {
"X" : 31.75,
"Y" : "32.75"
"instant" : "2016-06-20T13:28:06.419Z"
},
"temp" : {
"value" : 25.6,
"measurement_unit" : "Celsius",
"instant" : "2016-06-20T13:28:06.419Z"
},
"instant" : "2016-06-20T13:28:06.419Z"
}

That piece of data will fail validation cause the "Y" value does't have Float type (as defined in the Coordinate channel).

Is there a chance you could explain a little more what you said previously? will really help me.

Thank you

2016-07-07 20:54 GMT-03:00 Ted Yu <[hidden email]>:
For 1) you don't have to introduce external storage.

You can define case classes for the known formats.

FYI

On Thu, Jul 7, 2016 at 4:40 PM, venito camelas <[hidden email]> wrote:
I'm pretty new to this and I have a use case I'm not sure how to implement, I'll try to explain it and I'd appreciate if anyone could point me in the right direction.

The case has these requirements:
 1 - Any user shoud be able to define the format of the information they want to store (channel). For example, user X defines a channel named "coordinate": 
coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}
  Every channel has some time value, it can be an instant (like above) or a period of time ("start" : "Timestamp", "end" : "Timestamp")

 2 - Given the previous example, the user should be able to ask the following questions:
2.1 When was the last time I went near {X : x, Y : y}?  --> Process the information in order to get the "near" places and return the newest one.
2.2 Where was I on march 6th between 1pm and 2pm?       --> Query by time



For 1) I was thinking of using some Document oriented storage because of the channels lack of structure, not sure that's the only thing to consider though.

For 2.1) I'd use some MR job

For 2.2) I think it would be better to have the information in the document storage and make the queries there.

Is it a good approach to have the information stored both in the hdfs and the document oriented storage (for processing and querying respectively)?

As I mentioned in the beginning, I'm really new to this and I'm just trying to learn..so sorry if my doubts are silly.

Any suggestion or any good reference related to this will be much appreciated.